SRX5400/5600/5800 using SPC II (SRX5K-SPC-4-15-320) and running with Junos OS :
Alert Description:- 12.1X44-D10/D15/D20/D25/D30/D35
- 12.1X45-D10/D15/D20/D25/D30
- 12.1X46-D10/D15/D20
- 12.1X47-D10
A cache error exception
could happen randomly in rare conditions on SRX5K SPC II (Services
Processing Card, SRX5K-SPC-4-15-320) when the SPC is referring to an
invalid physical address in memory, which triggers flowd process core
and all SPCs restart on the local node. If the chassis cluster feature
is enabled, the data plane will fail over to the other node.
For example, the following output will be shown when this issue happens.
This issue can be tracked via PR1005195.
Solution:For example, the following output will be shown when this issue happens.
root@SRX5K> show system core-dumps -rw-rw---- 1 nobody wheel 941023387 May 12 23:45 /var/tmp/flowd_xlr-SPC7_PIC3.core.0.gz
root@SRX5K> show log messages ... May 12 23:43:31 (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-3: cpuid = 26 May 12 23:43:31 (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-3-ADDRESS_ERR: pid 251 (flowd_xlr), uid 0: pc 0xffffffff802927e0 got a read fault at 0xffffffff802927e0 May 12 23:43:31 (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-3: Trapframe Register Dump: May 12 23:43:31 (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-3: zero: 0000000000000000 at: fffffffffffffdff v0: 0000000000000001 v1: 00000001c9322248 May 12 23:43:31 (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-3: a0: ffffffff80a00406 a1: 00000000200f09fc a2: 0000000000000000 a3: 00000000243ab1b8 May 12 23:43:31 (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-3: t0: 0000000000009e63 t1: 00000001eb0e4a50 t2: 00000002dab903c0 t3: 00000002dab90390 May 12 23:43:31 (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-3: ta0: ffffffffd1ebbad8 ta1: 0000000000000000 ta2: 0000000000000000 ta3: 0000000000000000 May 12 23:43:31 (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-3: t8: 000000000000003a t9: 0000000020105650 s0: 000000000000001a s1: 00000000243ab288 May 12 23:43:31 (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-3: s2: 00000000243ab1b8 s3: 00000000243ab1b8 s4: 000000000000001a s5: 00000000241d0000 May 12 23:43:31 (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-3: s6: 0000000021950000 s7: 00000001c9321e88 k0: 0000000000000000 k1: 0000000000000000 May 12 23:43:31 (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-3: gp: 0000000000000000 sp: 0000000fdd5eaea0 s8: 000000000000001a ra: 00000000200f0bc8 May 12 23:43:31 (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-3: sr: 00000000508198f3 mullo: 0000000000000000 mulhi: 0000000000000000 May 12 23:43:31 (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-3: pc: ffffffff802927e0 cause: 0000000000000010 badvaddr: ffffffff802927e0 May 12 23:43:31 (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-3: pc address 0xffffffff802927e0 is inaccessible, pte = 0x0 May 12 23:43:31 (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-2: flowd core, May 12 23:43:31 fpc7 Cowra: %PFE-3: XLP3 flowd_xlr core dump, current state SPU_STATE_WORKING. May 12 23:43:31 fpc7 flowd_xlr coredump start, ecc regs: %PFE-3: 0,0,0,0 May 12 23:43:31 (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-2: stop xaui rx and drain packets on lbt cpu 4 May 12 23:43:31 (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-2: msgring_drain_process: bind thread to hwtid (4) cpuid(4) May 12 23:43:31 (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-2: [msgring May 12 23:43:31 (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-2: _drain_process]476 msges drained May 12 23:43:31 (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-2: Kernel thread "msgdrainthr4" (pid 41228) exited prematurely. May 12 23:43:32 fpc7 Cowra: %PFE-3: XLP3 flowd_xlr down, current state SPU_STATE_CRASH. info: Flowd down, flowd_xlr_statusfound flowd in coredump. May 12 23:43:32 (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-3: spu_cobar_send_mail_unlocked: New mail (6) tried 2 times to be sent, finally sent May 12 23:45:05 /kernel: %KERN-4: peer_inputs:4300 VKS0 closing connection peer type 10 indx 31 err 0 May 12 23:45:05 /kernel: %KERN-3: pfe_send_failed(index 31, type 10), err=32 May 12 23:45:05 (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-4: peer_inputs:4300 VKS0 closing connection peer type 10 indx 31 err 0 May 12 23:45:05 (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-3: pfe_send_failed(index 31, type 10), err=32 May 12 23:45:10 /kernel: %KERN-3: ###rdp_usr_detach tcb NULL socket 0xc6a824d4 May 12 23:45:15 fpc7 Cowra: %PFE-3: XLP3 flowd_xlr down complete. May 12 23:45:15 (FPC Slot 7, PIC Slot 3) SPC7_PIC3 init: %AUTH-6: flowd_xlr (PID 173) exited with status=0 Normal Exit May 12 23:45:16 chassisd[1506]: %DAEMON-5-CHASSISD_IFDEV_DETACH_PIC: ifdev_detach_pic(7/3) May 12 23:45:16 chassisd[1506]: %DAEMON-5-CHASSISD_SNMP_TRAP7: SNMP trap generated: Fru Failed (jnxFruContentsIndex 7, jnxFruL1Index 8, jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName FPC: SRX5k SPC II @ 7/*/*, jnxFruType 3, jnxFruSlot 7) May 12 23:45:16 alarmd[975]: %DAEMON-4: Alarm set: PIC color=RED, class=CHASSIS, reason=FPC 7 PIC 3 SPU flowd core dump complete May 12 23:45:16 chassisd[1506]: %DAEMON-5-CHASSISD_PIC_OFFLINE_NOTICE: Taking PIC 3 in FPC 7 offline: SPU flowd core dump complete May 12 23:45:16 craftd[976]: %DAEMON-4: Major alarm set, FPC 7 PIC 3 SPU flowd core dump complete May 12 23:45:16 chassisd[1506]: %DAEMON-5-CHASSISD_FRU_OFFLINE_NOTICE: Taking FPC 7 offline: Reset on SPC/SPU failure May 12 23:45:16 chassisd[1506]: %DAEMON-5-CHASSISD_IFDEV_DETACH_FPC: ifdev_detach_fpc(7) May 12 23:45:16 chassisd[1506]: %DAEMON-5-CHASSISD_FRU_OFFLINE_NOTICE: Taking FPC 0 offline: Reset on SPC/SPU failure May 12 23:45:16 chassisd[1506]: %DAEMON-5-CHASSISD_IFDEV_DETACH_FPC: ifdev_detach_fpc(0) ....
- SPU kernel detected user space address error - %USER-3-ADDRESS_ERR: pid 251 (flowd_xlr), uid 0: pc 0xffffffff802927e0 got a read fault at 0xffffffff802927e0
- SPU kernel started to generate flowd core - %USER-2: flowd core`) after collecting registry values
- Chassisd detached affected SPC - %DAEMON-5-CHASSISD_IFDEV_DETACH_PIC: ifdev_detach_pic(7/3)
- Chassisd offlined affected SPC due to SPU flowd core dump - %DAEMON-5-CHASSISD_PIC_OFFLINE_NOTICE: Taking PIC 3 in FPC 7 offline: SPU flowd core dump complete
- Chassisd reset all SPCs - %DAEMON-5-CHASSISD_FRU_OFFLINE_NOTICE: Taking FPC 0 offline: Reset on SPC/SPU failure
This issue can be tracked via PR1005195.
This issue is fixed in Junos OS 12.1X44-D40, 12.1X46-D25, 12.1X47-D15 and higher versions.
NOTE: There is no known way to monitor the system status before the flowd core due to cache error and there is no known workaround available. If the system reports a flowd core, please open a "Technical Service Request" on the Case Manager.
No comments:
Post a Comment