Tuesday, 17 March 2015

Cache Error on SRX5K SPC II causes flowd process core

Product Affected:

SRX5400/5600/5800 using SPC II (SRX5K-SPC-4-15-320) and running with Junos OS :
  • 12.1X44-D10/D15/D20/D25/D30/D35
  • 12.1X45-D10/D15/D20/D25/D30
  • 12.1X46-D10/D15/D20
  • 12.1X47-D10
Alert Description:
A cache error exception could happen randomly in rare conditions on SRX5K SPC II (Services Processing Card, SRX5K-SPC-4-15-320) when the SPC is referring to an invalid physical address in memory, which triggers flowd process core and all SPCs restart on the local node. If the chassis cluster feature is enabled, the data plane will fail over to the other node.

For example, the following output will be shown when this issue happens.

root@SRX5K> show system core-dumps

-rw-rw----  1 nobody wheel 941023387 May 12 23:45 /var/tmp/flowd_xlr-SPC7_PIC3.core.0.gz

root@SRX5K> show log messages
...
May 12 23:43:31   (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-3: cpuid = 26
May 12 23:43:31   (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-3-ADDRESS_ERR: pid 251 (flowd_xlr), uid 0: pc 0xffffffff802927e0 got a read fault at 0xffffffff802927e0
May 12 23:43:31   (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-3: Trapframe Register Dump:
May 12 23:43:31   (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-3: zero: 0000000000000000  at: fffffffffffffdff  v0: 0000000000000001  v1: 00000001c9322248
May 12 23:43:31   (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-3: a0: ffffffff80a00406  a1: 00000000200f09fc  a2: 0000000000000000  a3: 00000000243ab1b8
May 12 23:43:31   (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-3: t0: 0000000000009e63  t1: 00000001eb0e4a50  t2: 00000002dab903c0  t3: 00000002dab90390
May 12 23:43:31   (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-3: ta0: ffffffffd1ebbad8 ta1: 0000000000000000 ta2: 0000000000000000 ta3: 0000000000000000
May 12 23:43:31   (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-3: t8: 000000000000003a  t9: 0000000020105650  s0: 000000000000001a  s1: 00000000243ab288
May 12 23:43:31   (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-3: s2: 00000000243ab1b8  s3: 00000000243ab1b8  s4: 000000000000001a  s5: 00000000241d0000
May 12 23:43:31   (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-3: s6: 0000000021950000  s7: 00000001c9321e88  k0: 0000000000000000  k1: 0000000000000000
May 12 23:43:31   (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-3: gp: 0000000000000000  sp: 0000000fdd5eaea0  s8: 000000000000001a  ra: 00000000200f0bc8
May 12 23:43:31   (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-3: sr: 00000000508198f3 mullo: 0000000000000000    mulhi: 0000000000000000
May 12 23:43:31   (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-3: pc: ffffffff802927e0 cause: 0000000000000010 badvaddr: ffffffff802927e0
May 12 23:43:31   (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-3: pc address 0xffffffff802927e0 is inaccessible, pte = 0x0
May 12 23:43:31   (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-2: flowd core, 
May 12 23:43:31   fpc7 Cowra: %PFE-3: XLP3 flowd_xlr core dump, current state SPU_STATE_WORKING. 
May 12 23:43:31   fpc7 flowd_xlr coredump start, ecc regs: %PFE-3: 0,0,0,0 
May 12 23:43:31   (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-2: stop xaui rx and drain packets on lbt cpu 4
May 12 23:43:31   (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-2: msgring_drain_process: bind thread to hwtid (4) cpuid(4)
May 12 23:43:31   (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-2: [msgring
May 12 23:43:31   (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-2: _drain_process]476 msges drained
May 12 23:43:31   (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-2: Kernel thread "msgdrainthr4" (pid 41228) exited prematurely.
May 12 23:43:32   fpc7 Cowra: %PFE-3: XLP3 flowd_xlr down, current state SPU_STATE_CRASH. info: Flowd down, flowd_xlr_statusfound flowd in coredump.  
May 12 23:43:32   (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-3: spu_cobar_send_mail_unlocked: New mail (6) tried 2 times to be sent, finally sent
May 12 23:45:05   /kernel: %KERN-4: peer_inputs:4300 VKS0 closing connection peer type 10 indx 31 err 0
May 12 23:45:05   /kernel: %KERN-3: pfe_send_failed(index 31, type 10), err=32
May 12 23:45:05   (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-4: peer_inputs:4300 VKS0 closing connection peer type 10 indx 31 err 0
May 12 23:45:05   (FPC Slot 7, PIC Slot 3) SPC7_PIC3 kernel: %USER-3: pfe_send_failed(index 31, type 10), err=32
May 12 23:45:10   /kernel: %KERN-3: ###rdp_usr_detach tcb NULL socket 0xc6a824d4
May 12 23:45:15   fpc7 Cowra: %PFE-3: XLP3 flowd_xlr down complete. 
May 12 23:45:15   (FPC Slot 7, PIC Slot 3) SPC7_PIC3 init: %AUTH-6: flowd_xlr (PID 173) exited with status=0 Normal Exit
May 12 23:45:16   chassisd[1506]: %DAEMON-5-CHASSISD_IFDEV_DETACH_PIC: ifdev_detach_pic(7/3)
May 12 23:45:16   chassisd[1506]: %DAEMON-5-CHASSISD_SNMP_TRAP7: SNMP trap generated: Fru Failed (jnxFruContentsIndex 7, jnxFruL1Index 8, jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName FPC: SRX5k SPC II @ 7/*/*, jnxFruType 3, jnxFruSlot 7)
May 12 23:45:16   alarmd[975]: %DAEMON-4: Alarm set: PIC color=RED, class=CHASSIS, reason=FPC 7 PIC 3 SPU flowd core dump complete
May 12 23:45:16   chassisd[1506]: %DAEMON-5-CHASSISD_PIC_OFFLINE_NOTICE: Taking PIC 3 in FPC 7 offline: SPU flowd core dump complete
May 12 23:45:16   craftd[976]: %DAEMON-4:  Major alarm set, FPC 7 PIC 3 SPU flowd core dump complete
May 12 23:45:16   chassisd[1506]: %DAEMON-5-CHASSISD_FRU_OFFLINE_NOTICE: Taking FPC 7 offline: Reset on SPC/SPU failure
May 12 23:45:16   chassisd[1506]: %DAEMON-5-CHASSISD_IFDEV_DETACH_FPC: ifdev_detach_fpc(7)
May 12 23:45:16   chassisd[1506]: %DAEMON-5-CHASSISD_FRU_OFFLINE_NOTICE: Taking FPC 0 offline: Reset on SPC/SPU failure
May 12 23:45:16   chassisd[1506]: %DAEMON-5-CHASSISD_IFDEV_DETACH_FPC: ifdev_detach_fpc(0)
....
  1. SPU kernel detected user space address error - %USER-3-ADDRESS_ERR: pid 251 (flowd_xlr), uid 0: pc 0xffffffff802927e0 got a read fault at 0xffffffff802927e0
  2. SPU kernel started to generate flowd core - %USER-2: flowd core`) after collecting registry values
  3. Chassisd detached affected SPC - %DAEMON-5-CHASSISD_IFDEV_DETACH_PIC: ifdev_detach_pic(7/3)
  4. Chassisd offlined affected SPC due to SPU flowd core dump - %DAEMON-5-CHASSISD_PIC_OFFLINE_NOTICE: Taking PIC 3 in FPC 7 offline: SPU flowd core dump complete
  5. Chassisd reset all SPCs - %DAEMON-5-CHASSISD_FRU_OFFLINE_NOTICE: Taking FPC 0 offline: Reset on SPC/SPU failure

This issue can be tracked via PR1005195.


Solution:
This issue is fixed in Junos OS 12.1X44-D40, 12.1X46-D25, 12.1X47-D15 and higher versions.

NOTE: There is no known way to monitor the system status before the flowd core due to cache error and there is no known workaround available. If the system reports a flowd core, please open a "Technical Service Request" on the Case Manager.

No comments:

Post a Comment

loading...