Vous êtes sur la page 1sur 39

Understanding system crashes and collecting appropriate data

Nitin Vig - JTAC (ERX)

Target Audience This training is intended for engineers who are supporting the E-series platform Ein the field. A basic familiarity with the product and a good network troubleshooting skills are expected.

Agenda
Why does an E-series router crash? EWhat happens after a crash? What can I do once it crashes? What information does JTAC need?

Why does the router crash?


 Crashes can happen on the SRP or the Line Module  Can be due to software or hardware problem  Crashes can be a good thing

Why does the router crash?


 A crash happens when:
We do not know what to do:
Received a packet that the software does not know how to handle. A mis-behaving piece of hardware causes undesirable misoperation.

Someone does not listen to us:


An application is in a deadlock and does not release resources. A line module does not respond to the SRP.

Type of crashes
 Software panics
Software not designed to handle a specific condition.

 Processor Exceptions
Processor on the SRP or LM hits a violation while processing data.

 Detector crashes
Recovery and Detection mechanism implemented to address forwarding fault conditions

Software panics
 Software panics (example)
time of reset: THU NOV 01 2007 00:57:26 CDT run state: primary image type: application location: slot (6) build date: 0x46392b64 THU MAY 03 2007 00:23:00 UTC reset type: panic task: cliActor file: osSemaphore.cc line: 153 arg: 38516744 last errno: 0x3d0001 pc: 0x1c480f74: fatalPanic__Fv +0x8 lr: 0x1c532eec: take__11OsSemaphore +0x1c0 <output truncated>

SRP crash seen when clearing an improperly terminated SSH session using the 'clear line vty from another SSH session. The fix involved changes to the SSH application behavior when clearing a VTY session. KB 31413

Software panics
 Software panics (example)
time of reset: Thu Aug 2 01:17:18 2007 run state: unknown (0) image type: application image id: 0x464c3086 build date: 0x464c3086 Thu May 17 2007 10:37:58 GMT location: internal slot (1), processor 0, boardId 0x33, boardRev 0x3 reset type: panic file: dhcpDemux.cc line: 221 task: scheduler last errno: 0x30065 pc: 0x3a1f5c -> fatalPanic(void) offset: 0x8 lr: 0x709c4 -> DhcpDemux::receivePacket(Uid, Uid, OsBuffer &, unsigned char) <output truncated>

Crash noticed on line modules running DHCP Relay proxy application after an SRP failover. DHCP application on LM received a DHCP packet before it was ready Fix involved discarding DHCP packets until DHCP application is ready KB 29531

Processor Exceptions
 Processor Exceptions (example)
time of reset: Tue Mar 19 18:35:45 200 location: slot 9 (a), processor 0, boardId 0x19, boardRev 0x0 image type: application build date: 0x3c7608b1 (Fri Feb 22 09:00:33 2002) reset type: processor exception 0x200 (machine check) task: icc pc: 0x37f2ccc -> memPartAlignedAlloc offset: 0x108 lr: 0x37f2c9c -> memPartAlignedAlloc offset: 0xd8 dar: 0x00000000 cr: 0x24000080 xer: 0x00000000 fpcsr: 0x0369beec srr1: 0x0010b030 dsisr: 0x00000000 ctr: 0x00000000 <output truncated>

SRP crash due to L2 cache memory parity error The SRP CPU encounters a parity error when reading data from the L2 data cache No software fix available for the problem. Historically the crash never recurs on the same SRP KB 2443

Processor Exceptions
 Processor Exceptions (example)
time of reset: Thu Aug 23 20:48:09 2007 run state: primary image type: application image id: 0x462e5408 build date: 0x462e5408 Tue Apr 24 2007 19:01:28 GMT location: internal slot (7), processor 0, boardId 0x3e, boardRev 0 reset type: processor exception 0x300 (data access: protection violation (read attempt)) task: IpSubscriberMana pc: 0x4d0b8724 -> OsPooledObject::operator delete(void *) offset: 0x74 lr: 0x490518f4 -> IpSubscriberMgr::SubscriberEntry::~SubscriberEntry(void) offset: 0x24 <output truncated>

SRP crash due to a data access violation. IPSM application wrote data to an incorrect location. SRP CPU does not that data in that location hence it crashes while trying to read from that location IPSM application was fixed to ensure that it does not write to the invalid location KB 29909

Detector crashes
 Detector crashes
Internal mechanism developed by Juniper to detect and recover from forwarding faults The objective is to minimize the forwarding impact on the router The system initially tries to recover the fault without any external impact. If that is not possible, a crash is performed. Also records information about these faults for troubleshooting purpose. E KB 16800: Enhanced PFTE support on E-series

Detector crashes
 Detector crashes
2 basic mechanisms:
Run by SRP
Commonly known as PIMTE (Ping/Icc Monitoring Threshold Exceeded) or PFTE (Ping Failure Threshold Exceeded) Frequently polls the line modules to check their health (aka ping) Thresholds are defined for applications interacting between SRP and LM If thresholds are exceeded, the SRP decides on what action should be taken. Additional information is written to a file with extension .tsa TSA file generation does not always mean there was a crash !!! These crashes have a generic crash signature. Crash can occur on the standby SRP or Line Module Reboot.hty, coredump and TSA file (if present) are required in each case Reboot.hty, to analyse the root cause.

Detector crashes
 Detector crashes: PIMTE (example)
time of reset: Fri Nov 10 00:13:00 2006 run state: unknown (0) image type: application image id: 0x4550dd8b build date: 0x4550dd8b Tue Nov 07 2006 19:24:59 GMT location: internal slot (4), processor 0, boardId 0xff, boardRev 0xff reset type: panic, msg "Ping/ICC monitoring threshold exceeded" file: ontrolNetwork.cc line: 775 task: scheduler last errno: 0x110001 pc: 0x9e4228 -> fatalPanic(void) offset: 0x8 lr: 0x122088 -> Hw2Ic1ControlNetwork::DefaultSession::acceptCall(CbusMessageconst &, CbusMessage &, CbusReplyDoneAction *&) offset: offset: 0x1308 <output truncated> time of reset: Tue Aug 29 11:21:41 2006 run state: standby image type: application image id: 0x44ed842e build date: 0x44ed842e Thu Aug 24 2006 10:49:18 Eastern Standard Time location: internal slot (9), processor 0, boardId 0x3e, boardRev 0 reset type: unknown software error signature (0xfadead4), msg "Ping/ICC monitoring threshold exceeded" file: ontrolNetwork.cc line: 752 task: cbusSlave last errno: 0x380003 pc: 0x4cc98f10 -> fatalPanic(void) offset: 0x8 lr: 0x4ce26e7c -> Hw1SrpSlaveControlNetwork::DefaultSession::acceptCall(CbusMessage const &, CbusMessage &) offset: 0x1548 <output truncated>

Detector crashes
 Detector crashes: PFTE (example)
time of reset: Mon Apr 24 12:44:33 2006 run state: unknown (0) image type: boot image id: 0x4390bdc4 build date: 0x4390bdc4 Fri Dec 02 2005 21:33:56 GMT location: internal slot (5), processor 0, boardId 0x33, boardRev 0x3 reset type: panic, msg "ping failure threshold exceeded" file: ontrolNetwork.cc line: 1182 task: scheduler last errno: 0 pc: 0x19235d4 -> fatalPanic(void) offset: 0x8 lr: 0x1999d90 -> Hw1Ic1ControlNetwork::DefaultSession::acceptCall(CbusMessage const &, CbusMessage &, CbusReplyDoneAction *&) offset: offset: 0xbf8 <output truncated>

Detector crashes
 Detector crashes
2 basic mechanisms that run:
Run by the Line module
Commonly known as ic1Detector crashes Various components on the line module inform the IC (line module CPU) about any forwarding faults. Based on the severity of the problem, the line module decides what action should be taken The IC (line module CPU) initially attempts to recover the particular component. If it can not be recovered or if the problem recurs, a crash is taken The ic1Detector crashes are seen only on Line modules. They have a generic crash signature. Reboot.hty and coredump are required in each case to analyse the root cause.

Detector crashes
 Detector crashes: ic1Detector (example)
time of reset: Mon Aug 28 14:55:44 2006 run state: unknown (0) image type: application image id: 0x44a294a3 build date: 0x44a294a3 Wed Jun 28 2006 14:39:31 GMT location: internal slot (5), processor 0, boardId 0xff, boardRev 0xff reset type: panic, msg "Ic1Detector::requestRecovery executed forced IC crash" file: ic1Detector.cc line: 718 task: scheduler last errno: 0x110001 pc: 0x9b528c -> fatalPanic(void) offset: 0x8 lr: 0x1286544 -> Ic1Detector::requestRecovery(Ic1Detector *, unsigned long, int, unsigned short, bool, bool, char const *, bool) offset: 0xa1c <output truncated> time of reset: Tue Jan 22 11:32:39 2008 run state: unknown (0) image type: application image id: 0x46d571bf build date: 0x46d571bf Wed Aug 29 2007 13:16:47 GMT location: internal slot (15), processor 0, boardId 0xff, boardRev 0xff reset type: panic, msg "Ic1Detector::requestRecovery executed forced IC crash" file: ic1Detector.cc line: 738 task: scheduler last errno: 0x110001 pc: 0xa6b740 -> fatalPanic(void) offset: 0x8 lr: 0x147dc24 -> Ic1Detector::requestRecovery(Ic1Detector *, unsigned long, int, unsigned short, bool, bool, char const *, bool) offset: 0xa54 <output truncated>

Agenda
Why does an E-series router crash? EWhat happens after a crash? What can I do once it crashes? What information does JTAC need?

What happens after a crash?


 After an SRP or Line module crashes, it goes through the boot process  During this time it writes entries to the reboot.hty file and generates a coredump (if enabled)  What is the reboot.hty file? What is a coredump?

What is the reboot.hty file?


 Reboot.hty is a file maintained on the SRPs flash which keeps a history of the reboots that happen on the router.  These include regular reboots performed by the user and unexpected crashes that may happen on the router.  Each SRP maintains its own copy of the reboot.hty file. This file is NOT synchronized.  Each SRP keeps a record of its own reboots  Additionally, when a line module reboots its entries are written to the primary SRPs reboot.hty

What is the reboot.hty file?


 How do I view the contents of the reboot.hty file?
Primary SRP
Use the show reboot-history command reboot-

Standby SRP
Make a copy of the Standby SRPs reboot.hty file on the primary SRPs flash:
copy standby:reboot.hty <filename>.hty

Use the show reboot-history <filename>.hty command reboot-

 How do I copy the reboot.hty file to an FTP server?


Primary SRP:
copy reboot.hty <FTPserver>:<path>/<filename>.hty

Standby SRP:
copy standby:reboot.hty <FTPserver>:<path>/<filename>.hty

What is the reboot.hty file?


 What is the format of a crash record in the reboot.hty file?
time of reset: Thu Aug 23 20:48:09 2007 run state: primary image type: application image id: 0x462e5408 build date: 0x462e5408 Tue Apr 24 2007 19:01:28 GMT location: internal slot (7), processor 0, boardId 0x3e, boardRev 0 reset type: processor exception 0x300 (data access: protection violation (read attempt)) task: IpSubscriberMana pc: 0x4d0b8724 -> OsPooledObject::operator delete(void *) offset: 0x74 lr: 0x490518f4 -> IpSubscriberMgr::SubscriberEntry::~SubscriberEntry(void) offset: 0x24 <output truncated>

time of reset: Identifies the time the reset took place run state: Relevant only for the SRP. Identifies if the SRP was in primary or standby
state when the reset occurred. Set to unknown for line modules.

image type: Identifies the type of image the SRP or line module was running when it
reloaded. This can be boot, diag or application image.

image id: Internal ID used by Juniper to identify the release build date: Identifying the date when the release on this router was built.

What is the reboot.hty file?


 What is the format of a crash record in the reboot.hty file?
location:
Identifies the slot in which the SRP or line module was present when it reloaded. There is a difference in the physical slot in which SRP or line module resides and the internal slot reported in the reboot.hty file. Mapping of the internal slot to physical slot:
ERX:
Internal Slot 0 1 2 3 4 5 7 9 10 11 12 13 14 15 Physical Slot 0 1 2 3 4 5 6 7 8 9 10 11 12 13

E320: KB 26791: E320 Internal Slot Numbering

What is the reboot.hty file?


 What is the format of a crash record in the reboot.hty file?
reset type: Identifies the type of reset. There can be quite a few different reset types
such as processor exception, panic, user reboot

task: Identifies the task that was running on the SRP or line module when it was reset. On
the line module this will always be set to scheduler as that is the only task that runs on the line module.

What is the core dump file?


 Core dump is a snapshot of the memory at time when the crash occurred.  It is an important tool for JTAC and engineering teams to identify the root cause of a crash  The size of the core dump varies based on the amount of memory on the SRP or Line module.  The core dump is generated during the boot process after the crash

What is the core dump file?


 How do I check if a core dump was generated?
- Coredumps are stored on the SRPs flash - Use the dir output to check

 Can core dumps be disabled?


- Yes. - Disabling line card coredumps:
E120(config)#exception dump srp-only srp-

- Disabling SRP coredumps:


E120(config)#exception dump except-srp except-

 How do I check if core dumps are enabled?


Use the show exception dump command

 How do I copy a core dump to an FTP server?


copy <filename>.dmp <FTPserver>:<path>/<filename>.dmp

What is the core dump file?


 Why do customers sometimes disable core dumps?
A coredump takes a few minutes to complete This adds to the boot time of the SRP/line module

 Should I be disabling them?


It is not recommended to disable core dumps In certain cases, when repetitive crashes are seen on a router, this may be considered in consultation with JTAC.

 Sometimes why does a crash not generate a core dump?


Some of the possible causes could be:
Not enough space on flash to store the core dump Core dumps were disabled Power glitches Conditions where the SRP/Line module could not take a dump of the memory.

Agenda
Why does an E-series router crash? EWhat happens after a crash? What can I do once it crashes? What information does JTAC need?

What can I do once it crashes?


 Ensure services are restored
A brief checklist:
Are the SRP and line modules in online state? Did all the routing protocols converge? Have the subscribers started reconnecting? Are the traffic levels restored? Is the CPU utilization normal after some time?

What can I do once it crashes?


 Assess the impact of the crash
A brief checklist What crashed? Is the SRP/Line module stable after the crash? Did any customer applications suffer an impact? How many subscribers were impacted? For how long? For an SRP crash
Was it the primary or standby SRP? Was high availability enabled? Did the standby SRP take over?

For a Line module crash


Was the line module in a redundancy group? If yes, did the redundant line module take over? Was it subscriber-facing or core-facing subscribercore-

What can I do once it crashes?


 Research the cause of the crash
- Ask yourself (or your customer ):
- Any recent changes to configuration? - Any recent changes to the load on the router? - Any recent changes in the network?

Search the knowledge base


All defects found at a customer site have a knowledge base article associated with them. Use the knowledge base effectively If you find a match, always double-check with JTAC double-

Contact JTAC
It is recommended to contact JTAC whenever there is a crash on the router

Tips on searching the KB


 Some tips on searching the knowledge base:
- The crash record in the reboot.hty is a good pointer to start - Look for information in the stack trace which seems unique
- Filenames, Line numbers, Reset type

- Search for these in the knowledge base - Remember:


- Some crashes have very generic crash records (eg: Detector crashes)
- In such cases, a match in the KB does NOT necessarily mean you are hitting the same problem

- Some crash records may match closely but not exactly


- In some cases this may be the same problem showing up in a different form - In some cases, it may be an entirely different issue

- Read the problem description and solution fields carefully


- Some times these are good pointers to confirm if you are hitting the same issue

- When in doubt, consult JTAC

http://www.juniper.net/kb

Agenda
Why does an E-series router crash? EWhat happens after a crash? What can I do once it crashes? What information does JTAC need?

What information does JTAC need?


 Files to be collected
Collect the following files from the router
reboot.hty
It is usually a good idea to collect the file from both primary and standby SRPs

Core dumps (if any) Files with extension .tsa (if any) system.log file Copy of the router configuration in CNF and SCR format

What information does JTAC need?


 Outputs
- Collect the following outputs:
sh version sh hardware sh env all dir sh log data nv-file nvsh log data severity debug sh redundancy

- Depending upon the problem, there may be other outputs that JTAC may require.

What information does JTAC need?


 Other information
The following information is very useful to JTAC when troubleshooting crash cases:
Services deployed on the router Logical diagram of the network Information about devices connected to the router Number of subscribers connected to the router. The amount of traffic (in Mbps) on the router. Information about any changes to the router configuration or deployment scenario - Information about any changes in the external network -

What information does JTAC need?


 When all else fails
Some crashes are illusive Crashes are seen on customer routers however the core dumps do not provide us enough information And, the crashes are not reproducible in JTAC labs worldwide In such cases there may be a need to collect additional data from the customers router Some of the techniques used in the past:
Installing a debug image on the customer router Enabling memory debugging Enabling assertions

This is required in special cases only and JTAC will provide all necessary information in such cases.

Summary
 Crashes can be a good thing  Assess the severity of the crash and its impact in the network  Work closely with JTAC to analyze the root cause  A good understanding of the E-series behavior Ehelps build customer confidence

Questions. ???

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

39

Vous aimerez peut-être aussi