Basic System Crashes and Hangs v2

BASIC System crashes and hangs for Sun Partners.
Armand Heijster Sun Microsystems EMEA Madrid
Index: 0. Intro 1. What is a System Crash (panic!)? 2. My System has Hung!

2.1 What is a system hang? 2.2 What causes a hang? 2.3 How do i know a system is hung? 2.4 What to do? 2.5 Break but no ok prompt? 2.6 Analyzing why a hung system could not be paniced
3. Unexpected reboot: panic / AFT1

3.1 Panic but no core files? 3.2 How to Troubleshoot ? 3.3 Findaft
4. Examples of panics 5. Fatal Reset

5.1 How to identify? 5.2 How to troubleshoot? 5.3 Fatal Reset but no component disabled?
6. Miscelleaneous
6.1 Power outage 6.2 System dropped to the ok prompt 6.3 System powered off
7. Debugging Options 8. First contact 9. The AFT (Asynchronous Fault Trap) message format. 10. ACT Output (Example) 11. FindAFT Output (Example) 12. Resources
0. Intro
Unix system crash. It's a fact of life. Analysis of system crash dumps usually requires the skills and resources of a Unix Guru, including a wide set of programming skills, an in-depth knowledge of Unix Internals, and access to source code. However for most system crashes or hangs you don't need to know all this!
Resources used: SunSolve Panic! By Chris Drake and Kimberley Brown Documents from Frederic Capel Sun France Documents from Nico Waltrott Sun Germany
1. What is a System Crash (panic!)
In general you should consider a panic as a system self-protecting event to prevent data corruption. To provide this mechanism the system has several safeguards working which are checking the Integrity of Data or System Resources. Panic() can only be called by the operating system while in kernel mode. No user, not even root can actually write an application that calls panic(). But any program that has a bug might trigger a panic(). A system will also crash if it detects a hardware problem that should not happen. This type of crash we call a bad trap. System administrators see this as the same thing. A Unix system performs millions of traps every day, so don't panic with the word trap, as long as its not a bad trap. The panic() will copy all content of the memory to the dump-device. Looking, after the server has been rebooted again, into this "picture" and sorting out the bad trap and refer it to e.g a bug or identify an HW-defect is called CDA Core Dump Analysis. By default the dump-device is usually the swap area on the HDD but it also can be located on a dedicated device. The appropriate command to check and configure the dumpdevice is dumpadm in newer OS than Solaris 7. Below Solaris 8 the dumpdevice is configured in /etc/rc2.d/S20sysetup and is disabled by default. Removing the hash-characters in this script will enable the savecore mechanism. If there is any need to change the savecore directory it can be edited in that file, as well. The crash dump can be found normally in /var/crash/<host-name> as vmcore.x and unix.x The file Bounds only contains the count of previously generated cored dumps. The dump will be written to the end of the dumpdevice as the beginning contains special information like header-records, code information called a magic number. The magic number identifies the current contents of the dumpdevice. The header-records will identify if the dump image has been overwritten through swap activity.
2. My System has Hung?

System hangs are a big frustration for system administrators. They don't know if the system is alive, dead or incredible slow, and after a while they think the only solution is to power cycle, because the system is hanging.
2.1 What is a system hang?

The system is no longer completely usable. Not like panics where the system reboots and is completely unusable, a system hang eats slowly the system resources till it is finally useless.
2.2 What causes a hang?

Deadlocks, one process is waiting for something that is locked by another process. Two routines trying to manipulate the same data structure. System resources which are not available anymore, so its waiting for more resources. Sometimes a hardware problem, a hung bus is an example with no or bad communication to a drive and the system gets into a loop.
2.3 How do i know a system is hung?

Depending on the hang some users are not able to login anymore, or only the root user can login on the system console. If a system is hung you will not see any panic messages on the console that will give you information about the cause. The problem could be simple as a powered down disk or an Ethernet cable which was disconnected. Part of the difficulty in working with system hangs is knowing whether or not the system is actually hung. Sometimes it appears that the system is hung when in fact only a particular application is hung. For example users of a database may report that "the system is hung" when in fact the system itself is still running but the database application has hung. The most common cause of a system hang is that the system has run out of resources.
2.4 What to do?

Using Sunsolve is a good thing and i always point new engineers out to the database. There are already very good documents in SunSolve which helps you on your way:
Document ID:13258 Title:KERNEL: How to enable deadman kernel code Document ID:13039 Title:Troubleshooting System Hangs Document ID:80737 Title:Crash Dump capturing for Redhat Linux Document ID:15553 Title:Solaris[TM] Operating System: Forcing a kernel core dump on an x86 or x64 system.
If nothing outstanding has been found, we must force this silent event to be verbose by make it panic ! The Kernel core files generated will then be sent to PTS for analyse which will determine if it is a software (perf/kernel locks ...) or a hardware issue.
2.5 Break but no ok prompt?

For the servers only, check first if the key is not in the lock position which will prevent the break signal to be trapped by the Kernel. Otherwise power off/on the server ... that is the only solution and unfortunately we won't be able to analyse the hang ... at least for this time !
2.6 Analysing why a hung system could not be panicked

There are two reasons which explain why such break signals are without any effect: Break disabled by Solaris
Document ID:16106 Title:How to disable BREAK and L1-A using kbd and /etc/default/kbd
This feature appeared in Solaris 2.6 to unable people to move their console for instance. Hard Hang Some hangs, usually hardware only, are so important that the signal sent to the kernel are without effect. One solution is to enable deadman kernel
Document ID:13258 Title:KERNEL: How to enable deadman kernel code

If it is without effect, first recommend the customer to update (patch cluster) his system to limit the software probability, then start by replacing one part (System Board, CPU, Memory ...) after the other. For servers mainly, this is usually done from the Sun Office because hard hangs analysis can last for weeks.
3. Unexpected reboot: panic / AFT1

3.1 How to identify ?
/var/adm/messages
One tip focus on Copyright which represents one of the first message logged when the system boots and ...... look above to see the unexpected reboot. Feb. 2 21:39:13 ultra250-5 last message repeated 1 time Feb. 2 23:18:20 ultra250-5 unix: [ID 836849 kern.notice] Feb. 2 23:18:20 ultra250-5 panic [cpu2]/thread=2a1000d7d20: Feb. 2 23:18:20 ultra250-5 unix: [ID 969304 kern.notice] dqput: dqp->dq_cnt == 0 Feb. 2 23:18:20 ultra250-5 unix: [ID 100000 kern.notice] Feb 2 23:18:20 ultra250-5 genunix: [ID 723222 kern.notice] 000002a1000d7570 ufs:real_panic_v+70 (0, 1047a7c0, 2a1000d7810, 0, 8, 8) Feb 2 23:18:20 ultra250-5 genunix: [ID 179002 kern.notice] %l0-3: 0000000010338694 000000001041f8b0 0000000000000000 000003000ca70290 Feb 2 23:18:20 ultra250-5 %l4-7: 0000030000e6bf28 00000000101746e0 000002a1000d75b8 0000000000000001 Feb 2 23:18:20 ultra250-5 genunix: [ID 723222 kern.notice] 000002a1000d7620 ufs:ufs_fault_v+c8 (300137d2b70, 0, 2a1000d7810, 3000b243330, 5b, 1) Feb 2 23:18:20 ultra250-5 genunix: [ID 179002 kern.notice] %l0-3: 00000300137d2ac0 000000001047a7c0 000003000ca70290 0000000000000000 Feb 2 23:18:20 ultra250-5 %l4-7: 0000000000000000 0000030000e6bf28 000003000027af70 0000000003b6c790 Feb 2 23:18:20 ultra250-5 genunix: [ID 723222 kern.notice] 000002a1000d76d0 ufs:ufs_fault+1c (3000b243330, 1047a7c0, 20, 10000, 30000e6bf28, 2a1000d7d1c) Feb 2 23:18:20 ultra250-5 genunix: [ID 179002 kern.notice] %l0-3: 0000000000000000 0000000000000000 000003000ca70290 000000001000a408 [...] Feb 2 23:18:22 ultra250-5 genunix: [ID 672855 kern.notice] syncing file systems... Feb 2 23:18:23 ultra250-5 genunix: [ID 733762 kern.notice] 58 Feb 2 23:18:24 ultra250-5 genunix: [ID 733762 kern.notice] 36 Feb 2 23:18:25 ultra250-5 genunix: [ID 733762 kern.notice] 32 Feb 2 23:18:49 ultra250-5 last message repeated 20 times Feb 2 23:18:50 ultra250-5 genunix: [ID 622722 kern.notice] done (not all i/o completed) Feb 2 23:18:51 ultra250-5 genunix [ID 353387 kern.notice] dumping to / dev/dsk/c0t0d0s1, offset 644153344 Feb 2 23:19:33 ultra250-5 genunix: [ID 409368 kern.notice] ^M100% done: 64623 pages dumped, compression ratio 2.63, Feb 2 23:19:33 ultra250-5 genunix: [ID 851671 kern.notice] dump succeeded Feb 2 23:22:26 ultra250-5 genunix: [ID 540533 kern.notice] SunOS Release 5.8 Version Generic_117000-03 64-bit Feb 2 23:22:26 ultra250-5 genunix: [ID 913632 kern.notice] Copyright 1983-2003 Sun Microsystems, Inc. All rights reserved.
Messages above the Copyright could have been also ...
Jul 15 07:11:58 ultra250-5 last message repeated 1 timeJul 15 07:11:58 ultra250-5 SUNW,UltraSPARC-III+: [ID 809671 kern.info] NOTICE: [AFT0] WDC Event detected by CPU20 at TL>0, errID 0x000027c5.258e90b0 Jul 15 07:11:58 ultra250-5 AFSR 0x00300440<ME,PRIV,UCC,WDC>.000001d4 AFAR 0x00000000.030e64a0 INVALID Jul 15 07:11:58 ultra250-5 Fault_PC 0x10154800 Esynd 0x01d4 AMBIGUOUS /N0/SB5/P0/E0 J4400 Jul 15 07:11:58 ultra250-5 SUNW,UltraSPARC-III+: [ID 485521 kern.info] [AFT0] errID 0x000027c5.258e90b0 Data Bit 75 was in error and corrected Jul 15 07:11:59 ultra250-5 SUNW,UltraSPARC-III+: [ID 489146 kern.notice] NOTICE: [ AFT1 ] CPU20 offlined due to UCC Event with ME set [...] Jul 15 07:13:00 ultra250-5 genunix: [ID 913632 kern.notice] Copyright 1983-2003 Sun Microsystems, Inc. All rights reserved.
look below to see the Kernel core files generation:
Apr 17 13:49:07 sun28 genunix: [ID 913632 kern.notice] Copyright 1983-2003 Sun Microsystems, Inc. All rights reserved. Apr 17 13:49:07 sun28 unix: Ethernet address = 8:0:20:c4:be:66 Apr 17 13:49:07 sun28 unix: mem = 1048576K (0x40000000) Apr 17 13:49:07 sun28 unix: avail mem = 1029734400 [...] Apr 17 13:49:17 sun28 unix: PCI-device: network@1,1, hme0 Apr 17 13:49:17 sun28 unix: hme0 is /pci@1f,4000/network@1,1 Apr 17 13:49:17 sun28 unix: dump on /dev/dsk/c0t0d0s3 size 1028 MB Apr 17 13:49:20 sun28 unix: SUNW,hme0: Using Internal Transceiver Apr 17 13:49:20 sun28 unix: SUNW,hme0: 10 Mbps half-duplex Link Up Apr 17 13:50:11 sun28 savecore reboot after panic: CPU1 Ecache SRAM Data Parity Error: AFSR 0x00000000.00400100 AFAR 0x00000000.000ffff0
3.1 Panic but no core files?

A panic is always associated with core files so if you don't have any core files, there is a problem in the savecore procedure you must troubleshoot. Known issues which pop up generally in the savecore command output, are:

savecore not enabled (2.6 and below in general) dump device or save directory too small dump device incorrectly set with disksuite dump device too big (32 bits architecture)
3.2 How to Troubleshoot ?
Do i always need the Kernel core files to troubleshoot a panic ? No 80% of the panics are easily identifiable just by their panic string in the messages file or on the console. Select Infodoc/SRDB (+ FINs and Sun Alert) first ! Escalate the rest which are mostly the BAD TRAP 0x10, 0x31, 0x34 and need core dump analysis.
/var/adm/messages files # grep -i panic messages messages.0 <-- this will search for a panic string within the messages files # grep -i fatal messages messages.0 <-- this will search for 'fatal' resets # grep -i AFT1 messages messages.0 <-- this will search for Ecache messages
3.3 Findaft
# findaft <path to message file(s)> <-- this will automatically scan the message files for you a report it findings - VERY USEFUL!!! Document ID: 80270 Title:Findaft - an AFT, CPU, Memory and PCI ECC error message summary script. The Findaft script, aims to provide a concise summary of AFT, CPU and PCI ECC errors found in a Solaris[TM] Operating System(OS) /var/adm/messages file. This summary can then used to assist in diagnosing a customers' hardware fault.
Note: Findaft is Sun Internal only and cannot be sent to customers.
Findaft is a standalone perl script. The latest version is runnable from here: /net/cores.uk/export/hotline/hotlocal/bin/findaft or you can download the tool and install it locally, see document Solaris 10 FMA removes the requirement for log scrapers like findaft. Please do not replace multiple FRUs based on the output of findaft alone. Ask an experienced engineer to confirm your diagnosis, send a question to one of the <insert platform>-support@sun.com aliases and CC findaft-interest@sun.com or escalate if required.
4. Examples of panics
Hardware panics Jul 10 01:43:34 tronsd81 unix: WARNING: [AFT1] WP event on CPU11, errID 0x00092f64.77726c8d Jul 10 01:43:34 tronsd81 unix: AFSR 0x00000000.00800100<WP> AFAR 0x00000179.fe75f940 Jul 10 01:43:34 tronsd81 unix: AFSR.PSYND 0x0100(Score 95) AFSR.ETS 0x00 Fault_PC 0x100171b0 Jul 10 01:43:34 tronsd81 unix: UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND 0x00 Jul 10 01:43:58 tronsd81 unix: WARNING: [AFT1] Uncorrectable Memory Error on CPU0 Data access at TL=0, errID 0x00092f6a.34dfd7e1 Jul 10 01:43:58 tronsd81 unix: AFSR 0x00000000.80200000<PRIV,UE> AFAR 0x00000002.95b74000 Jul 10 01:43:58 tronsd81 unix: AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x10021058 Jul 10 01:43:58 tronsd81 unix: UDBH 0x0203<UE> UDBH.ESYND 0x03 UDBL 0x0000 UDBL.ESYND 0x00 Jul 10 01:43:58 tronsd81 unix: UDBH Syndrome 0x3 Memory Module Board 4 J3101 J3201 J3301 J3401 J3501 J3601 J3701 J3801 CPU and memory are the components which cause most of the hardware panics. => The first one is often betrayed by its External Cache (see FIN I0616), the second is often a consequence of initial corrected memory errors marked by AFT0/AFT2 (see cediag tool !). Freeing free X panic[cpu0]/thread=e2174a60: 0xe07ddbd0: free: freeing free block, dev:0x740487, block:15408, ino:722790, fs:/export/nightly Indicates a panic due to a corruption in the File System pointed out. => Look for hardware messages under /var/adm or at the console before the crash. If nothing is found, consider a possible software issue in the drivers used to write data on this File System.
Document ID: 18890 Analysing free inode, frag, block panics

vfs_mountroot() panic[cpu32]/thread=0x10404000: vfs_mountroot: cannot mount root Due to errors in the boot-device configuration (/etc/system, /etc/vfstab ...). Occur frequently when the system is improperly setup/modified under DiskSuite/Veritas Volume Manager.
boot cdrom -s and check configuration files.
Document ID: 83484 OS 8 Panics with vfs_mountroot:cannot mount root
assertion failed panic[cpu8]/thread=0x3186be80: assertion failed: cache->sc_nextfree >= 1 && cache>sc_nextfree < SN_NFBAPFLB, file: cache.c, line: 228 Points to a programmation error in the source code *.c at line n.
look for known bugs in sunsolve usually third party issue
panic zero Feb 18 09:25:23 tera13s unix: panic[cpu0]/thread=0x30023ec0: zero
It is a panic triggered by a sync command typed at the obp see System hangs or customer who pressed STOP A
Document ID: 11837 How to force a crash when my machine is hung?
5. Fatal Reset
It is a sudden hardware failure (processor not answering to kernel requests for instance), which makes the system reset Document ID: 51105 Title: Sun4U Fatal Reset FAQ
5.1 How to identify ?
/var/adm/messages
Jun 23 22:54:03 thyme unix: System booting after fatal error FATAL
5.2 How to troubleshoot?

During the OBP reset, hardware components are tested (Obdiags) and the failure can be identified. If the bad component is vital and unique, the system will not reboot and the console trace points to the problem. If there is redundancy, the component is by-passed (Automatic System Recovery mechanism). This will be seen at the prtdiag -v output by a missing component in the list and usually an error.
Example on an Enterprise Server E3500: Analysis for Board 2 -------------------AC: P_FERR error P_REPLY received from UPA Port The error could be caused by: CPU Address Controller AC: Illegal P_REPLY received from UPA Port The error could be caused by: CPU Address Controller
5.3 Fatal Reset but no component disabled?

This can be due to insufficient tests at the OBP or tricky failure ... First check if some trace can be found at the console (see below ...). If not, check the OBP level and patch if necessary. Then run diagnostics at level=max. ok setenv auto-boot? false ok setenv diag-level max ok setenv diag-switch? true (not necessary on servers: key to put in diag mode ok reset If the Obdiags cannot identified the bad component, put a console at the serial port and wait for the next crash with the diag-level still set to max (SunVTS to reproduce ?). Get the trace at the console and put it in the FATAL Reset Decoder at: http://pts-platform.uk/twiki/bin/view/Tools/ToolPageFatalResetDecoder
6. Miscellaneous 6.1 Power outage

It does not produce any trace in the messages files but ... When a system is powered off (key, init 5, [intermittent] power outage, etc ...) the OBP variable power-cycles (if set or set to a small number) is increment by one. It is then necessary to compare two eeprom command output between two unexpected reboot. The explorer also contains this output: # grep power sysconfig/eeprom.out diag-trigger=power-reset #power-cycles=45 First, investigate with the customer possible OnSite problems (general outage, plug issue) and if not, replace the AC input assembly on the machine.
6.2 System dropped to the ok prompt Document ID: 20767 Title:Why a system would drop to the ok prompt
Usually, it is due to a break sent to the console (see hang) and you will find type go to resume just type go to resume Solaris ;-) But it can be also be a watchdog ... see infodoc above
6.3 System powered off

Due to overtemperature (see messages files), or the auto-shutdown script not disabled (/etc/rc2.d/S85power).
7. Debugging Options
for the core dump analyse #strings #adb #kdb #act #iscda #fm #scat on older Revisions of SunOS you will be able to perform commands like #crash #dbx or dbxtool
The Usage of adb, kdb, fm or scat is quite a heavy thing and needs at least few more training sessions. Therefore i want to show you just some easy and sometimes even helpful output of the command strings. When you get the output of strings redirected into a file: e.g. #strings vmcore.1 > vmcore1.strings.out Following you will see just a few Examples in using the strings command Be careful with the strings output, depending on the size of vmcore.x it easily can be printed out on over 1000 Pages.
strings vmcore.1| less strings vmcore.1| grep SunOS strings vmcore.1| grep <*****> strings only works on the core file vmcore.x. The unix.x is a binary file, known as the object file, which contains the symbol table and executable code. The CDA will not be successful without having the unix.x because adb needs the symbol-table in this file to refer all entries in vmcore.x.
ACT:
is a tool developed over several years to aid in the process of analysing kernel dumps. It attempts to perform a good first pass on a kernel dump. ACT prints detailed and accurate information about:

Where the kernel paniced A complete list of threads on the system. The contents of the /etc/system file which was read when the failed system booted A list of kernel modules that were loaded at the time of the panic. The output of the kernel message buffer Full deadlock detection relating to threads blocked on mutexes or readers/writer locks. Threads blocked in either getblk() or biowait().
For a more detailed description of ACT: http://pts-platform/twiki/bin/view/Tools/ToolPageAct
Adb:
$ adb -k unix.0 vmcore.0 $<msgbuf ... <This will output the message buffer> *panicstr/s ... <This will output the panic string> $c ... <This will output a stacktrace>
Iscda:
is a tool that was distributed with SunSolve CDs, it uses adb to collect some basics from a crashdump. Therefor this script cannot be used from Solaris 9 onwards, because there is only mdb available. iscda is available from: Document ID: 10214 Title:Initial System Crash Dump Analysis for Solaris[TM] 2.X Operating Environment
8. First contact
By having the first contact to a customer it is quite helpful to ask him about the Priority he wants to set. Is the System still down or was reboot successful? did the panic happens once or did it he see more panics in the last time on this Server. what is the message-format in /var/adm/messages and/or serial-output/console has the a coredump been written? -/var/crashes/<hostname> any changes in Hardware? - e.g. new PCI Devices, HW-Maintenance in a previous Case, any upgrades, re-location, possible Environment-Issues any changes in software? - e.g. patchupdates , devicedriver update, customized applications, any 3rd party software the filetransfer (*** if a CDA is required ***) ask the customer to transfer the coredumps as a tar.gz <including caseID>
a way to upload could be the web-interface: http://supportfiles.sun.com/upload customers can use a directory in the pop-up menu. engineers can download the files on: https://supportfiles.sun.com/engineer When you find yourself in the situation that a customer wants to know the cause of a panic on his system but there are no messages about a panic and no core dump than it suffered more likely on a FATAL ERROR:FATAL Reset. In this Case i would start with general HW- Troubleshooting as it is always HW related.
9. The AFT (Asynchronous Fault Trap) message format.

Solaris decodes the AFSR and provides a humanly read-able description of the error, together with the CPU which detected this event. The errID is the time in nanoseconds since the system booted. Solaris determines the bank of memory from the AFAR The individual DIMM is determined using the Esynd which for correctable errors can identify where in the bank the data was sourced from. The CE is classified as Intermittent Persistent or Sticky. For CE errors Solaris converts the Esynd to the humanly readable Data or Check bit corrected. The AFT2 lines can be ignored in almost all cases. Events can be grouped into: Events of the reported cpu DPE - D$ parity event DDSPE - D$ data parity event DTSPE - D$ physical tag parity event IPE - I$ parity event IDSPE - I$ data parity event ITSPE - I$ physical tag parity event TSCE - software correctable single-bit E$ tag ECC event THCE - hardware corrected single-bit E$ tag ECC event UCC - software correctable E$ ECC event UCU - uncorrectable E$ ECC event EDC - hardware corrected E$ ECC event EDU:ST - uncorrectable E$ ECC event for store merge EDU:BLD - uncorrectable E$ ECC event for block load WDC - hardware corrected E$ ECC event for writeback ( WDU - uncorrectable E$ ECC event for writeback CPC - hardware corrected E$ ECC event for copyout CPU - uncorrectable E$ ECC event for copyout ETP - uncorrectable E$ tag parity error
These errors are typically internal to the cpu that reported them or its caches. In some cases, the E$ is not physically part of the cpu module. These errors do not necessarily imply that the cpu or its E$ is at fault, as such errors can be caused by ran-dom events. Attention must be paid to earlier errors (if any) which may have deposited the errors into the reporting cpu's caches.
Events reported by the CPU, but the reporting cpu is likely innocent. CE - Hardware corrected system bus data ECC event for read from the system bus UE - Uncorrectable system bus data ECC event for read from the system bus TO - Unmapped event from the system bus BERR - Bus Error event response from the system bus EMC - Hardware corrected system bus MTag ECC event IVC - Hardware corrected system bus data ECC event for read of interrupt vector IVU - Uncorrectable system bus data ECC event for read of interrupt vector DUE - Uncorrectable system bus data ECC event for prefetch or store queue fill read from system bus DTO - Unmapped event for prefetch or store queue fill read from system bus DBERR - Bus Error event response for prefetch or store queue fill read from system bus FRC - foreign read to DRAM incurring correctable ECC event FRU - foreign read to DRAM incurring uncorrec. ECC event RCE - correctable ECC event from remote cache/memory RUE - uncorrectable ECC event from remote cache/mem. OM - JBus transaction event due to cacheable PA bits not supported by L2 tag (out-of-range memory) UMS - event due to unsupported store IVPE - interrupt vector parity event BP - JBus parity event on returned read data WBP - JBus parity event on data for writeback or block store The cpu listed with these events is simply the reporting cpu, and maybe unrelated to the cause of the event. is fail-ing. To find the rootcause a CDA is necessary. Fatal Errors EMU -uncorrectable system bus MTag ECC event TUE - uncorrectable E$ tag ECC event PERR - system Interface Protocol event IERR - internal processor event ISAP - system request parity event on incoming address JETO - system interface protocol event, hardware time-out SCE - JBus parity event on J_PACK or J_REQ signals JEIC - system interface protocol event detected JEIT - system interface protocol event, illegal ADTYPE ETP - uncorrectable E$ tag parity event JEIS - system interface protocol event, illegal stall state These events will result in a complete reset of the system. Core Dump should be available then, see also:
Document ID: 72846 Title:Event Messages for UltraSPARC-III[R], UltraSPARCIII+[R], UltraSPARC-IIIi[R], and UltraSPARC-IV[R] CPU Modules
As soon you clarified the panic situation, message format and you defined the case is worth it to escalate it for a CDA you should prepare the following Data and keep them online available on cores.uk (set the permissions properly) mandatory: vmcore.x unix.x act.out + Problem statement (either in esc.Template or as a txt-file on Server) special: extd. post output (if trap message points to a HW-related Panic) explorer output (if trap message points to a SW-related Panic) to analyse drivers status , modinfo etc, etc, running ACT with KENV (Kernel Environment) kenv act 0 >act.out Sun Hardware Diagnostics If a panic occurred and the root-cause is unknown, the first Action should be to run an extd.POST to rule out any HW- Issue as the cause and to verify the HW-Availability after a PowerCycle. To perform extd. POST at ok> prompt: ok> setenv auto-boot? false ok> setenv diag-switch? true ok> setenv diag-level max ok> power-off ----power-on--(watch results of diagnostic tests) (or log console out) If devices appear to be missing, you can also run the following tests: ok> probe-scsi-all ok> probe-sbus ok> show-sbus ok> show-disks ok> show-tapes ok> show-nets ok> show-devs
10. ACT Output (Example)

Loading Symbol Table from: unix.0 Crash Dump Platform: sun4u Crash Dump Machine Type: SUNW,Sun-Fire-480R Crash Dump Architecture: 64-Bit Crash Dump OS Release: 5.8 Crash Dump Version: Generic_117000-05 Crash Dump includes: SDS 4.1 Crash Dump includes: Veritas Local Machine Platform: sun4u Local Machine Architecture: 64-Bit Local Machine OS Release: 5.9 Using FCS Stabs Using Kernel Patch: Running command act on Crashdump 0 act -n unix.0 -d vmcore.0 isa_list: sparcv9+vis2 sparcv9+vis sparcv9 sparcv8plus+vis sparcv8plus sparcv8 sparcv8-fsmuld sparcv7 sparc Hostname: blahblah Domainname: blahblah Release: SunOS 5.8 Generic_117000-05 Architecture: sun4u Hostid: xxxxxxxxxxxxx System booted at: 2005 Jul 9 14:40:14 GMT System crashed at: 2006 Jan 21 16:58:52 GMT Crash dump started at: 2006 Jan 21 16:58:52 GMT panic: ufs_ifree: freeing free inode .... lbolt is 0x64fc8972, 0t1694271858 physmem is 0x1f33dd 15.6G (.......) No. of cpus: 4 Total no. of threads: 445 No. of threads in use: 405 No. of swapped threads: 0 No. of processes: 119 ....... {listing of all cpus with Treads and traces:) ........ CPU 0x2...... ### PANIC occurred on this CPU cpu addr 0x30005430aa0 running thread addr 0x2a1005f7d20 pause thread addr 0x2a1002fbd20 PANIC occurred on this thread Kernel thread: thread addr 0x2a1005f7d20, proc addr 0x10424d30, lwp addr 0x0 Thread bound to cpu id 0x2 t_state is 0x4 - TS_ONPROC ufs: ufs_fault+0x1c (0x3000502e628,0x1047c0c8,0x0,0x27e06,0x300050560d4) ufs: ufs_ifree+0x1bc (0x3000502e598,0x27e06,0x8180) ufs: ufs_delete+0x1e4 (0x3000014a038,0x3000502e598,0x1) ufs: ufs_thread_delete+0xc4 (0x10446028,0x0) unix: thread_start+0x4 (0x10446028,0x0,0x0,0x0,0x0,0x0)
11. Example FindAFT output

This script looks for Hardware errors including all AFT and pci ECC events Written for 108528-16/112233-01 or above. Some tests may fail on other revisions Report bugs,RFEs or if you have questions email findaft-interest@sun.com Version 1.91 homepage http://pts-platform/twiki/bin/view/Tools/ToolPageFindaft Or runnable from /net/cores.uk/export/hotline/hotlocal/bin/findaft Infodoc 80270 Findaft an AFT CPU Memory and PCI ECC error message summary script ################################################################################ Input file V440_UE_DIMM is 0.6 MB ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Esynd and Msynd errors CE and UE errors are included ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 4 Esynd 0x0071 C3/P0/B0: B0/D0 B0/D1 4 Esynd 0x0071 INVALID J_AID 0 INVALID 4 Esynd 0x0071 INVALID J_AID 1 INVALID 1 Esynd 0x0071 INVALID J_AID 2 INVALID 10 Esynd 0x0071 J_AID 0 8 Esynd 0x0071 J_AID 1 6 Esynd 0x0071 J_AID 2 4 Esynd 0x00a8 C3/P0/B0/D0: B0/D0 2 Esynd 0x00a8 J_AID 1 2 Esynd 0x00a8 J_AID 2 3 Esynd 0x0130 C3/P0/B0/D0: B0/D0 3 Esynd 0x0130 J_AID 0 7 Esynd 0x0130 J_AID 1 5 Esynd 0x0130 J_AID 2 18 Esynd 0x0141 C3/P0/B0/D0: B0/D0 10 Esynd 0x0141 J_AID 0 13 Esynd 0x0141 J_AID 1 3 Esynd 0x0141 J_AID 2 2 Esynd 0x0142 C3/P0/B0/D0: B0/D0 2 Esynd 0x0142 J_AID 0 6 Esynd 0x0142 J_AID 1 3 Esynd 0x0142 J_AID 2 12 Esynd 0x01ea C3/P0/B0: B0/D0 B0/D1 12 Esynd 0x01ea INVALID 10 Esynd 0x01ea INVALID J_AID 1 INVALID 6 Esynd 0x01ea INVALID J_AID 2 INVALID 10 Esynd 0x01ea J_AID 1 6 Esynd 0x01ea J_AID 2 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Other AFT Events ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 39 [AFT0] Corrected memory (CE) Event detected by CPU3 81 [AFT0] Corrected memory (FRC) Event detected by CPU3 18 [AFT0] Corrected remote memory/cache (RCE) Event detected by CPU0 33 [AFT0] Corrected remote memory/cache (RCE) Event detected by CPU1 17 [AFT0] Corrected remote memory/cache (RCE) Event detected by CPU2 56 [AFT1] Two Bits were in error 40 [AFT1] Uncorrectable memory (FRU) Event detected by CPU3 16 [AFT1] Uncorrectable memory (UE) Event detected by CPU3 Privileged Data Access 10 [AFT1] Uncorrectable remote memory/cache (RUE) Event detected by CPU0 Privileged Data Access 18 [AFT1] Uncorrectable remote memory/cache (RUE) Event detected by CPU1 Privileged Data Access 12 [AFT1] Uncorrectable remote memory/cache (RUE) Event detected by CPU2 Privileged Data Access
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Main Memory Correctable ECC events sorted by date ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 Jan 01 C3/P0/B0/D0: B0/D0 is Persistent 1 Jan 02 C3/P0/B0/D0: B0/D0 is Persistent 17 Dec 29 C3/P0/B0/D0: B0/D0 is Persistent 6 Dec 30 C3/P0/B0/D0: B0/D0 is Persistent 2 Dec 31 C3/P0/B0/D0: B0/D0 is Persistent ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ USIIIi FRC and FRU events - Further info available from http://gmpweb.uk/~db124859/pdf/UltraSPARCIIIi_CPU_Error_Handling_and_Messages.pdf ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 4 (FRC) CPU3 0x0071 INVALID J_AID 0 INVALID 4 (FRC) CPU3 0x0071 INVALID J_AID 1 INVALID 1 (FRC) CPU3 0x0071 INVALID J_AID 2 INVALID 2 (FRC) CPU3 0x00a8 J_AID 1 2 (FRC) CPU3 0x00a8 J_AID 2 3 (FRC) CPU3 0x0130 J_AID 0 7 (FRC) CPU3 0x0130 J_AID 1 5 (FRC) CPU3 0x0130 J_AID 2 10 (FRC) CPU3 0x0141 J_AID 0 13 (FRC) CPU3 0x0141 J_AID 1 3 (FRC) CPU3 0x0141 J_AID 2 2 (FRC) CPU3 0x0142 J_AID 0 6 (FRC) CPU3 0x0142 J_AID 1 3 (FRC) CPU3 0x0142 J_AID 2 10 (FRC) CPU3 0x01ea INVALID J_AID 1 INVALID 6 (FRC) CPU3 0x01ea INVALID J_AID 2 INVALID 10 (FRU) CPU3 0x0071 J_AID 0 8 (FRU) CPU3 0x0071 J_AID 1 6 (FRU) CPU3 0x0071 J_AID 2 10 (FRU) CPU3 0x01ea J_AID 1 6 (FRU) CPU3 0x01ea J_AID 2 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ USIIIi RCE and RCU events - Further info available from http://gmpweb.uk/~db124859/pdf/UltraSPARCIIIi_CPU_Error_Handling_and_Messages.pdf ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 12 (RCE) CPU0 C3/P0/B0: B0/D0 B0/D1 (applicable only if corresponding FRC Event also logged) 19 (RCE) CPU1 C3/P0/B0: B0/D0 B0/D1 (applicable only if corresponding FRC Event also logged) 9 (RCE) CPU2 C3/P0/B0: B0/D0 B0/D1 (applicable only if corresponding FRC Event also logged) 10 (RUE) CPU0 C3/P0/B0: B0/D0 B0/D1 (applicable only if corresponding FRU Event also logged) 18 (RUE) CPU1 C3/P0/B0: B0/D0 B0/D1 (applicable only if corresponding FRU Event also logged) 12 (RUE) CPU2 C3/P0/B0: B0/D0 B0/D1 (applicable only if corresponding FRU Event also logged) ################################################################################ WARNING: Errors found indicate that this system has a UE DIMM in bank CPU3 Bank0: B0/D0 B0/D1 ################################################################################ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Correctable memory errors found cediag syntax will be as follows ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ cediag -e explorer_directory/ cediag -c SunOS,viefep20,5.x,sparc -k KUP_rev -u 3i messages
Second Example:
This script looks for Hardware errors including all AFT and pci ECC events Written for 108528-16/112233-01 or above. Some tests may fail on other revisions Report bugs,RFEs or if you have questions email findaft-interest@sun.com Version 1.91 homepage http://pts-platform/twiki/bin/view/Tools/ToolPageFindaft Or runnable from /net/cores.uk/export/hotline/hotlocal/bin/findaft Infodoc 80270 Findaft an AFT CPU Memory and PCI ECC error message summary script ################################################################################ 880_uniq_col_prtdiag Input file 880_uniq_col is 1.2 MB ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Esynd and Msynd errors CE and UE errors are included ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 848 Esynd 0x01d4 Slot A: J8000 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Schizo / Psycho reported CE and UE events ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 6 Memory Module Slot A: J8000 id 8. 6 NOTICE: correctable error detected by pci0 (safari id 8) during ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Other AFT Events ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 210 [AFT0] Corrected system bus (CE) Event detected by CPU0 327 [AFT0] Corrected system bus (CE) Event detected by CPU1 141 [AFT0] Corrected system bus (CE) Event detected by CPU2 170 [AFT0] Corrected system bus (CE) Event detected by CPU3 1 [AFT0] Sticky Softerr encountered on Memory Module Slot A: J8000 742 [AFT0] Sticky Softerror encountered on Memory Module Slot A: J8000 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Main Memory Correctable ECC events sorted by date ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 105 May 17 Slot A: J8000 is Intermittent 743 May 17 Slot A: J8000 is Sticky ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Correctable memory errors found cediag syntax will be as follows ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ cediag -e explorer_directory/ cediag -c SunOS,corpp802,5.x,sparc -k KUP_rev -u 3+ messages ################################################################################ Start of NGDIMM CE specific checks ################################################################################ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Unique dimms total 1 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Slot A: J8000 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Unique Dimm with the Data or Check Bit corrected total 1 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 848 Slot A: J8000 0x01d4 (CE) Event Data Bit 75
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Unique Esynds / Data or Check bit total 1 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 848 0x01d4 (CE) Event Data Bit 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ CE Event type reported by each CPU ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Reporting CPU Intermittent Persistent Sticky CPU0 CPU1 CPU2 CPU3 20 34 24 27 0 0 0 0 190 293 117 143 << << << <<
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ RAS/CAS and Logical Bank decode per Esynd rfe 4990456 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Esynd 0x01d4 ############ Slot A: J8000 DIMM size 256MB Total Events 848 Interleave 2 Unique Rows, 165 Columns 1 Logical banks 1 Quadwords 1
12. Resources
Document ID:13258 Title:KERNEL: How to enable deadman kernel code Document ID:13039 Title:Troubleshooting System Hangs Document ID:80737 Title:Crash Dump capturing for Redhat Linux Document ID:15553 Title:Solaris[TM] Operating System: Forcing a kernel core dump on an x86 or x64 system. Document ID: 79609 Title:Ecache events and what to do about them Document ID: 81451 Title:Analyzing and Troubleshooting UE errors on Enterprise[TM] E3x00-E6x00 systems Document ID: 80270 Title:Findaft - an AFT, CPU, Memory and PCI ECC error message summary script. Document ID:16106 Title:How to disable BREAK and L1-A using kbd and /etc/default/kbd Document ID: 17152 Title:Watchdog FAQ Document ID: 46241 Title:Sun Fire[TM] reboots due to TOD-POR reason Document ID: 45780 Title:Use prtconf to find the reason of unexpected reboots Document ID: 20767 Title:Why a system would drop to the ok prompt Document ID: 51105 Title: Sun4U Fatal Reset FAQ Document ID: 11837 How to force a crash when my machine is hung? Document ID: 83484 OS 8 Panics with vfs_mountroot:cannot mount root Document ID: 18890 Analysing free inode, frag, block panics Document ID:13258 Title:KERNEL: How to enable deadman kernel code Document ID: 72846 Title:Event Messages for UltraSPARC-III[R], UltraSPARCIII+[R], UltraSPARC-IIIi[R], and UltraSPARC-IV[R] CPU Modules

Basic System Crashes and Hangs v2

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Basic System Crashes and Hangs v2

Transféré par

Droits d'auteur :

Formats disponibles

BASIC System crashes and hangs for Sun Partners.

Armand Heijster Sun Microsystems EMEA Madrid

Index: 0. Intro 1. What is a System Crash (panic!)? 2. My System has Hung!

3. Unexpected reboot: panic / AFT1

4. Examples of panics 5. Fatal Reset

1. What is a System Crash (panic!)

2. My System has Hung?

2.1 What is a system hang?

2.2 What causes a hang?

2.3 How do i know a system is hung?

2.4 What to do?

2.5 Break but no ok prompt?

2.6 Analysing why a hung system could not be panicked

Document ID:13258 Title:KERNEL: How to enable deadman kernel code

3. Unexpected reboot: panic / AFT1

Messages above the Copyright could have been also ...

look below to see the Kernel core files generation:

3.1 Panic but no core files?

3.2 How to Troubleshoot ?

Note: Findaft is Sun Internal only and cannot be sent to customers.

Document ID: 18890 Analysing free inode, frag, block panics

boot cdrom -s and check configuration files.

Document ID: 83484 OS 8 Panics with vfs_mountroot:cannot mount root

look for known bugs in sunsolve usually third party issue

panic zero Feb 18 09:25:23 tera13s unix: panic[cpu0]/thread=0x30023ec0: zero

Document ID: 11837 How to force a crash when my machine is hung?

5.1 How to identify ?

5.2 How to troubleshoot?

5.3 Fatal Reset but no component disabled?

6. Miscellaneous 6.1 Power outage

6.3 System powered off

For a more detailed description of ACT: http://pts-platform/twiki/bin/view/Tools/ToolPageAct

9. The AFT (Asynchronous Fault Trap) message format.

10. ACT Output (Example)

11. Example FindAFT output

Vous aimerez peut-être aussi