Solaris Troubleshooting – System Panics, Hangs and Crashes

There are a number of differing scenarios under which the Solaris operating system may panic, hang, or exhibit other symptoms that lead the administrator to have to restart or reboot the system.

As there are many different failure scenarios and many different classes of hardware, the information and procedures for collection of system information vary from system to system.

1.Hang

The first class of failures is the hang. This is when a system appears to become unresponsive. See the documents below that discuss dealing with hung systems.

Be aware that some systems that appear hung are not! Be sure to verify if the server is hung or not. For example: The display may be non-responsive, because the output has been redirected to the console device.

2.Panic or unexpected reboot

Panics can be caused by a variety of issues, including Solaris Bugs, Hardware errors and Third Party Drivers and Applications. It’s important to collect as much detail as is possible when systems panic.

Information that need to be collected to troubleshoot Kernel Panic:

When logging a new case, provide answers to the following questions as an absolute minimum:

When did the problem start
What changes have been recently made on the system. Important: Anything that has happened since the last reboot is within the scope of this question. Patching, application changes, disk replacements, anything. It’s all important to know when trying to resolve the issues.
How often has this failure occurred
What may have been going on around the time of the panic and if anything out of the ordinary may have been observed

In the case of a panic or reboot, the messages log and prtdiag output are items that can be quickly sent to Oracle Sun, and that can go a long way towards diagnosis of the cause of the problem, however, an explorer is almost always better.

By far, the simplest way to collect the vast majority of details required to resolve a panic is to collect:

Sun Explorer Data Collector output
The crash dump
Any console messages

a. EXPLORER OUTPUT

If Explorer is run on the system after the incident occurs, it will contain most of what will be required to understand the current configuration of the system, and additional information that may be helpful.
b. CRASH DUMP

If the system generated a system crash dump (check /var/crash/`hostname`), create a compressed tar file containing the unix.* and vmcore.* files and transfer that file to Oracle Sun for analysis. Compressing the tar file (using one of the compress, gzip of bzip2 utilities) reduces the size of the file dramatically and so reduces the time taken to transfer the file to Sun.

c. CONSOLE MESSAGES

Console messages are most important when the server is experiencing hardware issues and the OS is not allowed an opportunity to panic. In cases where multiple unexpected reboots are occurring, and no diagnostic infomation is being provided by the system logs in /var/adm/messages, some form of console logging should be setup as soon as possible to capture the diagnostic information from the console on the next failure.

Recommended NVRAM settings , to Collect Console Messages:

Bring system to OBP level from command line using “shutdown” or “init 0″ commands (either will run all RC shutdown scripts), sync file systems and then drop system to OK prompt. DO NOT use a stop+A key press. The following commands can be executed from the OK prompt or from the command line using the “eeprom <variable=parameter>” command.

at OK prompt	# eeprom	Description
setenv diag-level max	diag-level=max	system will run extended POST
printenv boot-device	boot-device	determine what your boot device is….
setenv diag-device <your boot-device>	diag-device=<your bootdev>	prevent attempting net boot w diags on
setenv error-reset-recovery sync	error-reset-recovery=sync	force sync reboot if system drops to OK
setenv diag-switch true	diag-switch =true
reset-all	reboot or init 6	system has to reset for changes to take affect

Exceptions

In some cases, it is difficult to collect an explorer.

In the event that explorer cannot be installed or run in a timely manner, the following data is of tremendous value, and should be collected:

a. MESSAGES LOG

Messages logs from /var/adm directory. If there was a panic, the panic message in the file will help determine if we need to analyze a crash dump to diagnose the cause. In many cases a crash dump is not necessary, and waiting for one to be transferred simply increases the resolution time. For example, if there was a hardware reset rather than a panic, the messages log should show that.

b. PRTDIAG OUTPUT

Output of the prtdiag command.

 /usr/platform/`uname -i`/sbin/prtdiag -v

prtdiag gives a summary of a system’s hardware configuration, so that we would know what part to order in the case of a hardware failure. It also gives hardware error messages that can aid in diagnosis.

c. SHOWREV OUTPUT

The output from ‘/usr/bin/showrev -p’ gives a list of the patches installed on the system. This will help eliminate possible casues of the problem and ensure that the correct versions of source code and analysis tools are used during the investigation.

3.Live Dump

On occasion, it is required that a live crashdump be collected. It’s uncommon, as dumping a live system does not capture a completely consistent snapshot of the system. Data is changing while the dump is being written out. Although live dumps are not always consistent, they are still a great source of information for certain types of issues.

collectiongg Live Dump from the Solaris Machine:

1) Before collecting a live kernel dump, a dedicated, NON SWAP, dump device must be configured using dumpadm(1M). The dedicated dump device must not be used in any other way (i.e., filesystem, databases, etc.). The dump device CANNOT be swap or any part of. If the dump device is part of swap, generation of live kernel dump will corrupt the swap area causing the kernel to eventually panic. Also note that any filesystem or data on the dump device disk will be lost.

2)The kernel is running during generation of live kernel dump, and the linked lists that all kernel debuggers use to traverse those linked structures may fail because the list was in flux when saved. Therefore, always run the following ps command to capture the process addresses:
/usr/bin/ps -e -o uid,pid,ppid,pri,nice,addr,vsz,wchan,time,fname
You must use those switches to get addresses with a 64 bit kernel. This will allow you to look at processes since you will have the process address. From there you can generate more complete threadlists, etc. Please see the ps(1) manpage for meaning of those options. For example, the following is what /usr/bin/ps -elf outputs on a 64-bit machine:
> /usr/bin/ps -elf
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
19 T root 0 0 0 0 SY 0 Sep 27 0:00 sched
8 S root 1 0 0 41 20 98 Sep 27 0:00 /etc/init -
19 S root 2 0 0 0 SY 0 Sep 27 0:00 pageout
19 S root 3 0 0 0 SY 0 Sep 27 1:05 fsflush
8 S root 267 1 0 41 20 220 Sep 27 0:00 /usr/lib/saf/sac -t 300
8 S root 154 1 0 51 20 325 Sep 27 0:00 /usr/sbin/inetd -s
8 S root 125 1 0 41 20 294 Sep 27 0:00 /usr/sbin/rpcbind
8 S root 48 1 0 47 20 185 Sep 27 0:00 /usr/lib/sysevent/syseventd
8 S root 50 1 0 48 20 160 Sep 27 0:00 /usr/lib/sysevent/syseventconfd
NOTE: The ADDR field is not populated. If you issue the command to capture process addresses as described above, you will see the following output instead:

> /usr/bin/ps -e -o uid,pid,ppid,pri,nice,addr,vsz,wchan,time,fname
UID PID PPID PRI NI ADDR VSZ WCHAN TIME COMMAND
0 0 0 96 SY 10423a60 0 – 0:00 sched
0 1 0 58 20 30000909528 784 30000909848 0:00 init
0 2 0 98 SY 30000908a98 0 10458248 0:00 pageout
0 3 0 60 SY 30000908008 0 104618a0 1:05 fsflush
0 267 1 58 20 30000963530 1760 300009659a8 0:00 sac
0 154 1 48 20 30001708028 2600 30000baaca2 0:00 inetd
0 125 1 58 20 30001709548 2352 30000aa2102 0:00 rpcbind
0 48 1 52 20 300009f0aa8 1480 300009f0dc8 0:00 sysevent
0 50 1 51 20 300009f0018 1280 22d44 0:00 sysevent

Thursday, 29 September 2011