Solaris Troubleshooting – System Panics, Hangs and Crashes
There are a number of differing scenarios under which the Solaris operating system may panic, hang, or exhibit other symptoms that lead the administrator to have to restart or reboot the system.
As there
are many different failure scenarios and many different classes of
hardware, the information and procedures for collection of system
information vary from system to system.
1.Hang
The first
class of failures is the hang. This is when a system appears to become
unresponsive. See the documents below that discuss dealing with hung
systems.
Be aware
that some systems that appear hung are not! Be sure to verify if the
server is hung or not. For example: The display may be non-responsive,
because the output has been redirected to the console device.
2.Panic or unexpected reboot
Panics can
be caused by a variety of issues, including Solaris Bugs, Hardware
errors and Third Party Drivers and Applications. It’s important to
collect as much detail as is possible when systems panic.
Information that need to be collected to troubleshoot Kernel Panic:
When logging a new case, provide answers to the following questions as an absolute minimum:
- When did the problem start
- What changes have been recently made on the system. Important: Anything that has happened since the last reboot is within the scope of this question. Patching, application changes, disk replacements, anything. It’s all important to know when trying to resolve the issues.
- How often has this failure occurred
- What may have been going on around the time of the panic and if anything out of the ordinary may have been observed
In the
case of a panic or reboot, the messages log and prtdiag output are items
that can be quickly sent to Oracle Sun, and that can go a long way
towards diagnosis of the cause of the problem, however, an explorer is
almost always better.
By far, the simplest way to collect the vast majority of details required to resolve a panic is to collect:
- Sun Explorer Data Collector output
- The crash dump
- Any console messages
a. EXPLORER OUTPUT
If
Explorer is run on the system after the incident occurs, it will contain
most of what will be required to understand the current configuration
of the system, and additional information that may be helpful.
b. CRASH DUMP
b. CRASH DUMP
If the
system generated a system crash dump (check /var/crash/`hostname`),
create a compressed tar file containing the unix.* and vmcore.* files
and transfer that file to Oracle Sun for analysis. Compressing the tar
file (using one of the compress, gzip of bzip2 utilities) reduces the
size of the file dramatically and so reduces the time taken to transfer
the file to Sun.
c. CONSOLE MESSAGES
Console
messages are most important when the server is experiencing hardware
issues and the OS is not allowed an opportunity to panic. In cases where
multiple unexpected reboots are occurring, and no diagnostic infomation
is being provided by the system logs in /var/adm/messages, some form of
console logging should be setup as soon as possible to capture the
diagnostic information from the console on the next failure.
Recommended NVRAM settings , to Collect Console Messages:
Bring
system to OBP level from command line using “shutdown” or “init 0″
commands (either will run all RC shutdown scripts), sync file systems
and then drop system to OK prompt. DO NOT use a stop+A key press. The
following commands can be executed from the OK prompt or from the
command line using the “eeprom <variable=parameter>” command.
at OK prompt | # eeprom | Description |
setenv diag-level max | diag-level=max | system will run extended POST |
printenv boot-device | boot-device | determine what your boot device is…. |
setenv diag-device <your boot-device> | diag-device=<your bootdev> | prevent attempting net boot w diags on |
setenv error-reset-recovery sync | error-reset-recovery=sync | force sync reboot if system drops to OK |
setenv diag-switch true | diag-switch =true | |
reset-all | reboot or init 6 | system has to reset for changes to take affect |
Exceptions
In some cases, it is difficult to collect an explorer.
In the
event that explorer cannot be installed or run in a timely manner, the
following data is of tremendous value, and should be collected:
a. MESSAGES LOG
Messages
logs from /var/adm directory. If there was a panic, the panic message in
the file will help determine if we need to analyze a crash dump to
diagnose the cause. In many cases a crash dump is not necessary, and
waiting for one to be transferred simply increases the resolution time.
For example, if there was a hardware reset rather than a panic, the
messages log should show that.
b. PRTDIAG OUTPUT
Output of the prtdiag command.
/usr/platform/`uname -i`/sbin/prtdiag -v
prtdiag
gives a summary of a system’s hardware configuration, so that we would
know what part to order in the case of a hardware failure. It also gives
hardware error messages that can aid in diagnosis.
c. SHOWREV OUTPUT
The output
from ‘/usr/bin/showrev -p’ gives a list of the patches installed on the
system. This will help eliminate possible casues of the problem and
ensure that the correct versions of source code and analysis tools are
used during the investigation.
3.Live Dump
On
occasion, it is required that a live crashdump be collected. It’s
uncommon, as dumping a live system does not capture a completely
consistent snapshot of the system. Data is changing while the dump is
being written out. Although live dumps are not always consistent, they
are still a great source of information for certain types of issues.
collectiongg Live Dump from the Solaris Machine:
1) Before collecting a live kernel dump, a dedicated, NON SWAP,
dump device must be configured using dumpadm(1M). The dedicated dump
device must not be used in any other way (i.e., filesystem, databases,
etc.). The dump device CANNOT be swap or any part of. If the dump device
is part of swap, generation of live kernel dump will corrupt the swap
area causing the kernel to eventually panic. Also note that any
filesystem or data on the dump device disk will be lost.
/usr/bin/ps -e -o uid,pid,ppid,pri,nice,addr,vsz,wchan,time,fname
You must use those switches to get addresses with a 64 bit kernel. This will allow you to look at processes since you will have the process address. From there you can generate more complete threadlists, etc. Please see the ps(1) manpage for meaning of those options. For example, the following is what /usr/bin/ps -elf outputs on a 64-bit machine:
> /usr/bin/ps -elf
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
19 T root 0 0 0 0 SY 0 Sep 27 0:00 sched
8 S root 1 0 0 41 20 98 Sep 27 0:00 /etc/init -
19 S root 2 0 0 0 SY 0 Sep 27 0:00 pageout
19 S root 3 0 0 0 SY 0 Sep 27 1:05 fsflush
8 S root 267 1 0 41 20 220 Sep 27 0:00 /usr/lib/saf/sac -t 300
8 S root 154 1 0 51 20 325 Sep 27 0:00 /usr/sbin/inetd -s
8 S root 125 1 0 41 20 294 Sep 27 0:00 /usr/sbin/rpcbind
8 S root 48 1 0 47 20 185 Sep 27 0:00 /usr/lib/sysevent/syseventd
8 S root 50 1 0 48 20 160 Sep 27 0:00 /usr/lib/sysevent/syseventconfd
NOTE: The ADDR field is not populated. If you issue the command to capture process addresses as described above, you will see the following output instead:
> /usr/bin/ps -e -o uid,pid,ppid,pri,nice,addr,vsz,wchan,time,fname
UID PID PPID PRI NI ADDR VSZ WCHAN TIME COMMAND
0 0 0 96 SY 10423a60 0 – 0:00 sched
0 1 0 58 20 30000909528 784 30000909848 0:00 init
0 2 0 98 SY 30000908a98 0 10458248 0:00 pageout
0 3 0 60 SY 30000908008 0 104618a0 1:05 fsflush
0 267 1 58 20 30000963530 1760 300009659a8 0:00 sac
0 154 1 48 20 30001708028 2600 30000baaca2 0:00 inetd
0 125 1 58 20 30001709548 2352 30000aa2102 0:00 rpcbind
0 48 1 52 20 300009f0aa8 1480 300009f0dc8 0:00 sysevent
0 50 1 51 20 300009f0018 1280 22d44 0:00 sysevent
UID PID PPID PRI NI ADDR VSZ WCHAN TIME COMMAND
0 0 0 96 SY 10423a60 0 – 0:00 sched
0 1 0 58 20 30000909528 784 30000909848 0:00 init
0 2 0 98 SY 30000908a98 0 10458248 0:00 pageout
0 3 0 60 SY 30000908008 0 104618a0 1:05 fsflush
0 267 1 58 20 30000963530 1760 300009659a8 0:00 sac
0 154 1 48 20 30001708028 2600 30000baaca2 0:00 inetd
0 125 1 58 20 30001709548 2352 30000aa2102 0:00 rpcbind
0 48 1 52 20 300009f0aa8 1480 300009f0dc8 0:00 sysevent
0 50 1 51 20 300009f0018 1280 22d44 0:00 sysevent
No comments:
Post a Comment