unix sysadmin archives
Donation will make us pay more time on the project:
          

Sunday 19 June 2011

T6320 Host Power Failure

We have a SunMicro T6320 that rebooted a couple of times.

reciosys01# last reboot | more
reboot    system boot                   Mon Mar 28 09:45
reboot    system down                   Mon Mar 28 09:38
reboot    system boot                   Mon Mar 28 08:44
reboot    system down                   Mon Mar 28 08:37

The problem the we cannot find anything in the /var/adm/messages that indicates the cause of the reboot.
A day before, there is a replacement of an emulex card in this box. But there is no error message that links to this change.

Here is the logs from /var/adm/messages:

Mar 28 09:37:08 reciosys01 inetd[411]: [ID xxxxxx daemon.notice] uptmagnt[xxxxx] from xx.xx.xx.xx xxxxx
Mar 28 09:37:09 reciosys01 inetd[411]: [ID xxxxxx daemon.notice] uptmagnt[xxxxx] from xx.xx.xx.xx xxxxx
Mar 28 09:38:40 reciosys01 inetd[411]: [ID xxxxxx daemon.notice] bgssd[xxxxx] from xx.xx.xx.xx xxxxx
Mar 28 09:44:50 reciosys01 genunix: [ID xxxxxx kern.notice] ^MSunOS Release 5.10 Version Generic_142909-17 64-bit
Mar 28 09:44:50 reciosys01 genunix: [ID xxxxxx kern.notice] Copyright (c) 1983, 2010, Oracle and/or its affiliates. All rights reserved
.
Mar 28 09:44:50 reciosys01 genunix: [ID xxxxxx kern.info] Ethernet address = x:xx:xx:xx:xx:xx
Mar 28 09:44:50 reciosys01 unix: [ID xxxxxx kern.info] NOTICE: Kernel Cage is ENABLED
Mar 28 09:44:50 reciosys01 unix: [ID xxxxxx kern.info] mem = 66977792K (0xff8000000)
Mar 28 09:44:50 reciosys01 unix: [ID xxxxxx kern.info] avail mem = 66732310528

Somehow we managed to check the event logs from the SP thru the ILOM.
And we found this specific error "Host Power Failure: MB_DC_POK Fault".
I'm thinking that this is somewhat related to power supply. The voltage output might not be at its expected levels.


-> cd /SP/logs/event
/SP/logs/event

-> show list

  /SP/logs/event/list
    Targets:

    Properties:

    Commands:
        cd
        show

ID     Date/Time                 Class     Type      Severity
-----  ------------------------  --------  --------  --------
70701  Mon Mar 28 01:41:50 2011  Chassis   Log       major  
       Host is running
70700  Mon Mar 28 01:38:20 2011  Fault     Repair    minor  
       SP detected fault cleared at time Mon Mar 28 01:38:18 2011. Host Power: M
       B_DC_POK is OK
70699  Mon Mar 28 01:37:14 2011  Chassis   Log       major  
       Host has been powered on
70698  Mon Mar 28 01:37:03 2011  Chassis   Log       critical
       Host has been powered off
70697  Mon Mar 28 01:37:03 2011  Chassis   Log       major  
       Power cycling Host System.  Please wait.
70696  Mon Mar 28 01:37:01 2011  Fault     Fault     critical
       SP detected fault at time Mon Mar 28 01:37:01 2011. Host Power Failure: M
       B_DC_POK Fault
Paused: press any key to continue, or 'q' to quit                                                                             


We tried to search for related incidents in the web but there is no specific cases for T6320.
We found something for T6340, "False Power Failure Faults Might Be Reported (CR 6895793)" but it is during POST or SunVTS Memory Testing.
This is not quite related.
Since there is a recent change on this box, it's a good idea to ask our vendor about this. Somehow it might be related. We update the service request for the emulex replacement with this problem inquiry.

Hopefully on our next update, we will have a better picture of this problem.

1 comment:

  1. No re-occurrence happened.
    Oracle Support is asking for the output of showfaults:

    sc> showfaults -v
    Last POST Run:Mon Mar 28 01:55:48 2011

    Post Status: Passed all devices
    No failures found in System
    sc>

    ReplyDelete