unix sysadmin archives
Donation will make us pay more time on the project:
          

Thursday 3 May 2012

How to Replace System Board for Sun Fire E6900 Systems

Applies to:
Sun Fire 4800 Server - Version: Not Applicable and later [Release: N/A and later ]
Sun Fire 4810 Server - Version: Not Applicable and later [Release: N/A and later]
Sun Fire 6800 Server - Version: Not Applicable and later [Release: N/A and later]
Sun Fire E4900 Server - Version: Not Applicable and later [Release: N/A and later]
Sun Fire 3800 Server - Version: Not Applicable and later [Release: N/A and later]
Information in this document applies to any platform.

H/W ON-SITE Action Plan #1. Parts: 540-6295
NOTE: THIS AUTOGENERATED CREATED USING https://actionplans.us.oracle.com/atr/.

***********************************************************
************ Start Hardware Onsite Action Plan ************


A. DISPATCH INSTRUCTIONS

A1. WHAT SKILLS DOES THE ENGINEER NEED (IS A SITE ENGINEER AVAILABLE?):

A2. PARTS REQUIRED: USE INFORMATION IN TASK UNLESS ENTERED BELOW
Part number: [F] 540-6295
Part location: SB4
Quantity: 1
Description: CPU/MEM W/ 4 US IV 1.35GHZ, 0GB (FRU)
Prior part DOA: No
Alternate parts: 540-6803
SPECIAL INSTRUCTIONS: Verify the new board's firmware matches that of the System Controller and other boards in the configuration.
See http://sunsolve.sun.com/search/document.do?assetkey=1-61-214805-1 for details.

A3. DELIVERY REQUIREMENT:
Preferred Onsite Time: Within Service SLA

A4. ONSITE VISIT DETAILS:
Account name:
Contact Name:
Contact Telephone #:
Email address:
Street Address:
City:
State:
Country:
Postal Code:
Alt. Contact name:
Alt. Contact email:
Alt. Contact phone:
Special instructions:

B. FIELD ENGINEER INSTRUCTIONS

NOTE : READ MANDATORY NOTES SECTION OF ACTION PLAN.
This Action Plan is not complete until all mandatory actions outlined below have been competed.

B1. PROBLEM OVERVIEW:
General problem: There is a component failure
Fault for part 540-6295: NA
*** Start System Error Message ***
cat-a:SC> showchs -c SB4 -v
Total # of records: 1
Component : /N0/SB4
Time Stamp : Sun Dec 04 19:26:26 EST 2011
New Status : Faulty
Old Status : OK
Event Code : HW
Initiator : ScApp
Message : 1.E6900.FAULT.ASIC.CHEETAH.AFSR_2_HI_ISAP.71191111.20-16.1

*** End System Error Message ***

B2. WHAT ACTIONS DOES THE ENGINEER NEED TO TAKE:
In this Document

Goal

Solution


Oracle Confidential (INTERNAL). Do not distribute to customers

Reason: FRU CAP

Applies to:
Sun Fire 4800 Server - Version: Not Applicable and later [Release: N/A and later ]
Sun Fire 4810 Server - Version: Not Applicable and later [Release: N/A and later]
Sun Fire 6800 Server - Version: Not Applicable and later [Release: N/A and later]
Sun Fire E4900 Server - Version: Not Applicable and later [Release: N/A and later]
Sun Fire 3800 Server - Version: Not Applicable and later [Release: N/A and later]
Information in this document applies to any platform.


Goal
How to Replace System Board for Sun Fire 3800, 4800, 4810, 6800, E4900, and E6900 Systems

******************************************************************************

To report errors or request improvements on this
procedure,

please go to http://support.us.oracle.com
and put a comment on Doc ID: 1306577.1

******************************************************************************

Solution
DISPATCH INSTRUCTIONS

WHAT SKILLS DOES ENGINEER NEED:

ScApp, lom

Task Complexity: 4

Time Estimate: 60 minutes

FIELD ENGINEER INSTRUCTIONS

CAP PROBLEM OVERVIEW:
System Board Failure

WHAT STATE SHOULD SYSTEM BE IN TO BE READY TO PERFORM RESOLUTION ACTIVITY?

Examples use a board location of '#'

1) See if DR can be used; If the board is listed in 'cfgadm -av | grep -i perm' output, you can't use DR.

2a) If able to DR, issue 'cfgadm -c disconnect N0.SB#'

2b) If unable, issue 'init 0' and then 'poweroff sb#' at SC prompt

WHAT ACTION DOES ENGINEER NEED TO TAKE:

You will need to move DIMMs from the 'old' board to the 'new' one (same slots).

1) Perform physical SB replacement per Service Manual

2) poweron SB# at SC prompt

3) showchs -b at SC prompt

4) Reset any 'Suspect' or 'Faulty' components to 'ok' from the Main SC or from the lom prompt:
setchs -s OK -r 'SR number' -c <comp>

NOTE: If ScApp 5.20.15 or higher, service mode access IS NO LONGER REQUIRED to execute setchs.
If < 5.20.15, contact service to obtain a Service Mode password or generate one yourself at https://modepass.us.oracle.com
(a backup server is also available from https://modepass-bak.us.oracle.com)
Repeat 'setchs' command until all components are 'ok'.

Verify 'showchs -b' is empty.

5) Verify new board firmware matches existing boards & SC(s) ('showboards -p proms' at SC prompt).

If needed, copy firmware from a like board 'flashupdate -c (source board) (destination board)'

6) Consider running extended POST (On domain issue 'eeprom diag-level=max' or at ok prompt 'setenv diag-level max').

7) If you're replacing a COD (Capacity on Demand) enabled board refer to
Sun Fire[TM] 12K/15K/E20K/E25K/F3800/Fx800/Ex900/ servers: How to replace a COD CPU/memory board (Doc ID <a href="<<INLINE_NOTE:
1002102.1
>>">
1002102.1
)
for the needed step to follow

8a) If able to DR, issue 'cfgadm -c configure N0.SB#' at domain level

8b) If unable to DR, issue 'setkeyswitch -d (domainID) off' followed by 'setkeyswitch -d (domainID) on'.

9) Monitor POST.
* If new errors are detected, collect POST and contact Support.

OBTAIN CUSTOMER ACCEPTANCE

WHAT ACTION DOES CUSTOMER NEED TO TAKE TO RETURN SYSTEM TO AN OPERATIONAL STATE:

Boot system if not already booting.

REFERENCE INFORMATION:

Replacement procedures are documented in the Service Manuals:

* Chapter 8, 3800/48x0/6800 Manual http://download.oracle.com/docs/cd/E19095-01/sf3800.srvr/805-7363-15/805-7363-15.pdf

* Chapter 8, E4900/E6900 Manual http://download.oracle.com/docs/cd/E19095-01/sfe4900.srvr/817-4120-13/817-4120-13.pdf


B3. SHOULD DYNAMIC RECONFIGURATION BE USED:

B4. IS OUTAGE REQUIRED AND AGREED TO BY THE CUSTOMER: Yes

B5. NOTICES THAT ENGINEER MUST TAKE INTO ACCOUNT:
ROHS NOTICE: This system has NOT been adequately identified during remote diagnosis for the purposes of RoHS. You must check the system's RoHS compliance by referring to information within FIN 102250, or by verifying with your support centre before commencing service.

B6. ADDITIONAL COMMENTS:

B7. WHAT TROUBLESHOOTING TESTS WERE DONE:


C. GENERAL ACTION PLAN INFORMATION
Action plan for case:
Action plan reference number: 1 (always reference with case id)
Affected product: SUN FIRE E6900
Platform version: N/A
(Please update MOS with correct serial number if necessary)


************* End Hardware Onsite Action Plan *************
***********************************************************


**************** GENERAL INSTRUCTIONS FOR THIS ACTION PLAN ***************
**************************************************************************
Make sure a new explorer [explorer -w all,interactive,scextended] is run and
submitted to proactive.central after installing the new parts. Email explorer to:
explorer-database-americas@sun.com - Americas
explorer-database-emea@sun.com - EMEA (Europe, Middle East, Africa)
explorer-database-apac@sun.com - APAC (Asia, Pacific)
**************************************************************************
***********************************************************************

Monday 13 February 2012

When downtime is inevitable

WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY :
Customer should shut OS down gracefully (using "init 0").
Then, At the OK prompt Type #. (in key sequence) to get into Alom.
Run the following commands from ALOM:
1. setlocator on
2. poweroff

WHAT ACTION DOES THE ENGINEER NEED TO TAKE:
1. Shutting System Down.
2. Extending Server to Maintenance Position.
3. Performing Electrostatic Discharge Prevention Measures.
4. Disconnecting Power From Server.
5. Removing Top Cover.
6. Remove system controller from chassis.
7. Locate system controller card.

8. Push down on ejector levers on each side of system controller until card releases from socket.

9. Grasp top corners of card and pull it out of socket. Place system
controller card on antistatic mat.

10. Using a small flat-head screwdriver, carefully pry battery from system controller.

11. Unpackage replacement battery and Press new battery into system controller with positive side facing upward (away from card).

12. Re-install system controller. Holding bottom edge of system controller, carefully align system controller so that each of its contacts is centered on a socket pin. Ensure that system controller is correctly oriented and ejector levers are open. A notch along the bottom of system controller corresponds to a tab on socket. Push firmly and evenly on both ends of system controller until it is firmly seated in socket. You hear a click when ejector levers lock into place.

13. Use ALOM CMT setdate command to set day and time. Use setdate command before you power on host system.

14. Place top front cover on chassis. Slide front top cover forward until it snaps into place, being careful to avoid catching cover on intrusion switch.

15. Position bezel on front of chassis and snap into place.

16. Open fan door. Tighten captive screw to secure front bezel to chassis.

OBTAIN CUSTOMER ACCEPTANCE (this needs to be performed by the field engineer)
At ALOM prompt:
1. setlocator off
2. Use SC setdate command to set ALOM day and time (before you power on host system).
3. poweron -c

At OK prompt:
Note: The ALOM date and Solaris date are not in sync. The Solaris date may not be correct and should be set separately.
1. Boot -s (to avoid any applications being affected by wrong OS date).
2. Use date command to set the correct OS date (optional: init 0, boot -s to verify)

WHAT ACTION DOES THE CUSTOMER NEED TO TAKE TO RETURN THE SYSTEM TO AN OPERATIONAL STATE:

Customer restarts software applications per applicable administration guides to resume system operation.

Friday 27 January 2012

How to upload core files to supportfiles.sun.com via ftp

Unsecure File Transfer Options

Oracle recommends that all users employ HTTPS or another secure method to transport files to Oracle. Users that choose to use FTP bear any risk associated with this method of file transport.

Standard method for unsecure upload to supportfiles.sun.com
ftp supportfiles.sun.com
login: anonymous
password: user@machine [ your email address]
cd to the appropriate directory*
binary [ set transfer mode]
put 62001234.tar.Z
quit
*Customers are instructed by the Oracle engineer to choose a destination directory in which to upload their file, based on the customer's location and type of file being uploaded. Choices are:
  • cores
  • iplanetcores
  • explorer
  • explorer-amer
  • explorer-apac
  • explorer-emea

Please Note: Plans are in place to End-Of-Life Supportfiles within the next 12 months. For Oracle Hardware product telemetry and for files greater than 2GB, we recommend Oracle Secure File Transport.

Tuesday 10 January 2012

How to collect critical troubleshooting information in the SC's log buffer

1) Log into a system which has access to the Main System Controller (SC) and open a terminal window.

2) Open a script session so the following SC command output will be captured.
$ script -a /tmp/scdatafile

3) Connect to the platform shell of the Main SC per your configuration's requirements (telnet, console, ssh, tip, etc):
$ console main-sc
$ telnet main-sc
$ ssh main-sc
$ tip main-sc
    NOTE: Do not reboot the main SC before collecting this data. Doing so may erase critical troubleshooting information in the SC's log buffer.

4) From the platform shell, execute the following commands which will be captured in the script session that you opened previously:
showdate
showsc -v
showescape
showkeyswitch
showcodlicense -v
showcodlicense -rv
showcodusage -v
showplatform -v
showplatform -vda
showplatform -vdb
showplatform -vdc
showplatform -vdd
showboards -ev
showcomponent
showfru -r manr
showchs -b (will fail for fw below 5.20.15)
    And for each suspect or faulty component
showchs -vc /N0/IB6 (for example)
showdate -v
showdate -v -d a
showdate -v -d b
showdate -v -d c
showdate -v -d d
showlogs -v
showlogs -vp (the -vp* commands will fail for systems with older SCs)
showlogs -vda
showlogs -vpda
showlogs -vdb
showlogs -vpdb
showlogs -vdc
showlogs -vpdc
showlogs -vdd
showlogs -vpda
showerrorbuffer
showerrorbuffer -p
showenvironment -ltuv
history
showdate
    NOTE:  You might need to use a "Control right bracket" ("']") to disconnect, depending on how you have connected to the SC.

5) Exit the script session to save the collected data:

    * Hit <control> D and you should get the message "script /tmp/scdatafile closed", "script done" or a similar message.
    * Alternatively, you can also type "exit" at the prompt to close the script session.

6) Upload the data file (scdatafile in this example) utilizing the instructions in Document 1020199.1.

    * It is suggested that the SR Number be apended to the beginning of the file, for example "SR_Number_scdatafile".