unix sysadmin archives
Donation will make us pay more time on the project:
          

Sunday, 12 June 2011

Replacing a bad disk on a Solaris box.

Here is a snapshot of replacing a bad disk on a Solaris box.

Usually it all starts on your monitoring system, cutting a ticket saying you have a problematic disk or a volume having needs maintenance state.

First, you have to verify the alert on the system log files and just see if the system really trapped this message. If it is valid, you have to grasp a good picture of how severe the damage of this hardware failure. This is for you to weight accurately the severity that you will be declaring when raising the service request. Check if the volume is still online, which is the usual case since the standard is all disk should have mirrors for redundancy.
If in case the volume is totally offline and the system affected is a production server, you can raise a severity 1 service request to Sun/Oracle to get immediate action.

In most cases, you will be raising a severity 2 or 3 service request since most systems are imposing redundancy.

Creating the Service Request

To raise a service request, of course you got to have a service contract with the Sun/Oracle. You will be using this in creating a support account on Oracles support system. Since you have a Sun Microsystems box, most probably you already have one.
To sign up, just go to this site and find the registration link for you to get started.

https://support.oracle.com/CSP/ui/flash.html

Note: Registration is just one time and it might take a day for the confirmation process to finish.

Once your My Oracle Support account has been activated you can raise your service request immediately. Just log-in, and go to the service request panel and click on the Create SR button. Choose the hardware option. At First, you be asked to provide the serial number of your box. Just key it in and it will search its database of all the information it can get.

Note: There are cases wherein the serial number can not be found. It might be your Service Identifier that is blocking your way. Consult the one who registered the box for the correct service identifier.

Once all the data needed was found it will be presented to you for checking.
Verify the all the pertinent information for its validity specially the location.

Once you have verified them it will bring you to the details of your requests. Information like Severity, Detailed Problem Statement, and Steps you take to reproduce the error messages, Recent Changes made on the system and the possible workaround will be asked.
Then in the end of the process you will be asked to upload the explorer file. This will be analyzed by the service engineers.

To get the explorer file run this command.
  # /opt/SUNWexplo/bin/explorer -q

After you upload it, you're basically done with the service request. Take note of your service request number for tracking purposes. All you have to do is wait for the call of the Service engineer that will be assigned. Every now and then, you can check the status of your service request in the support website. Can also check it via phone, just dial the 1800 number plus the service request number and it will connect you to the engineer-in-charge.

While you are waiting, you can schedule the activity. Most probably there are some rules within your organization that should be followed in cases like these that involves changes. So contact your change management team and ask for the appropriate outage window.

Once you have the schedule and the service engineers are engage, you may now proceed with the actual replacement of the defective parts.

The replacement in actual.

Note: In case you a have a monitoring for the mirroring you might want to suppress it for awhile during the change to avoid unnecessary alerts.

Start by force detaching the metadevices that have components needing maintenance.
    E.g.
    # metadetach -f d0 d10
    # metadetach -f d1 d11
    # metadetach -f d3 d13
    # metadetach -f d4 d14
    # metadetach -f d5 d15


Delete all metadevice that contains a sub-component in an error state.
    E.g.
    # metaclear d10
    # metaclear d11
    # metaclear d13
    # metaclear d14
    # metaclear d15
   

If there are any replicas on this disk, note the number of replicas, and  remove them using the following:
    # metadb -i (number of replicas to be noted).
    # metadb -d c1t0d0s7

Verify there are no existing metadevices left on the disk..
    E.g. 
    # metastat -p | grep c1t0d0
    # metadb |grep c1t0d0

Remove the disk. E.g.
    # cfgadm -c unconfigure c1::dsk/c1t0d0

Initiate devfsadm cleanup subroutines
    E.g.
    # devfsadm -C -c disk

Verify the disk has been removed.
    E.g.
    # cfgadm -al
    # ls -ld /dev/dsk/c1t0d0*


Then inform the service engineer to physically replace the failed disk. Once you got the confirmation from the engineer that the failed disk has been replaced, then configure and create device paths
    E.g.
    # devfsadm -C -v
    # cfgadm -c configure c1::dsk/c1t0d0
    # devfsadm -C -v

Verify the disk
    E.g.
    # cfgadm -al
    # ls -ld /dev/dsk/c1t0d*
    # luxadm inq /dev/rdsk/c1t0d0s2

Put the desired partition table on the new disk using the prtvtoc from the old disk
    E.g.
    # prtvtoc /dev/rdsk/c1t1d0s2 | fmthard -s - /dev/rdsk/c1t0d0s2


Verify, run format or prtvtoc
    E.g.
    # prtvtoc /dev/rdsk/c1t0d0s2


If necessary, re-create the same number of replicas that existed previously
    E.g.
    # metadb -a -c 3 c1t0d0s7

Recreate each metadevice to be used as submirrors
    E.g.
    # metainit d10 1 1 c1t0d0s0
    # metainit d11 1 1 c1t0d0s1
    # metainit d13 1 1 c1t0d0s3
    # metainit d14 1 1 c1t0d0s4
    # metainit d15 1 1 c1t0d0s5
  


To verify, use metastat command. Then attach those submirrors to the mirrors to start the resynchronization.
    E.g.
    # metattach d0 d10
    # metattach d1 d11
    # metattach d3 d13
    # metattach d4 d14
    # metattach d5 d15

  
Monitor the synching until its 100% and you are done! There you go!





No comments:

Post a Comment