unix sysadmin archives
Donation will make us pay more time on the project:
          

Sunday, 25 December 2011

Ten Points when joining into a new team

So, during initial stage of new job keep your focus to understand the historical information about current environment , from the existing team, whenever you get a chance to discuss about it.

1. Know job scope of your team

Team scope is something which is very important to know right immediate you join to a new job because it will give you an idea to decide your ‘ priorities of learning’ related to the new job.

For example, if you join into a team in a large organization where the scope of the team is to support a set of servers which have only database but nothing else then your priority will immediately change to understand “how the Database works on Unix , and basics of DB terminology” , at the same time your team not supporting any DNS, NIS, DHCP servers and all of them were under control of different team so you will not worry about those servers in your initial learning.

2. Know about Technical architecture of environment

Technical Architecture of Environment talks about below points :

    a. How many Total servers( commonly called as “Server FootPrint” ) we are supporting and where they are actually located ( i.e. Data center information ) ?

    b. What Operating Systems are in use right now, and what are the supported hardware models?

    c. What are the Operating environments that team supporting now? e.g. Production , Testing or Developement

    d. What are the applications currently running on our server environment, and who is using them? e.g. sybase, clearcase, weblogic .. etc.

    e. What Storage is in use right now, and What sort of Console systems we are using to connect to the Servers remotely? EMC, Netapp, Cyclades Consoles ..etc.

    f. What storage management software is in use in which operating systems? e.g. LVM, VxVM , ZFS …etc.

3. Know about procedures and escalation

Ideally, any system administrator should deal with three types of operations:

    a. Break / Fix activities ( Widely known as incidents )

    This mainly involves in fixing the issues that encountered in a properly working environment. e.g. disk failure on a server, unix server crashed due to overload, network failed due to network port problem…etc.

    b. Changes and Service Requests

    Change operations mainly involves, introducing configuration/hardware/application change in the currently running environment either ‘for the purpose of improved stability’ or ‘for the purpose of improved security”, in the current environment.

    Service Requests involves performing operations on specific user requests like creating user accounts, changing permissions, installing new server ( called server commission), removing a server( called server decommission) …etc.

    c. Auditing the Server environment to identify the Quality of Service (QoS)

    This mainly involves periodic checking of all the servers to identify if there are any configuration or security vulnerabilities which compromises the stability of server environment. And remediation of such vulnerabilities by requesting changes in the configuration.

To perform above three kinds of operations , every organization will have internal rules to identify ‘ how to act ? ‘, ‘when to act?’ , ‘what to act?’ . And these rules will vary from job to job, during the initial stage of your job you should understand these rules and perform your duties accordingly.

Note : ITIL ( Information Technology Infrastructure Library) talks about the guidelines to define the above rules in a standard way in any IT related organization. Now a days, major companies streamlining their procedures to meet with these ITIL guidelines so that it will be easy to manage the environment although the people who created that environment leaves the organization. Learning ITIL is always beneficial to system admins( or any Infrastructure Support person).

4. Supporting tools/applications and your access to them

To Perform the Support operations discussed in the above point, organizations needs to have proper tools/applications to facilitate their employees and support people to ‘request and respond’ in automated way as per the procedures defined in the organization. E.g. Remedy Ticketing tool , HP Service Manager ..etc.

Once you join to a new team, just make sure you have requested your access to all the related tools in time and tested the access.

5. Intercommunication Procedures with Other Support Teams and Vendors

Being a System Admin, major part of our day job involves communication with other support teams like. Database Team, Network Team, Application Team, Hardware Vendors, Data Center Support Team … etc.

For successful service delivery, it is important to system administrators to have all of their contact details ( .. like Phone, email and Internal Chat IDs ) handy. So gather the information and make a good document which you can use in your job. It is very important to write down this information and keep it safe, because most of the times the minor issues turns into major problems if we don’t know whom to contact right immediate we noticed the issue.

6. Know where to find the information

Every Team will have some kind of documentation which explains the operations performed by the team, and this documentation gives you more information than any individual can share to you. Unfortunately, reading all these documents doesn’t help us to understand what is actually going in the job during our initial stage in the team, but the same documents might save your life once you actively start working in the team.

During Initial stage, just gather the information about where the documentation is saved and get the access to it. And quickly go through entire documentation( you don’t need to remember everything you read) , so that you will know where to find the information when you are looking for a specific piece of information related to a specific issue.

7. Know Important infrastructure server’s Details

Ideally, System administrators will classify their servers in two groups , first set is ‘ the servers which are used by users ( e.g. Database Servers / Application server ) and second set is ‘ the infrastructure Servers which are used to manage the first set of servers effectively’ ( e.g. Jumpstart Remote Installation Servers, DHCP , DNS , NIS , LDAP servers ..etc) .

As i explained in the point 1, you may or may not manage these infrastructure servers depending on the scope of your team, but you must know the details of these servers because every other server in your environment depends on these infrastructure servers.

    Below are the important question you can try to find answers, during the initial stage of job:

    a. What Name servers( DNS / NIS / LDAP ) we are using, and what are the names / aliases / IPs of those servers ?

    b. What remote installation ( jumpstart/ kickstart) servers we are using and our access to them ?

    c. Whether there is any DHCP server available in the environment or is it managed by customized tools? E.g. QIP …etc.

8. Get Ready with appropriate logistics

Every Unix administrator starts his work by requesting his access to a Windows product ( Desktop Access / Outlook ) . The moment you join into a new job, start requesting your access to your desktop PC login, Voip phone ( with international dialing if your job requires to call overseas ), Email account, internal Chat messenger access, Data center Access ( if your job requires physical access to DC) , and smart cards / Security tokens …etc.

The moment you get your email access, you may have to manage the flood of emails that is coming to your team every day, you might have to create appropriate Outlook rules to filter out emails which you don’t have to respond during the first one or two months of new job. Later, you can slowly start reading and responding them once you actual ready to work on the floor.

9. Areas of Automation, and the specific details

System Administrator cannot survive his job if he doesn’t know how to automate the work ( using scripting) that he is doing repeatedly. And whenever you join a new team, you should specifically ask for the information about any automated scripts which in place and used to perform day-to-day job.

Most of the time, system admins make scripts to perform daily/weekly system health checks and they might be running regularly from some specific servers using Cron scheduler. It is better to know them before hand, so that it will help you if you want to introduce your own scripts for the team’s benefit.

10. Understanding monitoring alerts and response procedures

As I explained in the point 8, you will receive tons of mail the moment you added your email id to team DL ( email distribution list), and major part of the mails could be from automated monitoring system which checks health status of your server environment and informs the system admin team, right immediate it notices an issue. If you are start receiving such mails, don’t just ignore them because you don’t know what to do with them. Actually you have note these alerts and keep raising questions with your team to know how to respond these alerts.

And also keep auto notice reminders in your outlook, for some of the important are alerts which are critical and urgent in their nature, so that you wont miss them.

What your experience says about this, just share with us …

if you see this post useful then share it back, so that some of your friends who are changing their jobs can benefit from this

Thursday, 1 December 2011

How to Force a Crash Dump When the Solaris Operating System is Hung

In most cases, a system crash dump of a hung system can be forced. However, this is not guaranteed to work for all system hang conditions. To force a dump, you often need to drop down to the boot PROM monitor (OBP) prompt, also known as the "OK prompt", suspending all current program execution.

There are several ways to drop a Sun system to the OK prompt.
1. On older Sun systems with a serial (PS2 type) Sun keyboard and monitor attached, this suspension is performed via a "Stop-A". The upper left key on a Sun keyboard is labeled "Stop". While holding down this key, press the A key.

2. On systems using ASCII terminals for the console, the terminal's predefined break sequence can be used to get to the boot PROM monitor.

3. Newer Sun systems with USB keyboards may require an alternate sequence.

4. Some Sun systems have a system controller/SSP (Enterprise 10000/15000, Sun Fire X800) or ALOM/RSC (Vx80/Vx90 and most new Netra servers) instead of serial port/keyboard access. These can be used to break a hanging system or domain.


Note: There special procedures for Sun SPARC(R) Enterprise Mx000 (OPL) Servers, T1000/T2000 systems, x86 and x64 systems.


The boot PROM monitor will respond with:

Type 'go' to resume
ok

If you don't see this message, you were probably not successful in stopping the system.

Once at the ok prompt, type 'sync' (without the quotes) and press Enter.

The system will immediately panic. Now the hang condition has been converted into a panic, so an image of memory can be collected for later analysis. The system will attempt to reboot after the dump is complete.

The sync command forces the computer to illegally use location, therefore causing a panic: zero. On later revisions of Solaris 8 and above you will see a panic: sync initiated

Not all hang situations can be interrupted. If Stop-A or Break doesn't work, sometimes a series of the same will do the trick. Some hangs are even more stubborn and can only be interrupted by physically disconnecting the console keyboard or terminal from the system for a minute, and then plugging it back in.

If all these attempts fail, you will have to power down the system, thus sadly losing the contents of memory. With luck, a subsequent hang will be interruptable.


NOTE: On the systems with keyswitches, be sure the key is not in the secure position, as this disables the break interrupt in the zs driver.

Sunday, 20 November 2011

zstat-process

Just in-case you encounter a large file /var/adm/exacct/zstat-process.
Here is work-around to reclaim the space.


# df -kh /var
Filesystem             Size   Used  Available Capacity  Mounted on
/dev/md/dsk/d3         4.9G   4.4G       529M    90%    /var

# find /var -xdev -type f -size +100000 -ls -exec du -sk {} \;
  273 2740992 -rw-------   1 root     root     2805391217 Nov 20 06:32 /var/adm/exacct/zstat-process
2740992 /var/adm/exacct/zstat-process

# svcs -a | grep zstat
online         Apr_05   svc:/application/xvm/zstat:default

# svcadm restart svc:/application/xvm/zstat:default

# svcs -a | grep zstat
online          6:38:37 svc:/application/xvm/zstat:default

# df -k /var
Filesystem           1024-blocks        Used   Available Capacity  Mounted on
/dev/md/dsk/d3           5166102     1831942     3282499    36%    /var

# df -kh /var
Filesystem             Size   Used  Available Capacity  Mounted on
/dev/md/dsk/d3         4.9G   1.7G       3.1G    36%    /var

# find /var -xdev -type f -size +100000 -ls -exec du -sk {} \; #

Monday, 24 October 2011

sesudo


Executes commands that require superuser authority on behalf of a regular user.
SYNOPSIS
sesudo [[ -h ] | [command [parameters]]
DESCRIPTION
The sesudo command borrows the permissions of another user (known as the target user) to perform one or more commands. This enables regular users to perform actions that require superuser authority, such as the mount command. The rules governing the user's authority to perform the command are defined in the SUDO class.
Notes
  • You must define the access rules for the user in the SUDO class. The definition may specify commands that the user can use and commands that the user is prohibited from using.
  • The output depends on the command that is being executed. Error messages are sent to the standard error device (stderr), usually defined as the terminal screen.
  • To execute the sudo command, the user should specify the following command at the UNIX shell prompt:
    sesudo profile_name
    
  • You can choose whether the command is displayed before it is executed. The default value is that commands are not displayed. To display commands, change the value in the echo_command token in the sesudo section of the seos.ini file.
Arguments
-h
Displays the help screen.
command [parameters]
Specifies the command that is to be performed on behalf of the user. The command name must be the name of a record in the SUDO class. Multiple parameters can be specified, provided they are separated by spaces.
Prerequisites: Define SUDO Commands
Several steps must be performed before it is possible to use the sesudo command. The first step needs to be done only once. Other steps need to be done every time a new user is given the authority to execute the sesudo command, or every time a new profile is defined in the SUDO class.
  1. Define the sesudo program as a trusted setuid program owned by root. This step only needs to be done once per TACF installation. The format of the command is:
    newres PROGRAM /usr/seos/bin/sesudo defaccess(NONE)
    
  2. Give a user the authority to execute the sesudo program. Do this once for every user who is entitled to this authority. The format of the command is:
    authorize PROGRAM /usr/seos/bin/sesudo/uid(user_name)
    
  3. Permit the user to surrogate to the target user using the sesudo program. Do this for every user who should have this authority, and do it for every target user ID that you want to make available to the user. The format of the command is:
    authorize SURROGATE USER.root uid(user_name) \
    via(pgm(/usr/seos/bin/sesudo))
    
  4. Define new records in the SUDO class for every command to be executed by users. For each command script, you can define permitted and forbidden parameters, permitted users, and password protection. If no parameters are specified as permitted or prohibited, then all parameters are permitted. The format of the command is:
    newres SUDO profile_name \
    data('cmd[;[prohibited-params][;permitted-params]]')
    

    A command can have prohibited and permitted parameters for each operand. The prohibited parameters and the permitted parameters for each operand are separated by the pipe symbol (|). The format is:

    newres SUDO profile_name \
    data('cmd;pro1|pro2|...|proN;per1|per2|...|perN')
    

    sesudo checks each parameter entered by the user in the following manner:
    1. Test if parameter number N matches permitted parameter N. (If permitted parameter N does not exist, the last permitted parameter is used.)
    2. Test if parameter number N matches prohibited parameter N. (If prohibited parameter N does not exist, the last prohibited parameter is used.)

    Only if all the parameters match permitted parameters, and none match prohibited parameters, does sesudo execute the command.
  5. Permit the user to access the profile that has been defined in the SUDO class. Do this for every profile a user should be able to access. The format of the command is:
    authorize SUDO profile_name uid(user_name)
    
    If defacess is none, specify each user who is granted permission with the authorize command. If defaccess is not set otherwise, use the authorize command to specify each user to whom access is forbidden.
  6. The sesudo command can display the command before executing it. Display depends on the value in the echo_command token in the [sesudo] section of the seos.ini file. The default value calls for no display, but the value can be changed.
  7. The output of the sesudo command depends on the command being performed. Error messages are sent to the standard error device (stderr), usually defined as the terminal's screen.
SUDO Record: Parameters and Variables
The special parameters used in connection with the SUDO record are explained in the following list:
profile_name
The name the security administrator gives to the superuser command.
cmd
The superuser command that a normal user can execute.
prohibited parameters
The parameters that you prohibit the regular user from invoking. These parameters may contain patterns or variables.
permitted parameters
The parameters that you specifically allow the regular user to invoke. These parameters may contain patterns or variables.
Prohibited and permitted parameters may also contain variables as described in the following list:
$A
Alphabetic value
$G
Existing TACF group name
$H
Home path pattern of the user
$N
Numeric value
$O
Executor's user name
$U
Existing TACF user name
$f
Existing file name
$g
Existing UNIX group name
$h
Existing host name
$r
Existing UNIX file name with UNIX read permission
$u
Existing UNIX user name
$w
Existing UNIX file name with UNIX write permission
$x
Existing UNIX file name with UNIX exec permission
Return Value
Each time the sesudo command runs, it returns one of the following values:
-2
Target user not found, or command interrupted
-1
Password error
0
Execution successful
10
Problem with usage of parameters
20
Target user error
30
Authorization error
EXAMPLES
  1. If you do not allow any parameters, define the profile in the following way:
    newres SUDO profile_name data('cmd;*')
    
  2. If you want to allow the user to invoke the name parameter, do the following:
    newres SUDO profile_name data('cmd;;NAME')
    
    In the previous example, the only parameter the user can enter is NAME.
  3. If you want to prevent the user from using -9 and -HUP but you permit the user to use all other parameters, do the following:
    newres SUDO profile_name data('cmd;-9 -HUP;*')
    
  4. If there are two prohibited parameters, the first is the UNIX user name and the second is the UNIX group name, and there are two permitted parameters, the first can be numeric and the second can be alphabetic, enter the following:
    newres SUDO profile_name \
    data('cmd;$u | $g ;$N | $A')
    
    The user cannot enter the UNIX user name, but can enter a numeric parameter for the first operand; and the user cannot enter the UNIX group name but can enter an alphabetic parameter for the second operand.
  5. If there are several prohibited parameters for several operands in the command, enter the following:
    newres SUDO profile_name \
    data('cmd;pro1 pro2 | pro3 pro4 | pro5 pro6')
    
    pro1 and pro2 are the prohibited parameters of the first operand of the command; pro3 and pro4 are the prohibited parameters of the second operand of the command; and pro5 and pro6 are the prohibited parameters of the third operand of the command.

Thursday, 20 October 2011

PICL bug causes Solaris 10 prtdiag to hang

The Solaris PICL framework provides information about the system configuration which it maintains in the PICL tree. I have an experience wherein Solaris 10 prtdiag is hanging. In order to fix this stop and start picld.

# top
load averages: 1582.95, 1462.52, 1345.91 22:57:54
8548 processes:8532 sleeping, 1 running, 1 zombie, 14 on cpu
CPU states: % idle, % user, % kernel, % iowait, % swap
Memory: 8064M real, 2747M free, 4123M swap in use, 9005M swap free
PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND
13622 root 999 59 0 318M 297M sleep 370.3H 82.48% java
26222 root 1 0 0 0K 0K cpu/6 7:52 5.88% ps
25217 root 1 20 0 0K 0K sleep 11:59 5.49% ps
27618 root 1 0 0 0K 0K cpu/5 1:08 5.43% ps
27101 root 1 0 0 0K 0K cpu/4 3:31 5.16% ps

***You can see here that PID 13622 is using alot of CPU.
***And when you check this, it points to prtdiag 

# ps -ef | grep 13622

root 23066 13622 0 Oct 15 ? 0:00 /usr/bin/ctrun -l child -o pgrponly /bin/sh -c /usr/sbin/prtdiag
root 802 13622 0 Oct 15 ? 0:00 /usr/bin/ctrun -l child -o pgrponly /bin/sh -c /usr/sbin/prtdiag
root 28092 13622 0 Oct 15 ? 0:00 /usr/bin/ctrun -l child -o pgrponly /bin/sh -c /usr/sbin/prtdiag

***Restart the PICL
# svcadm restart picl

***Check the load via uptime

# uptime
1:26am up 50 day(s), 11:22, 3 users, load average: 1886.79, 1513.28, 1402.57

***After a couple of minutes check it again
# uptime
1:26am up 50 day(s), 11:23, 3 users, load average: 962.59, 1327.11, 1343.05

***You can observe a dramatic drop on the load
# top
load averages: 3.71, 367.87, 875.60 01:33:23
76 processes: 75 sleeping, 1 on cpu
CPU states: % idle, % user, % kernel, % iowait, % swap
Memory: 8064M real, 5630M free, 1103M swap in use, 12G swap free

PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND
15726 root 1 42 0 108M 33M sleep 21:44 2.89% bptm
11116 root 1 52 0 108M 33M sleep 10:50 0.76% bptm
13622 root 78 59 0 215M 200M sleep 403.3H 0.12% java
15611 root 1 59 0 108M 67M sleep 5:09 0.10% bptm
16953 root 1 0 0 4432K 2160K cpu/11 0:00 0.08% top

Monday, 17 October 2011

Introduction to Veritas Cluster Services

In any organization, every server in the network will have a specific purpose in terms of  it’s usage, and most of the times these servers are used to provide stable environment to run software applications that are required for organization’s business. Usually, these applications are very critical for the business,  and organizations cannot afford to let them down even for minutes.  For Example: A bank having an application which takes care of it’s internet banking.

If it was not critical in terms of business the organization can considered to run applications as standalone, in other words whenever the application down it wont impact the actual business.
Usually, the application clients for these application will connect to the application server using the server name , server IP or specific application IP.

Let us assume, if the organization is having an application which is very critical for it’s business  and any impact to the application will cause huge loss to the organization. In that case, organization is having  one option to reduce the impact of the application failure due to the Operating system or Hardware failure, i.e Purchasing a secondary server with same hardware configuration ,  install same kind of OS & Database, and configure it with the same application in passive mode. And “failover” the application from primary server to these secondary server whenever there is an issue with underlying hardware/operating system of primary server. Thus, we call it Application Server with Highly Available Configuration

Whenever there is an issue related to the primary server  which  make application unavailable to the client machines, the application should be moved to another available server in the network either by manual or automatic intervention. Transferring application from primary server to the secondary server and making secondary server active for the application  is called “failover” operation. And the reverse Operation (i.e. restoring application on primary server ) is called “Failback“. Thus, we can call this configuration as application HA ( Highly Available ) setup compared to the earlier Standalone setup.

Now the question is, how is this manual fail over works when there is an application issue due to Hardware/Operating System?

Manual Failover basically involves below steps:

     1. Application IP should failover secondary node
     2. Same Storage  and Data  should be available on the secondary node
     3. Finally application should failover to the secondary node.

Challenges in Manual Failover Configuration

    1. Continuously monitor resources.
    2. Time Consuming
    3. Technically complex when it involves more dependent components for the application.

On the other hand, we can use Automatic Failover Softwares which can do the work without human intervention. It groups both primary server and secondary server  related to the application, and always keep an eye on primary server for any failures and failover the application to secondary server automatically when ever there is an issue with primary server.

Although we are having two different servers supporting the application, both of them are actually serving the  same purpose. And from the application client perspective they both  should be treated as single application cluster server ( composed of multiple physical servers in the background).

Now, you know that cluster is nothing but “group of individual servers working together to server the same purpose ,and appear as a single machine to the external world”.

What  are the Cluster Software available in the market, today?  There are many, depending on the Operating System and Application to be supported. Some of them native to the Operating System , and others from the third party vendor

List of Cluster Software available in the market

    *SUN Cluster Services – Native Solaris Cluster
    *Linux Cluster Server – Native Linux cluster
    *Oracle RAC – Application level cluster for Oracle database that works on different Operating Systems
    *Veritas Cluster Services – Third Party Cluster Software works on Different Operating Systems like Solaris / Linux/ AIX / HP UX.
    *HACMP – IBM AIX based Cluster Technology
    *HP UX native Cluster Technology

Note: In this post, we are actually discussing about VCS and its Operations. This post is not going to cover the actual implementation part or any command syntax of VCS, but will cover the concept how VCS makes application Highly Available(HA).

Veritas Cluster Services Components
VCS is having two types of Components 1. Physical Components 2. Logical Components

Physical Components:
1. Nodes
VCS nodes host the service groups (managed applications). Each system is connected to networking hardware, and usually also to storage hardware. The systems contain components to provide resilient management of the applications, and start and stop agents.
Nodes can be individual systems, or they can be created with domains or partitions on enterprise-class systems. Individual cluster nodes each run their own operating system and possess their own boot device. Each node must run the same operating system within a single VCS cluster.
Clusters can have from 1 to 32 nodes. Applications can be configured to run on specific nodes within the cluster.

2. Shared storage
Storage is a key resource of most applications services, and therefore most service groups. A managed application can only be started on a system that has access to its associated data files. Therefore, a service group can only run on all systems in the cluster if the storage is shared across all systems. In many configurations, a storage area network (SAN) provides this requirement.
You can use I/O fencing technology for data protection. I/O fencing blocks access to shared storage from any system that is not a current and verified member of the cluster.

3. Networking Components
Networking in the cluster is used for the following purposes:
    *Communications between the cluster nodes and the Application Clients and external systems.
    *Communications between the cluster nodes, called Heartbeat network.


Logical Components
1. Resources
Resources are hardware or software entities that make up the application. Resources include disk groups and file systems, network interface cards (NIC), IP addresses, and applications.
    1.1. Resource dependencies
    Resource dependencies indicate resources that depend on each other because of application or operating system requirements. Resource dependencies are graphically depicted in a hierarchy, also     called a tree, where the resources higher up (parent) depend on the resources lower down (child).
   
    1.2. Resource types
    VCS defines a resource type for each resource it manages. For example, the NIC resource type can be configured to manage network interface cards. Similarly, all IP addresses can be configured         using the IP resource type.
    VCS includes a set of predefined resources types. For each resource type, VCS has a corresponding agent, which provides the logic to control resources.

2. Service groups
A service group is a virtual container that contains all the hardware and software resources that are required to run the managed application. Service groups allow VCS to control all the hardware and software resources of the managed application as a single unit. When a failover occurs, resources do not fail over individually— the entire service group fails over. If there is more than one service group on a system, a group may fail over without affecting the others.

A single node may host any number of service groups, each providing a discrete service to networked clients. If the server crashes, all service groups on that node must be failed over elsewhere.

Service groups can be dependent on each other. For example a finance application may be dependent on a database application. Because the managed application consists of all components that are required to provide the service, service group dependencies create more complex managed applications. When you use service group dependencies, the managed application is the entire dependency tree.

2.1. Types of service groups

VCS service groups fall in three main categories: failover, parallel, and hybrid.

   * Failover service groups
    A failover service group runs on one system in the cluster at a time. Failover groups are used for most applications that do not support multiple systems to simultaneously access the application’s data.

   * Parallel service groups
    A parallel service group runs simultaneously on more than one system in the cluster. A parallel service group is more complex than a failover group. Parallel service groups are appropriate for     applications that manage multiple application instances running simultaneously without data corruption.

   * Hybrid service groups
    A hybrid service group is for replicated data clusters and is a combination of the failover and parallel service groups. It behaves as a failover group within a system zone and a parallel group across     system zones.

3. VCS Agents
Agents are multi-threaded processes that provide the logic to manage resources. VCS has one agent per resource type. The agent monitors all resources of that type; for example, a single IP agent manages all IP resources.
When the agent is started, it obtains the necessary configuration information from VCS. It then periodically monitors the resources, and updates VCS with the resource status.

4.  Cluster Communications and VCS Daemons
Cluster communications ensure that VCS is continuously aware of the status of each system’s service groups and resources. They also enable VCS to recognize which systems are active members of the cluster, which have joined or left the cluster, and which have failed.

4.1. High availability daemon (HAD)
    The VCS high availability daemon (HAD) runs on each system. Also known as the VCS engine, HAD is responsible for:

       * building the running cluster configuration from the configuration files
       * distributing the information when new nodes join the cluster
       * responding to operator input
       * taking corrective action when something fails.

    The engine uses agents to monitor and manage resources. It collects information about resource states from the agents on the local system and forwards it to all cluster members. The local engine     also receives information from the other cluster members to update its view of the cluster.

    The hashadow process monitors HAD and restarts it when required.

4.2.  HostMonitor daemon
    VCS also starts HostMonitor daemon when the VCS engine comes up. The VCS engine creates a VCS resource VCShm of type HostMonitor and a VCShmg service group. The VCS engine does not     add these objects to the main.cf file. Do not modify or delete these components of VCS. VCS uses the HostMonitor daemon to monitor the resource utilization of CPU and Swap. VCS reports to the     engine log if the resources cross the threshold limits that are defined for the resources.

4.3.  Group Membership Services/Atomic Broadcast (GAB)
    The Group Membership Services/Atomic Broadcast protocol (GAB) is responsible for cluster membership and cluster communications.

    * Cluster Membership
    GAB maintains cluster membership by receiving input on the status of the heartbeat from each node by LLT. When a system no longer receives heartbeats from a peer, it marks the peer as DOWN and     excludes the peer from the cluster. In VCS, memberships are sets of systems participating in the cluster.

    * Cluster Communications
    GAB’s second function is reliable cluster communications. GAB provides guaranteed delivery of point-to-point and broadcast messages to all nodes. The VCS engine uses a private IOCTL (provided     by GAB) to tell GAB that it is alive.

4.4. Low Latency Transport (LLT)
    VCS uses private network communications between cluster nodes for cluster maintenance. Symantec recommends two independent networks between all cluster nodes. These networks provide the     required redundancy in the communication path and enable VCS to discriminate between a network failure and a system failure. LLT has two major functions.

    * Traffic Distribution
    LLT distributes (load balances) internode communication across all available private network links. This distribution means that all cluster communications are evenly distributed across all private     network links (maximum eight) for performance and fault resilience. If a link fails, traffic is redirected to the remaining links.

    * Heartbeat
    LLT is responsible for sending and receiving heartbeat traffic over network links. The Group Membership Services function of GAB uses this heartbeat to determine cluster membership.

4.5. I/O fencing module
    The I/O fencing module implements a quorum-type functionality to ensure that only one cluster survives a split of the private network. I/O fencing also provides the ability to perform SCSI-3 persistent     reservations on failover. The shared disk groups offer complete protection against data corruption by nodes that are assumed to be excluded from cluster membership.

5. VCS Configuration files.

    5.1. main.cf
    /etc/VRTSvcs/conf/config/main.cf is key file in terms  of VCS configuration. The “main.cf”  file basically explains below information to the VCS agents/VCS daemons.
      What are the Nodes available in the Cluster?
      What are the Service Groups Configured for each node?
      What are the resources available in each Service Group, the types of resources and it’s attributes?
      What are the dependencies each resource having on other resources?
      What are the dependencies each service group having on other Service Groups?

     5.2. types.cf

    The file types.cf, which is listed in the include statement in the main.cf file, defines the VCS bundled types for VCS resources. The file types.cf is also located in the folder /etc/VRTSvcs/conf/config.

    5.3. Other Important files
        /etc/llthosts—lists all the nodes in the cluster
        /etc/llttab—describes the local system’s private network links to the other nodes in the cluster

Wednesday, 5 October 2011

VERITAS Volume Manager for Solaris

Veritas Volume Manager is a storage management application by symantec ,  which allows you to manage physical disks as logical devices called volumes.

VxVM uses two types of objects to perform the storage management
1. Physical objects - are direct mappings to physical disks
2 . Virtual objects - are volumes, plexes, subdisks and diskgroups.

a. Disk groups are composed of Volumes
b. Volumes are composed of Plexes and Subdisks
c. Plexes are composed of SubDisks
d. Subdisks are actual disk space segments of VxVM disk  ( directly mapped from the physical disks)

1. Physical Disks
Physical disk is a basic storage where ultimate data will be stored. In Solaris physical disk names  uses the  convention like “c#t#d#”  where c# refers to controller/adapter connection, t# refers to the SCSI target Id , and d# refers to disk device Id.  

Physical disks could be coming from different sources within the servers e.g. Internal disks to the server , Disks from the Disk Array  and Disks from the SAN.

Check if the disks are recognized by Solaris

#echo|format
Searching for disks…done

AVAILABLE DISK SELECTIONS:
0. c0t0d0 <SUN2.1G cyl 2733 alt 2 hd 19 sec 80>
/sbus@1f,0/SUNW,fas@e,8800000/sd@0,0
1. c0t1d0 <SUN9.0G cyl 4924 alt 2 hd 27 sec 133>
/sbus@1f,0/SUNW,fas@e,8800000/sd@1,0
 
2. Solaris Native Disk Partitioning

In solaris, physical disks will partitioned into slices numbered as S0,S1,S3,S4,S5,S6,S7 and the slice number S2 normally called as overlap slice and points to the entire disk.  In Solaris we use the format utility used to partition the physical disks into slices.

Once we added new disks to the Server, first we should recognize the disks from the solaris level before proceeding for any other storage management utility.

Steps to add new disk to Solaris:
If the disks that are recently added to the server not visible, you can use below procedure
 
Option 1: Reconfiguration Reboot ( for the server hardware models that doesn’t support hot swapping/dynamic addition of disks )

# touch /reconfigure; init 6

or

#reboot — -r ( only if no applications running on the machine)

Option 2: Recognize  the disks added to external SCSI, without reboot

# devfsadm

# echo | format <== to check the newly added disks

Option 3: Recognize disks that are added to internal scsi, hot swappable, disk connections.

Just run the command “cfgadm -al” and check for any newly added devices in “unconfigured” state, and configure them.

# cfgadm -al
Ap_Id                         Type            Receptacle   Occupant     Condition
c0                                  scsi-bus     connected    configured   unknown
c0::dsk/c0t0d0      disk              connected    configured   unknown
c0::dsk/c0t0d0      disk              connected    configured   unknown
c0::rmt/0                  tape             connected    configured   unknown
c1                                  scsi-bus      connected    configured   unknown
c1::dsk/c1t0d0       unavailable  connected    unconfigured unknown <== disk not configured
c1::dsk/c1t1d0       unavailable  connected    unconfigured unknown < == disk not configured

# cfgadm -c configure c1::dsk/c1t0d0

# cfgadm -c configure c1::dsk/c1t0d0

# cfgadm -al
Ap_Id                         Type            Receptacle   Occupant     Condition
c0                                  scsi-bus     connected    configured   unknown
c0::dsk/c0t0d0      disk              connected    configured   unknown
c0::rmt/0                  tape             connected    configured   unknown
c1                                  scsi-bus      connected    configured   unknown
c1::dsk/c1t0d0       disk              connected    configured unknown  <= Disk configured now
c1::dsk/c1t1d0       disk              connected    configured unknown  <= Disk configured now

# devfsadm

#echo|format <== now you should see all the disks connected to the server


3. Initialize Physical Disks under VxVM control


A formatted physical disk is considered uninitialized until it is initialized for use by VxVM. When a disk is initialized, partitions for the public and private regions are created, VM disk header information is written to the private region and actual data is written to Public region.  During the notmal initialization process any data or partitions that may have existed on the disk are removed.

Note: Encapsulation is another method of placing a disk under VxVM control in which existing data on the disk is preserved

An initialized disk is placed into the VxVM free disk pool. The VxVM free disk pool contains disks that have been initialized but that have not yet been assigned to a disk group. These disks are under Volume Manager control but cannot be used by Volume Manager until they are added to a disk group

Device Naming Schemes
In VxVM, device names can be represented in two ways:

    Using the traditional operating system-dependent format c#t#d#
    Using an operating system-independent format that is based on enclosure names

c#t#d# Naming Scheme
Traditionally, device names in VxVM have been represented in the way that the operating system represents them. For example, Solaris and HP-UX both use the format c#t#d# in device naming, which is derived from the controller, target, and disk number. In VxVM version 3.1.1 and earlier, all disks are named using the c#t#d# format. VxVM parses disk names in this format to retrieve connectivity information for disks.

Enclosure-Based Naming Scheme
With VxVM version 3.2 and later, VxVM provides a new device naming scheme, called enclosure-based naming. With enclosure-based naming, the name of a disk is based on the logical name of the enclosure, or disk array, in which the disk resides.

Steps to Recognize new disks under VxVM control
1. Run the below command to see the available disks under VxVM control

# vxdisk list
in the output you will see below status

    error indicates that the disk has neither been initialized nor encapsulated by VxVM. The disk is uninitialized.
    online indicates that the drive has been initialized or encapsulated.
    online invalid indicated that disk is visible to VxVM but not controlled by VxVM

If disks are visible with “format” command but not visible with  ”vxdisk list” command, run below command to scan the new disks for VxVM

# vxdctl enable

Now you should see new disks with the status of “Online Invalid“

2. Initialize each disk with “vxdisksetup” command

#/etc/vx/bin/vxdisksetup -i <disk_address>

after running this command “vxdisk list” should see the status as “online” for all the newly initialized disks

4. Virtual Objects (DiskGroups / Volumes / Plexs )  in VxVM

Disk GroupsA disk group is a collection of  VxVM disks ( going forward we will call them as VM Disks ) that share a common configuration.  Disk groups allow you to group disks into logical group of Subdisks called plexes which in turn forms the volumes.

Volumes
A volume is a virtual disk device that appears to applications, databases, and file systems like a physical disk device, but does not have the physical limitations of a physical disk device. A volume consists of one or more plexes, each holding a copy of the selected data in the volume.

Plexes:
VxVM uses subdisks to create virtual objects called plexes. A plex consists of one or more subdisks located on one or more physical disks.



Key Points on Transformation of Physical disks into Veritas Volumes
1. Recognize disks under solaris using devfsadm, cfgadm or reconfiguration reboot , and verify using format command
2. Recognize the disks under VxVM using “vxdctl enable“
3. Initialize the disks under VxVM using vxdisksetup
4. Add the disks to Veritas Disk Group using vxdg commands
5. Create Volumes under Disk Group using vxmake or vxassist commands
6. Create filesystem on top of volumes using mkfs or newfs, and you can create either VXFS filesystem or UFS filesystem

Thursday, 29 September 2011

Cloud Computing

Most of us already know that Cloud Computing is a new Buzz word in the industry and it is very true that  everyone want to learn about it as much as possible.  For myself, I have been reading and observing cloud computing evolution for past one year,  and recently I had an opportunity to  attend for IBM’s SmartCloudCamp session which has given me some insight on current state of cloud computing evolution.
I have noticed several questions from System Admin community about the Cloud computing’s effect on Infrastructure Support Teams.  In this post I am just trying to address the same question in a way that I understand cloud computing.

Cloud Computing

Let me tell you a small story  before we go to discuss about t the  Cloud Computing.
My Sister and her family is living in a small town  in the state of Andhra Pradesh, India.  In the town,  the power failures are so common and it is like 1 or 2 hours of power outage with a frequency of 2 or 3 times per day.  My sister and her neighbors were so upset because these continuous power outages disturbing the kid’s studies and also making life difficult during the evenings. They know that there is an alternative to solve the problem by having  power generator as  a backup power source but most of the neighbor  families are not in a position to afford for it and also they are worried about the regular  maintenance cost of these devices.
One fine day, a group of smart minds came up with a solution to purchase a high capacity power generator ,  place it in some common place and to provide backup power connections to every home who ever ready to pay for the usage charges as per the the actual usage calculated by the electric meter plugged in at every home.  Interestingly, the  idea worked very well, and most of the people in the town were adapted the backup power source with the minimum capital investment and  zero maintenance cost.
I believe, by this time,  you might have understood the purpose of  cloud computing in IT industry. If it is still unclear, lets go forward to look at it in more detailed terms
The Current definition of Cloud Computing is ” A Comprehensive solution which delivers the IT as a Service. Here the term IT can be expanded as Infrastructure, Platform, Storage and Software”.  . At present the IT industry classified  into two groups in terms of cloud computing , first one is Cloud Computing Service Providers and the other one is Cloud Computing Service Consumers ( Client).
Cloud Computing in its Basic Form

Quick refresh on Cloud Computing Benefits to a Client/Consumer

1. Reduced Capital Cost to setup IT Infrastructure
Scenario 1:
If any organisation want to start a new business function that needs IT infrastructure, the organisation need not go through the all the complex process of establishing IT infrastructure starting from the Data center planning. Instead the company simply can go for a Cloud Computing service provider who is providing the kind of service , in his service catalogue, that meets the organisation’s IT requirement for the new business function. The requested service could be anything  like  Server/Storage/Network Infrastructure, Platform Environment or already built software application which can be customized to your requirement.  And the organisation will pay,  to the service provider, only for the resources that has been utilized. No Capital investment, no running maintenance cost.
Scenerio 2:
If any organisation want to migrate it’s existing IT infrastructure ( or part of it ) related to less critical business function, it can again approach the Cloud Computing Service provider for a solution that works for their actuation requirement.
2. Rapid scalability with the help of dynamic infrastructure
Current Challenge:
In any business, it is very common that, the initial design of IT infrastructure happens  considering the current potential of business and expected growth of business in near future. And these expectations / predictions about the future growth may or may not be correct, in current day high fluctuating business markets.  Any large Investment in IT infra setup will be wasted if the related business not doing well , as expected. And at the same time insufficient IT infra resources could block the business growth if the business was progressing better than expected.
It is always a real challenge to any organisation to predict the actual requirement of IT infrastructure , and this challenge can easily addressable if the organisation considering the cloud computing solution.
Using Cloud Computing, organisations can easily scale it’s resources to the level it matches the business requirement  which is very dynamic in nature.
3.  Utility Pricing Model
This point is self explanatory, organisations will pay for the only resources that they have used. No Initial investment to setup infra.
4. Self Service by using Automated Provisioning
I believe, this is one key point where cloud computing affecting the existing IT infrastructure  job roles.
By using automated provisioning feature of Cloud Computing , organisations can request the services mentioned in Service Catalogue and could receive the services  instantly and dynamically with minimum or no technology skills.
5. Resource availability  from anywhere of the world
Public clouds can be accessed from anywhere of the world using the internet, and this feature makes cloud computing as beautiful solution for many startup companies which are running using virtual teams located in different parts of world.
for more inforamtoin, you can refer my other post ” Cloud Computing – It’s not just another buzzword, but a near future “,  which talks about cloud computing features and benefits.

Cloud Computing Layers

IaaS  -  Infrastructure as a Service
Iaas   is basically a paradigm shift from “Infrastructure as an asset” to “Infrastructure as a Service”
Key Characteristics of Iaas:
  • Infrastructure is Platform independent
  • Infrastructure costs are shared by multiple clients/users
  • Utility Pricing – Clients will pay only for the resources they have consumed
Advantages:
  • Minimal or No Capital investment on Infrastructure Hardware
  • No Maintenance costs for Hardware
  • Reduced ROI risk
  • Avoid the wastage of Computing resources
  • Dynamic in nature
  • Rapid Scalability of Infrastructure to meet sudden peak in business requirements
Drawbacks:
  • Performance of Infrastructure purely depends on Vendor capability to manage resources
  • Consistent  high usage of resources for a long term could lead to higher costs
  • Companies have to introduce new layer of Enterprise security to deal with the cloud computing related to security issues
Note: It is better not to adapt Iaas Solution, if the oraganisation capital budget is greater than the Operating budget
PaaS – Platform as a Service
Paas is a Paradigm shift from ” purchasing platform environment tools as a licensing product ”  to “purchasing as a service”.
Key Characteristics:
  • Deployment purely based on cloud infrastructure
  • caters to agile project management methods
Advantages:
  • It is possible capture the complex testing & development platform  requirement and automate the tasks for provisioning of consistent environment.
Drawback:
  • Enterprises have to introduce new layer of security to deal with the security in cloud computing environment.
SaaS – Software as a Service
SaaS is basically paradigm shift from treating “treating software as an asset of  business/consumer” to “using software as a service achieve the business goals”
Advantages:
  • reduce Capital expenses required for the development and testing resources
  • Reduced ROI risk
  • Streamlines and Iterative updates of the software
Drawbacks:
  • Enterprises have to introduce new layer of security to deal with the security in cloud computing environment.

Cloud Computing Solutions for Enterprise

Public Cloud Solution for Enterprise
Public Cloud solution allows enterprise to adapt Iass, Pass and Saas services from a cloud computing service provide on the internet, and actual computing resources are available under control of Vendor.
Private Cloud Solution for Enterprise
Private Cloud Solution for Enterprise nothing but constructing cloud solution within the enterprise datacenter, to provide more security on physical resources. And the internal departments of the enterprise within the organisation can utilise and pay for cloud computing resources as if they are using public cloud resources.
Hybrid Cloud Solution for Enterprise
Hybrid cloud solution enables enterprise use both public cloud and private cloud resources same time depending on the criticality and importance of the business function.
Virtual Private Cloud Solution
Using Virtual Private Cloud Solution Companies can create their own private cloud environment with in the public cloud by using different network/firewall rules. And the purpose is to avoid external access to the enterprise resources.

How Cloud Computing affects the Job roles in the Infrastructure Support Team

Depending on the Clod computing Solution that enterprise adapted, there will be direct and indirect effect on the various job roles with in the infrastructure support teams.
If you look at the Sysadmin role in general , the actual job role involves three major responsibilities:
  • Hardware administration
  • Operating System Builds
  • Operating System Administration
  • Network Services Administration
Once the organisation adapted the Cloud Computing solution ( IaaS / PaaS / SaaS ) , it no longer required to maintain the skillful technical people to deal with hardware related issues and OS Build operations but they still need resources to perform OS / Network administration and to customize cloud resources to meet the organisation requirements. And the same effect is true for the Network Support roles.
Cloud Computing solutions cannot replace every system administrator in the company but it will expect new level  cloud computing related expertise instead of ” to be isolated hardware maintenance skills”. For sure, it’s a call for learning. And more importantly the sysadmin job roles specifically dealing with the “Hardware & OS builds” has to go away, in near future.
For any organisation, the current  recruitment strategy for the SysAdmin Team  is “No. of Sysadmins are directly proportional to the physical server foot print in the data center “.  With IaaS adaption organisation’s server footprint will reduce drastically, and hence the no. of sysadmin positions.
As of now the Clouds were deployed to replace the Server infrastructure with windows / linux on X86 model, but not yet having solutions for Vendor Specific Server OS like Solaris on Sparc, IBM AIX and HP UX …etc.  Considering the speed of evolution in cloud computing technologies, it may not take long time to provide solutions for all kinds of server infrastructure. From the other side, if the Organisation choose to migrate their applications to X86 model servers to receive the benefits of economic cloud computing  then the change is more rapid.
Below pictures will give you an understanding how the roles are moving out of Infra Teams depending on the Cloud solution adapted by the organisation.

Final and one more story, i want to tell you,  before closing this post.

As most of you already aware, India is an agricultural based society where people treat their land  like “mother that feeds you everyday ” and cows like “part of family wealth”.  A decade before, most of the families used to follow the traditional way of cultivation that requires more number people and long working hours . And this requirement for the human labor is the main source for the jobs , in villages,  for longtime
With technology innovations in India, there were many new tools/machines  had been introduced to the indian agricultural industry which in turn reduced the requirement for the human labor. During  this technology change,  many people back at villages worried about their  livelihood for sometime. But, the worry didn’t last longtime because most of them quickly adapted the skills related to these new technologies like “regular maintenance of these new tools” , “using the tools for better productivity” and “finding new lands to cultivate using these new machines with low cost” etc., and started living better than earlier.
And I believe, same story applies for any other industry including IT.  And whenever we notice an inevitable change in our way, it is always wise to understand and get ready to accept it,  instead of worrying about and trying to resist it.

Solaris Troubleshooting – System Panics, Hangs and Crashes

Solaris Troubleshooting – System Panics, Hangs and Crashes

There are a number of differing scenarios under which the Solaris  operating system may panic, hang, or exhibit other symptoms that lead the administrator to have to restart or reboot the system.
As there are many different failure scenarios and many different classes of hardware, the information and procedures for collection of system information vary from system to system.

1.Hang

The first class of failures is the hang. This is when a system appears to become unresponsive. See the documents below that discuss dealing with hung systems.
Be aware that some systems that appear hung are not! Be sure to verify if the server is hung or not. For example: The display may be non-responsive, because the output has been redirected to the console device.

2.Panic or unexpected reboot

Panics can be caused by a variety of issues, including Solaris Bugs, Hardware errors and Third Party Drivers and Applications. It’s important to collect as much detail as is possible when systems panic.
Information that need to be collected to troubleshoot Kernel Panic:
When logging a new case, provide answers to the following questions as an absolute minimum:
  • When did the problem start
  • What changes have been recently made on the system. Important: Anything that has happened since the last reboot is within the scope of this question. Patching, application changes, disk replacements, anything. It’s all important to know when trying to resolve the issues.
  • How often has this failure occurred
  • What may have been going on around the time of the panic and if anything out of the ordinary may have been observed
In the case of a panic or reboot, the messages log and prtdiag output are items that can be quickly sent to Oracle Sun, and that can go a long way towards diagnosis of the cause of the problem, however, an explorer is almost always better.
By far, the simplest way to collect the vast majority of details required to resolve a panic is to collect:
  • Sun Explorer Data Collector output
  • The crash dump
  • Any console messages
a. EXPLORER OUTPUT
If Explorer is run on the system after the incident occurs, it will contain most of what will be required to understand the current configuration of the system, and additional information that may be helpful.
b. CRASH DUMP
If the system generated a system crash dump (check /var/crash/`hostname`), create a compressed tar file containing the unix.* and vmcore.* files and transfer that file to Oracle Sun for analysis. Compressing the tar file (using one of the compress, gzip of bzip2 utilities) reduces the size of the file dramatically and so reduces the time taken to transfer the file to Sun.
c. CONSOLE MESSAGES
Console messages are most important when the server is experiencing hardware issues and the OS is not allowed an opportunity to panic. In cases where multiple unexpected reboots are occurring, and no diagnostic infomation is being provided by the system logs in /var/adm/messages, some form of console logging should be setup as soon as possible to capture the diagnostic information from the console on the next failure.
Recommended NVRAM settings , to Collect Console Messages:
Bring system to OBP level from command line using “shutdown” or “init 0″ commands (either will run all RC shutdown scripts), sync file systems and then drop system to OK prompt. DO NOT use a stop+A key press. The following commands can be executed from the OK prompt or from the command line using the “eeprom <variable=parameter>” command.
at OK prompt # eeprom Description
setenv diag-level max diag-level=max system will run extended POST
printenv boot-device boot-device determine what your boot device is….
setenv diag-device <your boot-device> diag-device=<your bootdev> prevent attempting net boot w diags on
setenv error-reset-recovery sync error-reset-recovery=sync force sync reboot if system drops to OK
setenv diag-switch  true diag-switch =true
reset-all reboot or init 6 system has to reset for changes to take affect

Exceptions
In some cases, it is difficult to collect an explorer.
In the event that explorer cannot be installed or run in a timely manner, the following data is of tremendous value, and should be collected:
a. MESSAGES LOG
Messages logs from /var/adm directory. If there was a panic, the panic message in the file will help determine if we need to analyze a crash dump to diagnose the cause. In many cases a crash dump is not necessary, and waiting for one to be transferred simply increases the resolution time. For example, if there was a hardware reset rather than a panic, the messages log should show that.
b. PRTDIAG OUTPUT
Output of the prtdiag command.
 /usr/platform/`uname -i`/sbin/prtdiag -v
prtdiag gives a summary of a system’s hardware configuration, so that we would know what part to order in the case of a hardware failure. It also gives hardware error messages that can aid in diagnosis.
c. SHOWREV OUTPUT
The output from ‘/usr/bin/showrev -p’ gives a list of the patches installed on the system. This will help eliminate possible casues of the problem and ensure that the correct versions of source code and analysis tools are used during the investigation.

3.Live Dump

On occasion, it is required that a live crashdump be collected. It’s uncommon, as dumping a live system does not capture a completely consistent snapshot of the system. Data is changing while the dump is being written out. Although live dumps are not always consistent, they are still a great source of information for certain types of issues.
collectiongg Live Dump from the Solaris Machine:
1) Before collecting a live kernel dump, a dedicated, NON SWAP, dump device must be configured using dumpadm(1M). The dedicated dump device must not be used in any other way (i.e., filesystem, databases, etc.). The dump device CANNOT be swap or any part of. If the dump device is part of swap, generation of live kernel dump will corrupt the swap area causing the kernel to eventually panic. Also note that any filesystem or data on the dump device disk will be lost.

2)The kernel is running during generation of live kernel dump, and the linked lists that all kernel debuggers use to traverse those linked structures may fail because the list was in flux when saved. Therefore, always run the following ps command to capture the process addresses:
/usr/bin/ps -e -o uid,pid,ppid,pri,nice,addr,vsz,wchan,time,fname
You must use those switches to get addresses with a 64 bit kernel. This will allow you to look at processes since you will have the process address. From there you can generate more complete threadlists, etc. Please see the ps(1) manpage for meaning of those options. For example, the following is what /usr/bin/ps -elf outputs on a 64-bit machine:
> /usr/bin/ps -elf
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
19 T root 0 0 0 0 SY 0 Sep 27 0:00 sched
8 S root 1 0 0 41 20 98 Sep 27 0:00 /etc/init -
19 S root 2 0 0 0 SY 0 Sep 27 0:00 pageout
19 S root 3 0 0 0 SY 0 Sep 27 1:05 fsflush
8 S root 267 1 0 41 20 220 Sep 27 0:00 /usr/lib/saf/sac -t 300
8 S root 154 1 0 51 20 325 Sep 27 0:00 /usr/sbin/inetd -s
8 S root 125 1 0 41 20 294 Sep 27 0:00 /usr/sbin/rpcbind
8 S root 48 1 0 47 20 185 Sep 27 0:00 /usr/lib/sysevent/syseventd
8 S root 50 1 0 48 20 160 Sep 27 0:00 /usr/lib/sysevent/syseventconfd
NOTE: The ADDR field is not populated. If you issue the command to capture process addresses as described above, you will see the following output instead:
> /usr/bin/ps -e -o uid,pid,ppid,pri,nice,addr,vsz,wchan,time,fname
UID PID PPID PRI NI ADDR VSZ WCHAN TIME COMMAND
0 0 0 96 SY 10423a60 0 – 0:00 sched
0 1 0 58 20 30000909528 784 30000909848 0:00 init
0 2 0 98 SY 30000908a98 0 10458248 0:00 pageout
0 3 0 60 SY 30000908008 0 104618a0 1:05 fsflush
0 267 1 58 20 30000963530 1760 300009659a8 0:00 sac
0 154 1 48 20 30001708028 2600 30000baaca2 0:00 inetd
0 125 1 58 20 30001709548 2352 30000aa2102 0:00 rpcbind
0 48 1 52 20 300009f0aa8 1480 300009f0dc8 0:00 sysevent
0 50 1 51 20 300009f0018 1280 22d44 0:00 sysevent

Wednesday, 31 August 2011

Sun Hardware Diagnosis at OBP level

It is so common that servers encounter issues related to hardware and those errors cannot be diagnosed by Operating system level utilities.  To perform preliminary diagnosis and to pin point the hardware trouble, system admins have to rely on OBP (Open Boot PROM) diagnosis options.
at OBP level, system admin have three options to investigate the issue
  1. OBP diagnosis commands
  2. OBDiag Outputs
  3. POST Errors

1.Using OBP Commands

Below are the some of the OBP commands that system admins with advanced skill on hardware can use to investigate the trouble.
banner
Displays the power on banner. The banner includes information such as CPU speed, OBP revision, total system memory, ethernet address and hostid.
.enet-addr
Displays the ethernet address
led-off/led-on
Turns the system led off or on.
nvstore
Copies the contents of the temporary buffer to NVRAM and discards the contents of the temporary buffer.
power-off/power-on
Powers the system off or on.
printenv
Displays all parameters, settings, and values
probe-fcal-all
dentifies Fiber Channel Arbitrated Loop (FCAL) devices on a system. 1
probe-sbus
Identifies devices attached to all SBUS slots. Note - This command works only on systems with SBUS slots.
probe-scsi
Identifies devices attached to the onboard SCSI bus. 1
probe-scsi-all
Identifies devices attached to all SCSI busses. 1
set-default parameter
Resets the value of parameter to the default setting.
set-defaults
Resets the value of all parameters to the default settings. Tip - You can also press the Stop and N keys simultaneously during system power-up to reset the values to their defaults.
setenv parameter value
Sets parameter to specified value. Note - Run the reset-all command to save changes in NVRAM.
show-devs
Displays all the devices recognized by the system.
show-disks
Displays the physical device path for disk controllers.
show-displays
Displays the physical device path for frame buffers.
show-nets
Displays the physical device path for network interfaces
show-post-results
If run after Power On Self Test (POST) is completed, this command displays the findings of POST in a readable format.
show-sbus
Displays devices attached to all SBUS slots. Similar to probe-sbus .
show-tapes
Displays the physical device path for tape controllers.
sifting string
Searches for OBP commands or methods that contain string. For example, the sifting probe command displays probe-scsi, probe-scsi-all, probe-sbus, and so on.
.speed
Displays CPU and bus speeds
test device-specifier
Executes the selftest method for device-specifier. For example, the test net command tests the network connection.
test-all
Tests all devices that have a built-in test method.
.version
Displays OBP and POST version information.
watch-clock
Tests a clock function.
watch-net
Monitors the network connection for the primary interface.
watch-net-all
Monitors all the network connections.

 

2.OBDiag


OBDIAG can be used to diagnosis main logic board as well interface boards ( e.g.  PCI /  SCSI / Ethernet / Serial/ Parallel / Keyboard/mouse / NVRAM / Audio /  Video )
To run OBDIAG simply run
OK> obdiag
You can also set up OBDiag to run automatically when the system is powered on using the following methods:
    1. Set the OBP diagnostics variable:              ok setenv diag-switch  true
    2. Press the Stop and D keys simultaneously while you power on the system
Note: On Ultra Enterprise servers, just turn the key switch to the diagnostics position and power on the system, to start obdiag.

 

3.POST

POST is a program that resides in the firmware of each board in a system, and it is used to initialize, configure, and test the system boards. POST output is sent to serial port A  and POST completion status will be indicated by the status LEDs
You can watch POST ouput in real-time by attaching a terminal device to serial port A. If none is available, you can use the OBP command show-post-results to view the results after POST completes.
How To Run POST
  • Attach a terminal device to serial port A.
  • Set the OBP diagnostics variable:ok
ok setenv diag-switch true
  • Set the desired testing level. Two different levels of POST can be run, and you can choose to run all tests or some of the tests. Set the OBP variable diag-level to the desired level of testing (max or min), for example:
ok setenv diag-level max
  • If you wish to boot from disk, set the OBP variable diag-device :
ok setenv diag-device : disk   (  The system default for this variable is net).
  • Set the auto-boot variable
ok setenv auto-boot false
  • Save the changes
ok reset-all
  • Power cycle the system (turn it off, and then back on).
POST runs while the system is powered on, and the output is displayed on the device attached to serial port A. After POST is completed, you can also run the OBP command show-post-results to view the results.
LED STATUS
Power LED ( Left position)
Should always be on. If all three LEDs are off, suspect a power problem. If this LED is in any other state than on and steady, it indicates a problem.
Service LED (Middle Position)
This LED should be off in normal operation. If on, a component is in an error state and you should check check individual board LEDs. A lit service LED does not imply there is an OS-related problem.
Cycling LED ( Right Position)
This LED should be flashing — this is the normal state.