Unix SysAdmin Archives

Sunday, 5 October 2014

man powercf

User Commands                                          powercf(1)

NAME
powercf - PowerPath Configuration Utility

SYNOPSIS
powercf -q|-Z

DESCRIPTION
During system boot on Solaris hosts, the powercf utility configures PowerPath devices by scanning the HBAs for both single-ported and multiported storage system logical devices. (A multiported logical device shows up on two or more HBAs with the same storage system subsystem/device identity. The identity comes from the serial number for the logical device.) For each storage system logical device found in the scan of the HBAs, powercf creates a corresponding emcpower device entry in the emcp.conf file, and it saves a primary path and an alternate primary path to that device.

After PowerPath is installed, you need to run powercf only when the physical configuration of the storage system or the host changes. Configuration changes that require you to reconfigure PowerPath devices include:

     * Adding or removing HBAs
     * Adding, removing, or changing storage system logical devices
     * Changing the cabling routes between HBAs and storage system ports
     * Adding or removing storage system interfaces

     Refer to the PowerPath Product Guide for instructions on reconfiguring PowerPath devices on Solaris.

Executing powercf
You must have superuser privileges to use powercf.
To run powercf on a Solaris host, type the command, plus any options, at the shell prompt.

emcp.conf File
The /kernel/drv/emcp.conf file lists the primary and alternate path to each storage system logical device and the storage system device serial number for that logical device. The powercf -q command updates the existing emcp.conf file or creates a new one if one does not already exist.

OPTIONS
powercf scans HBAs for single-ported and multiported storage system logical devices and compares those logical devices with PowerPath device entries in emcp.conf.

-q      Runs powercf in quiet mode.

Updates the emcp.conf file by removing PowerPath devices not found in the HBA scan and adding new PowerPath devices that were found. Saves a primary and an alternate path to each PowerPath device.
powercf -q runs automatically during system boot.

-Z      Configures an SRDF-enabled server to be bootable from an R2 mirror of a Symmetrix-based emcpower boot disk by a remote host.
powercf -Z should be run manually whenever such a server's Symmetrix volume configuration changes due to the addition or deletion of volumes.

Monday, 8 July 2013

How to support AIX (part1)

Before problems occur:
• Effective problem determination starts with a good understanding of the system and its components.
• The more information you have about the normal operation of a system, the better.
– System configuration
– Operating system level
– Applications installed
– Baseline performance
– Installation, configuration, and service manuals

A few good commands
• lspv Lists physical volumes, PVID, VG membership
• lscfg Provides information regarding system components
• prtconf Displays system configuration information
• lsvg Lists the volume groups
• lsps Displays information about paging spaces
• lsfs Gives file system information
• lsdev Provides device information
• getconf Displays values of system configuration variables
• bootinfo Displays system configuration information (unsupported)
• snap Collects system data

Steps in problem resolution
1.Identify the problem
2. Talk to users to define the problem
3. Collect system data
4. Resolve the problem

Progress and reference codes
• Progress codes
– Checkpoint during a process such as boot, shutdown, or dump
• System reference codes (SRCs)
– Error codes for problems in hardware, firmware, or operating system
• Service request numbers (SRNs)
– Indicates the detecting component and error condition detected
• Obtained from:
– Front panel of system enclosure
– HMC or IVM (for logically partitioned systems)
– Operator console message or diagnostics (diag utility)

Reference codes at IBM Information Center
http://pic.dhe.ibm.com/infocenter/powersys/v3r1m5/index.jsp

Working with AIX support
• Have needed information ready:
– Name, phone #, customer #,
– Machine type model and serial #,
– AIX version, release, technology level, and service pack
– Problem description, including error codes
– Severity level: critical, significant impact, some impact, minimal
• 1-800-IBM-SERV (1-800-426-7378)
• Level 1 will collect information and assign PMR number
• Route to level 2 responsible for the product
• You may be asked to collect additional information to upload
• They may ask you to update to a specific TL or SP
– APAR for your problem already addressed
– Need to have a standard environment for them to investigate

AIX support test case data
Run the following (or very similar) commands to gather snap information:


# snap –a comment: Copy any extra data to the /tmp/ibmsupt/testcase or the /tmp/ibmsupt/other directory

# snap –c comment: This step will create /tmp/ibmsupt/snap.pax.Z.

# cd /tmp/ibmsupt

# mv snap.pax.Z \

PMR#.b<branch#>.c<country#>.snap.pax.Z

Upload the information you have captured:


# ftp testcase.software.ibm.com

User: anonymous

Password: <your email address>

ftp> cd /toibm/aix

ftp> bin

ftp> put PMR#.b<branch#>.c<country#>.snap.pax.Z

ftp> quit

AIX software update hierarchy
• Version and release (oslevel)
– Requires new license and migration install
• Fileset updates (lslpp –L will show mod and fix levels)
– Collected changes to files in a fileset
– Related to APARs and PTFs
– Only need to apply the new fileset
• Fix bundles
– Collections of fileset updates
• Technology level and maintenance level (oslevel –r)
– Fix bundle of enhancements and fixes
• Service packs (oslevel –s)
– Fix bundle of important fixes
• Interim fixes
– Special situation code replacements
– Delay for normal PTF packaging is too slow
– Managed with efix tool

Thursday, 4 July 2013

ufsrestore -if

Another helpful unix utility is the ufsrestore.
It's very handy, specially when you only have to restore a file or a specific directory.
This is preferable than netbackup for it's easiness.
But of course a ufsdump is needed for you to make use of this tool.

Here are the highlights of how to use it. Let say we want to restore /etc



#ufsrestore -ifv 

 add /etc

 extract

Specify next volume #: 1 comment: in most cases is volume 1

set owner/mode for '.'? [yn] n comment: NO when restoring in directory other then one from which files were dumped

          comment: YES if restoring in same directory from were dump was performed.



ufsrestore > quit

Thursday, 3 May 2012

How to Replace System Board for Sun Fire E6900 Systems

Applies to:
Sun Fire 4800 Server - Version: Not Applicable and later [Release: N/A and later ]
Sun Fire 4810 Server - Version: Not Applicable and later [Release: N/A and later]
Sun Fire 6800 Server - Version: Not Applicable and later [Release: N/A and later]
Sun Fire E4900 Server - Version: Not Applicable and later [Release: N/A and later]
Sun Fire 3800 Server - Version: Not Applicable and later [Release: N/A and later]
Information in this document applies to any platform.

H/W ON-SITE Action Plan #1. Parts: 540-6295
NOTE: THIS AUTOGENERATED CREATED USING https://actionplans.us.oracle.com/atr/.

***********************************************************
************ Start Hardware Onsite Action Plan ************

A. DISPATCH INSTRUCTIONS

A1. WHAT SKILLS DOES THE ENGINEER NEED (IS A SITE ENGINEER AVAILABLE?):

A2. PARTS REQUIRED: USE INFORMATION IN TASK UNLESS ENTERED BELOW
Part number: [F] 540-6295
Part location: SB4
Quantity: 1
Description: CPU/MEM W/ 4 US IV 1.35GHZ, 0GB (FRU)
Prior part DOA: No
Alternate parts: 540-6803
SPECIAL INSTRUCTIONS: Verify the new board's firmware matches that of the System Controller and other boards in the configuration.
See http://sunsolve.sun.com/search/document.do?assetkey=1-61-214805-1 for details.

A3. DELIVERY REQUIREMENT:
Preferred Onsite Time: Within Service SLA

A4. ONSITE VISIT DETAILS:
Account name:
Contact Name:
Contact Telephone #:
Email address:
Street Address:
City:
State:
Country:
Postal Code:
Alt. Contact name:
Alt. Contact email:
Alt. Contact phone:
Special instructions:

B. FIELD ENGINEER INSTRUCTIONS

NOTE : READ MANDATORY NOTES SECTION OF ACTION PLAN.
This Action Plan is not complete until all mandatory actions outlined below have been competed.

B1. PROBLEM OVERVIEW:
General problem: There is a component failure
Fault for part 540-6295: NA
*** Start System Error Message ***
cat-a:SC> showchs -c SB4 -v
Total # of records: 1
Component : /N0/SB4
Time Stamp : Sun Dec 04 19:26:26 EST 2011
New Status : Faulty
Old Status : OK
Event Code : HW
Initiator : ScApp
Message : 1.E6900.FAULT.ASIC.CHEETAH.AFSR_2_HI_ISAP.71191111.20-16.1

*** End System Error Message ***

B2. WHAT ACTIONS DOES THE ENGINEER NEED TO TAKE:
In this Document

Goal

Solution

Oracle Confidential (INTERNAL). Do not distribute to customers

Reason: FRU CAP

Applies to:
Sun Fire 4800 Server - Version: Not Applicable and later [Release: N/A and later ]
Sun Fire 4810 Server - Version: Not Applicable and later [Release: N/A and later]
Sun Fire 6800 Server - Version: Not Applicable and later [Release: N/A and later]
Sun Fire E4900 Server - Version: Not Applicable and later [Release: N/A and later]
Sun Fire 3800 Server - Version: Not Applicable and later [Release: N/A and later]
Information in this document applies to any platform.

Goal
How to Replace System Board for Sun Fire 3800, 4800, 4810, 6800, E4900, and E6900 Systems

******************************************************************************

To report errors or request improvements on this
procedure,

please go to http://support.us.oracle.com
and put a comment on Doc ID: 1306577.1

******************************************************************************

Solution
DISPATCH INSTRUCTIONS

WHAT SKILLS DOES ENGINEER NEED:

ScApp, lom

Task Complexity: 4

Time Estimate: 60 minutes

FIELD ENGINEER INSTRUCTIONS

CAP PROBLEM OVERVIEW:
System Board Failure

WHAT STATE SHOULD SYSTEM BE IN TO BE READY TO PERFORM RESOLUTION ACTIVITY?

Examples use a board location of '#'

1) See if DR can be used; If the board is listed in 'cfgadm -av | grep -i perm' output, you can't use DR.

2a) If able to DR, issue 'cfgadm -c disconnect N0.SB#'

2b) If unable, issue 'init 0' and then 'poweroff sb#' at SC prompt

WHAT ACTION DOES ENGINEER NEED TO TAKE:

You will need to move DIMMs from the 'old' board to the 'new' one (same slots).

1) Perform physical SB replacement per Service Manual

2) poweron SB# at SC prompt

3) showchs -b at SC prompt

4) Reset any 'Suspect' or 'Faulty' components to 'ok' from the Main SC or from the lom prompt:
setchs -s OK -r 'SR number' -c <comp>

NOTE: If ScApp 5.20.15 or higher, service mode access IS NO LONGER REQUIRED to execute setchs.
If < 5.20.15, contact service to obtain a Service Mode password or generate one yourself at https://modepass.us.oracle.com
(a backup server is also available from https://modepass-bak.us.oracle.com)
Repeat 'setchs' command until all components are 'ok'.

Verify 'showchs -b' is empty.

5) Verify new board firmware matches existing boards & SC(s) ('showboards -p proms' at SC prompt).

If needed, copy firmware from a like board 'flashupdate -c (source board) (destination board)'

6) Consider running extended POST (On domain issue 'eeprom diag-level=max' or at ok prompt 'setenv diag-level max').

7) If you're replacing a COD (Capacity on Demand) enabled board refer to
Sun Fire[TM] 12K/15K/E20K/E25K/F3800/Fx800/Ex900/ servers: How to replace a COD CPU/memory board (Doc ID <a href="<<INLINE_NOTE:
1002102.1
>>">
1002102.1
)
for the needed step to follow

8a) If able to DR, issue 'cfgadm -c configure N0.SB#' at domain level

8b) If unable to DR, issue 'setkeyswitch -d (domainID) off' followed by 'setkeyswitch -d (domainID) on'.

9) Monitor POST.
* If new errors are detected, collect POST and contact Support.

OBTAIN CUSTOMER ACCEPTANCE

WHAT ACTION DOES CUSTOMER NEED TO TAKE TO RETURN SYSTEM TO AN OPERATIONAL STATE:

Boot system if not already booting.

REFERENCE INFORMATION:

Replacement procedures are documented in the Service Manuals:

* Chapter 8, 3800/48x0/6800 Manual http://download.oracle.com/docs/cd/E19095-01/sf3800.srvr/805-7363-15/805-7363-15.pdf

* Chapter 8, E4900/E6900 Manual http://download.oracle.com/docs/cd/E19095-01/sfe4900.srvr/817-4120-13/817-4120-13.pdf

B3. SHOULD DYNAMIC RECONFIGURATION BE USED:

B4. IS OUTAGE REQUIRED AND AGREED TO BY THE CUSTOMER: Yes

B5. NOTICES THAT ENGINEER MUST TAKE INTO ACCOUNT:
ROHS NOTICE: This system has NOT been adequately identified during remote diagnosis for the purposes of RoHS. You must check the system's RoHS compliance by referring to information within FIN 102250, or by verifying with your support centre before commencing service.

B6. ADDITIONAL COMMENTS:

B7. WHAT TROUBLESHOOTING TESTS WERE DONE:

C. GENERAL ACTION PLAN INFORMATION
Action plan for case:
Action plan reference number: 1 (always reference with case id)
Affected product: SUN FIRE E6900
Platform version: N/A
(Please update MOS with correct serial number if necessary)

************* End Hardware Onsite Action Plan *************
***********************************************************

**************** GENERAL INSTRUCTIONS FOR THIS ACTION PLAN ***************
**************************************************************************
Make sure a new explorer [explorer -w all,interactive,scextended] is run and
submitted to proactive.central after installing the new parts. Email explorer to:
explorer-database-americas@sun.com - Americas
explorer-database-emea@sun.com - EMEA (Europe, Middle East, Africa)
explorer-database-apac@sun.com - APAC (Asia, Pacific)
**************************************************************************
***********************************************************************

Monday, 13 February 2012

When downtime is inevitable

WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY :
Customer should shut OS down gracefully (using "init 0").
Then, At the OK prompt Type #. (in key sequence) to get into Alom.
Run the following commands from ALOM:
1. setlocator on
2. poweroff

WHAT ACTION DOES THE ENGINEER NEED TO TAKE:
1. Shutting System Down.
2. Extending Server to Maintenance Position.
3. Performing Electrostatic Discharge Prevention Measures.
4. Disconnecting Power From Server.
5. Removing Top Cover.
6. Remove system controller from chassis.
7. Locate system controller card.

8. Push down on ejector levers on each side of system controller until card releases from socket.

9. Grasp top corners of card and pull it out of socket. Place system
controller card on antistatic mat.

10. Using a small flat-head screwdriver, carefully pry battery from system controller.

11. Unpackage replacement battery and Press new battery into system controller with positive side facing upward (away from card).

12. Re-install system controller. Holding bottom edge of system controller, carefully align system controller so that each of its contacts is centered on a socket pin. Ensure that system controller is correctly oriented and ejector levers are open. A notch along the bottom of system controller corresponds to a tab on socket. Push firmly and evenly on both ends of system controller until it is firmly seated in socket. You hear a click when ejector levers lock into place.

13. Use ALOM CMT setdate command to set day and time. Use setdate command before you power on host system.

14. Place top front cover on chassis. Slide front top cover forward until it snaps into place, being careful to avoid catching cover on intrusion switch.

15. Position bezel on front of chassis and snap into place.

16. Open fan door. Tighten captive screw to secure front bezel to chassis.

OBTAIN CUSTOMER ACCEPTANCE (this needs to be performed by the field engineer)
At ALOM prompt:
1. setlocator off
2. Use SC setdate command to set ALOM day and time (before you power on host system).
3. poweron -c

At OK prompt:
Note: The ALOM date and Solaris date are not in sync. The Solaris date may not be correct and should be set separately.
1. Boot -s (to avoid any applications being affected by wrong OS date).
2. Use date command to set the correct OS date (optional: init 0, boot -s to verify)

WHAT ACTION DOES THE CUSTOMER NEED TO TAKE TO RETURN THE SYSTEM TO AN OPERATIONAL STATE:

Customer restarts software applications per applicable administration guides to resume system operation.

Friday, 27 January 2012

How to upload core files to supportfiles.sun.com via ftp

Unsecure File Transfer Options

Oracle recommends that all users employ HTTPS or another secure method to transport files to Oracle. Users that choose to use FTP bear any risk associated with this method of file transport.

Standard method for unsecure upload to supportfiles.sun.com

 

ftp supportfiles.sun.com

login: anonymous

password: user@machine   [ your email address]

cd  to the appropriate directory*

binary                                 [ set transfer mode]

put 62001234.tar.Z

quit

*Customers are instructed by the Oracle engineer to choose a destination directory in which to upload their file, based on the customer's location and type of file being uploaded. Choices are:

cores
iplanetcores
explorer
explorer-amer
explorer-apac
explorer-emea

Please Note: Plans are in place to End-Of-Life Supportfiles within the next 12 months. For Oracle Hardware product telemetry and for files greater than 2GB, we recommend Oracle Secure File Transport.

Tuesday, 10 January 2012

How to collect critical troubleshooting information in the SC's log buffer

1) Log into a system which has access to the Main System Controller (SC) and open a terminal window.

2) Open a script session so the following SC command output will be captured.

   
$ script -a /tmp/scdatafile

3) Connect to the platform shell of the Main SC per your configuration's requirements (telnet, console, ssh, tip, etc):

   
    $ console main-sc

    $ telnet main-sc

    $ ssh main-sc

    $ tip main-sc

NOTE: Do not reboot the main SC before collecting this data. Doing so may erase critical troubleshooting information in the SC's log buffer.

4) From the platform shell, execute the following commands which will be captured in the script session that you opened previously:

   
    showdate

    showsc -v

    showescape

    showkeyswitch

    showcodlicense -v

    showcodlicense -rv

    showcodusage -v

    showplatform -v

    showplatform -vda

    showplatform -vdb

    showplatform -vdc

    showplatform -vdd

    showboards -ev

    showcomponent

    showfru -r manr

    showchs -b  (will fail for fw below 5.20.15)

And for each suspect or faulty component

   
    showchs -vc /N0/IB6    (for example) 

    showdate -v

    showdate -v -d a

    showdate -v -d b

    showdate -v -d c

    showdate -v -d d

    showlogs -v

    showlogs -vp  (the -vp* commands will fail for systems with older SCs)

    showlogs -vda

    showlogs -vpda

    showlogs -vdb

    showlogs -vpdb

    showlogs -vdc

    showlogs -vpdc

    showlogs -vdd

    showlogs -vpda

    showerrorbuffer

    showerrorbuffer -p

    showenvironment -ltuv

    history

    showdate

    NOTE: You might need to use a "Control right bracket" ("']") to disconnect, depending on how you have connected to the SC.

5) Exit the script session to save the collected data:

    * Hit <control> D and you should get the message "script /tmp/scdatafile closed", "script done" or a similar message.
    * Alternatively, you can also type "exit" at the prompt to close the script session.

6) Upload the data file (scdatafile in this example) utilizing the instructions in Document 1020199.1.

    * It is suggested that the SR Number be apended to the beginning of the file, for example "SR_Number_scdatafile".

Sunday, 25 December 2011

Ten Points when joining into a new team

So, during initial stage of new job keep your focus to understand the historical information about current environment , from the existing team, whenever you get a chance to discuss about it.

1. Know job scope of your team

Team scope is something which is very important to know right immediate you join to a new job because it will give you an idea to decide your ‘ priorities of learning’ related to the new job.

For example, if you join into a team in a large organization where the scope of the team is to support a set of servers which have only database but nothing else then your priority will immediately change to understand “how the Database works on Unix , and basics of DB terminology” , at the same time your team not supporting any DNS, NIS, DHCP servers and all of them were under control of different team so you will not worry about those servers in your initial learning.

2. Know about Technical architecture of environment

Technical Architecture of Environment talks about below points :

    a. How many Total servers( commonly called as “Server FootPrint” ) we are supporting and where they are actually located ( i.e. Data center information ) ?

    b. What Operating Systems are in use right now, and what are the supported hardware models?

    c. What are the Operating environments that team supporting now? e.g. Production , Testing or Developement

    d. What are the applications currently running on our server environment, and who is using them? e.g. sybase, clearcase, weblogic .. etc.

    e. What Storage is in use right now, and What sort of Console systems we are using to connect to the Servers remotely? EMC, Netapp, Cyclades Consoles ..etc.

    f. What storage management software is in use in which operating systems? e.g. LVM, VxVM , ZFS …etc.

3. Know about procedures and escalation

Ideally, any system administrator should deal with three types of operations:

    a. Break / Fix activities ( Widely known as incidents )

    This mainly involves in fixing the issues that encountered in a properly working environment. e.g. disk failure on a server, unix server crashed due to overload, network failed due to network port problem…etc.

    b. Changes and Service Requests

    Change operations mainly involves, introducing configuration/hardware/application change in the currently running environment either ‘for the purpose of improved stability’ or ‘for the purpose of improved security”, in the current environment.

    Service Requests involves performing operations on specific user requests like creating user accounts, changing permissions, installing new server ( called server commission), removing a server( called server decommission) …etc.

    c. Auditing the Server environment to identify the Quality of Service (QoS)

    This mainly involves periodic checking of all the servers to identify if there are any configuration or security vulnerabilities which compromises the stability of server environment. And remediation of such vulnerabilities by requesting changes in the configuration.

To perform above three kinds of operations , every organization will have internal rules to identify ‘ how to act ? ‘, ‘when to act?’ , ‘what to act?’ . And these rules will vary from job to job, during the initial stage of your job you should understand these rules and perform your duties accordingly.

Note : ITIL ( Information Technology Infrastructure Library) talks about the guidelines to define the above rules in a standard way in any IT related organization. Now a days, major companies streamlining their procedures to meet with these ITIL guidelines so that it will be easy to manage the environment although the people who created that environment leaves the organization. Learning ITIL is always beneficial to system admins( or any Infrastructure Support person).

4. Supporting tools/applications and your access to them

To Perform the Support operations discussed in the above point, organizations needs to have proper tools/applications to facilitate their employees and support people to ‘request and respond’ in automated way as per the procedures defined in the organization. E.g. Remedy Ticketing tool , HP Service Manager ..etc.

Once you join to a new team, just make sure you have requested your access to all the related tools in time and tested the access.

5. Intercommunication Procedures with Other Support Teams and Vendors

Being a System Admin, major part of our day job involves communication with other support teams like. Database Team, Network Team, Application Team, Hardware Vendors, Data Center Support Team … etc.

For successful service delivery, it is important to system administrators to have all of their contact details ( .. like Phone, email and Internal Chat IDs ) handy. So gather the information and make a good document which you can use in your job. It is very important to write down this information and keep it safe, because most of the times the minor issues turns into major problems if we don’t know whom to contact right immediate we noticed the issue.

6. Know where to find the information

Every Team will have some kind of documentation which explains the operations performed by the team, and this documentation gives you more information than any individual can share to you. Unfortunately, reading all these documents doesn’t help us to understand what is actually going in the job during our initial stage in the team, but the same documents might save your life once you actively start working in the team.

During Initial stage, just gather the information about where the documentation is saved and get the access to it. And quickly go through entire documentation( you don’t need to remember everything you read) , so that you will know where to find the information when you are looking for a specific piece of information related to a specific issue.

7. Know Important infrastructure server’s Details

Ideally, System administrators will classify their servers in two groups , first set is ‘ the servers which are used by users ( e.g. Database Servers / Application server ) and second set is ‘ the infrastructure Servers which are used to manage the first set of servers effectively’ ( e.g. Jumpstart Remote Installation Servers, DHCP , DNS , NIS , LDAP servers ..etc) .

As i explained in the point 1, you may or may not manage these infrastructure servers depending on the scope of your team, but you must know the details of these servers because every other server in your environment depends on these infrastructure servers.

    Below are the important question you can try to find answers, during the initial stage of job:

    a. What Name servers( DNS / NIS / LDAP ) we are using, and what are the names / aliases / IPs of those servers ?

    b. What remote installation ( jumpstart/ kickstart) servers we are using and our access to them ?

    c. Whether there is any DHCP server available in the environment or is it managed by customized tools? E.g. QIP …etc.

8. Get Ready with appropriate logistics

Every Unix administrator starts his work by requesting his access to a Windows product ( Desktop Access / Outlook ) . The moment you join into a new job, start requesting your access to your desktop PC login, Voip phone ( with international dialing if your job requires to call overseas ), Email account, internal Chat messenger access, Data center Access ( if your job requires physical access to DC) , and smart cards / Security tokens …etc.

The moment you get your email access, you may have to manage the flood of emails that is coming to your team every day, you might have to create appropriate Outlook rules to filter out emails which you don’t have to respond during the first one or two months of new job. Later, you can slowly start reading and responding them once you actual ready to work on the floor.

9. Areas of Automation, and the specific details

System Administrator cannot survive his job if he doesn’t know how to automate the work ( using scripting) that he is doing repeatedly. And whenever you join a new team, you should specifically ask for the information about any automated scripts which in place and used to perform day-to-day job.

Most of the time, system admins make scripts to perform daily/weekly system health checks and they might be running regularly from some specific servers using Cron scheduler. It is better to know them before hand, so that it will help you if you want to introduce your own scripts for the team’s benefit.

10. Understanding monitoring alerts and response procedures

As I explained in the point 8, you will receive tons of mail the moment you added your email id to team DL ( email distribution list), and major part of the mails could be from automated monitoring system which checks health status of your server environment and informs the system admin team, right immediate it notices an issue. If you are start receiving such mails, don’t just ignore them because you don’t know what to do with them. Actually you have note these alerts and keep raising questions with your team to know how to respond these alerts.

And also keep auto notice reminders in your outlook, for some of the important are alerts which are critical and urgent in their nature, so that you wont miss them.

What your experience says about this, just share with us …

if you see this post useful then share it back, so that some of your friends who are changing their jobs can benefit from this

Thursday, 1 December 2011

How to Force a Crash Dump When the Solaris Operating System is Hung

In most cases, a system crash dump of a hung system can be forced. However, this is not guaranteed to work for all system hang conditions. To force a dump, you often need to drop down to the boot PROM monitor (OBP) prompt, also known as the "OK prompt", suspending all current program execution.

There are several ways to drop a Sun system to the OK prompt.
1. On older Sun systems with a serial (PS2 type) Sun keyboard and monitor attached, this suspension is performed via a "Stop-A". The upper left key on a Sun keyboard is labeled "Stop". While holding down this key, press the A key.

2. On systems using ASCII terminals for the console, the terminal's predefined break sequence can be used to get to the boot PROM monitor.

3. Newer Sun systems with USB keyboards may require an alternate sequence.

4. Some Sun systems have a system controller/SSP (Enterprise 10000/15000, Sun Fire X800) or ALOM/RSC (Vx80/Vx90 and most new Netra servers) instead of serial port/keyboard access. These can be used to break a hanging system or domain.

Note: There special procedures for Sun SPARC(R) Enterprise Mx000 (OPL) Servers, T1000/T2000 systems, x86 and x64 systems.

The boot PROM monitor will respond with:

Type 'go' to resume
ok

If you don't see this message, you were probably not successful in stopping the system.

Once at the ok prompt, type 'sync' (without the quotes) and press Enter.

The system will immediately panic. Now the hang condition has been converted into a panic, so an image of memory can be collected for later analysis. The system will attempt to reboot after the dump is complete.

The sync command forces the computer to illegally use location, therefore causing a panic: zero. On later revisions of Solaris 8 and above you will see a panic: sync initiated

Not all hang situations can be interrupted. If Stop-A or Break doesn't work, sometimes a series of the same will do the trick. Some hangs are even more stubborn and can only be interrupted by physically disconnecting the console keyboard or terminal from the system for a minute, and then plugging it back in.

If all these attempts fail, you will have to power down the system, thus sadly losing the contents of memory. With luck, a subsequent hang will be interruptable.

NOTE: On the systems with keyswitches, be sure the key is not in the secure position, as this disables the break interrupt in the zs driver.

Sunday, 20 November 2011

zstat-process

Just in-case you encounter a large file /var/adm/exacct/zstat-process.

Here is work-around to reclaim the space.

   
# df -kh /var

Filesystem             Size   Used  Available Capacity  Mounted on

/dev/md/dsk/d3         4.9G   4.4G       529M    90%    /var



# find /var -xdev -type f -size +100000 -ls -exec du -sk {} \;

  273 2740992 -rw-------   1 root     root     2805391217 Nov 20 06:32 /var/adm/exacct/zstat-process

2740992 /var/adm/exacct/zstat-process



# svcs -a | grep zstat

online         Apr_05   svc:/application/xvm/zstat:default



# svcadm restart svc:/application/xvm/zstat:default



# svcs -a | grep zstat

online          6:38:37 svc:/application/xvm/zstat:default



# df -k /var

Filesystem           1024-blocks        Used   Available Capacity  Mounted on

/dev/md/dsk/d3           5166102     1831942     3282499    36%    /var



# df -kh /var

Filesystem             Size   Used  Available Capacity  Mounted on

/dev/md/dsk/d3         4.9G   1.7G       3.1G    36%    /var



# find /var -xdev -type f -size +100000 -ls -exec du -sk {} \;    
#

Monday, 24 October 2011

sesudo

Executes commands that require superuser authority on behalf of a regular user.
SYNOPSIS
sesudo [[ -h ] | [command [parameters]]
DESCRIPTION
The sesudo command borrows the permissions of another user (known as the target user) to perform one or more commands. This enables regular users to perform actions that require superuser authority, such as the mount command. The rules governing the user's authority to perform the command are defined in the SUDO class.

Notes

You must define the access rules for the user in the SUDO class. The definition may specify commands that the user can use and commands that the user is prohibited from using.
The output depends on the command that is being executed. Error messages are sent to the standard error device (stderr), usually defined as the terminal screen.
To execute the sudo command, the user should specify the following command at the UNIX shell prompt:
```
sesudo profile_name
```
You can choose whether the command is displayed before it is executed. The default value is that commands are not displayed. To display commands, change the value in the echo_command token in the sesudo section of the seos.ini file.

Arguments

-h: Displays the help screen.
command [parameters]: Specifies the command that is to be performed on behalf of the user. The command name must be the name of a record in the SUDO class. Multiple parameters can be specified, provided they are separated by spaces.

Prerequisites: Define SUDO Commands

Several steps must be performed before it is possible to use the sesudo command. The first step needs to be done only once. Other steps need to be done every time a new user is given the authority to execute the sesudo command, or every time a new profile is defined in the SUDO class.

Define the sesudo program as a trusted setuid program owned by root. This step only needs to be done once per TACF installation. The format of the command is:
```
newres PROGRAM /usr/seos/bin/sesudo defaccess(NONE)
```
Give a user the authority to execute the sesudo program. Do this once for every user who is entitled to this authority. The format of the command is:
```
authorize PROGRAM /usr/seos/bin/sesudo/uid(user_name)
```
Permit the user to surrogate to the target user using the sesudo program. Do this for every user who should have this authority, and do it for every target user ID that you want to make available to the user. The format of the command is:
```
authorize SURROGATE USER.root uid(user_name) \
via(pgm(/usr/seos/bin/sesudo))
```
Define new records in the SUDO class for every command to be executed by users. For each command script, you can define permitted and forbidden parameters, permitted users, and password protection. If no parameters are specified as permitted or prohibited, then all parameters are permitted. The format of the command is:
```
newres SUDO profile_name \
data('cmd[;[prohibited-params][;permitted-params]]')
```
A command can have prohibited and permitted parameters for each operand. The prohibited parameters and the permitted parameters for each operand are separated by the pipe symbol (|). The format is:
```
newres SUDO profile_name \
data('cmd;pro1|pro2|...|proN;per1|per2|...|perN')
```
sesudo checks each parameter entered by the user in the following manner:
1. Test if parameter number N matches permitted parameter N. (If permitted parameter N does not exist, the last permitted parameter is used.)
2. Test if parameter number N matches prohibited parameter N. (If prohibited parameter N does not exist, the last prohibited parameter is used.)
Only if all the parameters match permitted parameters, and none match prohibited parameters, does sesudo execute the command.
Permit the user to access the profile that has been defined in the SUDO class. Do this for every profile a user should be able to access. The format of the command is:
```
authorize SUDO profile_name uid(user_name)
```
If defacess is none, specify each user who is granted permission with the authorize command. If defaccess is not set otherwise, use the authorize command to specify each user to whom access is forbidden.
The sesudo command can display the command before executing it. Display depends on the value in the echo_command token in the [sesudo] section of the seos.ini file. The default value calls for no display, but the value can be changed.
The output of the sesudo command depends on the command being performed. Error messages are sent to the standard error device (stderr), usually defined as the terminal's screen.

SUDO Record: Parameters and Variables

The special parameters used in connection with the SUDO record are explained in the following list:

profile_name: The name the security administrator gives to the superuser command.
cmd: The superuser command that a normal user can execute.
prohibited parameters: The parameters that you prohibit the regular user from invoking. These parameters may contain patterns or variables.
permitted parameters: The parameters that you specifically allow the regular user to invoke. These parameters may contain patterns or variables.

Prohibited and permitted parameters may also contain variables as described in the following list:

$A: Alphabetic value
$G: Existing TACF group name
$H: Home path pattern of the user
$N: Numeric value
$O: Executor's user name
$U: Existing TACF user name
$f: Existing file name
$g: Existing UNIX group name
$h: Existing host name
$r: Existing UNIX file name with UNIX read permission
$u: Existing UNIX user name
$w: Existing UNIX file name with UNIX write permission
$x: Existing UNIX file name with UNIX exec permission

Return Value

Each time the sesudo command runs, it returns one of the following values:

-2: Target user not found, or command interrupted
-1: Password error
0: Execution successful
10: Problem with usage of parameters
20: Target user error
30: Authorization error

EXAMPLES

If you do not allow any parameters, define the profile in the following way:
```
newres SUDO profile_name data('cmd;*')
```
If you want to allow the user to invoke the name parameter, do the following:
```
newres SUDO profile_name data('cmd;;NAME')
```
In the previous example, the only parameter the user can enter is NAME.
If you want to prevent the user from using -9 and -HUP but you permit the user to use all other parameters, do the following:
```
newres SUDO profile_name data('cmd;-9 -HUP;*')
```
If there are two prohibited parameters, the first is the UNIX user name and the second is the UNIX group name, and there are two permitted parameters, the first can be numeric and the second can be alphabetic, enter the following:
```
newres SUDO profile_name \
data('cmd;$u | $g ;$N | $A')
```
The user cannot enter the UNIX user name, but can enter a numeric parameter for the first operand; and the user cannot enter the UNIX group name but can enter an alphabetic parameter for the second operand.
If there are several prohibited parameters for several operands in the command, enter the following:
```
newres SUDO profile_name \
data('cmd;pro1 pro2 | pro3 pro4 | pro5 pro6')
```
pro1 and pro2 are the prohibited parameters of the first operand of the command; pro3 and pro4 are the prohibited parameters of the second operand of the command; and pro5 and pro6 are the prohibited parameters of the third operand of the command.

Thursday, 20 October 2011

PICL bug causes Solaris 10 prtdiag to hang

The Solaris PICL framework provides information about the system configuration which it maintains in the PICL tree. I have an experience wherein Solaris 10 prtdiag is hanging. In order to fix this stop and start picld.


# top


load averages: 1582.95, 1462.52, 1345.91                                                                                   22:57:54

8548 processes:8532 sleeping, 1 running, 1 zombie, 14 on cpu

CPU states:     % idle,     % user,     % kernel,     % iowait,     % swap

Memory: 8064M real, 2747M free, 4123M swap in use, 9005M swap free

   PID USERNAME LWP PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND

 13622 root     999  59    0  318M  297M sleep  370.3H 82.48% java

 26222 root       1   0    0    0K    0K cpu/6    7:52  5.88% ps

 25217 root       1  20    0    0K    0K sleep   11:59  5.49% ps

 27618 root       1   0    0    0K    0K cpu/5    1:08  5.43% ps

 27101 root       1   0    0    0K    0K cpu/4    3:31  5.16% ps

***You can see here that PID 13622 is using alot of CPU.

***And when you check this, it points to prtdiag


# ps -ef | grep 13622



    root 23066 13622   0   Oct 15 ?           0:00 /usr/bin/ctrun -l child -o pgrponly /bin/sh -c /usr/sbin/prtdiag

    root   802 13622   0   Oct 15 ?           0:00 /usr/bin/ctrun -l child -o pgrponly /bin/sh -c /usr/sbin/prtdiag

    root 28092 13622   0   Oct 15 ?           0:00 /usr/bin/ctrun -l child -o pgrponly /bin/sh -c /usr/sbin/prtdiag

***Restart the PICL


# svcadm restart picl

***Check the load via uptime


# uptime

  1:26am  up 50 day(s), 11:22,  3 users,  load average: 1886.79, 1513.28, 1402.57

***After a couple of minutes check it again


# uptime 

  1:26am  up 50 day(s), 11:23,  3 users,  load average: 962.59, 1327.11, 1343.05

***You can observe a dramatic drop on the load


# top 


load averages:  3.71, 367.87, 875.60                                                                                       01:33:23 

76 processes:  75 sleeping, 1 on cpu

CPU states:     % idle,     % user,     % kernel,     % iowait,     % swap

Memory: 8064M real, 5630M free, 1103M swap in use, 12G swap free



   PID USERNAME LWP PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND

 15726 root       1  42    0  108M   33M sleep   21:44  2.89% bptm

 11116 root       1  52    0  108M   33M sleep   10:50  0.76% bptm

 13622 root      78  59    0  215M  200M sleep  403.3H  0.12% java

 15611 root       1  59    0  108M   67M sleep    5:09  0.10% bptm

 16953 root       1   0    0 4432K 2160K cpu/11   0:00  0.08% top

Monday, 17 October 2011

Introduction to Veritas Cluster Services

In any organization, every server in the network will have a specific purpose in terms of it’s usage, and most of the times these servers are used to provide stable environment to run software applications that are required for organization’s business. Usually, these applications are very critical for the business, and organizations cannot afford to let them down even for minutes. For Example: A bank having an application which takes care of it’s internet banking.

If it was not critical in terms of business the organization can considered to run applications as standalone, in other words whenever the application down it wont impact the actual business.
Usually, the application clients for these application will connect to the application server using the server name , server IP or specific application IP.

Let us assume, if the organization is having an application which is very critical for it’s business and any impact to the application will cause huge loss to the organization. In that case, organization is having one option to reduce the impact of the application failure due to the Operating system or Hardware failure, i.e Purchasing a secondary server with same hardware configuration , install same kind of OS & Database, and configure it with the same application in passive mode. And “failover” the application from primary server to these secondary server whenever there is an issue with underlying hardware/operating system of primary server. Thus, we call it Application Server with Highly Available Configuration

Whenever there is an issue related to the primary server which make application unavailable to the client machines, the application should be moved to another available server in the network either by manual or automatic intervention. Transferring application from primary server to the secondary server and making secondary server active for the application is called “failover” operation. And the reverse Operation (i.e. restoring application on primary server ) is called “Failback“. Thus, we can call this configuration as application HA ( Highly Available ) setup compared to the earlier Standalone setup.

Now the question is, how is this manual fail over works when there is an application issue due to Hardware/Operating System?

Manual Failover basically involves below steps:

     1. Application IP should failover secondary node
     2. Same Storage and Data should be available on the secondary node
     3. Finally application should failover to the secondary node.

Challenges in Manual Failover Configuration

    1. Continuously monitor resources.
    2. Time Consuming
    3. Technically complex when it involves more dependent components for the application.

On the other hand, we can use Automatic Failover Softwares which can do the work without human intervention. It groups both primary server and secondary server related to the application, and always keep an eye on primary server for any failures and failover the application to secondary server automatically when ever there is an issue with primary server.

Although we are having two different servers supporting the application, both of them are actually serving the same purpose. And from the application client perspective they both should be treated as single application cluster server ( composed of multiple physical servers in the background).

Now, you know that cluster is nothing but “group of individual servers working together to server the same purpose ,and appear as a single machine to the external world”.

What are the Cluster Software available in the market, today? There are many, depending on the Operating System and Application to be supported. Some of them native to the Operating System , and others from the third party vendor

List of Cluster Software available in the market

    *SUN Cluster Services – Native Solaris Cluster
    *Linux Cluster Server – Native Linux cluster
    *Oracle RAC – Application level cluster for Oracle database that works on different Operating Systems
    *Veritas Cluster Services – Third Party Cluster Software works on Different Operating Systems like Solaris / Linux/ AIX / HP UX.
    *HACMP – IBM AIX based Cluster Technology
    *HP UX native Cluster Technology

Note: In this post, we are actually discussing about VCS and its Operations. This post is not going to cover the actual implementation part or any command syntax of VCS, but will cover the concept how VCS makes application Highly Available(HA).

Veritas Cluster Services Components
VCS is having two types of Components 1. Physical Components 2. Logical Components

Physical Components:
1. Nodes
VCS nodes host the service groups (managed applications). Each system is connected to networking hardware, and usually also to storage hardware. The systems contain components to provide resilient management of the applications, and start and stop agents.
Nodes can be individual systems, or they can be created with domains or partitions on enterprise-class systems. Individual cluster nodes each run their own operating system and possess their own boot device. Each node must run the same operating system within a single VCS cluster.
Clusters can have from 1 to 32 nodes. Applications can be configured to run on specific nodes within the cluster.

2. Shared storage
Storage is a key resource of most applications services, and therefore most service groups. A managed application can only be started on a system that has access to its associated data files. Therefore, a service group can only run on all systems in the cluster if the storage is shared across all systems. In many configurations, a storage area network (SAN) provides this requirement.
You can use I/O fencing technology for data protection. I/O fencing blocks access to shared storage from any system that is not a current and verified member of the cluster.

3. Networking Components
Networking in the cluster is used for the following purposes:
    *Communications between the cluster nodes and the Application Clients and external systems.
    *Communications between the cluster nodes, called Heartbeat network.

Logical Components
1. Resources
Resources are hardware or software entities that make up the application. Resources include disk groups and file systems, network interface cards (NIC), IP addresses, and applications.
    1.1. Resource dependencies
    Resource dependencies indicate resources that depend on each other because of application or operating system requirements. Resource dependencies are graphically depicted in a hierarchy, also     called a tree, where the resources higher up (parent) depend on the resources lower down (child).

    1.2. Resource types
    VCS defines a resource type for each resource it manages. For example, the NIC resource type can be configured to manage network interface cards. Similarly, all IP addresses can be configured         using the IP resource type.
    VCS includes a set of predefined resources types. For each resource type, VCS has a corresponding agent, which provides the logic to control resources.

2. Service groups
A service group is a virtual container that contains all the hardware and software resources that are required to run the managed application. Service groups allow VCS to control all the hardware and software resources of the managed application as a single unit. When a failover occurs, resources do not fail over individually— the entire service group fails over. If there is more than one service group on a system, a group may fail over without affecting the others.

A single node may host any number of service groups, each providing a discrete service to networked clients. If the server crashes, all service groups on that node must be failed over elsewhere.

Service groups can be dependent on each other. For example a finance application may be dependent on a database application. Because the managed application consists of all components that are required to provide the service, service group dependencies create more complex managed applications. When you use service group dependencies, the managed application is the entire dependency tree.

2.1. Types of service groups

VCS service groups fall in three main categories: failover, parallel, and hybrid.

   * Failover service groups
    A failover service group runs on one system in the cluster at a time. Failover groups are used for most applications that do not support multiple systems to simultaneously access the application’s data.

   * Parallel service groups
    A parallel service group runs simultaneously on more than one system in the cluster. A parallel service group is more complex than a failover group. Parallel service groups are appropriate for     applications that manage multiple application instances running simultaneously without data corruption.

   * Hybrid service groups
    A hybrid service group is for replicated data clusters and is a combination of the failover and parallel service groups. It behaves as a failover group within a system zone and a parallel group across     system zones.

3. VCS Agents
Agents are multi-threaded processes that provide the logic to manage resources. VCS has one agent per resource type. The agent monitors all resources of that type; for example, a single IP agent manages all IP resources.
When the agent is started, it obtains the necessary configuration information from VCS. It then periodically monitors the resources, and updates VCS with the resource status.

4. Cluster Communications and VCS Daemons
Cluster communications ensure that VCS is continuously aware of the status of each system’s service groups and resources. They also enable VCS to recognize which systems are active members of the cluster, which have joined or left the cluster, and which have failed.

4.1. High availability daemon (HAD)
    The VCS high availability daemon (HAD) runs on each system. Also known as the VCS engine, HAD is responsible for:

       * building the running cluster configuration from the configuration files
       * distributing the information when new nodes join the cluster
       * responding to operator input
       * taking corrective action when something fails.

    The engine uses agents to monitor and manage resources. It collects information about resource states from the agents on the local system and forwards it to all cluster members. The local engine     also receives information from the other cluster members to update its view of the cluster.

    The hashadow process monitors HAD and restarts it when required.

4.2. HostMonitor daemon
    VCS also starts HostMonitor daemon when the VCS engine comes up. The VCS engine creates a VCS resource VCShm of type HostMonitor and a VCShmg service group. The VCS engine does not     add these objects to the main.cf file. Do not modify or delete these components of VCS. VCS uses the HostMonitor daemon to monitor the resource utilization of CPU and Swap. VCS reports to the     engine log if the resources cross the threshold limits that are defined for the resources.

4.3. Group Membership Services/Atomic Broadcast (GAB)
    The Group Membership Services/Atomic Broadcast protocol (GAB) is responsible for cluster membership and cluster communications.

    * Cluster Membership
    GAB maintains cluster membership by receiving input on the status of the heartbeat from each node by LLT. When a system no longer receives heartbeats from a peer, it marks the peer as DOWN and     excludes the peer from the cluster. In VCS, memberships are sets of systems participating in the cluster.

    * Cluster Communications
    GAB’s second function is reliable cluster communications. GAB provides guaranteed delivery of point-to-point and broadcast messages to all nodes. The VCS engine uses a private IOCTL (provided     by GAB) to tell GAB that it is alive.

4.4. Low Latency Transport (LLT)
    VCS uses private network communications between cluster nodes for cluster maintenance. Symantec recommends two independent networks between all cluster nodes. These networks provide the     required redundancy in the communication path and enable VCS to discriminate between a network failure and a system failure. LLT has two major functions.

    * Traffic Distribution
    LLT distributes (load balances) internode communication across all available private network links. This distribution means that all cluster communications are evenly distributed across all private     network links (maximum eight) for performance and fault resilience. If a link fails, traffic is redirected to the remaining links.

    * Heartbeat
    LLT is responsible for sending and receiving heartbeat traffic over network links. The Group Membership Services function of GAB uses this heartbeat to determine cluster membership.

4.5. I/O fencing module
    The I/O fencing module implements a quorum-type functionality to ensure that only one cluster survives a split of the private network. I/O fencing also provides the ability to perform SCSI-3 persistent     reservations on failover. The shared disk groups offer complete protection against data corruption by nodes that are assumed to be excluded from cluster membership.

5. VCS Configuration files.

    5.1. main.cf
    /etc/VRTSvcs/conf/config/main.cf is key file in terms of VCS configuration. The “main.cf” file basically explains below information to the VCS agents/VCS daemons.
    What are the Nodes available in the Cluster?
    What are the Service Groups Configured for each node?
    What are the resources available in each Service Group, the types of resources and it’s attributes?
    What are the dependencies each resource having on other resources?
    What are the dependencies each service group having on other Service Groups?

     5.2. types.cf
    The file types.cf, which is listed in the include statement in the main.cf file, defines the VCS bundled types for VCS resources. The file types.cf is also located in the folder /etc/VRTSvcs/conf/config.

    5.3. Other Important files
        /etc/llthosts—lists all the nodes in the cluster
        /etc/llttab—describes the local system’s private network links to the other nodes in the cluster

Wednesday, 5 October 2011

VERITAS Volume Manager for Solaris

Veritas Volume Manager is a storage management application by symantec , which allows you to manage physical disks as logical devices called volumes.

VxVM uses two types of objects to perform the storage management
1. Physical objects - are direct mappings to physical disks
2 . Virtual objects - are volumes, plexes, subdisks and diskgroups.

a. Disk groups are composed of Volumes
b. Volumes are composed of Plexes and Subdisks
c. Plexes are composed of SubDisks
d. Subdisks are actual disk space segments of VxVM disk ( directly mapped from the physical disks)

1. Physical Disks
Physical disk is a basic storage where ultimate data will be stored. In Solaris physical disk names uses the convention like “c#t#d#” where c# refers to controller/adapter connection, t# refers to the SCSI target Id , and d# refers to disk device Id.

Physical disks could be coming from different sources within the servers e.g. Internal disks to the server , Disks from the Disk Array and Disks from the SAN.

Check if the disks are recognized by Solaris

#echo|format
Searching for disks…done

AVAILABLE DISK SELECTIONS:
0. c0t0d0 <SUN2.1G cyl 2733 alt 2 hd 19 sec 80>
/sbus@1f,0/SUNW,fas@e,8800000/sd@0,0
1. c0t1d0 <SUN9.0G cyl 4924 alt 2 hd 27 sec 133>
/sbus@1f,0/SUNW,fas@e,8800000/sd@1,0

2. Solaris Native Disk Partitioning

In solaris, physical disks will partitioned into slices numbered as S0,S1,S3,S4,S5,S6,S7 and the slice number S2 normally called as overlap slice and points to the entire disk. In Solaris we use the format utility used to partition the physical disks into slices.

Once we added new disks to the Server, first we should recognize the disks from the solaris level before proceeding for any other storage management utility.

Steps to add new disk to Solaris:
If the disks that are recently added to the server not visible, you can use below procedure

Option 1: Reconfiguration Reboot ( for the server hardware models that doesn’t support hot swapping/dynamic addition of disks )

# touch /reconfigure; init 6

or

#reboot — -r ( only if no applications running on the machine)

Option 2: Recognize the disks added to external SCSI, without reboot

# devfsadm

# echo | format <== to check the newly added disks

Option 3: Recognize disks that are added to internal scsi, hot swappable, disk connections.

Just run the command “cfgadm -al” and check for any newly added devices in “unconfigured” state, and configure them.

# cfgadm -al
Ap_Id                         Type            Receptacle   Occupant     Condition
c0                                  scsi-bus     connected    configured   unknown
c0::dsk/c0t0d0      disk              connected    configured   unknown
c0::dsk/c0t0d0      disk              connected    configured   unknown
c0::rmt/0                  tape             connected    configured   unknown
c1                                  scsi-bus      connected    configured   unknown
c1::dsk/c1t0d0       unavailable connected    unconfigured unknown <== disk not configured
c1::dsk/c1t1d0       unavailable connected    unconfigured unknown < == disk not configured

# cfgadm -c configure c1::dsk/c1t0d0

# cfgadm -c configure c1::dsk/c1t0d0

# cfgadm -al
Ap_Id                         Type            Receptacle   Occupant     Condition
c0                                  scsi-bus     connected    configured   unknown
c0::dsk/c0t0d0      disk              connected    configured   unknown
c0::rmt/0                  tape             connected    configured   unknown
c1                                  scsi-bus      connected    configured   unknown
c1::dsk/c1t0d0       disk              connected    configured unknown <= Disk configured now
c1::dsk/c1t1d0       disk              connected    configured unknown <= Disk configured now

# devfsadm

#echo|format <== now you should see all the disks connected to the server

3. Initialize Physical Disks under VxVM control

A formatted physical disk is considered uninitialized until it is initialized for use by VxVM. When a disk is initialized, partitions for the public and private regions are created, VM disk header information is written to the private region and actual data is written to Public region. During the notmal initialization process any data or partitions that may have existed on the disk are removed.

Note: Encapsulation is another method of placing a disk under VxVM control in which existing data on the disk is preserved

An initialized disk is placed into the VxVM free disk pool. The VxVM free disk pool contains disks that have been initialized but that have not yet been assigned to a disk group. These disks are under Volume Manager control but cannot be used by Volume Manager until they are added to a disk group

Device Naming Schemes
In VxVM, device names can be represented in two ways:

    Using the traditional operating system-dependent format c#t#d#
    Using an operating system-independent format that is based on enclosure names

c#t#d# Naming Scheme
Traditionally, device names in VxVM have been represented in the way that the operating system represents them. For example, Solaris and HP-UX both use the format c#t#d# in device naming, which is derived from the controller, target, and disk number. In VxVM version 3.1.1 and earlier, all disks are named using the c#t#d# format. VxVM parses disk names in this format to retrieve connectivity information for disks.

Enclosure-Based Naming Scheme
With VxVM version 3.2 and later, VxVM provides a new device naming scheme, called enclosure-based naming. With enclosure-based naming, the name of a disk is based on the logical name of the enclosure, or disk array, in which the disk resides.

Steps to Recognize new disks under VxVM control
1. Run the below command to see the available disks under VxVM control

# vxdisk list
in the output you will see below status

    error indicates that the disk has neither been initialized nor encapsulated by VxVM. The disk is uninitialized.
    online indicates that the drive has been initialized or encapsulated.
    online invalid indicated that disk is visible to VxVM but not controlled by VxVM

If disks are visible with “format” command but not visible with ”vxdisk list” command, run below command to scan the new disks for VxVM

# vxdctl enable

Now you should see new disks with the status of “Online Invalid“

2. Initialize each disk with “vxdisksetup” command

#/etc/vx/bin/vxdisksetup -i <disk_address>

after running this command “vxdisk list” should see the status as “online” for all the newly initialized disks

4. Virtual Objects (DiskGroups / Volumes / Plexs ) in VxVM

Disk GroupsA disk group is a collection of VxVM disks ( going forward we will call them as VM Disks ) that share a common configuration. Disk groups allow you to group disks into logical group of Subdisks called plexes which in turn forms the volumes.

Volumes
A volume is a virtual disk device that appears to applications, databases, and file systems like a physical disk device, but does not have the physical limitations of a physical disk device. A volume consists of one or more plexes, each holding a copy of the selected data in the volume.

Plexes:
VxVM uses subdisks to create virtual objects called plexes. A plex consists of one or more subdisks located on one or more physical disks.

Key Points on Transformation of Physical disks into Veritas Volumes
1. Recognize disks under solaris using devfsadm, cfgadm or reconfiguration reboot , and verify using format command
2. Recognize the disks under VxVM using “vxdctl enable“
3. Initialize the disks under VxVM using vxdisksetup
4. Add the disks to Veritas Disk Group using vxdg commands
5. Create Volumes under Disk Group using vxmake or vxassist commands
6. Create filesystem on top of volumes using mkfs or newfs, and you can create either VXFS filesystem or UFS filesystem

Thursday, 29 September 2011

Cloud Computing

Most of us already know that Cloud Computing is a new Buzz word in the industry and it is very true that everyone want to learn about it as much as possible. For myself, I have been reading and observing cloud computing evolution for past one year, and recently I had an opportunity to attend for IBM’s SmartCloudCamp session which has given me some insight on current state of cloud computing evolution.

I have noticed several questions from System Admin community about the Cloud computing’s effect on Infrastructure Support Teams. In this post I am just trying to address the same question in a way that I understand cloud computing.

Cloud Computing

Let me tell you a small story before we go to discuss about t the Cloud Computing.

My Sister and her family is living in a small town in the state of Andhra Pradesh, India. In the town, the power failures are so common and it is like 1 or 2 hours of power outage with a frequency of 2 or 3 times per day. My sister and her neighbors were so upset because these continuous power outages disturbing the kid’s studies and also making life difficult during the evenings. They know that there is an alternative to solve the problem by having power generator as a backup power source but most of the neighbor families are not in a position to afford for it and also they are worried about the regular maintenance cost of these devices.

One fine day, a group of smart minds came up with a solution to purchase a high capacity power generator , place it in some common place and to provide backup power connections to every home who ever ready to pay for the usage charges as per the the actual usage calculated by the electric meter plugged in at every home. Interestingly, the idea worked very well, and most of the people in the town were adapted the backup power source with the minimum capital investment and zero maintenance cost.

I believe, by this time, you might have understood the purpose of cloud computing in IT industry. If it is still unclear, lets go forward to look at it in more detailed terms

The Current definition of Cloud Computing is ” A Comprehensive solution which delivers the IT as a Service. Here the term IT can be expanded as Infrastructure, Platform, Storage and Software”. . At present the IT industry classified into two groups in terms of cloud computing , first one is Cloud Computing Service Providers and the other one is Cloud Computing Service Consumers ( Client).

Cloud Computing in its Basic Form

Quick refresh on Cloud Computing Benefits to a Client/Consumer

1. Reduced Capital Cost to setup IT Infrastructure

Scenario 1:

If any organisation want to start a new business function that needs IT infrastructure, the organisation need not go through the all the complex process of establishing IT infrastructure starting from the Data center planning. Instead the company simply can go for a Cloud Computing service provider who is providing the kind of service , in his service catalogue, that meets the organisation’s IT requirement for the new business function. The requested service could be anything like Server/Storage/Network Infrastructure, Platform Environment or already built software application which can be customized to your requirement. And the organisation will pay, to the service provider, only for the resources that has been utilized. No Capital investment, no running maintenance cost.

Scenerio 2:

If any organisation want to migrate it’s existing IT infrastructure ( or part of it ) related to less critical business function, it can again approach the Cloud Computing Service provider for a solution that works for their actuation requirement.

2. Rapid scalability with the help of dynamic infrastructure

Current Challenge:

In any business, it is very common that, the initial design of IT infrastructure happens considering the current potential of business and expected growth of business in near future. And these expectations / predictions about the future growth may or may not be correct, in current day high fluctuating business markets. Any large Investment in IT infra setup will be wasted if the related business not doing well , as expected. And at the same time insufficient IT infra resources could block the business growth if the business was progressing better than expected.

It is always a real challenge to any organisation to predict the actual requirement of IT infrastructure , and this challenge can easily addressable if the organisation considering the cloud computing solution.

Using Cloud Computing, organisations can easily scale it’s resources to the level it matches the business requirement which is very dynamic in nature.

3. Utility Pricing Model

This point is self explanatory, organisations will pay for the only resources that they have used. No Initial investment to setup infra.

4. Self Service by using Automated Provisioning

I believe, this is one key point where cloud computing affecting the existing IT infrastructure job roles.

By using automated provisioning feature of Cloud Computing , organisations can request the services mentioned in Service Catalogue and could receive the services instantly and dynamically with minimum or no technology skills.

5. Resource availability from anywhere of the world

Public clouds can be accessed from anywhere of the world using the internet, and this feature makes cloud computing as beautiful solution for many startup companies which are running using virtual teams located in different parts of world.

for more inforamtoin, you can refer my other post ” Cloud Computing – It’s not just another buzzword, but a near future “, which talks about cloud computing features and benefits.

Cloud Computing Layers

IaaS - Infrastructure as a Service

Iaas is basically a paradigm shift from “Infrastructure as an asset” to “Infrastructure as a Service”

Key Characteristics of Iaas:

Infrastructure is Platform independent
Infrastructure costs are shared by multiple clients/users
Utility Pricing – Clients will pay only for the resources they have consumed

Advantages:

Minimal or No Capital investment on Infrastructure Hardware
No Maintenance costs for Hardware
Reduced ROI risk
Avoid the wastage of Computing resources
Dynamic in nature
Rapid Scalability of Infrastructure to meet sudden peak in business requirements

Drawbacks:

Performance of Infrastructure purely depends on Vendor capability to manage resources
Consistent high usage of resources for a long term could lead to higher costs
Companies have to introduce new layer of Enterprise security to deal with the cloud computing related to security issues

Note: It is better not to adapt Iaas Solution, if the oraganisation capital budget is greater than the Operating budget

PaaS – Platform as a Service

Paas is a Paradigm shift from ” purchasing platform environment tools as a licensing product ” to “purchasing as a service”.

Key Characteristics:

Deployment purely based on cloud infrastructure
caters to agile project management methods

Advantages:

It is possible capture the complex testing & development platform requirement and automate the tasks for provisioning of consistent environment.

Drawback:

Enterprises have to introduce new layer of security to deal with the security in cloud computing environment.

SaaS – Software as a Service

SaaS is basically paradigm shift from treating “treating software as an asset of business/consumer” to “using software as a service achieve the business goals”

Advantages:

reduce Capital expenses required for the development and testing resources
Reduced ROI risk
Streamlines and Iterative updates of the software

Drawbacks:

Enterprises have to introduce new layer of security to deal with the security in cloud computing environment.

Cloud Computing Solutions for Enterprise

Public Cloud Solution for Enterprise

Public Cloud solution allows enterprise to adapt Iass, Pass and Saas services from a cloud computing service provide on the internet, and actual computing resources are available under control of Vendor.

Private Cloud Solution for Enterprise

Private Cloud Solution for Enterprise nothing but constructing cloud solution within the enterprise datacenter, to provide more security on physical resources. And the internal departments of the enterprise within the organisation can utilise and pay for cloud computing resources as if they are using public cloud resources.

Hybrid Cloud Solution for Enterprise

Hybrid cloud solution enables enterprise use both public cloud and private cloud resources same time depending on the criticality and importance of the business function.

Virtual Private Cloud Solution

Using Virtual Private Cloud Solution Companies can create their own private cloud environment with in the public cloud by using different network/firewall rules. And the purpose is to avoid external access to the enterprise resources.

How Cloud Computing affects the Job roles in the Infrastructure Support Team

Depending on the Clod computing Solution that enterprise adapted, there will be direct and indirect effect on the various job roles with in the infrastructure support teams.

If you look at the Sysadmin role in general , the actual job role involves three major responsibilities:

Hardware administration
Operating System Builds
Operating System Administration
Network Services Administration

Once the organisation adapted the Cloud Computing solution ( IaaS / PaaS / SaaS ) , it no longer required to maintain the skillful technical people to deal with hardware related issues and OS Build operations but they still need resources to perform OS / Network administration and to customize cloud resources to meet the organisation requirements. And the same effect is true for the Network Support roles.

Cloud Computing solutions cannot replace every system administrator in the company but it will expect new level cloud computing related expertise instead of ” to be isolated hardware maintenance skills”. For sure, it’s a call for learning. And more importantly the sysadmin job roles specifically dealing with the “Hardware & OS builds” has to go away, in near future.

For any organisation, the current recruitment strategy for the SysAdmin Team is “No. of Sysadmins are directly proportional to the physical server foot print in the data center “. With IaaS adaption organisation’s server footprint will reduce drastically, and hence the no. of sysadmin positions.

As of now the Clouds were deployed to replace the Server infrastructure with windows / linux on X86 model, but not yet having solutions for Vendor Specific Server OS like Solaris on Sparc, IBM AIX and HP UX …etc. Considering the speed of evolution in cloud computing technologies, it may not take long time to provide solutions for all kinds of server infrastructure. From the other side, if the Organisation choose to migrate their applications to X86 model servers to receive the benefits of economic cloud computing then the change is more rapid.

Below pictures will give you an understanding how the roles are moving out of Infra Teams depending on the Cloud solution adapted by the organisation.

Final and one more story, i want to tell you, before closing this post.

As most of you already aware, India is an agricultural based society where people treat their land like “mother that feeds you everyday ” and cows like “part of family wealth”. A decade before, most of the families used to follow the traditional way of cultivation that requires more number people and long working hours . And this requirement for the human labor is the main source for the jobs , in villages, for longtime

With technology innovations in India, there were many new tools/machines had been introduced to the indian agricultural industry which in turn reduced the requirement for the human labor. During this technology change, many people back at villages worried about their livelihood for sometime. But, the worry didn’t last longtime because most of them quickly adapted the skills related to these new technologies like “regular maintenance of these new tools” , “using the tools for better productivity” and “finding new lands to cultivate using these new machines with low cost” etc., and started living better than earlier.

And I believe, same story applies for any other industry including IT. And whenever we notice an inevitable change in our way, it is always wise to understand and get ready to accept it, instead of worrying about and trying to resist it.

Sunday, 5 October 2014

Monday, 8 July 2013

Thursday, 4 July 2013

Thursday, 3 May 2012

Monday, 13 February 2012

Friday, 27 January 2012

Tuesday, 10 January 2012

Sunday, 25 December 2011

Thursday, 1 December 2011

Sunday, 20 November 2011

Monday, 24 October 2011

Thursday, 20 October 2011

Monday, 17 October 2011

Wednesday, 5 October 2011

Thursday, 29 September 2011

Cloud Computing

Quick refresh on Cloud Computing Benefits to a Client/Consumer

Cloud Computing Layers

Cloud Computing Solutions for Enterprise

How Cloud Computing affects the Job roles in the Infrastructure Support Team

Final and one more story, i want to tell you, before closing this post.

Blog Archive

Sponsored Links