unix sysadmin archives
Donation will make us pay more time on the project:

Thursday, 29 September 2011

Cloud Computing

Most of us already know that Cloud Computing is a new Buzz word in the industry and it is very true that  everyone want to learn about it as much as possible.  For myself, I have been reading and observing cloud computing evolution for past one year,  and recently I had an opportunity to  attend for IBM’s SmartCloudCamp session which has given me some insight on current state of cloud computing evolution.
I have noticed several questions from System Admin community about the Cloud computing’s effect on Infrastructure Support Teams.  In this post I am just trying to address the same question in a way that I understand cloud computing.

Cloud Computing

Let me tell you a small story  before we go to discuss about t the  Cloud Computing.
My Sister and her family is living in a small town  in the state of Andhra Pradesh, India.  In the town,  the power failures are so common and it is like 1 or 2 hours of power outage with a frequency of 2 or 3 times per day.  My sister and her neighbors were so upset because these continuous power outages disturbing the kid’s studies and also making life difficult during the evenings. They know that there is an alternative to solve the problem by having  power generator as  a backup power source but most of the neighbor  families are not in a position to afford for it and also they are worried about the regular  maintenance cost of these devices.
One fine day, a group of smart minds came up with a solution to purchase a high capacity power generator ,  place it in some common place and to provide backup power connections to every home who ever ready to pay for the usage charges as per the the actual usage calculated by the electric meter plugged in at every home.  Interestingly, the  idea worked very well, and most of the people in the town were adapted the backup power source with the minimum capital investment and  zero maintenance cost.
I believe, by this time,  you might have understood the purpose of  cloud computing in IT industry. If it is still unclear, lets go forward to look at it in more detailed terms
The Current definition of Cloud Computing is ” A Comprehensive solution which delivers the IT as a Service. Here the term IT can be expanded as Infrastructure, Platform, Storage and Software”.  . At present the IT industry classified  into two groups in terms of cloud computing , first one is Cloud Computing Service Providers and the other one is Cloud Computing Service Consumers ( Client).
Cloud Computing in its Basic Form

Quick refresh on Cloud Computing Benefits to a Client/Consumer

1. Reduced Capital Cost to setup IT Infrastructure
Scenario 1:
If any organisation want to start a new business function that needs IT infrastructure, the organisation need not go through the all the complex process of establishing IT infrastructure starting from the Data center planning. Instead the company simply can go for a Cloud Computing service provider who is providing the kind of service , in his service catalogue, that meets the organisation’s IT requirement for the new business function. The requested service could be anything  like  Server/Storage/Network Infrastructure, Platform Environment or already built software application which can be customized to your requirement.  And the organisation will pay,  to the service provider, only for the resources that has been utilized. No Capital investment, no running maintenance cost.
Scenerio 2:
If any organisation want to migrate it’s existing IT infrastructure ( or part of it ) related to less critical business function, it can again approach the Cloud Computing Service provider for a solution that works for their actuation requirement.
2. Rapid scalability with the help of dynamic infrastructure
Current Challenge:
In any business, it is very common that, the initial design of IT infrastructure happens  considering the current potential of business and expected growth of business in near future. And these expectations / predictions about the future growth may or may not be correct, in current day high fluctuating business markets.  Any large Investment in IT infra setup will be wasted if the related business not doing well , as expected. And at the same time insufficient IT infra resources could block the business growth if the business was progressing better than expected.
It is always a real challenge to any organisation to predict the actual requirement of IT infrastructure , and this challenge can easily addressable if the organisation considering the cloud computing solution.
Using Cloud Computing, organisations can easily scale it’s resources to the level it matches the business requirement  which is very dynamic in nature.
3.  Utility Pricing Model
This point is self explanatory, organisations will pay for the only resources that they have used. No Initial investment to setup infra.
4. Self Service by using Automated Provisioning
I believe, this is one key point where cloud computing affecting the existing IT infrastructure  job roles.
By using automated provisioning feature of Cloud Computing , organisations can request the services mentioned in Service Catalogue and could receive the services  instantly and dynamically with minimum or no technology skills.
5. Resource availability  from anywhere of the world
Public clouds can be accessed from anywhere of the world using the internet, and this feature makes cloud computing as beautiful solution for many startup companies which are running using virtual teams located in different parts of world.
for more inforamtoin, you can refer my other post ” Cloud Computing – It’s not just another buzzword, but a near future “,  which talks about cloud computing features and benefits.

Cloud Computing Layers

IaaS  -  Infrastructure as a Service
Iaas   is basically a paradigm shift from “Infrastructure as an asset” to “Infrastructure as a Service”
Key Characteristics of Iaas:
  • Infrastructure is Platform independent
  • Infrastructure costs are shared by multiple clients/users
  • Utility Pricing – Clients will pay only for the resources they have consumed
  • Minimal or No Capital investment on Infrastructure Hardware
  • No Maintenance costs for Hardware
  • Reduced ROI risk
  • Avoid the wastage of Computing resources
  • Dynamic in nature
  • Rapid Scalability of Infrastructure to meet sudden peak in business requirements
  • Performance of Infrastructure purely depends on Vendor capability to manage resources
  • Consistent  high usage of resources for a long term could lead to higher costs
  • Companies have to introduce new layer of Enterprise security to deal with the cloud computing related to security issues
Note: It is better not to adapt Iaas Solution, if the oraganisation capital budget is greater than the Operating budget
PaaS – Platform as a Service
Paas is a Paradigm shift from ” purchasing platform environment tools as a licensing product ”  to “purchasing as a service”.
Key Characteristics:
  • Deployment purely based on cloud infrastructure
  • caters to agile project management methods
  • It is possible capture the complex testing & development platform  requirement and automate the tasks for provisioning of consistent environment.
  • Enterprises have to introduce new layer of security to deal with the security in cloud computing environment.
SaaS – Software as a Service
SaaS is basically paradigm shift from treating “treating software as an asset of  business/consumer” to “using software as a service achieve the business goals”
  • reduce Capital expenses required for the development and testing resources
  • Reduced ROI risk
  • Streamlines and Iterative updates of the software
  • Enterprises have to introduce new layer of security to deal with the security in cloud computing environment.

Cloud Computing Solutions for Enterprise

Public Cloud Solution for Enterprise
Public Cloud solution allows enterprise to adapt Iass, Pass and Saas services from a cloud computing service provide on the internet, and actual computing resources are available under control of Vendor.
Private Cloud Solution for Enterprise
Private Cloud Solution for Enterprise nothing but constructing cloud solution within the enterprise datacenter, to provide more security on physical resources. And the internal departments of the enterprise within the organisation can utilise and pay for cloud computing resources as if they are using public cloud resources.
Hybrid Cloud Solution for Enterprise
Hybrid cloud solution enables enterprise use both public cloud and private cloud resources same time depending on the criticality and importance of the business function.
Virtual Private Cloud Solution
Using Virtual Private Cloud Solution Companies can create their own private cloud environment with in the public cloud by using different network/firewall rules. And the purpose is to avoid external access to the enterprise resources.

How Cloud Computing affects the Job roles in the Infrastructure Support Team

Depending on the Clod computing Solution that enterprise adapted, there will be direct and indirect effect on the various job roles with in the infrastructure support teams.
If you look at the Sysadmin role in general , the actual job role involves three major responsibilities:
  • Hardware administration
  • Operating System Builds
  • Operating System Administration
  • Network Services Administration
Once the organisation adapted the Cloud Computing solution ( IaaS / PaaS / SaaS ) , it no longer required to maintain the skillful technical people to deal with hardware related issues and OS Build operations but they still need resources to perform OS / Network administration and to customize cloud resources to meet the organisation requirements. And the same effect is true for the Network Support roles.
Cloud Computing solutions cannot replace every system administrator in the company but it will expect new level  cloud computing related expertise instead of ” to be isolated hardware maintenance skills”. For sure, it’s a call for learning. And more importantly the sysadmin job roles specifically dealing with the “Hardware & OS builds” has to go away, in near future.
For any organisation, the current  recruitment strategy for the SysAdmin Team  is “No. of Sysadmins are directly proportional to the physical server foot print in the data center “.  With IaaS adaption organisation’s server footprint will reduce drastically, and hence the no. of sysadmin positions.
As of now the Clouds were deployed to replace the Server infrastructure with windows / linux on X86 model, but not yet having solutions for Vendor Specific Server OS like Solaris on Sparc, IBM AIX and HP UX …etc.  Considering the speed of evolution in cloud computing technologies, it may not take long time to provide solutions for all kinds of server infrastructure. From the other side, if the Organisation choose to migrate their applications to X86 model servers to receive the benefits of economic cloud computing  then the change is more rapid.
Below pictures will give you an understanding how the roles are moving out of Infra Teams depending on the Cloud solution adapted by the organisation.

Final and one more story, i want to tell you,  before closing this post.

As most of you already aware, India is an agricultural based society where people treat their land  like “mother that feeds you everyday ” and cows like “part of family wealth”.  A decade before, most of the families used to follow the traditional way of cultivation that requires more number people and long working hours . And this requirement for the human labor is the main source for the jobs , in villages,  for longtime
With technology innovations in India, there were many new tools/machines  had been introduced to the indian agricultural industry which in turn reduced the requirement for the human labor. During  this technology change,  many people back at villages worried about their  livelihood for sometime. But, the worry didn’t last longtime because most of them quickly adapted the skills related to these new technologies like “regular maintenance of these new tools” , “using the tools for better productivity” and “finding new lands to cultivate using these new machines with low cost” etc., and started living better than earlier.
And I believe, same story applies for any other industry including IT.  And whenever we notice an inevitable change in our way, it is always wise to understand and get ready to accept it,  instead of worrying about and trying to resist it.

Solaris Troubleshooting – System Panics, Hangs and Crashes

Solaris Troubleshooting – System Panics, Hangs and Crashes

There are a number of differing scenarios under which the Solaris  operating system may panic, hang, or exhibit other symptoms that lead the administrator to have to restart or reboot the system.
As there are many different failure scenarios and many different classes of hardware, the information and procedures for collection of system information vary from system to system.


The first class of failures is the hang. This is when a system appears to become unresponsive. See the documents below that discuss dealing with hung systems.
Be aware that some systems that appear hung are not! Be sure to verify if the server is hung or not. For example: The display may be non-responsive, because the output has been redirected to the console device.

2.Panic or unexpected reboot

Panics can be caused by a variety of issues, including Solaris Bugs, Hardware errors and Third Party Drivers and Applications. It’s important to collect as much detail as is possible when systems panic.
Information that need to be collected to troubleshoot Kernel Panic:
When logging a new case, provide answers to the following questions as an absolute minimum:
  • When did the problem start
  • What changes have been recently made on the system. Important: Anything that has happened since the last reboot is within the scope of this question. Patching, application changes, disk replacements, anything. It’s all important to know when trying to resolve the issues.
  • How often has this failure occurred
  • What may have been going on around the time of the panic and if anything out of the ordinary may have been observed
In the case of a panic or reboot, the messages log and prtdiag output are items that can be quickly sent to Oracle Sun, and that can go a long way towards diagnosis of the cause of the problem, however, an explorer is almost always better.
By far, the simplest way to collect the vast majority of details required to resolve a panic is to collect:
  • Sun Explorer Data Collector output
  • The crash dump
  • Any console messages
If Explorer is run on the system after the incident occurs, it will contain most of what will be required to understand the current configuration of the system, and additional information that may be helpful.
If the system generated a system crash dump (check /var/crash/`hostname`), create a compressed tar file containing the unix.* and vmcore.* files and transfer that file to Oracle Sun for analysis. Compressing the tar file (using one of the compress, gzip of bzip2 utilities) reduces the size of the file dramatically and so reduces the time taken to transfer the file to Sun.
Console messages are most important when the server is experiencing hardware issues and the OS is not allowed an opportunity to panic. In cases where multiple unexpected reboots are occurring, and no diagnostic infomation is being provided by the system logs in /var/adm/messages, some form of console logging should be setup as soon as possible to capture the diagnostic information from the console on the next failure.
Recommended NVRAM settings , to Collect Console Messages:
Bring system to OBP level from command line using “shutdown” or “init 0″ commands (either will run all RC shutdown scripts), sync file systems and then drop system to OK prompt. DO NOT use a stop+A key press. The following commands can be executed from the OK prompt or from the command line using the “eeprom <variable=parameter>” command.
at OK prompt # eeprom Description
setenv diag-level max diag-level=max system will run extended POST
printenv boot-device boot-device determine what your boot device is….
setenv diag-device <your boot-device> diag-device=<your bootdev> prevent attempting net boot w diags on
setenv error-reset-recovery sync error-reset-recovery=sync force sync reboot if system drops to OK
setenv diag-switch  true diag-switch =true
reset-all reboot or init 6 system has to reset for changes to take affect

In some cases, it is difficult to collect an explorer.
In the event that explorer cannot be installed or run in a timely manner, the following data is of tremendous value, and should be collected:
Messages logs from /var/adm directory. If there was a panic, the panic message in the file will help determine if we need to analyze a crash dump to diagnose the cause. In many cases a crash dump is not necessary, and waiting for one to be transferred simply increases the resolution time. For example, if there was a hardware reset rather than a panic, the messages log should show that.
Output of the prtdiag command.
 /usr/platform/`uname -i`/sbin/prtdiag -v
prtdiag gives a summary of a system’s hardware configuration, so that we would know what part to order in the case of a hardware failure. It also gives hardware error messages that can aid in diagnosis.
The output from ‘/usr/bin/showrev -p’ gives a list of the patches installed on the system. This will help eliminate possible casues of the problem and ensure that the correct versions of source code and analysis tools are used during the investigation.

3.Live Dump

On occasion, it is required that a live crashdump be collected. It’s uncommon, as dumping a live system does not capture a completely consistent snapshot of the system. Data is changing while the dump is being written out. Although live dumps are not always consistent, they are still a great source of information for certain types of issues.
collectiongg Live Dump from the Solaris Machine:
1) Before collecting a live kernel dump, a dedicated, NON SWAP, dump device must be configured using dumpadm(1M). The dedicated dump device must not be used in any other way (i.e., filesystem, databases, etc.). The dump device CANNOT be swap or any part of. If the dump device is part of swap, generation of live kernel dump will corrupt the swap area causing the kernel to eventually panic. Also note that any filesystem or data on the dump device disk will be lost.

2)The kernel is running during generation of live kernel dump, and the linked lists that all kernel debuggers use to traverse those linked structures may fail because the list was in flux when saved. Therefore, always run the following ps command to capture the process addresses:
/usr/bin/ps -e -o uid,pid,ppid,pri,nice,addr,vsz,wchan,time,fname
You must use those switches to get addresses with a 64 bit kernel. This will allow you to look at processes since you will have the process address. From there you can generate more complete threadlists, etc. Please see the ps(1) manpage for meaning of those options. For example, the following is what /usr/bin/ps -elf outputs on a 64-bit machine:
> /usr/bin/ps -elf
19 T root 0 0 0 0 SY 0 Sep 27 0:00 sched
8 S root 1 0 0 41 20 98 Sep 27 0:00 /etc/init -
19 S root 2 0 0 0 SY 0 Sep 27 0:00 pageout
19 S root 3 0 0 0 SY 0 Sep 27 1:05 fsflush
8 S root 267 1 0 41 20 220 Sep 27 0:00 /usr/lib/saf/sac -t 300
8 S root 154 1 0 51 20 325 Sep 27 0:00 /usr/sbin/inetd -s
8 S root 125 1 0 41 20 294 Sep 27 0:00 /usr/sbin/rpcbind
8 S root 48 1 0 47 20 185 Sep 27 0:00 /usr/lib/sysevent/syseventd
8 S root 50 1 0 48 20 160 Sep 27 0:00 /usr/lib/sysevent/syseventconfd
NOTE: The ADDR field is not populated. If you issue the command to capture process addresses as described above, you will see the following output instead:
> /usr/bin/ps -e -o uid,pid,ppid,pri,nice,addr,vsz,wchan,time,fname
0 0 0 96 SY 10423a60 0 – 0:00 sched
0 1 0 58 20 30000909528 784 30000909848 0:00 init
0 2 0 98 SY 30000908a98 0 10458248 0:00 pageout
0 3 0 60 SY 30000908008 0 104618a0 1:05 fsflush
0 267 1 58 20 30000963530 1760 300009659a8 0:00 sac
0 154 1 48 20 30001708028 2600 30000baaca2 0:00 inetd
0 125 1 58 20 30001709548 2352 30000aa2102 0:00 rpcbind
0 48 1 52 20 300009f0aa8 1480 300009f0dc8 0:00 sysevent
0 50 1 51 20 300009f0018 1280 22d44 0:00 sysevent