unix sysadmin archives
Donation will make us pay more time on the project:
          

Sunday, 25 December 2011

Ten Points when joining into a new team

So, during initial stage of new job keep your focus to understand the historical information about current environment , from the existing team, whenever you get a chance to discuss about it.

1. Know job scope of your team

Team scope is something which is very important to know right immediate you join to a new job because it will give you an idea to decide your ‘ priorities of learning’ related to the new job.

For example, if you join into a team in a large organization where the scope of the team is to support a set of servers which have only database but nothing else then your priority will immediately change to understand “how the Database works on Unix , and basics of DB terminology” , at the same time your team not supporting any DNS, NIS, DHCP servers and all of them were under control of different team so you will not worry about those servers in your initial learning.

2. Know about Technical architecture of environment

Technical Architecture of Environment talks about below points :

    a. How many Total servers( commonly called as “Server FootPrint” ) we are supporting and where they are actually located ( i.e. Data center information ) ?

    b. What Operating Systems are in use right now, and what are the supported hardware models?

    c. What are the Operating environments that team supporting now? e.g. Production , Testing or Developement

    d. What are the applications currently running on our server environment, and who is using them? e.g. sybase, clearcase, weblogic .. etc.

    e. What Storage is in use right now, and What sort of Console systems we are using to connect to the Servers remotely? EMC, Netapp, Cyclades Consoles ..etc.

    f. What storage management software is in use in which operating systems? e.g. LVM, VxVM , ZFS …etc.

3. Know about procedures and escalation

Ideally, any system administrator should deal with three types of operations:

    a. Break / Fix activities ( Widely known as incidents )

    This mainly involves in fixing the issues that encountered in a properly working environment. e.g. disk failure on a server, unix server crashed due to overload, network failed due to network port problem…etc.

    b. Changes and Service Requests

    Change operations mainly involves, introducing configuration/hardware/application change in the currently running environment either ‘for the purpose of improved stability’ or ‘for the purpose of improved security”, in the current environment.

    Service Requests involves performing operations on specific user requests like creating user accounts, changing permissions, installing new server ( called server commission), removing a server( called server decommission) …etc.

    c. Auditing the Server environment to identify the Quality of Service (QoS)

    This mainly involves periodic checking of all the servers to identify if there are any configuration or security vulnerabilities which compromises the stability of server environment. And remediation of such vulnerabilities by requesting changes in the configuration.

To perform above three kinds of operations , every organization will have internal rules to identify ‘ how to act ? ‘, ‘when to act?’ , ‘what to act?’ . And these rules will vary from job to job, during the initial stage of your job you should understand these rules and perform your duties accordingly.

Note : ITIL ( Information Technology Infrastructure Library) talks about the guidelines to define the above rules in a standard way in any IT related organization. Now a days, major companies streamlining their procedures to meet with these ITIL guidelines so that it will be easy to manage the environment although the people who created that environment leaves the organization. Learning ITIL is always beneficial to system admins( or any Infrastructure Support person).

4. Supporting tools/applications and your access to them

To Perform the Support operations discussed in the above point, organizations needs to have proper tools/applications to facilitate their employees and support people to ‘request and respond’ in automated way as per the procedures defined in the organization. E.g. Remedy Ticketing tool , HP Service Manager ..etc.

Once you join to a new team, just make sure you have requested your access to all the related tools in time and tested the access.

5. Intercommunication Procedures with Other Support Teams and Vendors

Being a System Admin, major part of our day job involves communication with other support teams like. Database Team, Network Team, Application Team, Hardware Vendors, Data Center Support Team … etc.

For successful service delivery, it is important to system administrators to have all of their contact details ( .. like Phone, email and Internal Chat IDs ) handy. So gather the information and make a good document which you can use in your job. It is very important to write down this information and keep it safe, because most of the times the minor issues turns into major problems if we don’t know whom to contact right immediate we noticed the issue.

6. Know where to find the information

Every Team will have some kind of documentation which explains the operations performed by the team, and this documentation gives you more information than any individual can share to you. Unfortunately, reading all these documents doesn’t help us to understand what is actually going in the job during our initial stage in the team, but the same documents might save your life once you actively start working in the team.

During Initial stage, just gather the information about where the documentation is saved and get the access to it. And quickly go through entire documentation( you don’t need to remember everything you read) , so that you will know where to find the information when you are looking for a specific piece of information related to a specific issue.

7. Know Important infrastructure server’s Details

Ideally, System administrators will classify their servers in two groups , first set is ‘ the servers which are used by users ( e.g. Database Servers / Application server ) and second set is ‘ the infrastructure Servers which are used to manage the first set of servers effectively’ ( e.g. Jumpstart Remote Installation Servers, DHCP , DNS , NIS , LDAP servers ..etc) .

As i explained in the point 1, you may or may not manage these infrastructure servers depending on the scope of your team, but you must know the details of these servers because every other server in your environment depends on these infrastructure servers.

    Below are the important question you can try to find answers, during the initial stage of job:

    a. What Name servers( DNS / NIS / LDAP ) we are using, and what are the names / aliases / IPs of those servers ?

    b. What remote installation ( jumpstart/ kickstart) servers we are using and our access to them ?

    c. Whether there is any DHCP server available in the environment or is it managed by customized tools? E.g. QIP …etc.

8. Get Ready with appropriate logistics

Every Unix administrator starts his work by requesting his access to a Windows product ( Desktop Access / Outlook ) . The moment you join into a new job, start requesting your access to your desktop PC login, Voip phone ( with international dialing if your job requires to call overseas ), Email account, internal Chat messenger access, Data center Access ( if your job requires physical access to DC) , and smart cards / Security tokens …etc.

The moment you get your email access, you may have to manage the flood of emails that is coming to your team every day, you might have to create appropriate Outlook rules to filter out emails which you don’t have to respond during the first one or two months of new job. Later, you can slowly start reading and responding them once you actual ready to work on the floor.

9. Areas of Automation, and the specific details

System Administrator cannot survive his job if he doesn’t know how to automate the work ( using scripting) that he is doing repeatedly. And whenever you join a new team, you should specifically ask for the information about any automated scripts which in place and used to perform day-to-day job.

Most of the time, system admins make scripts to perform daily/weekly system health checks and they might be running regularly from some specific servers using Cron scheduler. It is better to know them before hand, so that it will help you if you want to introduce your own scripts for the team’s benefit.

10. Understanding monitoring alerts and response procedures

As I explained in the point 8, you will receive tons of mail the moment you added your email id to team DL ( email distribution list), and major part of the mails could be from automated monitoring system which checks health status of your server environment and informs the system admin team, right immediate it notices an issue. If you are start receiving such mails, don’t just ignore them because you don’t know what to do with them. Actually you have note these alerts and keep raising questions with your team to know how to respond these alerts.

And also keep auto notice reminders in your outlook, for some of the important are alerts which are critical and urgent in their nature, so that you wont miss them.

What your experience says about this, just share with us …

if you see this post useful then share it back, so that some of your friends who are changing their jobs can benefit from this

Thursday, 1 December 2011

How to Force a Crash Dump When the Solaris Operating System is Hung

In most cases, a system crash dump of a hung system can be forced. However, this is not guaranteed to work for all system hang conditions. To force a dump, you often need to drop down to the boot PROM monitor (OBP) prompt, also known as the "OK prompt", suspending all current program execution.

There are several ways to drop a Sun system to the OK prompt.
1. On older Sun systems with a serial (PS2 type) Sun keyboard and monitor attached, this suspension is performed via a "Stop-A". The upper left key on a Sun keyboard is labeled "Stop". While holding down this key, press the A key.

2. On systems using ASCII terminals for the console, the terminal's predefined break sequence can be used to get to the boot PROM monitor.

3. Newer Sun systems with USB keyboards may require an alternate sequence.

4. Some Sun systems have a system controller/SSP (Enterprise 10000/15000, Sun Fire X800) or ALOM/RSC (Vx80/Vx90 and most new Netra servers) instead of serial port/keyboard access. These can be used to break a hanging system or domain.


Note: There special procedures for Sun SPARC(R) Enterprise Mx000 (OPL) Servers, T1000/T2000 systems, x86 and x64 systems.


The boot PROM monitor will respond with:

Type 'go' to resume
ok

If you don't see this message, you were probably not successful in stopping the system.

Once at the ok prompt, type 'sync' (without the quotes) and press Enter.

The system will immediately panic. Now the hang condition has been converted into a panic, so an image of memory can be collected for later analysis. The system will attempt to reboot after the dump is complete.

The sync command forces the computer to illegally use location, therefore causing a panic: zero. On later revisions of Solaris 8 and above you will see a panic: sync initiated

Not all hang situations can be interrupted. If Stop-A or Break doesn't work, sometimes a series of the same will do the trick. Some hangs are even more stubborn and can only be interrupted by physically disconnecting the console keyboard or terminal from the system for a minute, and then plugging it back in.

If all these attempts fail, you will have to power down the system, thus sadly losing the contents of memory. With luck, a subsequent hang will be interruptable.


NOTE: On the systems with keyswitches, be sure the key is not in the secure position, as this disables the break interrupt in the zs driver.