Advertisement: Support SunWorld, click here!

September 1999

Navigate

	Home
	Next Story
	Printer-Friendly Version

Navigate

Subscribe, It's Free
Topical Index
Backissues
SunWHERE
Letters to the Editor
Events Calendar
TechDispatch Newsletters

Technical FAQs

Solaris Security
Secure Programming
Performance Q&A
SE Toolkit

SunWorld Partners

Software Store
Career Central
Sun Microsystems

About SunWorld

SunWorld FAQ
Advertising Info
SunWorld Editors
Masthead
Editorial Calendar
Writers Guidelines
Privacy Policy
Link To Us!
Copyright

Stop your fire-fighting

Why fight fires when you can prevent them?

Summary
Peter identifies some common problems system administrators face -- and more importantly, how to prevent them from happening repeatedly. (2,000 words)

PETE'S SUPER SYSTEMS

By Peter Baer Galvin

cenario: A systems problem arises, and you begin work on it, but then get distracted by other help requests. An hour later you return to the problem, solve it, and move onto other tasks. Suddenly, another problem flares up. Sound familiar?

It's time to get out of the fire-fighting mode and into the firehouse-building mode -- unless, of course, you find an ever-growing to-do list and plenty of interruptions entertaining.

I've often been asked to review my clients' system administration methods and activities in order to determine why they get stuck in the "fire-fighting cycle from hell." My recommendations vary wildly depending on the circumstances, but some common themes have emerged, and they might serve as the keys to your new firehouse.

Putting out fires
How does this frustrating cycle get started and remain in place? Consider the following traps:

Solving the same problems repeatedly: Some problems are recurring. Postmaster mail arrives and needs to be handled. Disks get full and space needs to be juggled. Users delete files and need help retrieving them. New systems arrive and each one needs to be installed, configured, tested, debugged, and delivered to the users. You know the common system administrator credo: "If it weren't for these dang users, everything would run smoothly." Unfortunately, blaming users for being users is counterproductive and can lead to sysadmin burnout. Rather, look to the methods being used, because solving a problem more than once is a waste of time.

Systems having problems: Disks and programs crash or run amok, programs fail, system administrators make mistakes (hey, it happens), networks don't network -- in this business, a wide variety of things can go haywire. They all take time to resolve. (Another system administrator credo: "Hey, if it were easy, everyone would be a sysadmin.")
Consider disk failure with no RAID protection. The disk must be replaced, a service call must be made, and service information and contact and contract numbers must be known. Next, the proper tape must be located and possibly brought on-site. Then the contents of the tape must be understood and the appropriate data located and restored. If the failed disk was a boot disk, the boot blocks must be written to the disk, and the EEPROM settings could possibly require change.
Many people might be needed to solve this problem. For instance, you might need someone from operations to schedule downtime or physically access the system. If the disk held program data, then someone from the application area may be involved. If the disk held a database, DBAs may be needed. And if the failure took down a product server, phones will be ringing, memos will need to be written, and the system's users will need appeasement. Bottom line: there are many reasons a project takes longer to complete than it should, which is why projects usually take longer than they should.

Interrupts occurring: The bane of system administrators is a lack of work continuity. Just when you start making progress on a project, the phone rings, email arrives, or people stop by your office or stick their heads into your cube asking for "just a quick favor." The result is that short tasks get done, but longer projects get buried.

Receiving pushed information: Some sites automate the delivery of information to system administrators. System log files can be emailed to the sysadmins each day, or problem reports can arrive from the help desk. Sysadmins also contribute to this glut of information by adding themselves to mailing lists. All this information can be helpful, but it often results in time-wasting activity.
Pulling information: On the other hand, gathering information requires work. Logging in to systems to read log files, viewing monitoring and performance information, and reading newsgroups all suck up the cycles. Are these the best ways to spend your time?

The causes of fires
There are specific fire causes, but it's more practical to evaluate the big picture in order to prevent the isolated incident.

Too much to do in too little time: Tasks almost always take longer than they should, and unreasonable expectations lead to too much work being planned for too little time. Other causes of this phenomena are equally simple, and include: too many tasks for too few people, ever-changing environments, new bugs popping up, and users and administrators making mistakes.

Lack of focus: System administrators who focus on specific technologies (for example, Solaris, storage, or networking) become more efficient by virtue of more advanced knowledge, more experience with specific issues, and less need for technical support. (How much time do you spend on hold?) Unfortunately for the to-do list, it's human nature to be helpful. By trying to be all things to all people, and by not saying no (or not being allowed to do so), administrators lose their focus. The nature of the job also encourages fuzziness -- the simple task of installing a system can include the need to know the operating system, networking, storage, security, backup methods, patch methods, performance tuning, and problem debugging.

Too much specialization: On the other hand, too much specialization can prolong resolution times. If there's a different worker on each part of a job, inefficiencies can be huge. Each person may be a pro, but time spent in meetings and hallways costs money. Sometimes just trying to schedule a meeting can delay a project by a week or more.

Mixture of short, medium, and long-term projects: It's human nature to work on the shorter, easier tasks first, especially when a nice person submitted the request. What happens to the long, hard projects then? Take a look at your to-do list, and you'll see all of them!

Lack of preparation: One common problem is not planning for problems. For instance, on a new system install, did all the parts arrive? Is the network ready? Do you have all the configuration information you need (hostname, network, disk partitions, licenses, rack space, power and cooling)? Spare parts such as cables and hubs and switch ports, if available at install time, can be the difference between an installation that proceeds and an installation that stalls. A lack of communication with the project owner or users can cause a reinstall, and frustration for everyone.

Poor technology choices: There are many good reasons that people make technology choices, but, unfortunately, there are just as many bad ones. Politics, the desire to simplify, poor past choices, fear of the unknown, and familiarity with old technologies are all among them. An important phrase to watch out for: "We made this mistake before, so let's do it again!"

Engineering for today or yesterday, not tomorrow: Experienced system administrators have learned that system use grows. Disk space is eaten up, CPU cycles disappear, and network traffic increases. Failure to keep this in mind while implementing a new system results in one that is soon overworked, and thus tricky to maintain.
On the other hand, the best solution is sometimes rejected because it's too complex, so a simpler one is implemented, even though it does not adequately solve the problem or lacks needed flexibility. The phrase to keep in mind here is: "Make it as simple as possible."

Complicating the systems work to save money: There are many good, free tools available for increasing system administrator leverage and decreasing system administration time. There are also commercial tools that solve most problems. Sometimes programmers think they can create a needed tool themselves in no time; in practice, they usually don't, a free one is never found, and a commercial one is never bought. The result is a problem that remains unsolved.

Fire prevention
The antidote for system administration fires is finding their causes and sources, then solving them. Solutions vary depending on the circumstances, but hopefully your site already implements some of those recommended here. This section is designed to give you new ideas and provide a starting point to help you put out your own fires.

Most solutions cannot be implemented in a vacuum; they need coordination with other staff members, management buy-in, planning, and, sometimes, money. The first part of this column could serve as justification for the effort needed to implement these changes.

Schedule office hours: These can be used for drop-in visits, with email and phone calls included. Then enforce those office hours. Make time outside your office hours "getting stuff done time."

Rotate first call duty: If there are four system administrators, each can take first call one week per month. Any help desk problem reports or incidents should be routed to this person. Likewise, this person would be responsible for monitoring tools, log files, and postmaster mail. Of course, not all problems are likely to be solved by the on-call person, but even if half are, expect the productivity of other staff members to more than double.

Use a cool tool: Maintain a to-do list or do so with something like a Palm Pilot. This allows for the addition and subtraction of projects, without taking a time-out at the moment of interrupt.

Don't receive what you don't want or need: This includes magazines, mailing lists, log files, email, and phone calls. Be brutal, and use time-management techniques to optimize dealing with these items (for instance, only touch a document once).

Automate the routine: Try to decrease the time spent on necessary but routine tasks, like logging in every day to read information, watching a screen to check for problems, or reading logs, Web pages, and newsgroups. Rather, automate these as much as possible. Avoid looking at anything that is working; use tools to alert you when there is a problem.

Track your time: Spend a week or two intermittently tracking your time use to determine just where those cycles go and why they go there. If you're spending 20 percent of your time dealing with pulled information, look for ways to automate.

Become an expert: Develop expertise in specific areas to economize time spent on fires in those areas. For instance, if your site frequently has storage problems, become a storage expert. Then develop a breadth of knowledge so that you no longer have to ask others for help. Encourage management to hire enough staff to allow for specialization. The benefits should be obvious, especially if quality of service matters to your organization.

Use Service level agreements (SLAs): These agreements can benefit staff members as much as they benefit users. It is quite reasonable for both sides to agree on a specific set of services and uptime goals; such agreements allow the staff to concentrate on areas users deem important. Just as importantly, such agreements also allow staff members to say no to requests for support outside of the scope of the SLA. Keeping in mind what is important and why system administrators frequently have so much to do are good reminders to the user.
Implement system log books These should contain needed disaster and fire information, and create all documents in a sustainable way (i.e., on an intranet site). Before starting a project, scope out all the aspects of the project and be sure all parts are in hand before beginning.

Watch out for hidden expenses: Free tools require maintenance, patching, effort to install and maintain, and have a learning curve. For that matter, so do commercial tools. Home-grown tools require time and effort to develop. Choose the correct solution for the problem and account for hidden, as well as apparent, costs. Make the right decision at the outset, even if the payoff may ultimately take a while.

Summary
Fires will always occur, and time spent fighting them can mean the difference between being productive and stress-free, or becoming prematurely burned-out. Let me know if you chose to implement any of these ideas, and how they work for you.

Next month's column will continue this theme, and include examples of system administrators pouring gasoline on fires, and system administration methods that avoided fire altogether.

The information in this column was first presented as a keynote talk at the 1999 SAGEAU conference. Thanks to the participants there for their feedback.

[Peter Galvin's photo] About the author
Peter Baer Galvin is the chief technologist for Corporate Technologies, a systems integrator and VAR. Before that, Peter was the systems manager for Brown University's Computer Science Department. He has written articles for Byte and other magazines, and previously wrote Pete's Wicked World, the security column for SunWorld. Peter is coauthor of Operating Systems Concepts and Applied Operating Systems Concepts. As a consultant and trainer, Peter has taught tutorials on security and system administration and given talks at many conferences and institutions.

Home | Next Story | Printer-Friendly Version | Comment on this Story | Resources and Related Links

Advertisement: Support SunWorld, click here!
<A HREF="http://ad.doubleclick.net/jump/idg.sw.com/archives;sz=468x60"><IMG SRC="http://ad.doubleclick.net/ad/idg.sw.com/archives;sz=468x60" height=60 width=468></A>

Resources and Related Links

Send Peter your best sysadmin tips and check out his:
http://www.sunworld.com/swol-06-1999/swol-06-supersys.html
Full listing of previous Pete's Super Systems columns:
http://www.sunworld.com/sunworldonline/common/swol-backissues-columns.html#supersys
Hal Stern's archived SysAdmin columns:
http://www.sunworld.com/sunworldonline/common/swol-backissues-columns.html#sysadmin
Peter's Solaris Security FAQ (recently updated!):
http://www.sunworld.com/sunworldonline/common/security-faq.html
Peter's Unix Secure Programming FAQ:
http://www.sunworld.com/swol-08-1998/swol-08-security.html
Other SunWorld resources
The SunWorld Topical Index -- a comprehensive listing of all SunWorld articles by subject:
http://www.sunworld.com/common/swol-siteindex.html
Visit sunWHERE -- launchpad to hundreds of online resources for Sun users:
http://www.sunworld.com/sunwhere.html
Explore SunWorld's back issues:
http://www.sunworld.com/common/swol-backissues.html
IDG.net, your one-stop IT resource:
http://www.idg.net

Tell Us What You Thought of This Story

-Very worth reading
-Worth reading
-Not worth reading
-Too long
-Just right
-Too short
-Too technical
-Just right
-Not technical enough

Comments:
Name:

Email:

Company Name:

If you have technical problems with this magazine, contact webmaster@sunworld.com

URL: http://www.sunworld.com/swol-09-1999/swol-09-supersys.html
Last modified: