Advertisement: Support SunWorld, click here!
Stop your fire-fighting
Why fight fires when you can prevent them?
Peter identifies some common problems system administrators face -- and more importantly, how to prevent them from happening repeatedly. (2,000 words)
cenario: A systems problem arises, and you begin work on it, but
then get distracted by other help requests. An hour later you return
to the problem, solve it, and move onto other tasks. Suddenly,
another problem flares up. Sound familiar?
PETE'S SUPER SYSTEMS
|By Peter Baer Galvin
It's time to get out of the fire-fighting mode and into the
firehouse-building mode -- unless, of course, you find
an ever-growing to-do list and plenty of interruptions entertaining.
I've often been asked to review my clients' system administration
methods and activities in order to determine why they get stuck in
the "fire-fighting cycle from hell." My recommendations vary wildly
depending on the circumstances, but some common themes have emerged,
and they might serve as the keys to your new firehouse.
Putting out fires
How does this frustrating cycle get started and remain in place?
Consider the following traps:
- Solving the same problems repeatedly: Some problems are recurring. Postmaster mail arrives
and needs to be handled. Disks get full and space needs to be juggled.
Users delete files and need help retrieving them. New systems arrive
and each one needs to be installed, configured, tested, debugged, and
delivered to the users. You know the common system administrator credo: "If it weren't
for these dang users, everything would run smoothly." Unfortunately,
blaming users for being users is counterproductive and can lead to
sysadmin burnout. Rather, look to the methods being used, because
solving a problem more than once is a waste of time.
- Systems having problems: Disks and programs crash or run amok, programs fail, system
administrators make mistakes (hey, it happens), networks don't network -- in this business, a wide variety of things can go haywire. They all take time to
resolve. (Another system administrator credo: "Hey, if it were easy,
everyone would be a sysadmin.")
Consider disk failure with no RAID protection. The disk must be
replaced, a service call must be made, and service information and
contact and contract numbers must be known. Next, the proper tape must be
located and possibly brought on-site. Then the contents of the tape
must be understood and the appropriate data located and restored. If
the failed disk was a boot disk, the boot blocks must be written to the disk, and the EEPROM settings could possibly require change.
Many people might be needed to solve this problem. For instance, you might need
someone from operations to schedule downtime or physically access the
system. If the disk held program data, then someone from the application
area may be involved. If the disk held a database, DBAs may be
needed. And if the failure took down a product server, phones will be
ringing, memos will need to be written, and the system's users will
need appeasement. Bottom line: there are many reasons a project takes
longer to complete than it should, which is why projects usually take
longer than they should.
- Interrupts occurring: The bane of system administrators is a lack of work continuity. Just
when you start making progress on a project, the phone rings, email
arrives, or people stop by your office or stick their heads into
your cube asking for "just a quick favor." The result is that short
tasks get done, but longer projects get buried.
- Receiving pushed information: Some sites automate the delivery of information to system
administrators. System log files can be emailed to the sysadmins each
day, or problem reports can arrive from the help desk. Sysadmins also
contribute to this glut of information by adding themselves to mailing lists. All this information can be helpful, but it often results in time-wasting activity.
- Pulling information: On the other hand, gathering information requires work. Logging in to
systems to read log files, viewing monitoring and performance
information, and reading newsgroups all suck up the cycles. Are these
the best ways to spend your time?
The causes of fires
There are specific fire causes, but it's more practical to evaluate
the big picture in order to prevent the isolated incident.
- Too much to do in too little time: Tasks almost always take longer than they should, and unreasonable
expectations lead to too much work being planned for too little time.
Other causes of this phenomena are equally simple, and include: too
many tasks for too few people, ever-changing environments, new bugs
popping up, and users and administrators making mistakes.
- Lack of focus: System administrators who focus on specific technologies (for example,
Solaris, storage, or networking) become more efficient by virtue of
more advanced knowledge, more experience with specific issues, and less
need for technical support. (How much time do you spend on hold?)
Unfortunately for the to-do list, it's human nature to be helpful. By
trying to be all things to all people, and by not saying no (or not
being allowed to do so), administrators lose their focus. The
nature of the job also encourages fuzziness -- the simple task of
installing a system can include the need to know the operating system,
networking, storage, security, backup methods, patch methods,
performance tuning, and problem debugging.
- Too much specialization: On the other hand, too much specialization can prolong resolution
times. If there's a different worker on each part of a job,
inefficiencies can be huge. Each person may be a pro, but time spent in
meetings and hallways costs money. Sometimes just trying to schedule a
meeting can delay a project by a week or more.
- Mixture of short, medium, and long-term projects: It's human nature to work on the shorter, easier tasks first,
especially when a nice person submitted the request. What happens
to the long, hard projects then? Take a look at your to-do list, and you'll
see all of them!
- Lack of preparation: One common problem is not planning for problems. For instance, on a new
system install, did all the parts arrive? Is the network ready? Do you
have all the configuration information you need (hostname, network,
disk partitions, licenses, rack space, power and cooling)? Spare parts
such as cables and hubs and switch ports, if available at install time,
can be the difference between an installation that proceeds and an installation that stalls. A lack of
communication with the project owner or users can cause a reinstall, and
frustration for everyone.
- Poor technology choices: There are many good reasons that people make technology choices, but, unfortunately,
there are just as many bad ones. Politics, the desire to simplify, poor past
choices, fear of the unknown, and familiarity with old technologies
are all among them. An important phrase to watch out for: "We made this
mistake before, so let's do it again!"
- Engineering for today or yesterday, not tomorrow: Experienced system administrators have learned that system use
grows. Disk space is eaten up, CPU cycles disappear, and network traffic
increases. Failure to keep this in mind while implementing a new
system results in one that is soon overworked, and thus tricky to maintain.
On the other hand, the best solution is sometimes rejected because it's too complex,
so a simpler one is implemented, even though it does not adequately
solve the problem or lacks needed flexibility. The
phrase to keep in mind here is: "Make it as simple as possible."
- Complicating the systems work to save money: There are many good, free tools available for increasing system
administrator leverage and decreasing system administration time.
There are also commercial tools that solve most problems. Sometimes
programmers think they can create a needed tool themselves in no time; in practice, they
usually don't, a free one is never found, and a commercial one is
never bought. The result is a problem that remains unsolved.
The antidote for system administration fires is finding their causes
and sources, then solving them. Solutions vary depending on the
circumstances, but hopefully your site already implements some of those
recommended here. This section is designed to give you new ideas and
provide a starting point to help you put out your own fires.
Most solutions cannot be implemented in a vacuum; they need
coordination with other staff members, management buy-in, planning, and,
sometimes, money. The first part of this column could serve as
justification for the effort needed to implement these changes.
- Schedule office hours: These can be used for drop-in visits, with
email and phone calls included. Then enforce those office hours. Make
time outside your office hours "getting stuff done time."
- Rotate first call duty: If there are four system
administrators, each can take first call one week per month. Any help
desk problem reports or incidents should be routed to this person.
Likewise, this person would be responsible for monitoring tools, log
files, and postmaster mail. Of course, not all problems are likely to
be solved by the on-call person, but even if half are, expect the
productivity of other staff members to more than double.
- Use a cool tool: Maintain a to-do list or do so with something like a Palm Pilot. This
allows for the addition and subtraction of projects, without taking a
time-out at the moment of interrupt.
- Don't receive what you don't want or need: This includes magazines,
mailing lists, log files, email, and phone calls. Be brutal, and use
time-management techniques to optimize dealing with these items (for
instance, only touch a document once).
- Automate the routine: Try to decrease the time spent on necessary but routine tasks, like
logging in every day to read information, watching a screen to check
for problems, or reading logs, Web pages, and newsgroups. Rather,
automate these as much as possible. Avoid looking at anything that is
working; use tools to alert you when there is a problem.
- Track your time: Spend a week or two intermittently tracking your time use to determine
just where those cycles go and why they go there. If
you're spending 20 percent of your time dealing with pulled
information, look for ways to automate.
- Become an expert: Develop expertise in specific areas to economize time spent on fires in
those areas. For instance, if your site frequently has storage problems, become a storage expert. Then develop a breadth of knowledge so that you no longer have to
ask others for help. Encourage management to hire enough staff
to allow for specialization. The benefits should be obvious, especially
if quality of service matters to your organization.
- Use Service level agreements (SLAs): These agreements can benefit staff members as much as
they benefit users. It is quite reasonable for both sides to agree on
a specific set of services and uptime goals; such agreements allow the
staff to concentrate on areas users deem important. Just as importantly, such agreements also
allow staff members to say no to requests for support outside of the
scope of the SLA. Keeping in mind what is important and why system
administrators frequently have so much to do are good reminders to the
- Implement system log books These should contain needed disaster and fire
information, and create all documents in a sustainable way (i.e., on an
intranet site). Before starting a project, scope out all the aspects of
the project and be sure all parts are in hand before beginning.
- Watch out for hidden expenses: Free tools require maintenance,
patching, effort to install and maintain, and have a learning curve.
For that matter, so do commercial tools. Home-grown tools require time
and effort to develop. Choose the correct solution for the problem and
account for hidden, as well as apparent, costs. Make the right decision
at the outset, even if the payoff may ultimately take a while.
Fires will always occur, and time spent fighting them can mean the
difference between being productive and stress-free, or becoming prematurely burned-out. Let me know if you chose to implement any of these ideas, and how they work for you.
Next month's column will continue this theme, and include
examples of system administrators pouring gasoline on fires, and
system administration methods that avoided fire altogether.
The information in this column was first presented as a keynote talk
at the 1999 SAGEAU conference. Thanks to the participants there for
About the author
Peter Baer Galvin is the chief technologist for Corporate Technologies, a systems integrator and VAR. Before that, Peter was the systems manager for
Brown University's Computer Science Department. He has written articles
for Byte and other magazines, and previously wrote Pete's
Wicked World, the security column for SunWorld. Peter is
coauthor of Operating Systems Concepts and Applied Operating Systems Concepts. As a
consultant and trainer, Peter has taught tutorials on security and system
administration and given talks at many conferences and institutions.
Advertisement: Support SunWorld, click here!
Tell Us What You Thought of This Story
If you have technical problems with this magazine, contact