Is mainframe-class availability possible in a Unix environment?
What does it take? We give you guidelines for maximizing your uptime
Who doesn't strive for 99.8 percent availability or higher? But is this a realistic goal for your Unix environment? Sure it is -- but only with proper management. We outline the essential steps for achieving continuous uptime. (2,200 words)
Certainly, a stable Unix environment can run for a long time without crashing. Unfortunately, most systems require periodic rebooting and service interruptions, and most admins are accustomed to bouncing the box when things get particularly ugly. Runaway processes and hung daemons are common, and there is often little recourse but a quick reboot to set things straight. Still, it's possible to achieve mainframe availability in a Unix environment. All it takes is a little discipline, attention to detail, and the willingness to correct your own bad habits.
To be completely honest, most mainframe shops are not available 100 percent of the time. Most shops will define a service level goal that determines their minimum acceptable availability. Depending on the end users and the type of computing performed, that goal may range from as low as 98 percent availability to 99.8 percent availability, and up.
If 98 percent availability is your goal, you can have up to 28.8 minutes of downtime every day, an easy mark to hit with almost any system, even NT. Raise the bar to 99.8 percent, and you're limited to 2 minutes, 52.8 seconds of downtime on any given day.
Of course, uptime is usually cumulative over periods of at least a week and usually a month. In these cases, 98 percent uptime allows about 15 hours of downtime per month, while 99.8 percent gives you about 88 minutes of downtime. That's enough to bounce the box, and maybe even swap a failing disk drive while you're at it.
To make things worse, uptime often varies by the time of day. Prime time, usually 8 a.m. to 5 p.m., often requires 100 percent availability, without exception. During non-prime hours, you can relax and drop back to a more leisurely 99.8 percent availability. Again, the issue here is user expectations. You must explore and define the requirements of your customers and then build a system to meet those requirements. Once you understand what is required of your systems, it's possible to build a Unix environment that achieves mainframe-class availability.
Measuring your performance
The first step in achieving 24x7 uptime in your Unix environment is to build the tools that will measure and report your availability. At the very least, you need to keep track of when your systems go down and when they come back up. You must be diligent and accurate, and you need to report your uptime to your users and your management. There is an old adage that is especially true for system availability: If you measure something it will improve. Once your administrative staff knows that availability is being measured and tracked, they'll work harder to make those numbers look good.
Automated availability tracking is fairly easy and doesn't require any sophisticated tools. On your system, create a script that writes a timestamp into a file once per minute. Start the script at boot time and let it run. If your system crashes that file will contain a timestamp of when the crash occurred. Create another script that runs at boot and compares the time in the file with the current time. The difference in those two times is the duration of the outage. Have the script e-mail the outage time and duration to a central collection point, where you can log the outage into a database of some sort.
It's also important to monitor your systems so that outages can be detected and corrected quickly. Although a crash is a bad thing, period, a short outage is better than a long one. For systems running a lot of batch applications with few direct users, it's possible that a failed system can go undetected for a long time. That long outage can ruin an otherwise good uptime record.
While many expensive network management tools exist to perform this kind of monitoring, you can implement a simple monitor using systems you are currently running. Have one system ping every other box on your network once per minute, signaling an alarm if a system doesn't respond for two minutes in a row. If a lot of systems simultaneously stop responding, you can signal that a portion of the network has failed. Don't forget to use one other system to monitor the main monitoring system!
Once you have your monitoring tools in place, you need to become much more disciplined in deciding when systems will be taken down for maintenance, tuning, and patching. These planned outages are often not counted against your collective downtime, or at least aren't as serious as the unplanned outages that disrupt service without warning. Often, users will tolerate planned outages because they can work around the outage window.
You probably aren't helping yourself if you plan to have a few hours of downtime each week. Instead, schedule one outage per month during your non-prime computing hours. Set a starting time and duration for the outage and publish your outage schedule to the whole user community.
A typical outage schedule might allow one planned outage every six weeks -- Sunday morning at 6 a.m. -- lasting up to four hours. Except for those hours, you're not allowed to reboot the machine. This schedule should be published to your users a full year in advance, so they can plan their processing schedules for the entire year. Once you learn to live with this kind of outage schedule (and it can be done), you can set goals to reduce the total outage time. You might plan to shift to seven-week intervals the next year or reduce the outage duration to three hours instead of four. Again, once you start measuring these things, you can begin to improve your performance.
A controlled outage schedule forces you to control change in your environment. You can't make decisions willy-nilly; you must plan change and ensure that it will work correctly before you implement it during an outage. A disciplined computing shop has a defined change-management procedure, ensuring that all system changes are reviewed, tested, and approved before they are applied to the system. Changes usually originate from two sources: your users and your administrators. Regardless of the source, all requests for changes should be recorded and reviewed on a weekly or biweekly basis. This will be easy for your users as they can't change the system, and so are used to turning to you for support; but it's going to be agony for your Unix administrators, who are accustomed to finding problems, fixing them, and moving on without intervention from anyone else.
Even the smallest system change should be reviewed and approved. Need to tweak a system parameter in /etc/system? Change the sendmail.cf? Retune a file system? Put it in writing and get it approved. This need not be a lengthy, cumbersome process. The most important point is that changes get reviewed by the administration staff so that they are understood by everyone. The benefit is two-fold: Several people agree the change is correct and beneficial, and everyone is aware that the change is going to be made. Moreover, if the process includes people from the user community, they'll be aware of the change and know how it will affect their applications.
It may be that many changes are approved and implemented without waiting for an outage. Changing sendmail.cf can be done without a reboot, and you may decide to let your staff proceed in a disciplined way. Tweaking /etc/system is another story, and that will have to wait until the next scheduled outage. The bottom line is that you need to define a process and a methodology that fits your shop, your systems, and your users -- and stick to it without exception.
The hardest part of bringing Unix into the world of 24x7 availability is breaking your administrators of all their bad habits. It is unforgivable to make changes to a production system on the fly, and it is even worse to make changes and not document them. It should be a capital offense to make a change without first testing it on a non-production system.
This is an important distinction that separates a truly disciplined environment from the computational wannabes. In the mainframe world, the environment either includes a separate mainframe system that mimics the production environment, or the production system is divided into several regions, each of which runs as a separate logical machine. You can test in one region while running production jobs in another. Until recently, with the advent of Solaris 2.6 and the partitionable architecture of the Enterprise 10000 server, multiple regions (called "domains" by Sun) were generally not available on Unix systems. Fortunately, Unix systems are much cheaper than mainframes, thus a test system is easy to provide to your administrators. That system need not be as big as your production box; it just has to run the same versions of all your operating system and application software.
Before any change is made to the production environment, you must make that change to a test system and verify that it works as expected. This is especially important for patches and changes to system software, where even the most innocent change can wreak havoc with an application. Since you can only take your production system down for patching infrequently, based upon your outage schedule, it is easy to accumulate dozens of patches between outages. As any experienced admin can tell you, many patches will interact in bad ways, breaking your environment. It's much better to evaluate those conflicts on a test system than it is to be sitting in front of a broken box at 7 a.m. on a Sunday, trying to get someone from tech support on the phone as the clock ticks toward the end of your outage window.
Systems administrators also have a natural inclination to upgrade systems to the latest and greatest versions of everything, usually trying to get ahead of bugs and problems as they get fixed in later releases. This inclination will always get you in trouble in a controlled environment. New releases and upgrades are almost always driven by users, not administrators. Need to move from Solaris 2.5 to 2.6? You'll make the change when the users say you can, based upon how their applications support and exploit that version. Be prepared to sit on the current version of your operating system for several years in this kind of environment.
Eliminating bad hardware
Once you get your system change and maintenance processes under control, you need to attack the next biggest source of problems in your environment: failing hardware. It seems logical that if bad software isn't crashing your system, your hardware must be the culprit. Indeed, most unplanned outages in controlled environments are the result of failing disk drives, blown power supplies, and faulty cabling. Luckily, there isn't anything here that big bags of money can't fix.
If you have any money left over after building your systems, spend it on redundant disk architectures. Sudden disk failure is the most common cause of downtime in most systems, and the outages can be lengthy as you struggle to replace the bad drive and restore lost data. If you can eliminate these problems using mirrored or RAID 5 disk subsystems, you'll sleep much better, safe in the knowledge that you are immune to disk failure.
You can't imagine how satisfying it is to detect a failed disk drive and have the system chug right along, unaware of the event. It's even more exciting to pull a dead drive from a hot-swappable disk subsystem, pop in a new drive, and watch the system recover and move on without a blink. And, if that failed drive happened to be holding your root file system and half your swap space, you'll be inclined to hug and kiss your disk sales reps the next time they stop by.
Once you resolve your redundant disk issues, consider adding multiple connections to your disk arrays and installing redundant network connections to your machines. With appropriate failover software, you can make yourself immune to that other popular failure mode: tripping over wires.
Finally, consider adding uninterruptable power to your environment, either in the form of battery or generator power. Needless to say, all of this stuff is very expensive, but the demands placed upon you by your user community may require that it be installed and its cost absorbed by the users. Nothing is free, and serious uptime requirements come with serious price tags.
Where to begin?
We've covered a broad range of problems and solutions, all of which will help to bring your Unix uptime in line with your mainframe systems. Some are expensive, but most are free. The price in these cases is weighed in discipline, methodology, and control. Your most difficult task in achieving good Unix uptime is in re-educating your Unix admins to approach availability with a mainframe mindset. Change management and planned outages aren't as fun as seat-of-your-pants administration, but you're guaranteed to spend much less time in front of your users, hat in hand, as a result. Begin with a few changes to your administrative style, and start measuring your uptime. You'll be amazed by the results.
About the author
Chuck Musciano has been running various Web sites, including the HTML Guru Home Page, since early 1994, serving up HTML tips and tricks to hundreds of thousands of visitors each month. He's been a beta tester and contributor to the NCSA httpd project and speaks regularly on the Internet, World Wide Web, and related topics. Chuck is currently CIO at the American Kennel Club. Reach Chuck at firstname.lastname@example.org.
If you have technical problems with this magazine, contact email@example.com