When disaster strikes, will you be ready?
Disaster recovery may be the most expensive project you'll never put into action. But your career could be jeopardized if that day comes and you're unprepared. Here's what your plans should include
Surprisingly, disaster recovery may be easier in Unix than with a mainframe, but only if you take the time to map out your requirements in advance. What should be the most crucial elements in your disaster recovery game plan? Think you've got everything covered? Now's the time to make sure. (2,700 words)
hey are the tales that make us cringe: stories of horrific system outages, machines beset by unresolvable bugs, data centers demolished by natural disasters. The cast usually includes the brave and tireless system administrators, the angry and demanding users, and the incredulous and unforgiving managers. The stories always involve great pain and anguish, wasted time and money, and often a crippled career or two. And none of them would need to be told, if only appropriate disaster recovery planning had taken place.
What constitutes a disaster? Is it something as simple as a failed disk drive or a non-responsive router? Do machines have to burst into flames before the word disaster comes into play? In reality, a disaster is any interruption of service that results from some force beyond your control: malicious attack, act of God, human error. Disaster recovery is how you react to and recover from that ominous external force. And disaster recovery planning is the time and money you spend getting ready to recover from a disaster, even though you hope and pray that day never arrives.
It may be that disaster recovery planning is the one area where both Unix and mainframe systems are on equal footing. There is nothing special about a mainframe that makes it more recoverable than a Unix system. In fact, as we'll see, Unix systems are easier to recover in many respects. The only advantage that mainframes have is that they exist in a well-run, well-managed world where disaster recovery planning is a normal part of business. Being cheaper, more distributed, and less disciplined, most Unix systems are not part of an existing disaster recovery plan. If you're serious about bringing your Unix shop up to mainframe standards, it's time to think about disaster recovery planning right away.
Know your data
Disaster recovery planning is often known as business continuity planning. Ideally, a meteor could fall from the sky, turn your data center into a smoking crater, and your users will never know. Your company's business continues as if nothing had happened, although your life, especially if you were in the data center when the meteor hit, might take a decidedly hectic turn.
At the heart of any business is data: budgets, invoices, engineering documents, research results, e-mail, and Web pages. Any information needed to run your business needs to survive a disaster in the same form in which it existed before the disaster. The first step in building a disaster recovery plan is to inventory your data and understand how it is used.
Right off the bat, Unix systems are at a disadvantage to mainframes because of their distributed nature. While all the storage in a mainframe environment may be concentrated around one or two large systems, data in a Unix environment is often spread across dozens of systems in all sorts of formats. Your inventory process will be more complicated as a result, and the ways the data interact may be more convoluted.
Keep in mind that you will be cataloguing two kinds of data: system data needed to run your machines and user data needed to run your business. It's a rare installation where every single box is running the same copy of the operating system down to the individual patch level, and you'll need to catalogue each nuance of every system you'll be preserving in a disaster. It's not enough to simply perform full backups on every root volume; you'll need to keep track of the services each machine provides to the rest of your environment. This will be critical as you formulate your recovery plan.
User data comes in every flavor, from flat files to huge databases to documents stored on PC file servers. Every user of your systems has unique data requirements, and you must understand every single one before you can adequately recover from a disaster. Right off the bat, you'll need to perform data triage: which data is so critical that it must come back online immediately, and which data is less important and can wait a day or two to be recovered? You'll make no friends with these decisions: Every user believes his or her data to be the most important thing in your data center.
The best way to make these decisions is to understand your users' business processes and how their data is used. If your company processes invoices once a month, invoicing data can probably wait a day or two to be recovered. Data used to drive daily investment processes needs to go back online much more quickly. Data coupling is also important: An otherwise low priority system may feed a critical process and thus inherits the same urgency for rapid recovery.
Once you've determined how data is used and needs to be recovered, you can create appropriate backup strategies. Again, you'll be making tough trades here, as the two functions of backups, fixing human errors and supporting disaster recovery, are diametrically opposed. Human errors (deleting files, losing diskettes, fat fingers) require quick access to recent backups, implying that yesterday's data is stored within your data center. Disaster recovery, assuming that the data center is somehow missing, wants yesterday's data safely stored far away, readily available by courier.
Most sites cannot afford to meet both needs and must instead strike a compromise: They keep some data (usually a week's worth) local and rotate older data to safe off-site storage. This hedge implies that any disaster recovery will cost you a week of processing. As tape media has grown denser and drives faster, some sites have started mirroring backups, duplicating tapes during the day and shipping the copies off-site, reducing their window of exposure to a day. If that still isn't enough, consider live off-site data duplication using high-speed network connections to remote disk farms. Which is best for you? It all depends on your business needs and the size of your wallet.
Know your systems
Your business will not run without the data, but the data is useless without a system to support it. When you've finished your data inventory, turn your attention to your systems and begin cataloguing their services and functionality.
Some systems exist only to provide user services, while others provide the glue that holds your data center together. You should be able to construct a dependency diagram that shows how different systems rely upon one another. Invariably, this graph narrows down to one or two machines that support the entire environment with basic services: DNS, NIS, NFS, e-mail, and Web services. These are your most important machines and should be recovered first, before you try to bring back the big database servers your users will be clamoring for. Once you've established your basic network environment, the rest of the recovery will go much more smoothly.
Don't forget that the first systems you'll need online, after your infrastructure boxes, are your backup servers. Most big shops use centralized backup machines that manage and catalogue your daily backups. In the event of a disaster, you'll often start out with a single machine, one tape drive, and a mountain of tapes. Feeding tapes by hand to a single drive is not the best way to get things back up and running, so you'll want to get your backup libraries online quickly, so that subsequent restores go that much faster.
In the course of creating your recovery plan, you'll need to know which equipment to buy to replace destroyed systems. This is one place where Unix systems make life easier. Because Unix hardware has an average life cycle of less than two years, you could be faced with replacing everything from that old SPARC 1 to the Enterprise 6000 you bought last week. Instead of trying to find exact replacement systems, use these rapid advancements in Unix systems to your advantage.
After cataloguing your systems, group them into small, medium, and large configurations. When disaster strikes, all the small systems get replaced by a generic system big enough to handle any of them, say a SPARC 5. All your medium systems get replaced by a midrange unit, perhaps an Ultra 30. Your big iron boxes get replaced by generic big iron: Enterprise 6000s for everyone! Acquiring such generic systems during a disaster is easier, especially if you have a standing order on file with your vendors. Even as your systems grow and change, you need only review your recovery inventory once a year to ensure that your generic replacements still fit the bill.
Finally, make sure you know where you'll recover after a disaster. Mainframes traditionally use cold, warm, and hot sites for recovery. Cold sites are usually just a reserved space in a data center, ready to be loaded with equipment if disaster strikes. It can take days to populate a cold site, so some companies choose warm sites: a data center configured with identical hardware, ready to be booted when the need arises. Sometimes warm sites are used for non-critical processing that can be stopped when the site is activated. Warm site activation can take hours, and even that is too slow for some folks. In those cases, hot sites are configured with identical hardware that is always up and running with mirrored data, ready to go on a moment's notice.
Site management is where Unix brings a big advantage to the recovery process. All but the very largest Unix machines can run anywhere: an office, a hotel room, your garage. With the need for raised flooring, special power, and massive air handlers eliminated, finding a recovery site is much easier. In general, the biggest requirement for your Unix recovery site will be adequate power, possibly with 220V feeds. You'll still need to figure equipment availability into your recovery plan, but the cost of that cold, warm, or hot site may be much less for your Unix systems.
Know your plan
After all this self-assessment, you're ready to create your disaster recovery plan. Everyone's first thought is to plan for the meteor strike, but a real disaster plan covers all the bases, from tiny disasters to the big dramatic ones. To create the right plan, create a list of the most likely disasters to confront your environment.
Small disasters might entail the loss of one or two critical systems, or the failure of a key network component. This kind of small problem may be labeled as an outage rather than a disaster, but you'll recover nicely if you plan ahead for it. To create your list of small disasters, identify all the single points of failure in your environment. Do you have a single main network switch at the heart of your company? What if your DNS server blows a power supply? Suppose the cleaning crew dumps a pail of water into your power distribution unit? Often, these kinds of failures can be cured by having appropriate spare parts on hand or by creating secondary servers for critical functions.
With the small problems solved, turn your attention to medium disasters. How will you handle an extended power outage? What if your telecom room catches on fire? These disasters are often the toughest to handle. Most of your environment is intact, with just a critical piece missing. You can't rebuild the failed systems from spare parts, but the pain of shifting operations to the recovery site exceeds the cost of the outage. Recovering from these kinds of problems requires forethought, good planning, and a bit of luck.
Big disasters are easy: tornadoes, meteors, fire, floods, plagues of locusts, and anything else that renders your whole facility useless. Called footprint disasters, this variety leaves you no choice but to shift processing to your recovery site and begin rebuilding from scratch. You may be able to salvage some equipment, but you'll need to rely on getting replacement systems for most machines, and all your data will probably come from your off-site backups.
Every site has its share of small, medium, and large disasters, but you also need to consider problems unique to your facility or business. I used to work at a site in Florida, where we had a detailed hurricane disaster plan that went into effect when predicted landfall was still 72 hours away. As the hurricane drew closer, we would make critical system backups, fly tapes to a remote site in Atlanta, and as the storm drew near, move operation personnel and their families into the secure data center buildings. That site was also adjacent to an airport, and we actively planned for a disaster involving a plane crashing into our building.
What unique situations present themselves to your site? Sun Microsystems, in California, installed special anchor straps that bolt every system to the floor to prevent movement during an earthquake. Is flooding more prevalent in your area? What about lightning strikes? Do local road patterns increase the chance of a car driving into your building? Such scenarios might sound far-fetched, but a little planning now might pay off big later.
Once your plans are made, you must actively practice them. At least every other year, test your plan by going to your recovery site and trying to bring up a few systems, if not your whole data center. While such testing is often expensive, it can find holes in your plan that would otherwise cripple your business during a real disaster. If nothing else, you'll gain an appreciation for the logistics of shifting your operation and your people to another site.
Finally, publish your plan and make sure your users understand it. If a system is last on the list for recovery, it is better to deal with that user when things are running smoothly than it is to explain your decision amidst the rubble of your data center. If your users understand how your plan works and why you'll be doing what you do, they'll be far more supportive during a real crisis. In the meantime, you'll earn well-deserved brownie points for planning ahead while things are still calm.
Disaster recovery is an industry unto itself, and this article has hardly scratched the surface of the problem. If you are really interested in building a robust business continuity plan, you can turn to any one of several established disaster recovery specialists for help.
While a quick search of the Web turns up dozens of companies with varying levels of expertise, I would be remiss in not mentioning two leaders in the field: SunGard and Comdisco. Well-known among the mainframe community, SunGard and Comdisco have provided end-to-end business continuity and planning services for many years. You might also consider browsing through the Disaster Recovery Journal for further references and related information.
Regardless of where you start, get started now. If nothing else, conducting data and systems inventories will make you aware of where your problems lie and will help in the overall management of your environment. When disaster does strike, you'll be that much closer to a quick and manageable solution.
About the author
Chuck Musciano has been running various Web sites, including the HTML Guru Home Page, since early 1994, serving up HTML tips and tricks to hundreds of thousands of visitors each month. He's been a beta tester and contributor to the NCSA httpd project and speaks regularly on the Internet, World Wide Web, and related topics. Chuck is currently CIO at the American Kennel Club. Reach Chuck at email@example.com.
If you have technical problems with this magazine, contact firstname.lastname@example.org