Questions of integrity

By Hal Stern

System administration crises come in all colors and intensities. Dealing with the loss of data is certainly the most severe. Backup systems insure you against user carelessness and catastrophes but still leave your daily operations exposed. A disk head crash at 2:00 pm can erase a few hours of transactions that may involve millions of dollars of goods and services.

To run a full 7-day-a-week, 24-hour operation without losing a single transaction to hardware failure, you need to address disk redundancy, system and service redundancy, and application design. Reconstructing work from a paper trail is sure to cause a managerial conflagration, and you can bet some of the heat will be directed your way whether you're responsible for the failure or not. Spectacular failures occur only in live, fielded systems, so those who feed and care for the operational systems always share the blame for mishaps. This month, we'll explore strategies for protecting yourself and your users. Combining a review of disk redundancy techniques with lessons learned about disk behavior will help you design (and warranty) reliable, high-performance systems.

Arrays of possibilities
Planning a disk redundancy configuration relies on the same reconstruction evaluations used in designing a backup system. You want to be sure that you have at least two copies of every byte, that you can automatically regenerate redundant data, and that you'll eliminate data exposures during periods of limited or lost redundancy. Almost every data availability product protects you against single points of failure -- that is, the untimely demise of a single disk, controller, cable, or power supply. As you plan a redundancy configuration, think about the possibility of multiple failures. How would your configuration react? Would you lose data if two disks went south in quick succession? What happens when you suffer a failure while rebuilding a redundant volume? There's a dizzying array of possibilities and configuration options, all relying on logical or physical arrays of disks.

Before getting into the what-if scenarios, let's review the history of Unix disk redundancy schemes. For simplicity, we'll stick to the common Redundant Array of Inexpensive Disks (RAID) numbering scheme for everything, even though some of these techniques predate the acronym by at least a generation of technology. To get a sense of where RAID came from, think back to the mid-1980s. Storage module devices (SMDs) offered an impressive 500 megabytes on a 101Ú2-inch platter. SCSI disks were small, slow, and cheap compared with their SMD cousins. SCSI disks were acceptable as desktop or deskside expansion, but the economics and performance of SMD disks made them attractive for fileservers. The disparity between SMD and local SCSI disk random I/O performance even helped mask the network overhead of NFS until SCSI emerged as the fastest, dominant disk technology.

Realizing that sometimes one and one make three, the computer science researchers at University of California-Berkeley introduced the world to the phyla of RAID disk arrays. Aggregating the "ones," they were able to build configurations that minimized performance and reliability concerns of single SMD disks. Six RAID levels were defined in the original research, although only a few are actually implemented and of interest today.

RAID level 0 is disk striping. It doesn't improve reliability, but it enhances disk I/O by spreading the load over as many spindles as possible. For random I/O, access to a single logical filesystem utilizes more independent disk heads while sequential I/O benefits from parallel disk transfers.

RAID level 1 is simple disk mirroring, a technique that predates its name in the vernacular. Every byte is copied onto two disks, or in the case of triplexing, onto three disks. RAID level 1 is implemented at a logical disk driver level, so that a filesystem or database built on top of the mirrored device can't "feel" the duplication happening at the physical disk driver layer. Replicating typically incurs a performance penalty on disk write operations.

Mirroring more than doubles the cost of a disk farm, because you need data-center cabinets, cables, host interfaces and power, and office space for all of those redundant disks. However, 100 percent duplication is not the only solution to the data-reconstruction problem. Memory systems use parity and error-correcting code (ECC) schemes to guarantee data integrity with an overhead ranging from 20 to 40 percent. Trading 100 percent redundancy for parity disks, RAID level 3 offers data integrity at a much lower cost than RAID1. In a RAID3 array, data are striped across n-1 disks in each row, and the nth disk is used for parity data. The single parity disk can become a performance bottleneck, so a variation on RAID3 interleaves parity bytes with data bytes across the array. This striped data and parity arrangement is the popular RAID level 5. Faced with a plethora of RAID5 and disk-mirroring products what's a cautious system administrator to do? (Our September 1994 Buyers Guide references more than 100 RAID products.) Making the choice requires thinking about how you access your data and what trade-offs are most important.

Smokin' mirrors
Doesn't prudence dictate that you always use RAID5, due to its lower cost? Reductions in disk counts and complexity are strong points in favor of RAID5. If cost is your tightest constraint, evaluate the real expense of RAID5 and disk-mirroring configurations. In small sizes, the packaging and RAID5 software sometimes represent a large fraction of the purchase price, putting RAID5 and mirroring on the same cost plane. For large configurations, say, more than 20 gigabytes, RAID5 generally has a lower cost.

What could disk mirroring offer that justifies a 70 percent or more premium in disk space? For starters, it is more reliable because it offers protection from multiple failures. For a fair comparison, consider a five-disk RAID5 array using 2-gigabyte disks, yielding about 8 gigabytes of useable disk space with 2 gigabytes of parity information. The equivalent mirrored configuration uses eight disks, mirrored as four pairs that are then striped together to get the same I/O characteristics as the RAID5 array. This combination of striping and mirroring is referred to as RAID level 10 or RAID0+1, from its composition of functions. You could conceivably tolerate up to four sequential failures in the mirrored sets, while a single failure leaves you with minimal data coverage in the RAID5 model. The mean time between failures is probably shorter with mirroring, due to the larger number of devices with moving parts, and combining two levels of logical disk drivers adds a nontrivial administrative burden.

Mirroring's 100 percent redundancy also provides an alternate for system backups where raw devices are the medium of choice. After checkpointing your database to ensure that all pending writes have been flushed, break the mirror to produce a stable, consistent image of the database on the "broken" half of the mirror. Roll this snapshot to tape while continuing to use the other mirror half for on-line database operations. When the backup is done, let the disk mirroring software resynchronize the broken half and restore full duplication.

The cost of achieving parity
Performance contends with cost issues for the top-ranking spot in redundancy evaluation criteria. Many mirroring schemes let you round-robin read requests to the mirror components, or establish geometric read patterns that direct "front half" and "back half" requests to different disks in the mirrored pair. The former approach emulates the benefits of striping for random I/O while the latter approach sometimes minimizes seek times. RAID5 arrays handle reads like a striped set of disks. If you are mirroring without striping, RAID5 may offer performance benefits over the round-robin read scheme simply by putting more disks to work at once. Be sure you can tune the stripe size in the RAID5 array, or you may be inadvertently optimizing for sequential access when your access patterns are random in nature.

RAID5 and mirroring exhibit salient differences in write performance. In contrast to the small and relatively constant penalty for multiplying writes to a mirrored pair, the drag on a RAID5 array increases as the read/write mixture shifts toward an update-intensive workload. Each modification of data in a RAID5 array requires reading the existing parity information, calculating a new parity block, and then modifying the parity block and writing the data block. A RAID5 array may produce less than half of its peak I/O operations per second when subjected to a write-heavy load.

Till someone gets hurt
Neither RAID5 nor mirroring protect against users who wish to render themselves redundant. If you have a heavy-handed file executioner who hasn't figured out rm * doesn't ask for verification before ravaging a directory, you need to alias rm or run backups more frequently. Adding data and disk integrity features doesn't remove the need for a robust and redundant backup system. Improving the overall level of reliability requires that you build on a firm foundation, planning your backup strategy in conjunction with your redundancy configuration.

Here are some safety guidelines for configuring redundant disk arrays:

Trace the paths from CPU to disk block for each disk in a mirrored pair or row of a RAID5 array. Don't traverse any single component twice. SCSI buses, disk enclosures, cables, and power supplies should be as independent as possible in each half of the mirror.

Mirror between disks in the same enclosure at your own risk. If the power supply flames out, you may or may not lose data, but you'll definitely lose access to all of the data in the redundant set.

Don't stripe between disks on the same bus, even if they're in separate enclosures. First, the SCSI bus could fail and take out both halves of a mirror or a quorum of bytes in a RAID5 array. Performance is a secondary concern. If you are multiplying writes on a single SCSI bus, you may run out of SCSI bus bandwidth during sequential transfers.

Define your sparing and disk replacement strategies up front. Some RAID5 systems use hot-plug disks, which can be replaced and resynchronized on the fly. Others require that you bring the disk down to a state of inactivity, known as quiescing the drive. At a minimum, you'll have to unmount the filesystem or break the mirroring, then do an on-line drive replacement. Know where your spare drives are, and outline the sparing procedure with your vendor.

Hot-spare disks are an alternative to hot-plug disks. The spare disks are installed and spinning in the system in standby mode. When a disk fails in a redundant configuration, one of the hot-spare disks replaces it and is resynchronized with the remaining live disk(s). If you use hot spares, complete the datapath-tracing exercise for the default configuration and for each failure mode in which a hot-spare disk substitutes for a live one. If needed, segregate the hot spares into pools so you can eliminate end-stage configurations that would have mirrored disks on the same SCSI bus or in the same enclosure.

Explore what fielded, production systems do to gain insight into the kinds of policies and availability that your data center has already implemented.

Roads to recovery
The time to recover a failed mirror or array will affect the overall level of system availability you ultimately achieve. Look at the time to rebuild a disk after replacement. Does the vendor offer an optimized, dirty-region copy scheme, or must every data block be copied when a new disk enters a redundant set? Most of the optimized copy routines work only if a limited number of updates have occurred. If you plan on breaking a mirror to do backups while running the database against a single set of disks, be sure you'll be able to benefit from an optimized region copy when you re-attach the mirrored devices.

Should you take a filesystem or database off line during a volume resynchronization? If you modify substantial portions of the database and then lose one of your primary disks, you've irrevocably lost data. When using the mirror-to-tape raw disk backup mechanism, keep database transaction logs on mirrored devices as well, so you can do a roll-forward from the checkpoint saved on the broken-mirror half if a primary disk fails. If you can re-create filesystem updates, then running naked with only half a mirror or a partial array is also acceptable provided you can reconstruct the current state from backups and file-edit records, source code control-system control files or transaction records.

If you need to go beyond disk redundancy, multiple-host disk-sharing schemes can let you run a hot and standby server, clustered servers can arbitrate on-line access to shared disks, and replicated servers can provide multiple paths to the entire service. In future columns, we'll revisit these in detail.

The key for disk redundancy is to identify each failure mode you experience and estimate the time to recover from the failure, including the time to restore full redundancy and data integrity to the system. Starting at the disk and SCSI-bus failure level, work up through failures in the I/O system, host, database, network, and user application. Are your applications resilient in the face of disk errors? Does your code check errno religiously, looking for write operations that failed on full filesystems or reads that encountered media errors? Whether you use disk redundancy or not, be certain your applications understand failure modes and how to work around them by retrying or aborting operations that fail.

The fine print
Implementing this social contract between system management and application development is politically complex. The application side of the house relies on you to deliver a stable, reliable, and robust platform. You're making an implicit promise not to miss a single data beat. In accepting that responsibility, you must also assume some authority to dictate how applications will be designed and rolled out, so users are getting the best value out of the total system.

As Unix is relied upon in more business-critical situations, systems administrators must necessarily get involved early in the development cycle. Failure to consider all aspects of data integrity and reliability has dire consequences. Nobody wins if you provide the highest quality availability and access to the lowest quality data.

About the Author
Hal Stern is an area technology manager for Sun. He can be reached at hal.stern@sunworld.com.

[Amazon.com Books] You can buy Hal Stern's Managing NFS and NIS at Amazon.com Books.

(A list of Hal Stern's Sysadmin columns in SunWorld Online.)

If you have problems with this magazine, contact webmaster@sunworld.com
URL: http://www.sunworld.com/asm-11-1994/asm-11-sysadmin.html
Last updated: 1 December 1994