RAID: What does it mean to me?
With its cost more reasonable than ever, a redundant array of inexpensive disks (RAID) may be (or perhaps should be) in your future. Here's a primer on the pros and cons of various RAID options.
Rather than endure the inherent limitations of single disk drives, many companies are turning to RAID configurations such as RAID-5 to boost I/O performance and improve data storage reliability. Here's a technical look at the various levels of RAID, including RAID-5, and some guidance on which will best serve your data needs. (4,800 words)
Unfortunately, a substantial mythology has grown up around RAID, and the technical misunderstandings engendered by these myths have led to many a surprise when attempting to deploy these advanced disk subsystems. This article describes the most common RAID configurations and explains what to expect from them.
The goal of RAID units in general is to overcome the inherent limitations of a single disk drive. Various RAID configurations are designed to address either performance or reliability. (See the sidebar "RAID levels zero through five" for basic definitions of the various RAID levels.) Sometimes both are enhanced. However, when all is said and done, there is no free lunch.
RAID-0: Divide and conquer or striping for performance
RAID-0, also known as striping, is the basis for the vast majority of RAID configurations in use. It is crucial to understand RAID-0 before addressing RAID-5. Under the RAID-0 organization, the data from the logical volume is spread evenly across all of the member disks by dividing the data into blocks. RAID-0 assigns blocks to a logical volume's physical member disks by rotation: the first block goes on the first disk, the second to the second disk, and so on. The size of the block is called the chunk size, while the number of disks in the logical volume is called the stripe width. Both parameters can have a significant impact on performance.
With RAID-0, a request for user data larger than the chunk size is serviced by more than one member disk. This permits -- and requires -- the activation of multiple disks and use of multiple data paths. Activating multiple disks results in a major performance enhancement, particularly if the limiting factor on performance is the disk's internal transfer time. This is the case for sequential disk access, which requires very little disk head seek time.
RAID-0 breaks the logical request into smaller physical requests, which are dispatched serially but serviced concurrently. In other words, a single thread of execution generates the commands for the physical I/O subsystem, services the interrupts associated with those physical I/Os, and (for reads) assembles the results of the various member operations into a uniform logical I/O buffer as expected by the user. Because this process requires just a few milliseconds (a couple hundred microseconds to generate the I/O requests and a few milliseconds to service the them) nearly perfect scaling of I/O performance is achieved as disks are added. Naturally, the logical I/O is not complete until all of the constituent physical I/Os are complete.
Random access blues
Unfortunately, this optimization does not work for the more common case, which is random access. When the disk arm must seek to a new location before reading or writing on the platter, the I/O time is dominated by head seek and disk rotation time, while internal transfer time represents less than 2 percent of the elapsed time for the I/O. The net effect is that striping doesn't improve the performance of random-access disk I/O requests -- at least, not directly.
Here's where one of the aforementioned myths comes into play. In the case of striping, it is said that "it's always best to split up I/O requests so that pieces can be transferred in parallel." Unfortunately, the notion of dividing a single I/O request into multiple parallel transfers seems so intuitive and appealing that it discourages people from looking any further. Thus, people often overlook the fixed, non-zero -- and in fact non-trivial -- amount of work that must be done to accomplish any disk I/O, regardless of the size of the I/O. Each request to a member disk requires the generation of a SCSI command packet with all of its associated overhead; every SCSI I/O request typically involves four to seven discrete operations. For small requests, the effort of transferring the data is trivial compared to the amount of work required to set up and process the I/O. Typically the break-even point comes when the chunk size is 4 to 8 kilobytes. For chunk sizes smaller than this, overhead certainly increases, there is no benefit to individual requests, and performance will probably degrade.
Typical writes to mirrors are
only about 15 to 20 percent
slower than writes to a single member.
Another RAID myth is that "striping doesn't improve random I/O." In fact, even though striping does not benefit individual random access requests, it can dramatically improve overall disk I/O throughput in environments dominated by random access. The rationale behind this seemingly paradoxical situation is similar to opening more check out lines at the grocery store -- less waiting in a queue.
Even top-performing disk drives deliver about 100 random I/Os per second. This is much less than demanded on a routine basis by most servers, even low-end systems such as a Sun SPARCserver 5. When more I/O requests are submitted than the disk can actually handle, a queue forms and requests must wait their turn.
Striping spreads the data from a logical volume evenly across all of the volume's member disks, so it provides a perfect mechanism for avoiding long queues. The vast majority of systems with disk I/O performance problems suffer from a drastic imbalance of I/O requests. Often the majority of all requests are directed to a single disk!
To review: Striping improves sequential I/O performance by permitting multiple member disks to operate in parallel; it also improves random access I/O performance on servers by reducing disk utilization and disk queues. What's missing from this pretty picture? Reliability. That's where other RAID configurations come in.
RAID-1: Data redundancy/mirroring
The problem with a stripe is that it depends on every single one of its member disks for survival: if any one of them fails, the entire volume is useless. RAID-1, the obvious response to RAID-0's reliability issues, simply arranges for each disk to be duplicated in its entirety. Although conceptually clean, this approach can be expensive because it obviously requires twice as many disk drives as an unprotected storage system.
Performance of a mirror is also the subject of disk mythology. Under most circumstances, the performance of mirrored reads is improved by lowering disk utilization: Single-threaded reads aren't improved at all, but multithreaded reads benefit from having more ways to get at the data. Most mirroring implementations have an option to select a read policy; usually this amounts to a choice between round-robin scheduling and geometric scheduling.
Round-robin causes each successive read to the mirror to be dispatched to a different submirror. For random reads this is a simple way to lower disk utilization, but in some cases it degrades sequential performance because alternating the access between multiple drives can defeat the read-ahead optimization implemented in each disk's embedded target controller.
Geometric scheduling divides the mirror into regions (one for each submirror), and each member disk services only requests within its assigned region. For example, a two-way mirror might direct all accesses to data in the first half of the mirror to the first disk, and the remainder to the second disk. Geometric reads tend to benefit single-threaded sequential access for precisely the same reason that round-robin occasionally degrades this type of access. Geometric reads are not a very good idea when the logical mirror is only half full, for example.
While access to mirrored reads are not well understood, it is the subject of mirrored writes that attracts the most disk mythology. An extremely common misperception is that writes to a mirror are half the speed of writes to an individual member. Fortunately, this is only rarely the case. Although the logical write operation is not complete until both of the member disk writes are complete, that doesn't mean that the two writes must be serviced sequentially. Most RAID-1 implementations handle mirrors the same way they handle stripes: the writes are dispatched sequentially to both members, but the physical writes are serviced in parallel. As a result, typical writes to mirrors take only about 15 to 20 percent longer than writes to a single member.
RAID-0+1: Making mirroring more flexible
One drawback with true mirrors is that strictly speaking, they only deal with single disks -- unacceptably small storage units in most modern configurations, from either a performance (disk utilization) or a disk management perspective. Mirrors most often exist with striping: Mirroring is layered on top of striping -- each of the submirrors is a stripe. This effectively combines the strengths and weaknesses of each. It eliminates the effects of the read policy, but the other parameters are what one would expect from the combined configuration. Sequential reads improve through striping and random reads improve through the use of many member disks. Writes degrade by 15 to 20 percent from the performance of a submirror, but this is more than offset by the striped nature of a RAID-0+1 (or RAID-10) submirror.
RAID-3: Parity check
Mirroring and striping are the backdrop for RAID-5. The drawback to mirroring is, of course, the sheer expense of having twice as many disks as a non-mirrored configuration. Parity-computation forms of RAID, including RAID-3 and RAID-5, use a reversible parity function to avoid this necessity.
RAID-3 leverages the basic organization of RAID-0, but adds an additional disk to the stripe width. It uses the additional disk literally to hold the parity data. It divides the data into stripes as in RAID-0, and the parity disk contains corresponding blocks that are the computed parity for the data blocks residing on each of members of the stripe unit:
parity = b1 xor b2... xor bn
The XOR function has the unique property of being reversible; given all but one of the blocks and the parity block, the missing block can be computed. Although this sounds like a lot of computation, it isn't: For a typical 8-kilobyte file system I/O, parity generation takes much less than a millisecond.
The data and parity blocks are arranged in a straightforward manner. For example, a five-wide stripe (4 data disks and one disk for parity) looks like this:
disk1 disk2 disk3 disk4 disk5 d1 d2 d3 d4 parity0 d5 d6 d7 d8 parity1 d9 d10 d11 d12 parity2 d13 d14 d15 d16 parity3 d17 d18 d19 d20 parity4
Thus, disk1 handles the first data block (d1), disk2 handles the second data block (d2), and so on; once every data disk has had a turn, the cycle begins again with disk1. Meanwhile, disk5 handles the corresponding parity data.
When servicing large quantities of single-stream sequential I/O, RAID-3 delivers efficiency and high performance at a relatively low cost. Bulk I/O is normally requested in large increments, and RAID-3 performance is maximized for sequential I/O that exceeds the data stripe width (the number of member disks excluding the parity member). RAID-3 has been popular in the supercomputing arena for years because it delivers very good performance to that type of application.
The problem with RAID-3 is that it does not work well for random access I/O, including most DBMS workloads and virtually all timesharing workloads. Random access suffers under RAID-3 because of the physical organization of a RAID-3 stripe: every write to a RAID-3 volume necessarily involves accessing the parity disk. RAID-3 employs only one parity disk, and it becomes the bottleneck for the entire volume.
Another myth purports that in parity RAID volumes, the parity members are accessed on reads. Yet in most implementations, reads to the volume do not require access to the parity data unless there is an error reading from a member disk. Thus, from a performance standpoint, RAID-3 reads are exactly equivalent to RAID-0 reads. (Of course RAID-0 and RAID-3 work quite differently from a reliability and recovery perspective.)
In practice, few vendors actually implement RAID-3, since RAID-5 efficiently addresses the shortcomings of RAID-3, while retaining its strengths.
RAID-4: Independent access
Strictly speaking, RAID-3 is implemented at a byte level (the chunk size is technically one byte or one word). This requires that every member disk be involved in literally every I/O.
RAID-4 is a variant of RAID-3 in which the chunk size is much larger, typically on the order of a disk sector or a typical disk I/O size. This permits requests for small I/Os to be satisfied from individual disks, rather than requiring every member disk to be participate. In actuality, virtually every commercial "RAID-3" device that uses SCSI disks as components actually implements RAID-4. Neither RAID-3 nor RAID-4 are common in new products today since both suffer from the same bottleneck: the single parity disk.
RAID-5: Rotated parity
The primary weakness of RAID-3 is that writes overutilize the parity disk. RAID-5 overcomes this problem by distributing the parity blocks across all of the member disks; thus all member disks contain some data and some parity. With RAID-3, for any given stripe unit, the location of data and parity is fixed. RAID-5 spreads the parity by putting the parity block for each stripe unit in successive -- different -- locations. The same set of data shown above looks like this when organized in RAID-5:
disk1 disk2 disk3 disk4 disk5 d1 d2 d3 d4 parity0 d5 d6 d7 parity1 d8 d9 d10 parity2 d11 d12 d13 parity3 d14 d15 d16 parity4 d17 d18 d19 d20
Thus, both the parity and data functions rotate from disk to disk.
Another RAID myth is that
"striping doesn't improve random I/O."
If the data accesses are full-width, all member disks get equal utilization. Random access I/Os, almost always small (typically 2 to 8 kilobytes in size), only utilize some members for each I/O. For example, writing block d1 under RAID-3 activates disk1 and disk5, and writing block d6 activates disk2 and disk5; disk5 has twice the utilization of the other members. RAID-5 also activates disk1 and disk5 to modify block d1, but block d6 uses disk2 and disk4; this results in equal utilization on all four disks. Obviously, this also permits multiple writes to be serviced simultaneously.
As with RAID-3, reads from RAID-5 volumes do not involve access to the parity member unless one of the member disks is unable to provide its data. Also as with RAID-3, RAID-5 maximizes sequential read performance when the I/O request size is a multiple of the stripe width (excluding parity). Because parity is neither retrieved nor computed for reads, and because RAID-5 distributes the data across one more member disk than a RAID-0 stripe, sequential reads from RAID-5 match those of RAID-0, and RAID-5 handles random reads slightly faster than RAID-0. For writes, however, RAID-5 differs greatly from RAID-0.
Consider the process of writing block d7. With RAID-0, this is a simple write to a single disk. A RAID-5 write, however, involves much more work. On the RAID-5 volume shown above, for example, at a minimum the data block itself must be written, and much worse, the parity block must be updated to reflect the new contents of the stripe unit [d5,d6,d7,parity1,d8]. To do that, the old d7 and parity1 must be read, and the new parity1 is computed; finally, the revised d7 and parity1 can be written to disk. All in all, the "simple" write operation actually results in two reads and two writes, a four-fold increase in the I/O load! When the I/O request involves half of d6 and half of d7, things get still more cumbersome. The RAID software has to read d6 and d7 to obtain the unchanged portions of each chunk, modify the changed sections, and finally write both d6 and d7 back.
Beyond quadrupled overhead, RAID-5 introduces data integrity issues. Specifically, if the system power fails after the new d7 is actually written to the disk, but before the new p1 block is computed and written to the disk, the the parity block will not match the data actually on the disk when the system is restored.
To avoid this catastrophic situation, most array software or firmware uses a two-phase commit protocol similar to that used by DBMS systems. This protocol involves the use of a log area. Array software normally services write requests like this:
2) The new parity block is computed. (This process is commonly and erroneously thought to be the most expensive part of RAID-5 overhead, but parity computation consumes less than a millisecond, a figure dwarfed by the typical 3-15 millisecond service times for I/O to member disks.)
3) The modified data and new parity are written to the log, along with their block identities in case the log must be replayed after a failure. This is accomplished in a single I/O operation.
4) The modified data and parity are written to the member disks. Data and parity are written in parallel.
5) The last operation removes the modified parts of the log associated with this operation.
Although all of the physical I/Os in each step can be (and normally are) serviced in parallel, full integrity requires each step to fully complete before the next one begins. The cost of all this is that the "simple" write operation balloons into two reads to obtain parity data, the parity computation, two writes to manage the two-phase commit log, one write for the parity disk, and one or two writes to the data disks. All told, this represents a six-fold increase in actual I/O load and (typically) about a 60 percent degradation in actual I/O speed.
For obvious reasons, array software tries very hard to optimize these operations. In particular, non-volatile memory is frequently used to accelerate the write operations. This has one obvious and one not-so-obvious benefit.
Buffering the modified data and parity blocks in non-volatile memory is an order of magnitude faster than writing them all the way to physical disk (about 0.5 milliseconds compared to 5-12 milliseconds). Using non-volatile memory also benefits the write to the RAID-5 log, as well as other kinds of data. This optimization is write cancellation: the delayed "lazy" write operation of a non-volatile cache means that the array software can avoid physically (re)writing data blocks that are written frequently. For example, the blocks that contain the RAID-5 log are written twice within about 20 milliseconds for a single-write logical operation, and even more frequently when operating in most real environments. Since the non-volatile buffer normally is flushed to disk much less frequently than that, many physical write operations are avoided.
Array caches typically accelerate read operations much less than one might expect. Most caches are 4 to 64 megabytes in size, compared to typical disk farm sizes of 20 to 400 gigabytes. This is a much smaller proportion (about 1:1000) than found in CPU caches (1:100) or even main-memory disk buffers (about 1:200). Additionally, disk access patterns usually are much more random in nature than memory access, and the net result is that read hit rates on an array cache are far lower than those normally associated with the better-measured CPU caches. Read caches in RAID help primarily when applications write sequentially in small (less than a stripe width) units. Creation of a database is a good example; many databases issue 2-kilobyte I/Os, and DBMS creation is basically a sequential process.
RAID operation in degraded mode
Most users never consider what happens to a RAID volume in the face of a failure, but a brief look at this topic reveals a lot about how to configure RAID volumes. The failure behavior of a mirror is simple. When a disk goes down, the submirror that contains it is taken offline. Until the disk is replaced and the submirror brought back online (possibly by hot sparing), all access goes to the surviving submirror. The performance of a degraded system is then equivalent to the performance of the submirror.
Degraded RAID-3 and RAID-5, however, are distinctly not straightforward, because they store data redundancy in an encoded form. Writing on parity RAID volumes is the same -- slow -- whether in degraded mode or in normal mode. Reads from degraded parity RAID volumes, however, are quite a different matter. When a member disk fails, it becomes impossible to obtain that member's data without recovering the data from every other member disk and regenerating the parity computation. For very wide volumes, this can result in an astonishing amount of overhead: for a 30-disk volume, a read from the failed disk results in 29 physical I/O operations! Since writes to such a volume also are very expensive, it is usually wise to limit the width of a parity RAID volume to about six disks. Most array software permits the concatenation of multiple RAID volumes if larger-capacity volumes are required.
Putting it all together
By now it should be clear that each of the various RAID organizations has clear advantages and disadvantages. Striping solves most of the performance problems associated with individual disks, but for configurations involving realistic dataset sizes, striping alone is far too subject to member disk failure to be of practical use. Mirroring provides reliability, and does so with a reasonable tradeoff in performance. Unfortunately, the sheer cost of mirroring often makes it unacceptable in the very situations that most require high reliability: very large configurations. RAID-5 provides reliability comparable to that of mirroring, combined with substantially lower capital expenditure. However, for some applications, primarily those which have any non-trivial amount of writing, RAID-5 exacts an impractically high cost in terms of performance.
Fortunately, there is no reason why users can't mix or match different types of RAID volumes in a single configuration, and in fact this is usually the most appropriate strategy. For example, consider a large DBMS system used to support both online transaction processing (OLTP) and decision support (DSS). Decision support is characterized by heavy sequential I/O, most of which is dominated by read-only activity; meanwhile, random access reigns with OLTP, which usually sustains a considerable mix of updates. In such a system, the OLTP tables that are heavily updated should clearly avoid RAID-5 -- particularly if the DBMS is configured to operate in disk units of 2 kilobytes. On the other hand, the primary tables that are heavily accessed by the DSS applications are likely to be read a lot, not written very much, and tend to be larger and thus much more expensive to mirror. For those tables, RAID-5 is the right solution. However, most DSS applications make extensive use of sort areas and temporary tables -- data areas that endure extensive writing; if this storage must be protected against disk failure, RAID-0+1 is probably the best option. However, since this storage is very transitory (the data on it is only valid during the execution of a sort or join), a higher-performance alternative is a simple RAID-0 stripe.
There isn't any panacea for configuring high-performance disk subsystems. Fortunately, configuration planners can mix and match the various alternatives to optimize storage subsystem requirements. There seems to be a place for most of the available RAID configurations in today's large-scale systems. Choosing the right one for each part of an application is a balancing act that requires the administrator to carefully research the requirements of each disk dataset and to match the various disk configuration options to actual usage.
If you have technical problems with this magazine, contact email@example.com
RAID comes in several flavors, levels zero through five. Each level is optimized for various capabilities, including improved performance of read or write operations, and improved data availability through redundant copies or parity checking. While RAID-2 and RAID-4 are seldom used, all other RAID levels are supported by many different vendors.
RAID-0 is a high-performance/low-availability level. It provides basic disk striping without parity protection to catch errors, so while throughput is high, no redundancy is provided. Therefore, RAID-0 is considered a pseudo-RAID. Implementations should be relatively inexpensive because no extra storage is provided for parity information and because the software and firmware is not as complex as other RAID levels. If one disk in the array happens to fail, however, all data in the array is unavailable.
RAID-1 is a disk-mirroring strategy for high availability. All data is written twice to separate drives. The cost per megabyte of storage is higher, of course, but if one drive fails, normal operations can continue with the duplicate data. And if the RAID device permits hot-swapping of drives, the bad drive can be replaced without interrupting operations. Performance with RAID-1 is moderate, performing faster on reads than on writes.
Combining RAID-0 and -1 provides two sets of striped disks, and is not uncommon. Striping increases throughput, and simultaneous reads from the two sets will mitigate the performance drag caused by writing everything two times.
RAID-2 performs disk striping at the bit level and uses one or more disks to store parity information. RAID-2 is not used very often because it is considered to be slow and expensive.
RAID-3 uses data striping, generally at the byte level and uses one disk to store parity information. Striping improves the throughput of the system, and using only one disk per set for parity information reduces the cost per megabyte of storage. Striping data in small chunks provides excellent performance when transferring large amounts of data, because all disks operate in parallel. Two disks must fail within a set before data would become unavailable.
RAID-4 stripes data in larger chunks, which provides better performance than RAID-3 when transferring small amounts of data.
RAID-5 stripes data in blocks sequentially across all disks in an array and writes parity data on all disks as well. By distributing parity information across all disks, RAID-5 eliminates the bottleneck sometimes created by a single parity disk. RAID-5 is increasingly popular and is well suited to transaction environments.
-- Barry D. Bowen (firstname.lastname@example.org) is an industry analyst and writer with the Bowen Group Inc., based in Bellingham, WA.
About the author
Brian Wong (email@example.com) is a staff engineer in the SMCC Server Products Group. His current writing project is a book on configuration and capacity planning. Reach Brian at firstname.lastname@example.org.