Optimize your RAID configurations for maximum performance
We give you the crucial factors to consider in configuring your disk subsystems most effectively
Users are often faced with a myriad of choices when configuring disk subsystems. Performance, reliability, operation under degraded mode, cost, access times, and application complexity contribute to a bewildering array of considerations. I/O workloads must be matched with the strengths and weaknesses of various RAID organizations and their implementations. We detail the key considerations so you can determine the best strategy. (2,500 words)
robably the most significant information you need to choose or optimize RAID performance relates to the pattern of your disk I/O. Disk access patterns have two basic attributes, namely I/O size and randomness.
The most common I/O sizes are 2 kilobytes, 8 kilobytes, and 56 kilobytes; in a few environments, 1 MB is common. Most database applications use 2 kilobytes, because the data being manipulated is normally quite small. All of the common database implementations (Oracle, Informix, Sybase, and DB2) default to 2 kilobytes. The small 2-kilobyte block size is appropriate for virtually all transaction processing applications. Many database applications involving decision support can benefit substantially by using very large I/O blocks (Oracle can be configured with a block size of 64 kilobytes) and by clustering these together in maximum size I/O operations. Under some circumstances these I/O operations reach 1 MB in size.
for each file system
can be changed using
and it can be set
initially on a file
system with the -C
For most file system work, 8-kilobyte I/Os dominate. This is because the
default block size of both UFS and VxFS is 8 kilobytes, and most access
occurs in this size. When relatively large files are created, UFS
attempts to cluster several blocks together on the disk in order to
take advantage of the greater efficiency of larger I/O sizes. By
default, the file system attempts to gather up to
maxcontig 8-kilobyte blocks into a single physical I/O. The
maxcontig value is 71. If
maxcontig blocks are written to a file in short order,
they will be clustered together and written in a single large I/O.
Thus the second most common I/O size for file systems is 56 kilobytes.
There are two fundamental I/O patterns, sequential and random access. Sequential access is more familiar to most people. For example, a compiler reads source code and header files sequentially; it also writes its generated object code sequentially. Common examples of random access are indexed retrievals from the database, such as obtaining your current account balance at the bank.
One of the most common misunderstandings about I/O patterns is that a system that services many concurrent users nearly always has a purely random access I/O pattern. This holds true even when each of the users operate on their data sequentially. Except in very rare cases, the aggregate of user I/O is directed at a single collection of physical disks, and the random arrival nature of the multiuser environment has a shuffling effect on the I/O patterns as they arrive at the disks.
Determining I/O size
You can get a pretty good approximation of the typical I/O size of a workload if you can observe a running system. While the system is running the target workload, run the command
iostat -x 5. The
average I/O size is the sum of the "K/r" and "K/w" columns divided
by the sum of the "r/s" and "w/s" columns. Be sure to average the
data over a reasonably long period of time to avoid workload skew.
This is a good time to observe your actual I/O bandwidth demand, which can be surprisingly low. The system-wide disk I/O throughput is simply the sum of the "K/r" and "K/w" columns. Most users perceive the necessity for enormous bandwidth (on the order of megabytes per second), although this is rarely at issue. We'll discuss this more later.
There are two basic strategies for getting the most performance out of RAID configurations. The most appropriate strategy for your environment depends primarily on what type of I/O dominates your workload. Because random access to a disk requires waiting for both seek and rotate, these physical motions dominate random access I/O time. Seek and rotate consume about 80 percent of disk service time, and in a random access environment these are essentially unavoidable. In particular, splitting a single, small I/O across more than one RAID member disk does little to improve performance because data transfer time represents a very small proportion of disk service time (often as little as .5 percent). So for random access environments, the optimization strategy boils down to avoiding waiting in a long queue for other I/O.
The best way to use RAID techniques to optimize random access I/O is to use RAID to distribute I/O as evenly as possible across all member disks. For example, access to a database table can be distributed across many members by placing that table on a RAID-0 or RAID-5 volume. The RAID data organization spreads accesses evenly across the members, thereby minimizing the amount of time spent in the device queue. This is the same as when you go to the bank and find 10 people in line ahead of you. If only one teller window is open, you'll be waiting a long time; if five tellers are open, you won't wait nearly as long -- even though each teller helps customers at the same rate.
Although often overlooked, RAID-1 mirroring also improves the common multithreaded read case. Because there are two (and occasionally even more) copies of the data, reads are usually dispatched to alternate submirrors, thus dividing the I/O queue evenly across all of the members. Writes, of course, are always issued to both members.
2 Strictly speaking,
RAID-3 is not suitable
for random access I/O.
For this reason, nearly
all RAID-3 implementations
are actually RAID-5 under
The optimum RAID configuration for random access I/O uses enough member disks to keep the I/O queue short, combined with a large enough chunk size to ensure that typical I/O operations are serviced by only a single member. For the typical 2-kilobyte or 8-kilobyte I/O sizes, chunks larger than 16 kilobytes are suitable. This applies to both RAID-0 and RAID-52.
Heavy I/O environments
Considerations affecting configuration for a heavy I/O load are the larger number of member disks and the performance during failure of a member disk. While mirroring requires more physical disks, those additional disks provide materially improved throughput and response time during normal operation.
There is also an even larger differential between RAID-1 and RAID-5 in degraded mode. Read performance of a degraded mirror is lower than normal mode because fewer members are available to satisfy I/O requests. At the same time, write performance actually improves -- logical writes have to be expanded to fewer members. Degraded RAID-5 volumes marginally improve for writes, but reads are drastically slower as well as less efficient. Reads to functioning member disks operate as normal, but reading data that was on the failed member requires reading every other member drive and subsequent parity recomputation. For a typical 4+1 RAID-5 volume, the read overhead requires four physical reads instead of one. In extreme cases, such as a 30-wide volume, the overhead for degraded mode is a factor of 29! Obviously this affects the missing members, but this also affects the present members, because it places much greater utilization on them, as well as substantially greater utilization on disk and array controllers.
This can be particularly important with low-end array controllers, which may be easily capable of handling typical I/O load but may nonetheless be inadequate under normal load with degraded or reconstructing RAID-5. For example, a typical four- to eight-way UltraSPARC system with 250-MHz processors generates about 800-2000 I/Os per second under DBMS transaction processing workloads. This load is well within the capabilities of most disk array controllers, but with very wide RAID-5 logical units (LUNs) degraded overhead approaches 100 percent for the controller and 25 to 30 times for the volume. There are quite a few disk array controllers that can't handle 4000 I/Os per second without noticeable degradation. Fortunately, wide LUNs are not very common, because they often suffer from single points of failure. With more normal LUN configurations, such as 4+1, overall controller overhead is more typically 10 to 15 percent.
Parity and data checking
A common myth is that RAID implementations perform integrity checks on every I/O. I'm continually surprised by the number of people who believe controllers will satisfy a read to a RAID-1 volume by reading both (or all) submirrors and comparing. Some people expect RAID-5 implementations to recompute parity on every read to ensure data integrity, but commercial implementations don't do this. The reason is performance: reading both submirrors would obviate the disk utilization advantages of additional physical members (because every I/O would then involve every member), in addition to requiring that the logical I/O be delayed waiting for both sub-reads and the comparison operation to complete. Recomputing parity on RAID-5 reads would be even worse: the controller would have to read every member of the volume for every I/O. I/O load wouldn't be doubled as in the mirrored case, but quintupled or worse! Fortunately, this effort isn't necessary because disk drives store extensive error correcting code (ECC) data on the drive media and use this to detect errors on read operations. (The ECC data accounts for most of the difference between the formatted and unformatted capacity of a disk.) The intra-disk ECC data eliminates the largest class of disk read errors, making RAID recomputation and comparison unnecessary.
One of the easiest configuration choices is provision of sufficient I/O bandwidth. Most applications use random access I/O with small or fairly small I/O sizes. Because a disk typically delivers only 200 to 800 kilobytes/sec under random access conditions, a single fast/wide SCSI bus can handle 25 to 100 disks, even allowing for 100 percent utilization on each. At the recommended 65 percent maximum utilization, this is 40 to 150 drives -- far more than can be physically configured.
These guidelines apply to most common applications. The exceptions are very large scientific (HPC) computing systems, most decision support systems, and a few data-intensive NFS servers. These applications are able to move data to and from disk in large blocks and with sequential access patterns.
HPC and data-intensive NFS machines customarily process long (often hundreds of megabytes) streams of sequential I/O, with little interference. Decision support systems operating on behalf of single users have the same characteristics. Most DSS systems handle multiple users, and while they usually operate on large blocks, they aren't typically sequential by the time the disk sees them.
Unlike the much more common workloads, these data-intensive applications require substantial I/O bandwidth, often consuming whatever the system is able to offer. For example, a highly parallelized scan of a large Oracle table was benchmarked at an I/O rate of 1.4 GB/sec on a 64-processor Starfire.
The most effective RAID configurations for these applications that consume enormous bandwidth are those that carefully spread I/O transfer time across many member disks. This is accomplished by aligning the typical I/O size with the data width of the RAID volume. For RAID-0 volumes (or RAID-1+0 volumes), the data width is the chunk size multiplied by the number of members.
3 For example,
using a 64-K block size
with a multi-blockread
setting of 16.
For RAID-5 volumes, the data width is the chunk size multiplied by the number of members minus 1 (to account for parity storage). For example, an Oracle tablespace that will be scanned sequentially by an important query can be configured to be read with 1 MB blocks3. A typical RAID-5 volume uses a 4+1 or 3+1 configuration to avoid single points of failure. To optimize throughput for this table with a 4+1 RAID-5 LUN, the data width should be 1 MB; because there are four data disks, the chunk size should be 256 kilobytes. On a RAID-1+0 volume, optimum throughput would be achieved with a 4+4 configuration, with each of the four disks striped with a chunk size of 256 kilobytes.
In the case of a data-intensive NFS or HPC system, you'll be reading
and writing files through the file system, and this provides one
more opportunity to configure the I/O size. By setting
maxcontig to a larger size, say 32 blocks, file system
clusters can be made larger and more I/O efficient. With 32 blocks
in a single cluster, 256 kilobytes are read or written, so a 4+1 RAID-5
volume is optimal with a chunk size of 256 kilobytes/4 = 64 kilobytes.
One of the keys to an optimal configuration is not making blanket decisions. Many administrators choose to uniformly adopt a particular RAID organization or disk configuration. For example, choosing RAID-5 over RAID-1, or small disks for performance over large disks. Either of these choices are optimal in some circumstances and distinctly suboptimal in others. What many people overlook is that different or even opposite conditions may arise in a single system.
Data sets within a system rarely have the same requirements for performance, availability, and cost; consequently, their storage should be handled accordingly. For example, a large DBMS system is critically dependent on its redo logs and archive segments. Both of these data sets are often heavily written; between the performance requirements and the critical nature of the data, both should nearly always be mirrored. Likewise, if the same system is used for transaction processing, storage of historical data is both relatively unimportant as well as seldom updated -- and thus not worthy of the high cost of mirroring. This is particularly true of historical data that may represent the overwhelming majority of the data.
In a file server environment, a file system that supplies application binaries is a function ideally suited to an inexpensive RAID-5 configuration. The RAID-5 write issues don't come into play, because the only writes to the file system would be to update the applications -- a rare occurrence. Furthermore, because the same files are handled over and over, there's a good chance that the important files will be cached by the file server anyway, eliminating most issues with degraded mode performance. On the same server, a file system that supplies operational file space for many clients is usually better suited to a mirrored configuration, especially if access is very intensive.
With this many details to consider, it's easy to get overwhelmed in the process of planning a big configuration. The most effective strategy is usually to consider the importance of each data set and choose an appropriate RAID level for each. Then simply spread everything across as many drives as is reasonable and be done with it. In the relatively rare circumstance that you have a bandwidth issue, it's worth spending somewhat more time to match individual I/O patterns to their storage, but even here it doesn't pay to spend a lot of time. Most often you'll leave the last two percent on the table, but you'll save a lot of head scratching and worrying in the meantime. I/O access is almost always distributed unevenly, and once the system is up and running (ideally in production), you can look for the one or two hot spots and rebalance those devices or volumes.
About the author
Brian L. Wong is Chief Scientist for the Enterprise Engineering group at Sun Microsystems. He has been a data processing consultant and software architect and has held other positions at Sun. Reach Brian at email@example.com.
If you have technical problems with this magazine, contact firstname.lastname@example.org