The right disk configurations for servers
RAID? Journals? Stripes? Mirrors? SLEDs? There are many ways to configure disks for servers. Here's how to choose the right mix
To get high levels of disk performance, you need to combine multiple disks together into an array. There are many possible configurations and tuning variables. The best setup depends on your workload. This month I'll show you how to figure out your workload and guide you through the various options. (2,200 words)
I'm setting up a large server, and I'm not sure how to configure the
disks. The new server replaces several smaller ones so it will do a bit
of everything, NFS, a couple of large databases, home directories,
number crunching, intranet home pages. Can I just make one big
filesystem and throw the lot in, or do I need to setup lots of
different collections of disks?
--Clueless in Crivitz
A: There are several trade-offs, and no single right solution. The main factors you need to balance are the administrative complexity, resilience to disk failure, and performance requirements of each workload component.
There are two underlying factors to take into account. The filesystem type and whether the access pattern is primarily random or sequential access. I've shown this as a two-by-two table, with the typical workloads for each combination shown. To cover all that you mentioned, I would combine NFS, home directories, and intranet home pages into the random access filesystem category. Databases have less overhead, and less interaction with other workloads if they use separate raw disks. Databases that don't use raw disks should be given their own filesystem setup, as should number crunching applications that read/write large files.
|Primary workload options|
|Raw disk||database indexes||database table scans|
|Filesystem||Home directories||Number crunching|
Managing the trade-offs
We need to make a trade-off between performance and complexity. The best solution is to configure a small number of separate disk spaces, each optimized for a type of workload. You can get more performance by increasing the number of separate disk spaces, but if you have fewer it is easier to be able to share the spare capacity and add more disks in the future without having to go through a major reorganization.
Another trade-off is between cost, performance, and availability. You can be fast, cheap, or safe; pick one, balance two, but you can't have all three. One thing about combining disks into a stripe is that the more disks you have, the more likely it is that one will fail and take out the whole stripe.
If each disk has a mean time between failure (MTBF) of 500,000 hours, this implies that 100 disks have a combined MTBF of only 5,000 hours (about 30 weeks). If you have 1,000 disks you can expect to have a failure every three weeks on average. It is also worth noting that the very latest, fastest disks are much more likely to fail than disks that have been well debugged over several years, regardless of the MTBF quoted on the specification sheet.
There are two consequences of failure; one is loss of data, and the other is down time. In some cases, data can be regenerated (i.e. database indexes, number crunching output) or restored from backup tapes. If you can afford the time it takes to restore the data, and it is unlikely to happen often, there is no need to provide a resilient disk subsystem. This lets you configure for the highest performance. If data integrity or high availability is important there are two common approaches, mirroring and parity (typically RAID-5). Mirroring has the highest performance, especially for write-intensive workloads, but requires twice as many disks to implement it. Parity uses one extra disk in each stripe to hold the redundant information required to reconstruct data after a failure. Writes require read-modify-write operations, and there is the extra overhead of calculating the parity.
It is sometimes the case that the cost of a high performance, RAID-5 array controller exceeds the cost of the extra disks you would need to do simple mirroring. To get high performance, these controllers use non-volatile memory to perform write-behind safely, and to coalesce adjacent writes into single operations. Implementing RAID-5 without non-volatile memory will give you very poor write performance. The other problem with parity-based arrays is that when a disk has failed, extra work is needed to reconstruct the missing data and performance is degraded seriously.
The home, NFS, and Web choice
In this particular case, I will assume that the filesystem dedicated to home directories, NFS, and Web server home pages is mostly read-intensive, and that there is some kind of array controller available (a SPARCstorage Array or one of the many third party RAID controllers) that has non-volatile memory configured and enabled. Note that SPARCstorage Arrays default to having it disabled. You need to use the
command to turn on fast writes. The non-volatile, fast writes greatly
speed up NFS response times for writes, and as long as high-throughput
applications do not saturate this filesystem, it is a good candidate
for a RAID-5 configuration. The extra resilience saves you from data
loss and keeps users happy without wasting disk space on mirroring.
The default UFS filesystem parameters should be tuned slightly, as there is no need to waste 10 percent on free space, and almost as much on inodes. I would configure 1 or 2 percent free space (default is 10 percent) and an 8 kilobyte average file size per inode (default is 2 kilobytes) unless you are configuring a filesystem that is under one GB in size.
# newfs -i 8192 -m 1 /dev/raw_big_disk_device
big_disk_device itself should be created by combining groups of
disks together into RAID-5 protected arrays, then concatenating the
arrays to make the final filesystem. If you need to extend its size in
the future, make up a new group of disks into a RAID-5 array and extend
the filesystem onto it. It is possible to grow a filesystem on-line if
necessary, so there is no need to rebuild the whole array and restore
it from backup tapes. Each RAID-5 array should contain between 5 and 30
disks. I've used a 25-disk, RAID-5 setup on a SPARCstorage
Array. For highest performance, keep it to the lower end of this range,
and concatenate more smaller arrays together. We have found that a
128-kilobyte interlace is optimal for this largely random access workload.
Another issue to consider is the filesystem check required on
reboot. If the system shut down cleanly,
fsck can tell
it is safe to skip the check. If it went down in a power outage or
crash, it could take tens of minutes to more than an hour to check a
really huge filesystem. The solution is to use a logging filesystem,
where a separate disk stores all the changes. On reboot,
fsck just reads the log in a few seconds and it is done.
With Solstice Disk Suite (SDS), this is set up using a "metatrans"
device, and the normal UFS filesystem. In fact, an existing SDS
hosted filesystem can have the metatrans log added without any
disruption to the data. With the Veritas Volume Manager, it is necessary
to use the Veritas filesystem, VxFS, as the logging hooks in UFS are
For good performance, the log should live on a dedicated disk. For resilience, the log should be mirrored. In extreme cases, the log disk might saturate and require striping over more than one disk. In low usage cases, the log can be situated in a small partition at the start of a data disk.
An example result is shown below. To extend the capacity, you would make up another array of disks and concatenate it. There is nothing to prevent you making each array a different size either. Unless the log disk maxes out, a mirrored pair of log disks should not need to be extended.
2 disks combined as mirrored filesystem log
|Log 1||Log 2|
|Array 1||Array 1||Array 1||Array 1||Array 1||Array 1||Array 1||Array 1||Array 1||Array 1|
|Array 2||Array 2||Array 2||Array 2||Array 2||Array 2||Array 2||Array 2||Array 2||Array 2|
|Array 3||Array 3||Array 3||Array 3||Array 3||Array 3||Array 3||Array 3||Array 3||Array 3|
The high performance number crunching choice
High throughput applications such as number crunching that want to do large, high-speed, sequential reads and writes should use a completely separate collection of disks on their own controllers. In most cases, data can be regenerated if there is a disk failure, or occasional snapshots of the data could be compressed and archived into the home directory space. The key thing is to off-load frequent, I/O-intensive activity from the home directories into this high performance "scratch-pad" area.
Configure as many fast-wide (20 megabytes per second) or Ultra-SCSI (40 megabytes per second) disk controllers as you can. Each disk should be able to stream sequential data at between 3 and 5 megabytes per second, so don't put too many on each bus. Non-volatile memory in array controllers may help in some cases, but it may also get in the way. The cost may also tempt you to use too few controllers. A large number of SBus SCSI interfaces is a better investment for this particular workload.
If you need to run at sustained rates of more than 20 to 30 megabytes per second of sequential activity on a single file, you will run into problems with the default UFS filesystem. The UFS indirect block structure and data layout strategy work well for general purpose accesses such as the home directories, but cause too many random seeks for high-speed sequential performance. The Veritas VxFS filesystem is an extent-based structure, which avoids the indirect block problem. It also allows individual files to be designated as "direct" for raw unbuffered access. This bypasses the problems caused by UFS trying to cache all files in RAM, which is inappropriate for large sequential access files and stresses the pager. It is possible to get 100 megabytes per second or more with a carefully setup VxFS configuration.
A log-based filesystem may slow down high-speed sequential operations by limiting you to the throughput of the log. It should only log synchronous updates, such as directory changes and file creation/deletion, so see how it goes with and without a log for your own workload. If it doesn't get in the way of the performance you need, use a log to keep reboot times down.
Database workloads are very different again. Reads may be done in small random blocks (when looking up indexes), or large sequential blocks (when doing a full table scan). Writes are normally synchronous for safe commits of new data. On a mixed workload system, running databases through the filesystem can cause virtual memory "churning" due to the high levels of paging and scanning associated with filesystem I/O. This can affect other applications adversely, so where possible it is best to use raw disks or direct unbuffered I/O to a filesystem that supports it (such as VxFS).
Both Oracle and Sybase default to a 2-kilobyte block size. A small block size keeps the disk service time low for random lookups of indexes and small amounts of data. When a full table scan occurs, the database may read multiple blocks in one operation, causing larger I/O sizes and sequential patterns.
Databases have two characteristics that are greatly assisted by an array controller that contains non-volatile RAM. One is that a large proportion of the writes are synchronous, and are on the critical path for user response times. The service time for a 2-kilobyte write is often reduced from about 10 to 15 milliseconds to 1 to 2 milliseconds. The other is that synchronous sequential writes often occur as a stream of small blocks, typically of only 2 kilobytes at a time. The array controller can coalesce together multiple adjacent writes into a smaller number of much larger operations, which can be written to disk far faster. Throughput can increase by as much as three to four times on a per-disk basis.
Data integrity is important, but some sections of a database can be regenerated after a failure. You can trade off performance against availability by making temporary tablespaces and perhaps indexes out of wide unprotected stripes of disks. Tables that are largely read only or not on the critical performance path can be assigned to RAID-5 stripes. Safe, high performance writes should be handled with mirrored stripes.
The same basic techniques described in previous sections can be used to configure the arrays. Use concatenations of stripes, either unprotected, RAID-5, or mirrored as appropriate, with a 128-kilobyte interlace.
I have scratched the surface of a large and possibly contentious subject here. I hope this gives you the basis for a solution.
The important thing is to divide the problem into subproblems by separating the workloads according to their performance characteristics. Balance your solution on appropriate measures of performance, cost, and availability.
About the author
Adrian Cockcroft joined Sun in 1988, and currently works as a performance specialist for the Server Division of SMCC. He wrote Sun Performance and Tuning: SPARC and Solaris, published by SunSoft Press PTR Prentice Hall. Reach Adrian at email@example.com.
If you have technical problems with this magazine, contact firstname.lastname@example.org