Looking for Mr. Good Block

A little information gathering and distillation can go a long way toward optimizing I/O resources.

By Hal Stern

Performance optimization is an enjoyable pursuit when you have the budget to spend on incremental memory, disk, and network resources. Last month, we looked at disk utilization and I/O-load distribution, with the implicit assumption that you could spend money to generate results. Now, let's impose constraints on the problem: You live with filesystems, not databases. Your manager's login is "cutcosts."

Demonstrable improvements in today's environment is a precursor to even seeing the budget for next year. You need a pile of stats, a plan, and some wizard's magic.

Before resigning yourself to earning a degree in magnetic media alchemy, learn how your users are using or abusing the filesystems. Armed with a history of access patterns and a basic knowledge of filesystem mechanics, you can optimize your resource usage to provide a better environment. This month, we'll start with a distillation of disk and filesystem basics. From there, it's into information gathering mode. You'll need to know what kind of numbers to collect, so you can state your problems, propose fixes, and maybe even impress the bean counters along the way. We'll wrap up with some approaches for improving filesystem performance in place, without additional disks.

On the right track
Before setting performance goals, it's crucial to understand the speed limits of your disk system. SCSI disks peak at about 41Ú2 to 5 megabytes per second for sequential transfers, and have average seek times of 10 to 12 milliseconds. See the sidebar SCSI basics: theme and variations for a quick and dirty overview of the SCSI bus. Average seek time is something of a misnomer, because seeks fall into two broad categories: track to track and long operations. Long seeks take slightly more than the average service time and comprise the bulk of the requests handled. If you had to seek in between each SCSI-disk operation, you wouldn't come close to seeing the maximum transfer rate. Moving the disk head between consecutive tracks takes far less time than the average. The quick track- to-track seek is not particular to SCSI disks and has been a determinant of sequential throughput rates for most disks.

Variations in seek times color the design and implementation of the Berkeley Fast Filesystem (FFS), used in some form by most Unix vendors today. The heart of the FFS organization is the inode, a disk structure that contains information about the file and the addresses of its first few datablocks. The original Unix filesystem had all of the inodes at the beginning of the disk, and the datablocks placed through the remainder of the platters. An unfortunate side effect was that common sequences, like grabbing a file's permissions from the inode, followed by reading the first datablock, caused a long seek operation. The FFS improved inode allocation by scattering inodes throughout the disk in cylinder groups. Keeping the inodes near the associated datablocks cuts the average seek time for filesystem access. In addition to distributing inodes, the FFS places datablocks to minimize rotational delays and to keep parts of the same file close to each other.

Indirect consequences
Inodes contain the permissions, owner, and access/modification times for the file as well as the disk addresses of the first 12 datablocks. Filesystems with an 8-kilobyte block size therefore address about the first 100 kilobytes within the inode. Larger files rely on indirect blocks. The inode contains the first few indirect-block pointers, and the indirect blocks point to the datablocks. Each indirect block (in an 8-kilobyte filesystem) addresses 2,048 datablocks, or about 16 megabytes per indirect block. To go above 50 megabytes, you need to hit a double-indirect block, and there are mythical triple-indirect blocks to handle files the size of Texas.

Splitting the disk-block addresses into direct and indirect blocks effects a trade-off between time and space. I-nodes retain a fixed size and are kept small and manageable, but large files require multiple-disk accesses to locate their datablocks. Writing to the tail end of a 100-megabyte file may require four disk writes: 1. Update the inode with a new modification time. 2. Add a new double indirect-block pointer to the indirect block. 3. Update the double-indirect block. 4. Write the file's data. The last write occurs asynchronously, when the update (sync) daemon flushes out dirty filesystem pages.

What about fragmentation? In the DOS world, fragmentation is a very real issue and can impair disk performance by scattering disk accesses over an already busy surface. In theory, the Unix filesystem has little fragmentation, since fragments live only in the last block of the file. But theory and practice diverge when a filesystem is at 98 percent utilization and the users show no signs of saving fewer mail messages. The optimal block-placement policy starts to break down and you end up incurring more rotational and seek latencies just to get those hard-to-reach datablocks. There are some third-party tools, such as Eagle Software Inc.'s Disk_Pak (see "Disk_Pak defragments disks," November 1993), that will undo disk fragmentation in place, although historically this has involved dumping the filesystem to tape and reloading it to get the best placement as each file is recreated.

Organize your data to maximize the number of sequential transfers and minimize the number of long seeks to improve the Unix filesystem performance. Knowing something about how the Unix filesystem is laid out is essential to interpreting statistics about filesystem usage. Data on how and what files are accessed is even more important, because it is the input for a performance improvement plan.

Statistical multiplexing
One shortcoming of the Unix world is tools give you a gross, high-level picture of the system but don't associate particular work flows with system activity. Pointing out that your disks are 95 percent full is about as useful as showing your friends a closet full of suits. If none fit or all are out of style, the closet is just wasted space. It looks impressive from the high-level view, but it is infrequently accessed. If you want to build a profile of your typical filesystem user, look for lower-level statistics that tell you about access patterns:

How much dust collects on your disks? What is the the average time between accesses to your files? What is the utility of a hierarchical storage system. If your data is infrequently accessed but occupies a large magnetic-disk plaza, consider trading access time for capacity and putting some of it on optical storage. To look for files that haven't been accessed in 180 days, use find:
# find /home -type f -atime +180 -print > /usr/ tmp/180d-log
The access times on files aren't changed by this command, so you can run it repeatedly to look for those forlorn bytes that are there "just in case." (Beware, however, that this method may not work if your backup software modifies the access times of files.)
Repeat the process for the modification time to judge the rate of data change. Add up the total size of files modified daily in order to do some preliminary sizing of an incremental backup system. Gauge average age by sorting files into buckets for daily change, changed within a week, and changed within the month.
How much data does each user own, on average, and how much is accessed in a day? In addition to telling you about the incremental load posed by a new user, this information gives you a feel for how many files are used about as infrequently as the brown polyester suits buried in the back of your closet.
What are your access patterns? Determining the access pattern is not easy because it requires watching actual processes at work, or sifting through NFS request traces. Use trace or truss to watch the read and write patterns of commonly run applications, and use nfswatch (available via anonymous ftp from gatekeeper.dec.com in /.3/net/ ip/nfs) or a similar tool to generate NFS request histories. If you see increasing offsets into a file, you're looking at sequential accesses. Be sure to consider the average size of a file when classifying activity as random or sequential. A compiler accesses source files sequentially, but if those source files are each 10 kilobytes and are scattered all over the filesystem, the higher-order access pattern appears random. Most NFS servers experience random file accesses, even if the individual files are read and written sequentially.
What are the "favorite files" on the system? Shared executables like /usr/ local are popular, and some common header files or libraries may be accessed frequently in a development environment. Some of this data can be obtained from application traces, looking for calls to open(). In the NFS world, you need to watch the read and write requests because there is no open() function over the wire. The publicly available ofiles tool lets you take snapshots of the system's open file table and see what processes have a file in use at any time, if you have concerns about specific directories.

You need a bag of tricks to squeeze more performance from your filesystems. Some of the magic comes from the vendor (see the sidebar Turbocharging filesystems), but most configuration changes will be tailored for your environment.

Avoiding the circular file
Realistically, what can you do to improve filesystem performance? Most of the techniques involve better resource allocation or partitioning; those that involve fine-grain movement of files may not yield large results.

Optimize different disk volumes for different types of work. If you have some files that are accessed sequentially, put them on a striped, sequential access volume with a higher transfer rate. Those filesystems that are accessed in random fashion can be striped with a different interleave, as discussed last month (see "When enough is not enough," April 1994).
Cut down on synchronous writes. Use file locking to ensure consistency or a disk write accelerator like Sun's Prestoserve to minimize the number of physical-disk I/O operations.
Move users' favorite files close together so you cut down on seeks. This is not pretty. It involves building a preference list, and then pumping it into restore to extract from a dump tape. If you want to do this on a system with the old and new disks attached, you can use tar to build the archive in the desired order and then extract it onto the new disk. Here's a typical tar script that uses a popularity list to gather the files together in preference order, and install them on a new, mounted filesystem:

# (cd /home; tar cfF . - /usr/tmp/filelist ) | ( cd /newhome; tar xvf - )

Avoid large files, if possible. It's hard to segregate a 30-megabyte CAD parts file, but if you are writing out logs, it's reasonable to truncate the log after 1 megabyte and open a new file. Also keep your directories small and fairly shallow. Name resolution will go faster, since you don't spend as much time (on average) reading directory entries. Applications that open or access many small files benefit the most. To handle a large number of files in a few directories, consider using a mapping function, such as a hash table to avoid using many levels of directories, or long file names. It's always refreshing to use those computer-science fundamentals for something practical.
Eschew symbolic links. They are able to introduce disk I/O to read the link, although System V.4 and other versions of Unix cache the link in the inode. They can be confusing to users and cause additional NFS lookup requests for each pathname resolution. Use them when you need to preserve a path that is hard-coded into an application, do not use them to create convenient shorthand notations.
Replicate read-only data onto several filesystems. Spread the load over several servers, or different disks on the same machine. Just as striping improves sequential-I/O performance by aggregating the transfer rates of several disks, replication adds to the total transfer rate available to a set of files.
Don't let your disks overflow. This seems obvious, but full disks are the sources of numerous problems only resolved through tortuous discovery. NFS write errors, applications that can't print because there's no room for temporary files, and "denial of service" problems all result from full filesystems.
Use compression wisely and frequently. Watch for files that aren't accessed in a few months or are larger than your threshold for reasonable use. Do you think users really review mail in a 10-megabyte folder? Compression eases space crunches, makes room for replication, and frees more of the preferred parking spaces for new block allocation.

Here's the test: Could you handle the incremental load of 10 new users, without existing users complaining of slowdowns or disk-space shortages? Transparent growth of the environment maximizes everyone's happiness; quiet users give you more time to run Xmosaic. Collecting and interpreting filesystem statistics does more than give you experience using Lotus or Excel to make pretty charts. Those multicolored bars are the justifications for additional purchases, and the key to proactively managing system performance.

About the author
Hal Stern, an area technology manager for Sun, can be reached at hal.stern@sunworld.com.

[Amazon.com Books] You can buy Hal Stern's Managing NFS and NIS at Amazon.com Books.

(A list of Hal Stern's Sysadmin columns in SunWorld Online.)

SCSI basics: theme and variations

Vendor literature is full of claims about blazing SCSI-bus speed, touting fast, wide, and differential SCSI options as the keys to excellent performance. What's so great about SCSI buses that run at 10 to 20 megabytes per second when disks are cruising along at 1Ú2 to 1Ú4 that speed? Most of the differences lie in configuration guidelines and the SCSI bus utilization during sequential transfers:

SCSI-2 defines a bus protocol and a Common Command Set (CCS) that is essentially an API for talking to SCSI devices. Most host adaptors -- those cards you plug into your system -- implement the bus protocol and the CCS in a combination of SCSI processor hardware and SCSI-target device drivers.
"Narrow" SCSI buses use 8-bit data paths and can address eight devices: the host adapter and seven disk drives, tape drives, or other targets. "Wide" SCSI refers to 16-bit data paths, although some vendors use wide SCSI to mean 32 bits. 16-bit wide SCSI buses handle up to 15 targets. Obviously, the disk or tape devices and the host adapter must agree on the width of the data bus.
"Fast" SCSI uses 10-megabytes-per-second transfers on a narrow bus and 20-megabytes-per-second transfers on a wide bus. Running a SCSI bus at this higher rate requires good cabling, high noise immunity, and well- behaved devices.
"Differential" buses use two lines for each data bit while "single-ended" SCSI uses only a single wire. Single-ended buses can run a maximum of 6 meters while differential cabling can extend 25 meters. A differential bus has better noise elimination because it sends the signal and its inverse for each data bit. (Think of watching a friend holding a lantern high or low in a simple signaling experiment. If your friend stands a few feet away, it's easy to distinguish the two states. Put your friend on a boat at night, inject noise in the form of waves, and you'll have trouble identifying the lantern positions. Have your friend hold two lanterns at the same height for 0, and far apart for 1, and you can easily determine the signal value. Voil. You've created a differential bus.)

How do fast and wide buses help if random I/O is bound by disk speeds? First, the faster SCSI bus makes the complete SCSI transaction a bit faster, because it reduces the time spent copying data from the disk's buffer back to the host. Reducing bus bandwidth utilization allows you to put more disks on a bus and thus to increase disk connectivity to the system. SCSI-bus arbitration and command setup consume some bus cycles, but SCSI-bus bandwidth shouldn't be a constraining factor for random I/O. Bus speed is more of an issue for sequential transfers because you can run several disks in parallel across a 20-megabyte-per-second bus, but that same workload would saturate a 10-megabyte-per-second interface.

Turbocharging filesystems

Redundant information like cylinder groups and inode data ensures the Unix filesystem can be made consistent even after a system crash in which some writes did not complete. The workhorse that examins every file and rebuilds the filesystem is fsck. The longest event in any large server boot is the filesystem preening done by fsck, which can run into the 10-minute-or-more time frame. Compare this scenario to that of restarting a database. The DBMS maintains a transaction log, which is reviewed to determine if transactions can be rolled forward or need to be rolled back to keep the database consistent. A DBMS may be back up in under a minute because it only has to replay the log, not examine every row in every table. Unix logging filesystems marry the database log and the FFS data to optimize filesystem recovery time. A logging, or journaling, filesystem maintains a log of writes and updates, so fsck can be replaced with a recovery process that rolls the log. Logging filesystems also speed synchronous writes by coalescing multiple disk block writes into a single log-journal entry. Logging filesystems are available as part of the DCE distributed filesystem package, IBM's AIX, the widely-licensed Veritas software, and Windows NT. The true bastion of all data management -- the mainframe -- also contributes to Unix filesystem performance. A group of engineers at Sun used the applied extents (fixed-length, pre-allocated sections of the disk) to the FFS, providing up to a doubling of sequential throughput. The extent size is matched to the number of disk blocks cached by the SCSI controller, giving Sun's UFS (called UFS+) the ability to handle 56-kilobyte chunks of a file at a time.

If you have problems with this magazine, contact webmaster@sunworld.com
URL: http://www.sunworld.com/asm-05-1994/asm-05-sysadmin.html.
Last updated: 1 May 1994.