Originally published in the May 1994 issue of Advanced Systems.

SysAdmin

Looking for Mr. Good Block

A little information gathering and distillation can go a long way toward optimizing I/O resources.

By Hal Stern

Performance optimization is an enjoyable pursuit when you have the budget to spend on incremental memory, disk, and network resources. Last month, we looked at disk utilization and I/O-load distribution, with the implicit assumption that you could spend money to generate results. Now, let's impose constraints on the problem: You live with filesystems, not databases. Your manager's login is "cutcosts."

Demonstrable improvements in today's environment is a precursor to even seeing the budget for next year. You need a pile of stats, a plan, and some wizard's magic.

Before resigning yourself to earning a degree in magnetic media alchemy, learn how your users are using or abusing the filesystems. Armed with a history of access patterns and a basic knowledge of filesystem mechanics, you can optimize your resource usage to provide a better environment. This month, we'll start with a distillation of disk and filesystem basics. From there, it's into information gathering mode. You'll need to know what kind of numbers to collect, so you can state your problems, propose fixes, and maybe even impress the bean counters along the way. We'll wrap up with some approaches for improving filesystem performance in place, without additional disks.

On the right track
Before setting performance goals, it's crucial to understand the speed limits of your disk system. SCSI disks peak at about 412 to 5 megabytes per second for sequential transfers, and have average seek times of 10 to 12 milliseconds. See the sidebar SCSI basics: theme and variations for a quick and dirty overview of the SCSI bus. Average seek time is something of a misnomer, because seeks fall into two broad categories: track to track and long operations. Long seeks take slightly more than the average service time and comprise the bulk of the requests handled. If you had to seek in between each SCSI-disk operation, you wouldn't come close to seeing the maximum transfer rate. Moving the disk head between consecutive tracks takes far less time than the average. The quick track- to-track seek is not particular to SCSI disks and has been a determinant of sequential throughput rates for most disks.

Variations in seek times color the design and implementation of the Berkeley Fast Filesystem (FFS), used in some form by most Unix vendors today. The heart of the FFS organization is the inode, a disk structure that contains information about the file and the addresses of its first few datablocks. The original Unix filesystem had all of the inodes at the beginning of the disk, and the datablocks placed through the remainder of the platters. An unfortunate side effect was that common sequences, like grabbing a file's permissions from the inode, followed by reading the first datablock, caused a long seek operation. The FFS improved inode allocation by scattering inodes throughout the disk in cylinder groups. Keeping the inodes near the associated datablocks cuts the average seek time for filesystem access. In addition to distributing inodes, the FFS places datablocks to minimize rotational delays and to keep parts of the same file close to each other.

Indirect consequences
Inodes contain the permissions, owner, and access/modification times for the file as well as the disk addresses of the first 12 datablocks. Filesystems with an 8-kilobyte block size therefore address about the first 100 kilobytes within the inode. Larger files rely on indirect blocks. The inode contains the first few indirect-block pointers, and the indirect blocks point to the datablocks. Each indirect block (in an 8-kilobyte filesystem) addresses 2,048 datablocks, or about 16 megabytes per indirect block. To go above 50 megabytes, you need to hit a double-indirect block, and there are mythical triple-indirect blocks to handle files the size of Texas.

Splitting the disk-block addresses into direct and indirect blocks effects a trade-off between time and space. I-nodes retain a fixed size and are kept small and manageable, but large files require multiple-disk accesses to locate their datablocks. Writing to the tail end of a 100-megabyte file may require four disk writes: 1. Update the inode with a new modification time. 2. Add a new double indirect-block pointer to the indirect block. 3. Update the double-indirect block. 4. Write the file's data. The last write occurs asynchronously, when the update (sync) daemon flushes out dirty filesystem pages.

What about fragmentation? In the DOS world, fragmentation is a very real issue and can impair disk performance by scattering disk accesses over an already busy surface. In theory, the Unix filesystem has little fragmentation, since fragments live only in the last block of the file. But theory and practice diverge when a filesystem is at 98 percent utilization and the users show no signs of saving fewer mail messages. The optimal block-placement policy starts to break down and you end up incurring more rotational and seek latencies just to get those hard-to-reach datablocks. There are some third-party tools, such as Eagle Software Inc.'s Disk_Pak (see "Disk_Pak defragments disks," November 1993), that will undo disk fragmentation in place, although historically this has involved dumping the filesystem to tape and reloading it to get the best placement as each file is recreated.

Organize your data to maximize the number of sequential transfers and minimize the number of long seeks to improve the Unix filesystem performance. Knowing something about how the Unix filesystem is laid out is essential to interpreting statistics about filesystem usage. Data on how and what files are accessed is even more important, because it is the input for a performance improvement plan.

Statistical multiplexing
One shortcoming of the Unix world is tools give you a gross, high-level picture of the system but don't associate particular work flows with system activity. Pointing out that your disks are 95 percent full is about as useful as showing your friends a closet full of suits. If none fit or all are out of style, the closet is just wasted space. It looks impressive from the high-level view, but it is infrequently accessed. If you want to build a profile of your typical filesystem user, look for lower-level statistics that tell you about access patterns:

You need a bag of tricks to squeeze more performance from your filesystems. Some of the magic comes from the vendor (see the sidebar Turbocharging filesystems), but most configuration changes will be tailored for your environment.

Avoiding the circular file
Realistically, what can you do to improve filesystem performance? Most of the techniques involve better resource allocation or partitioning; those that involve fine-grain movement of files may not yield large results.

# (cd /home; tar cfF . - /usr/tmp/filelist ) | ( cd /newhome; tar xvf - )

Here's the test: Could you handle the incremental load of 10 new users, without existing users complaining of slowdowns or disk-space shortages? Transparent growth of the environment maximizes everyone's happiness; quiet users give you more time to run Xmosaic. Collecting and interpreting filesystem statistics does more than give you experience using Lotus or Excel to make pretty charts. Those multicolored bars are the justifications for additional purchases, and the key to proactively managing system performance.

About the author
Hal Stern, an area technology manager for Sun, can be reached at hal.stern@sunworld.com.

[Amazon.com Books] You can buy Hal Stern's Managing NFS and NIS at Amazon.com Books.

(A list of Hal Stern's Sysadmin columns in SunWorld Online.)


[Back to story]

SCSI basics: theme and variations


Vendor literature is full of claims about blazing SCSI-bus speed, touting fast, wide, and differential SCSI options as the keys to excellent performance. What's so great about SCSI buses that run at 10 to 20 megabytes per second when disks are cruising along at 12 to 14 that speed? Most of the differences lie in configuration guidelines and the SCSI bus utilization during sequential transfers:

How do fast and wide buses help if random I/O is bound by disk speeds? First, the faster SCSI bus makes the complete SCSI transaction a bit faster, because it reduces the time spent copying data from the disk's buffer back to the host. Reducing bus bandwidth utilization allows you to put more disks on a bus and thus to increase disk connectivity to the system. SCSI-bus arbitration and command setup consume some bus cycles, but SCSI-bus bandwidth shouldn't be a constraining factor for random I/O. Bus speed is more of an issue for sequential transfers because you can run several disks in parallel across a 20-megabyte-per-second bus, but that same workload would saturate a 10-megabyte-per-second interface.

[Back to story]


[Back to story]

Turbocharging filesystems


Redundant information like cylinder groups and inode data ensures the Unix filesystem can be made consistent even after a system crash in which some writes did not complete. The workhorse that examins every file and rebuilds the filesystem is fsck. The longest event in any large server boot is the filesystem preening done by fsck, which can run into the 10-minute-or-more time frame. Compare this scenario to that of restarting a database. The DBMS maintains a transaction log, which is reviewed to determine if transactions can be rolled forward or need to be rolled back to keep the database consistent. A DBMS may be back up in under a minute because it only has to replay the log, not examine every row in every table. Unix logging filesystems marry the database log and the FFS data to optimize filesystem recovery time. A logging, or journaling, filesystem maintains a log of writes and updates, so fsck can be replaced with a recovery process that rolls the log. Logging filesystems also speed synchronous writes by coalescing multiple disk block writes into a single log-journal entry. Logging filesystems are available as part of the DCE distributed filesystem package, IBM's AIX, the widely-licensed Veritas software, and Windows NT. The true bastion of all data management -- the mainframe -- also contributes to Unix filesystem performance. A group of engineers at Sun used the applied extents (fixed-length, pre-allocated sections of the disk) to the FFS, providing up to a doubling of sequential throughput. The extent size is matched to the number of disk blocks cached by the SCSI controller, giving Sun's UFS (called UFS+) the ability to handle 56-kilobyte chunks of a file at a time.

[Back to story]


[Copyright 1995 Web Publishing Inc.]

If you have problems with this magazine, contact webmaster@sunworld.com
URL: http://www.sunworld.com/asm-05-1994/asm-05-sysadmin.html.
Last updated: 1 May 1994.