Click on our Sponsors to help Support SunWorld
Performance Q & A by Adrian Cockcroft

How do disks really work?

Figure out your disks' behavior

June  1996
[Next story]
[Table of Contents]
Subscribe to SunWorld, it's free!

Modern disks have minds of their own. When combined with an intelligent disk array or RAID unit, their behavior can be quite confusing. To make it more challenging, the disk monitoring utilities print a seemingly random selection of metrics. But don't worry, this mess can be sorted out, and all you need are the standard utilities. (3,100 words)

Mail this
article to
a friend

Q: I monitor my disks with iostat, and sar, but these two tools don't print the same numbers. There seem to be many measurements available but it's not clear what they all mean. Finally, what difference does a disk array make?
-- Diskless in Dodgeville

There are several options to iostat, and there are indeed slightly different metrics reported by sar. To get to the bottom of this I'll start by describing how disks really work, and the low-level measurements that the kernel collects. The numbers printed by iostat and sar make sense when you see how they are derived from the underlying measurements.

If you've ever tried to get the same numbers from two copies of the same performance command, let alone two different commands, you've discovered that it is impossible to synchronize the measurements. Every disk access updates the metrics, and it's too hard to start the commands at exactly the same time and keep them running in step.

What is a disk, really?
Let's start by trying to understand the things that make up a modern disk. There is more to today's disks than you may think. For a start, the disk drive itself contains a CPU and memory. The CPU does more than copy data to and from the SCSI bus. It can handle perhaps 64 commands at the same time, and will figure out the best order to perform them in. If there are two files being read from the disk at once, the requests will be interleaved as they arrive at the disk drive. The drive can sort them out, and read a larger amount from one file before seeking to the other file. Since the overall number of seeks is reduced, overall performance is better. From the point of view of a single file, however, the access times are less predictable. Some accesses will be faster than expected, and some will be slower.

The RAM on the drive is used to hold the data for all those commands. It may also be used to hold prefetched data, i.e., the on-disk CPU guesses what you will ask for next, and tries to get it in advance if it is lightly loaded. When a disk gets busy the prefetch "guess" may get in the way of a real request, so it doesn't happen, and you may find that a busy disk with several competing sequential access streams performs worse than you would expect.

With SCSI, a disk is accessed by block number only. There is no way you can tell which sector, cylinder, or head is involved -- you just ask for the block. For some time now, SCSI disks have taken advantage of this to get higher densities. Even though the operating system and the Unix file system still work in cylinders, the entries in /etc/format.dat that define the disk geometry are almost pure fiction. The total size in blocks is all that matters.

On the "spinning rust" itself, the number of sectors per track varies. If you have ever played around with an old-style open reel audio tape drive you know that the tape speed can be varied, and that the higher the tape speed, the better the sound. On a disk, the outermost track passes the head faster than the innermost track. They both take the same time, but the circumference is longer. Modern disks store more data on the higher quality outermost track than they do on the innermost track, by turning up the data rate and putting more sectors on it. Why does this matter? Well, the disk blocks are numbered from outside edge in. When you slice a disk for use, you should find slice 0 faster than slice 7 because at 0 there are more sectors per track and a higher data rate. The performance difference from start to end can be as much as 30 percent, but performance falls off quite slowly during the first two-thirds of the disk.


Every now and again the disk's CPU will ignore the commands you are sending it, and go through a thermal recalibration cycle. This ensures that the head and the tracks are all perfectly aligned, but if it occurs at the moment you are attempting to access it, you may wait a lot longer than usual for your data. Some disk vendors sell special "multimedia" disk drives that try to avoid this problem. They are useful if you are trying to do I/O in real-time or replay a video from disk smoothly.

If you don't do anything for a while, many newer models start to power themselves down. This is a result of their internal Energy Star modes, which turn off sections of circuitry when they are not in use. Also, a firmware bug sometimes reports an error on lightly-used disk drives. The reason? The disk drive's CPU would go to sleep and upon awakening forget which speed zone it was in. After a while it would reset, seek, and recover, but there would be a worrying console message about read retries.

I don't have space here to talk about the SCSI bus itself, and there are no direct SCSI bus-specific metrics in the current OS.

The SCSI Host Bus Adapter
You might think of this as the SCSI controller, but all devices on the SCSI bus act as controllers. The name Host Bus Adapter (HBA) indicates that it connects the SCSI bus to the host computer bus. For SPARC computers this is the SBus.

The SCSI HBA can be quite a complex and intelligent device, or it can be quite simple minded. The difference, from a performance perspective, is that a simple minded device will keep interrupting the OS so it can be told the next thing to do. The simplest (and most common) SBus SCSI HBA uses the "esp" device driver, and every SCSI command takes several interrupts to complete. The most intelligent (and expensive) SBus SCSI HBA uses the "isp" device driver. The SWI/S and DWI/S cards are "isp" based. It completes a whole SCSI operation on its own, and interrupts once when it is finished. This saves much system CPU time on the host. The latest SBus SCSI HBA uses the "fas" driver, it is found on UltraSPARC systems and it obsoletes the "esp." It is an improvement on the "esp," but less complex and expensive than the "isp," which remains the high-end option.

The SPARC Storage Array (SSA) and Fibre Channel
The SSA contains six "isp" style SCSI buses, and a dedicated SPARC processor that keeps them fed, communicates with a host over Fibre Channel, and manages non-volatile storage. This processor exercises similar control over the disk drive's CPU. Commands waiting to be completed are sorted into queues per device. Adjacent I/O's can be coalesced into one I/O for the disk. If fast writes are enabled for the disk, writes are put into non-volatile storage (and written to disk later) and the host system is immediately told that the I/O has completed. This provides a dramatic speedup for many operations. The Fibre Channel's Serial Optical Controller (SOC) acts like an intelligent SCSI HBA with a limit of 256 outstanding commands.

The Solaris 2 operating system
There are two levels in the OS, the generic "sd" (or "ssd" for the SPARC Storage Array) SCSI disk driver and the HBA specific "esp/fas/isp/soc," that send SCSI commands to the device. At the generic level a read or write system call becomes an entry in a queue of commands waiting to be sent to a device. If the device's own queue is full, or the SCSI bus is very busy, the command may wait a long time.

When a read command is sent to the disk it becomes active, and inside the disk the queue of active commands is processed. When the disk has the data ready it is sent back to the HBA, which uses DMA to copy the data into memory. When the transfer is done the HBA interrupts the OS, which does some housekeeping work then returns from the read system call.

Solaris maintains a full set of counters and high-resolution timers that are updated by each command. The initial arrival of a command causes the wait queue length to be incremented and the time spent at the previous queue length is accumulated as the product of length and time. When a command is issued to the disk, another set of metrics count the length and time spent in the active queue. The time that each queue is empty is also noted, as is the size of the transfer. The data structure maintained by the kernel is described in the kstat(3) manual page as follows:

     typedef struct kstat_io {
         * Basic counters.
        u_longlong_t   nread;              /* number of bytes read */
        u_longlong_t   nwritten;           /* number of bytes written */
        ulong_t        reads;              /* number of read operations */
        ulong_t        writes;             /* number of write operations */
        * Accumulated time and queue length statistics.
        * Time statistics are kept as a running sum of "active" time.
        * Queue length statistics are kept as a running sum of the
        * product of queue length and elapsed time at that length --
        * i.e., a Riemann sum for queue length integrated against time.
        *              ^
        *              |                       _________
        *              8                       | i4    |
        *              |                       |       |
        *      Queue   6                       |       |
        *      Length  |       _________       |       |
        *              4       | i2    |_______|       |
        *              |       |       i3              |
        *              2_______|                       |
        *              |    i1                         |
        *              |_______________________________|
        *              Time->  t1      t2      t3      t4
        * At each change of state (entry or exit from the queue),
        * we add the elapsed time (since the previous state change)
        * to the active time if the queue length was non-zero during
        * that interval; and we add the product of the elapsed time
        * times the queue length to the running length*time sum.
        * This method is generalizable to measuring residency
        * in any defined system: instead of queue lengths, think
        * of "outstanding RPC calls to server X."
        * A large number of I/O subsystems have at least two basic
        * "lists" of transactions they manage: one for transactions
        * that have been accepted for processing but for which processing
        * has yet to begin, and one for transactions which are actively
        * being processed (but not done). For this reason, two cumulative
        * time statistics are defined here: pre-service (wait) time,
        * and service (run) time.
        * The units of cumulative busy time are accumulated nanoseconds.
        * The units of cumulative length*time products are elapsed time
        * multiplied by queue length.
        hrtime_t     wtime;            /* cumulative wait (pre-service) time */
        hrtime_t     wlentime;         /* cumulative wait length*time product */
        hrtime_t     wlastupdate;      /* last time wait queue changed */
        hrtime_t     rtime;            /* cumulative run (service) time */
        hrtime_t     rlentime;         /* cumulative run length*time product */
        hrtime_t     rlastupdate;      /* last time run queue changed */
        ulong_t      wcnt;             /* count of elements in wait state */
        ulong_t      rcnt;             /* count of elements in run state */
     } kstat_io_t;

From these measures it is possible to calculate all the numbers that are printed by iostat and sar -d. For some example code that does these calculations you could look at the per-disk iostat class in the SE toolkit. I'll summarize the math in the next section.

The basic requirement is to have two copies of the data, separated by a time interval. You can then work out the statistics for that time interval. Every disk on the system has its own copy of this data, so you need to store the data for every disk, wait a while, then re-read it again for every disk. Each measurement has its own high resolution 64-bit timestamp, and the timestamp is guaranteed to be stable and monotonic.

Some older operating systems (e.g., SunOS 4) use the system's clock tick counter. This is subject to change if someone sets the date, and may be temporarily sped up or slowed down if network-wide time synchronization is in use. When this occurs some performance metrics can become inaccurate. The hires counter (which can be accessed by way of the gethrtime(3C) library call) solves this problem for Solaris 2.

Disk Statistics

% sar -d 1 1

SunOS hostname 5.5 Generic sun4u    05/23/96

11:34:56   device        %busy   avque   r+w/s  blks/s  avwait  avserv
11:34:57   sd3               0     0.0       0       0     0.0     0.0

% iostat
      tty          sd3          cpu
 tin tout Kps tps serv  us sy wt id
   0    0   2   0   80   1  1  1 98
% iostat -D
 rps wps util 
   0   0  0.9 
% iostat -x
                                 extended disk statistics 
disk      r/s  w/s   Kr/s   Kw/s wait actv  svc_t  %w  %b 
sd3       0.2  0.1    1.3    1.2  0.0  0.0   79.6   0   1 

I have shown sar -d and the three forms of iostat. As you can see, some of the headings match and some look as if they might be the same, but with different names. Let's go through them in turn.

That's probably enough for most people. If you want more details of the mathematics that translates from the raw data to the printed metrics the best thing to do is read the source code for the per-disk iostat class. This is part of the SE toolkit. Of course, if you don't like the combinations of metrics that iostat or sar print, you can use SE to build your own custom utility. It really is a few minutes work if you use the script that clones iostat -x as a starting point.

An extreme example
Here is a SPARCstorage Array disk being hit by an extreme write load benchmark. I want to show a different set of metrics from any of the above tools -- I'll leave the required SE script as an exercise for the reader.

    Disk  %busy  avque   await avserv  rps  wps   krps   kwps
    ssd0 100.00  907.1 17254.9 6784.1    0   38      0   4219

The disk is ssd0, which is the system disk. It is 100 percent busy. There are 907 commands queued on it. On average command waits for 17 seconds, then gets sent to the SPARCstorage Array, where they it sits in the active queue for 6.7 seconds. 38 commands are processed per second, adding up to 4219 kilobytes per second being written.

This works out to 111 KB per write (4219/38) on average. 17254.9 + 6784.1 = 24039 ms (24 seconds) / 907.1 = 26.5ms per IO. This is about the length of time that you would expect a large IO to take. The problem is that the benchmark keeps issuing writes faster than the disk can keep up. The time spent in the SPARCstorage Array queue is 6784ms, at 26.5 per I/O this is exactly 256 commands, which is the active queue depth for the SOC interface, as I described earlier.

The symptom reported to me as a problem was that the system locked up for the duration of the test. The reason it locked is that just about any command will need to page-in its code from the system disk, and that page-in is being squeezed out by the benchmark. When running the test on another disk the system was fine.

Next month
I'll take a look at compiler options for regular use and also some extra options that improve floating point performance on UltraSPARC systems.

Click on our Sponsors to help Support SunWorld