Click on our Sponsors to help Support SunWorld
Performance Q & A by Adrian Cockcroft

Clarifying disk measurements and terminology

iostat metrics and related naming conventions have become more and more complicated. We straighten things out by telling you how disk measurements evolved and by properly defining the terms as they're used today

SunWorld
September  1997
[Next story]
[Table of Contents]
[Search]
Subscribe to SunWorld, it's free!

Abstract
Last month I introduced some new ways to look at disk performance in Solaris 2.6. Some of the terminology used by iostat is not strictly correct, so this month we will start with the base level measurements and work our way up using the proper terms. (1,900 words)


Mail this
article to
a friend

Q The metrics reported by iostat include service time, and in Solaris 2.6 there is now something called wait-service time. What does this mean? Wait time and service time are supposed to be two separate things aren't they?

A That's true. The confusion arises because disks used to be very simple things, and when they became more complex the instrumentation was extended, but some of the names were left the same. I'll start with a history lesson to show how we got here, and follow it up with a mathematics lesson to show how it works today.


Advertisements

Disk measurements history lesson
In the old days disk controllers really did control the disks directly. If you still remember the old Sun machines with SMD disks and Xylogics controllers you may know what I mean. All the intelligence was in the controller, which was a large VMEbus card inside the system cabinet. The disk heads were directly connected to the controller, and the device driver knew exactly which track the disk was reading. As each bit was read from disk it was buffered in the controller until a whole disk block was ready to be passed to the device driver.

The device driver maintained a queue of waiting requests that were serviced one at a time by the disk. From this the system could report the service time directly as milliseconds-per-seek. The throughput in transfers per second was also reported, and the percentage of the time that the disk was busy, known as the utilization. The terms utilization, service time, wait time, throughput, and wait queue length have well defined meanings in this scenario. A set of simple equations from queuing theory can be used to derive these values from underlying measurements. The original version of iostat in SunOS 3.X and SunOS 4.X was basically the same as BSD Unix.

Over time disk technology moved on. Nowadays, the standard disk is SCSI based, and has an embedded controller. The disk drive contains a small microprocessor and about 1 MB of RAM. It can typically handle up to 64 outstanding requests via SCSI-tagged command queuing. The system uses a SCSI host bus adaptor (HBA) to talk to the disk. In large systems there is another level of intelligence and buffering in a hardware RAID controller. The simple model of a disk used by iostat and its terminology have become confused.

In the old days, once the device driver sent the disk a request, it knew that the disk would do nothing else until the request was complete. The time it took was the service time, and the average service time is a property of the disk itself. Disks that spin faster and seek faster have lower (better) service times. With today's systems, the device driver issues a request, that request is queued internally by the RAID controller (and the disk drive) and several more requests can be sent before the first one comes back. The service time, as measured by the device driver, varies according to the load level and queue length and is not directly comparable to the old-style service time of a simple disk drive.

The instrumentation provided in Solaris 2 takes account of this change by explicitly measuring a two stage queue -- one queue in the device driver called the wait queue and one queue in the device itself called the active queue. A read or write command is issued to the device driver and sits in the wait queue until the SCSI bus and disk are both ready. When it is sent to the disk device it moves to the active queue until the disk sends its response.

The problem with iostat is that it tries to report the new measurements using some of the original terminology. The "wait service time" that is mentioned in the question this month is actually the time spent in the "wait" queue. This is not the right definition of service time in any case, and the word "wait" is being used to mean two different things. To sort out what we really do have, we need to move on to the mathematics lesson.

It's all a matter of mathematics
Let's start with the actual measurements made by the kernel. For each disk drive (and each disk partition, tape drive, and NFS mount in Solaris 2.6) there is a small set of counters that are updated. Here is an annotated copy of the kstat-based data structure that SE uses.

struct ks_disks {
    long      number$;    /* linear disk number */
    string    name$;      /* name of the device */

    ulonglong nread;      /* number of bytes read */
    ulonglong nwritten;   /* number of bytes written */
    ulong     reads;      /* number of reads */
    ulong     writes;     /* number of writes */
    longlong  wtime;      /* wait queue - time spent waiting */
    longlong  wlentime;   /* wait queue - sum of queue length multiplied
    			  by time at that length */
    longlong  wlastupdate;/* wait queue - time of last update to wait queue */
    longlong  rtime;      /* active/run queue - time spent active/running */
    longlong  rlentime;   /* active/run queue - sum of queue length * time 
   			  at that length */
    longlong  rlastupdate;/* active/run queue - time of last update 
    	                  to active/run queue */
    ulong     wcnt;       /* wait queue - current queue length */
    ulong     rcnt;       /* active/run queue - current queue length */

};

None of these values are printed out directly by iostat, so this is where the basic arithmetic starts. The first thing to realize is that the underlying metrics are cumulative counters or instantaneous values. The values printed by iostat are averages over a time interval. We need to take two copies of the above data structure together with high resolution timestamps for each and do some differencing (subtraction). We then get the average values between the start and end times. I'll write it out as plainly as possible, with pseudocode that assumes that there is an array of two values for each measure indexed by start and end.

Thires = hires elapsed time = end time - start time = timestamp[end] - timestamp[start]

Thires is in units of nanoseconds, so divide down to get seconds.

T = Thires / 1000000000

Bwait = hires busy time for wait queue = wtime[end] - wtime[start]

Brun = hires busy time for run queue = rtime[end] - rtime[start]

QBwait = wait queue length * time = wlentime[end] - wlentime[start]

QBrun = run queue length * time = rlentime[end] - rlentime[start]

Now we assume that all disk commands complete fairly quickly, so the arrival and completion rates are the same in a steady state average, and the throughput of both queues is the same. I'll use completions below as it seems more intuitive in this case.

Cread = completed reads = reads[end] - reads[start]

Xread = read throughput = iostat rps = Cr / T

Cwrite = completed writes = writes[end] - writes[start]

Xwrite = write throughput = iostat wps = Cw / T

C = total commands completed = Cr + Cw

X = throughput in commands per second = iostat tps = C / T

A similar calculation gets us the data rate in kilobytes per second.

Kread = KB read in the interval = ( nread[end] - nread[start] ) / 1024

Kwrite = KB written in the interval = ( nwritten[end] - nwritten[start] ) / 1024

Xkread = iostat Kr/s = Kread / T

Xkwrite = iostat Kw/s = Kwrite / T

Xk = total data rate = iostat Kps = Xkread + Xkwrite

Next we can obtain the utilization or busy percentage.

Uwait = wait queue utilization = iostat %w = 100 * Bwait / Thires

Urun = active/run queue utilization = iostat %b = 100 * Brun / Thires

Now we get to something called service time, but is NOT what iostat prints out and calls service time. This is the real thing!

Swait = average wait queue service time in nanoseconds = Bwait / C

Srun = average active/run queue service time in nanoseconds = Brun / C

Divide down to milliseconds or seconds if you like. The meaning of Srun is as close as you can get to the old style disk service time. Remember that the disk can run more than one command at a time and can return them in a different order than they were issued, and it becomes clear that it cannot be the same thing.

The data structure contains an instantaneous measure of queue length, but we want the average over the time interval. We get this from that strange "length time" product by dividing it by the busy time.

Qwait = average wait queue length = iostat wait = QBwait / Bwait

Qrun = average active/run queue length = iostat actv = QBrun / Brun

Finally we can get the thing that iostat calls service time, but which is really the response time or residence time, that includes all queuing as well as service time. It is defined as the queue length divided by the throughput.

Rwait = wait queue residence time = iostat w_svct = Qwait / X

Rrun = active/run queue residence time = iostat a_svct = Qrun / X

R = total residence time = iostat svc_t = Rwait + Rrun

The thing that iostat calls the service time is actually the residence or response time and includes all queuing effects. The real definition of service time is the time taken for the first command in line to be processed, and it is not printed out by iostat. Thanks to the SE toolkit this is easily fixed. A "corrected" version of iostat written in SE prints out the data using the format shown below. This was actually measured using a 2.5-inch IDE disk in my SPARCbook.

Correct disk service statistics  -----wait queue----- ----active queue----
disk      r/s  w/s   Kr/s   Kw/s  len   res   svc  %u  len   res   svc  %u
c0t3d0    8.4  0.0   35.4    0.0 0.00  0.01  0.01   0 0.20 24.33 24.33  20
c0t3d0   36.2  0.0  143.2    0.0 0.00  0.01  0.01   0 0.85 23.44 23.32  84
c0t3d0   32.6  4.2  157.2   27.8 0.08  2.20  1.84   7 1.57 42.57 26.14  96
c0t3d0   19.4 13.8  113.2  110.4 0.99 29.91 12.41  41 2.53 76.15 30.12 100

Wrap up
The Solaris 2 disk instrumentation is very complete and accurate. Now that it has been extended to tapes, partitions, and client NFS mount points there is a lot more that can be done with it. It's a pity that the naming conventions used by iostat are so confusing, and that sar -d mangles the data so much for display. We asked if sar could be fixed, but its output format and options are largely constrained by cross-platform Unix standards. We tried to get iostat fixed, but it was felt that the current naming convention was what users expected to see, so changing the header or data too much would confuse existing users. Hopefully this translation of existing practice into the correct terminology will help reduce the confusion somewhat.


Click on our Sponsors to help Support SunWorld


Resources

Other Cockcroft columns at www.sun.com

About the author
Adrian Cockcroft joined Sun in 1988, and currently works as a performance specialist for the Server Division of SMCC. He wrote Sun Performance and Tuning: SPARC and Solaris, published by SunSoft Press PTR Prentice Hall. Reach Adrian at adrian.cockcroft@sunworld.com.

What did you think of this article?
-Very worth reading
-Worth reading
-Not worth reading
-Too long
-Just right
-Too short
-Too technical
-Just right
-Not technical enough
 
 
 
    

SunWorld
[Table of Contents]
Subscribe to SunWorld, it's free!
[Search]
Feedback
[Next story]
Sun's Site

[(c) Copyright  Web Publishing Inc., and IDG Communication company]

If you have technical problems with this magazine, contact webmaster@sunworld.com

URL: http://www.sunworld.com/swol-09-1997/swol-09-perf.html
Last modified: