|
Clarifying disk measurements and terminology
|
Last month I introduced some new ways to look at disk performance
in Solaris 2.6. Some of the terminology used by iostat
is not strictly correct, so this month we will start with the base
level measurements and work our way up using the proper terms.
(1,900 words)
Mail this article to a friend |
The
metrics reported by iostat
include service time, and in Solaris
2.6 there is now something called wait-service time. What does this
mean? Wait time and service time are supposed to be two separate
things aren't they?
That's true. The confusion arises because disks used to be very simple things, and when they became more complex the instrumentation was extended, but some of the names were left the same. I'll start with a history lesson to show how we got here, and follow it up with a mathematics lesson to show how it works today.
|
|
|
|
Disk measurements history lesson
In the old days disk controllers really did control the disks
directly. If you still remember the old Sun machines with SMD disks
and Xylogics controllers you may know what I mean. All the
intelligence was in the controller, which was a large VMEbus card
inside the system cabinet. The disk heads were directly connected to
the controller, and the device driver knew exactly which track the
disk was reading. As each bit was read from disk it was buffered in
the controller until a whole disk block was ready to be passed to
the device driver.
The device driver maintained a queue of waiting requests that were
serviced one at a time by the disk. From this the system could
report the service time directly as milliseconds-per-seek. The
throughput in transfers per second was also reported, and the
percentage of the time that the disk was busy, known as the
utilization. The terms utilization, service
time, wait time, throughput, and wait
queue length have well defined meanings in this scenario. A set
of simple equations from queuing theory can be used to derive these
values from underlying measurements. The original version of
iostat
in SunOS 3.X and SunOS 4.X was basically the
same as BSD Unix.
Over time disk technology moved on. Nowadays, the standard disk is
SCSI based, and has an embedded controller. The disk drive contains
a small microprocessor and about 1 MB of RAM. It can typically
handle up to 64 outstanding requests via SCSI-tagged command
queuing. The system uses a SCSI host bus adaptor (HBA) to talk to
the disk. In large systems there is another level of intelligence
and buffering in a hardware RAID controller. The simple model of a
disk used by iostat
and its terminology have become
confused.
In the old days, once the device driver sent the disk a request, it knew that the disk would do nothing else until the request was complete. The time it took was the service time, and the average service time is a property of the disk itself. Disks that spin faster and seek faster have lower (better) service times. With today's systems, the device driver issues a request, that request is queued internally by the RAID controller (and the disk drive) and several more requests can be sent before the first one comes back. The service time, as measured by the device driver, varies according to the load level and queue length and is not directly comparable to the old-style service time of a simple disk drive.
The instrumentation provided in Solaris 2 takes account of this change by explicitly measuring a two stage queue -- one queue in the device driver called the wait queue and one queue in the device itself called the active queue. A read or write command is issued to the device driver and sits in the wait queue until the SCSI bus and disk are both ready. When it is sent to the disk device it moves to the active queue until the disk sends its response.
The problem with iostat
is that it tries to report the new
measurements using some of the original terminology. The "wait
service time" that is mentioned in the question this month is
actually the time spent in the "wait" queue. This is not the right
definition of service time in any case, and the word "wait" is being
used to mean two different things. To sort out what we really do
have, we need to move on to the mathematics lesson.
It's all a matter of mathematics
Let's start with the actual measurements made by the kernel. For each
disk drive (and each disk partition, tape drive, and NFS mount in
Solaris 2.6) there is a small set of counters that are updated. Here
is an annotated copy of the kstat
-based data structure that SE
uses.
struct ks_disks { long number$; /* linear disk number */ string name$; /* name of the device */ ulonglong nread; /* number of bytes read */ ulonglong nwritten; /* number of bytes written */ ulong reads; /* number of reads */ ulong writes; /* number of writes */ longlong wtime; /* wait queue - time spent waiting */ longlong wlentime; /* wait queue - sum of queue length multiplied by time at that length */ longlong wlastupdate;/* wait queue - time of last update to wait queue */ longlong rtime; /* active/run queue - time spent active/running */ longlong rlentime; /* active/run queue - sum of queue length * time at that length */ longlong rlastupdate;/* active/run queue - time of last update to active/run queue */ ulong wcnt; /* wait queue - current queue length */ ulong rcnt; /* active/run queue - current queue length */ };
None of these values are printed out directly by iostat
, so this is
where the basic arithmetic starts. The first thing to realize is
that the underlying metrics are cumulative counters or instantaneous
values. The values printed by iostat
are averages over a time
interval. We need to take two copies of the above data structure
together with high resolution timestamps for each and do some
differencing (subtraction). We then get the average values between
the start and end times. I'll write it out as plainly as possible,
with pseudocode that assumes that there is an array of two values
for each measure indexed by start and end.
Thires = hires elapsed time = end time - start time = timestamp[end] - timestamp[start]
Thires is in units of nanoseconds, so divide down to get seconds.
T = Thires / 1000000000Bwait = hires busy time for wait queue = wtime[end] - wtime[start]
Brun = hires busy time for run queue = rtime[end] - rtime[start]
QBwait = wait queue length * time = wlentime[end] - wlentime[start]
QBrun = run queue length * time = rlentime[end] - rlentime[start]
Now we assume that all disk commands complete fairly quickly, so the arrival and completion rates are the same in a steady state average, and the throughput of both queues is the same. I'll use completions below as it seems more intuitive in this case.
Cread = completed reads = reads[end] - reads[start]Xread = read throughput = iostat rps = Cr / T
Cwrite = completed writes = writes[end] - writes[start]
Xwrite = write throughput = iostat wps = Cw / T
C = total commands completed = Cr + Cw
X = throughput in commands per second = iostat tps = C / T
A similar calculation gets us the data rate in kilobytes per second.
Kread = KB read in the interval = ( nread[end] - nread[start] ) / 1024Kwrite = KB written in the interval = ( nwritten[end] - nwritten[start] ) / 1024
Xkread = iostat Kr/s = Kread / T
Xkwrite = iostat Kw/s = Kwrite / T
Xk = total data rate = iostat Kps = Xkread + Xkwrite
Next we can obtain the utilization or busy percentage.
Uwait = wait queue utilization = iostat %w = 100 * Bwait / ThiresUrun = active/run queue utilization = iostat %b = 100 * Brun / Thires
Now we get to something called service time, but is NOT what iostat
prints out and calls service time. This is the real thing!
Swait = average wait queue service time in nanoseconds = Bwait / CSrun = average active/run queue service time in nanoseconds = Brun / C
Divide down to milliseconds or seconds if you like. The meaning of Srun is as close as you can get to the old style disk service time. Remember that the disk can run more than one command at a time and can return them in a different order than they were issued, and it becomes clear that it cannot be the same thing.
The data structure contains an instantaneous measure of queue length, but we want the average over the time interval. We get this from that strange "length time" product by dividing it by the busy time.
Qwait = average wait queue length = iostat wait = QBwait / BwaitQrun = average active/run queue length = iostat actv = QBrun / Brun
Finally we can get the thing that iostat
calls service time, but
which is really the response time or residence time, that includes
all queuing as well as service time. It is defined as the queue
length divided by the throughput.
Rwait = wait queue residence time = iostat w_svct = Qwait / XRrun = active/run queue residence time = iostat a_svct = Qrun / X
R = total residence time = iostat svc_t = Rwait + Rrun
The thing that iostat
calls the service time is
actually the residence or response time and includes all queuing
effects. The real definition of service time is the time taken for
the first command in line to be processed, and it is not printed out
by iostat
. Thanks to the SE toolkit this is easily
fixed. A "corrected" version of iostat
written in SE
prints out the data using the format shown below. This was actually
measured using a 2.5-inch IDE disk in my SPARCbook.
Correct disk service statistics -----wait queue----- ----active queue---- disk r/s w/s Kr/s Kw/s len res svc %u len res svc %u c0t3d0 8.4 0.0 35.4 0.0 0.00 0.01 0.01 0 0.20 24.33 24.33 20 c0t3d0 36.2 0.0 143.2 0.0 0.00 0.01 0.01 0 0.85 23.44 23.32 84 c0t3d0 32.6 4.2 157.2 27.8 0.08 2.20 1.84 7 1.57 42.57 26.14 96 c0t3d0 19.4 13.8 113.2 110.4 0.99 29.91 12.41 41 2.53 76.15 30.12 100
Wrap up
The Solaris 2 disk instrumentation is very complete and accurate.
Now that it has been extended to tapes, partitions, and client NFS
mount points there is a lot more that can be done with it. It's a
pity that the naming conventions used by iostat
are so
confusing, and that sar -d
mangles the data so much for
display. We asked if sar
could be fixed, but its output
format and options are largely constrained by cross-platform Unix
standards. We tried to get iostat
fixed, but it was
felt that the current naming convention was what users expected to
see, so changing the header or data too much would confuse existing
users. Hopefully this translation of existing practice into the
correct terminology will help reduce the confusion somewhat.
|
Resources
About the author
Adrian Cockcroft joined Sun in 1988, and currently works as a
performance specialist for the Server Division of SMCC. He wrote
Sun Performance and Tuning: SPARC and Solaris,
published by SunSoft Press
PTR Prentice Hall.
Reach Adrian at adrian.cockcroft@sunworld.com.
If you have technical problems with this magazine, contact webmaster@sunworld.com
URL: http://www.sunworld.com/swol-09-1997/swol-09-perf.html
Last modified: