Click on our Sponsors to help Support SunWorld
Performance Q & A by Adrian Cockcroft

Solving the iostat disk mystery

Our detective uses the new Solaris 2.5 trace system
to track a puzzling disk activity report

SunWorld
October  1996
[Next story]
[Table of Contents]
[Search]
Subscribe to SunWorld, it's free!

Abstract
Disks can behave enigmatically. The only way to find out what is really happening is to delve beneath the averages shown by iostat. This month we'll use the Solaris 2.5 trace system to see each individual disk I/O and solve a riddle that has puzzled people for years. (3,000 words)


Mail this
article to
a friend
Q: I keep seeing disks that are lightly used, but have extremely large service times. Disks are supposed to have an average seek time of about 10-20 milliseconds, so why do they often report over 100 milliseconds when they can't possibly be overloaded? Why does this happen? Is it a sign of a problem?

                                 extended disk statistics 
disk      r/s  w/s   Kr/s   Kw/s wait actv  svc_t  %w  %b 
sd2       1.3  0.3   11.7    3.3  0.1  0.1  146.6   0   3 
sd3       0.0  0.1    0.1    0.7  0.0  0.0  131.0   0   0 

--Worried in Weyauwega

A:This is one of those recurring questions that everyone seems to ask at one time or another. The short answer is that it can safely be ignored, as the disks are so lightly used that they don't make a difference to the performance of the system.

It's a rather unsatisfying answer, however, as it doesn't explain why the large service times occur in the first place. There are several theories circulating:

Is it a bug?

This possibility can be dismissed immediately, since it's been seen for many years, and has even been reported as a bug and investigated. The calculations in iostat are well tested and correct.

Is it caused by rounding error at low-activity levels?

This is what I thought the problem was for many years. It was only when we used I/O tracing to look at near-idle disks that we found out what was really going on. Rounding errors cannot explain the high service times we see.

Is it caused by Energy Star and thermal recalibration?

Modern disks have minds of their own. If you stop using them for a while they power off their circuitry, and can even be programmed to spin down completely. Even when they are in use, they go through a recalibration sequence every now and again. This keeps the heads perfectly aligned even when temperature changes cause thermal expansion. While it's true that these activities will increase service times, they should be relatively infrequent. We might be able to find this kind of access in an I/O trace. It should appear as an isolated short distance seek that takes a long time.

Is it something to do with the filesystem?

If your system has disks that are used raw, by a database such as Sybase or Oracle, then you may notice that service times are much shorter overall. Even when activity levels are low, the service time stays low. (This, too, seems to rule out recalibration and Energy Star as the cause.) Perhaps the filesystem layout is a problem? Inodes, indirect blocks, and data blocks may be scattered all over the disk. The extra long distance seeks required to find the inode, get the indirect block, and finally read or write data, probably explain why filesystem service times are generally higher than raw disk service times. At low-usage levels it still doesn't explain why we see such long seek times, however.

Is it the filesystem flush process?

The fsflush process (pid 3) keeps the data on disk in sync with the data in memory at regular intervals. It normally runs every five seconds to flush data, and does additional work to flush inodes every 30 seconds. This is frequent enough to be a good candidate for the problem. It also explains why it is not seen on raw disks.

The only way to settle this is to trace each individual access, to see which process initiated the I/O, how long it took, and what else is happening at the same time. Solaris 2.5 introduced a new tracing capability that we can use, so I'll start by explaining how it works, then use it to capture some I/O processes.


Advertisements

The Solaris 2.5 trace capability
Unix systems have had a kernel trace capability for many years. It was designed to be used for development and debugging purposes, not for end users. The production kernel is normally built without the trace capability for performance reasons. One of the first production kernels to include tracing was IBM's AIX kernel on the RS/6000 family. IBM left it turned on during early releases to assist in debugging, then decided that it was useful enough to pay its way, and the overhead was quite low, so it is now a permanent feature. SunSoft also recognized the value of trace information, but decided to extend the trace capability to make it more generally useful, and implement it alongside the existing kernel trace system. It was introduced in Solaris 2.5 and consists of the following features.

There are a few things about kernel probes that are inconvenient. While user-level probes can write to trace files, the kernel probes write to a ring buffer. This buffer is not a single global ring. It is a buffer-per-kernel thread. This avoids any need to lock the data structures, so there is no performance loss or contention on multiprocessor systems. You cannot easily tell how big the buffer needs to be, and one highly active probe point may loop right around its buffer while others have hardly started. If you are trying to capture every single probe, you need to make the buffer as big as you can. In general it is best to work with low-event rate probes or rely on sampling, and put up with missing probes. The tnfxtract routine just takes a snapshot of the buffer, so a second snapshot will include anything left over from the first one. The tnfdump program does quite a lot of work to sort the probe events into time order.

Obtaining an I/O trace
The command sequence is quite simple. The commands must be run as root, and a directory is needed to hold the output. I find it easiest to have two windows open. One to run prex, and the other to go through a cycle of extract, dump, and viewing the data as required. The command sequence for prex is to first allocate a buffer (the default is 384 kilobytes -- you can make it bigger), enable the "io" group of probes, make them trace accesses, then turn on the global flag that enables all kernel tracing.

# prex -k
Type "help" for help ...
prex> buffer alloc
Buffer of size 393216 bytes allocated
prex> enable io
prex> trace io
prex> ktrace on

Now wait a while or run the program you want to trace. In this case I ran iostat -x 10 in another window, didn't try to cause any activity, and waited for some slow service time to appear. After a minute or so I stopped collecting.

prex> ktrace off
In the other window I extracted the data and dumped it to take a look.
# mkdir /tmp/tnf
# cd /tmp/tnf
# tnfxtract io.tnf
# tnfdump io.tnf | more

The first section identifies the three probes that were enabled because they have the key "io." The "strategy" probe records an I/O being issued, and "biodone" records an I/O completing. The other mechanism that may cause I/O is the pageout scanner, and this is also being probed. The location of the probe in the Solaris source code is printed for kernel developers to refer to.

probe    tnf_name: "strategy" tnf_string: "keys io blockio;file ../../common/os/driver.c;line 358;"
probe    tnf_name: "pageout" tnf_string: "keys vm pageio io;file ../../common/vm/vm_pvn.c;line 511;"
probe    tnf_name: "biodone" tnf_string: "keys io blockio;file ../../common/os/bio.c;line 935;"

The next section is a table of probe events in time order. Where a user process was scheduling the I/O, its PID is recorded. Events caused by interrupts or kernel activity are recorded as PID 0. This trace was measured on a dual-CPU SPARCstation 10, so you can see a mixture of CPU 0 and CPU 2. I've listed the first few records then skipped through looking for something interesting.


Elapsed (ms)	Delta (ms)	PID	LWPID	   TID		CPU	Probe Name	Data/Description . . .

    0.000000	  0.000000	632	  1	0xf61a1d80	 0 	 strategy 	device: 8388632 block: 795344 size: 8192 buf: 0xf5afdb40 flags: 9	
  108.450000 	108.450000	  0	  0	0xfbf6aec0	 2	 biodone 	device: 8388632 block: 795344 buf: 0xf5afdb40 
  108.977000 	  0.527000	632       1	0xf61a1d80 	 0 	 strategy 	device: 8388632 block: 795086 size: 1024 buf: 0xf610a358 flags: 524569
  121.557000 	 12.580000	  0	  0 	0xfbf6aec0 	 2	 biodone 	device: 8388632 block: 795086 buf: 0xf610a358
  121.755000 	  0.179000 	632	  1 	0xf61a1d80 	 0	 pageout 	vnode: 0xf61d0a48 pages_pageout: 0 pages_freed: 0 pages_reclaimed: 0

The first strategy routine is an eight-kilobyte write (flags bit 0x40 is set for reads) to block 795344. To decode the device 8388632 = 0x800018, 0x18 = 24 and running ls -lL on /dev/dsk shows that c0t3d0s0 is minor device 24, which is mounted as /export/home. It's an old Sun 424-megabyte drive, so valid blocks are from 0 to 828719. This is a seek to right at the end of the disk, and it takes 108 milliseconds. It is followed by a one-kilobyte write to a nearby block which only takes 12.58 milliseconds.

The pageout scanner runs, but finds nothing to do. It reports no pages paged out, freed, or reclaimed this time.

After eight seconds everything goes quiet, then there is a very interesting burst of activity around the 22-second mark that is all initiated by PID 3, the fsflush process.

 8034.218000      0.019000       0     	  0 	0xfbf6aec0   	 2 	 biodone        device: 8388632 block: 796976 buf: 0xf610a538
21862.015500  13827.716000       3	  1 	0xf5d8cc80	 2 	 strategy       device: 8388632 block: 240 size: 2048 buf: 0xf5e71158 flags: 9
21897.756000     35.740500       0     	  0 	0xfbf6aec0   	 2 	 biodone        device: 8388632 block: 240 buf: 0xf5e71158
21897.944000      0.188000       3     	  1 	0xf5d8cc80   	 0   	 strategy       device: 8388632 block: 16 size: 2048 buf: 0xf5adadc0 flags: 524809
21907.130500      9.186500       0     	  0 	0xfbf6aec0   	 2       biodone        device: 8388632 block: 16 buf: 0xf5adadc0

We start by seeking from block 796976 to 240 with a 36-millisecond write, followed by block 16 with a nine-millisecond write. This is probably the filesystem superblock being updated. The following writes are issued extremely close together, 14 of them in about 1.5 milliseconds, and they refer to two different devices. It starts with device 8388624 mapping to c0t2d0s0, which is a Sun 1.05-gigabyte disk holding the root filesystem, then also accesses /export/home again during the sequence.

21934.506000     27.375500       3     	  1 	0xf5d8cc80   	 0 	 strategy       device: 8388624 block: 32544 size: 8192 buf: 0xf5afd780 flags: 1033
21934.732000      0.226000       3     	  1 	0xf5d8cc80   	 0 	 strategy       device: 8388624 block: 64896 size: 8192 buf: 0xf5e71a40 flags: 1033
21934.949000      0.217000       3        1 	0xf5d8cc80   	 0 	 strategy       device: 8388624 block: 887696 size: 8192 buf: 0xf5e71ef0 flags: 1033
21935.042000      0.093000       3        1 	0xf5d8cc80   	 0 	 strategy       device: 8388624 block: 855296 size: 8192 buf: 0xf5e70e10 flags: 1033
21935.129000      0.087000       3        1     0xf5d8cc80       0       strategy       device: 8388624 block: 855248 size: 8192 buf: 0xf5e70438 flags: 1033
21935.198500      0.069500       3        1 	0xf5d8cc80   	 0 	 strategy	device: 8388632 block: 796464 size: 8192 buf: 0xf5e707f8 flags: 1033
21935.326500      0.128000       3        1 	0xf5d8cc80   	 0 	 strategy	device: 8388624 block: 678272 size: 8192 buf: 0xf5afd000 flags: 1033
21935.411000      0.084500     	 3     	  1 	0xf5d8cc80   	 0 	 strategy	device: 8388632 block: 795328 size: 2048 buf: 0xf5afe9c8 flags: 1033
21935.493500      0.082500     	 3     	  1 	0xf5d8cc80   	 0 	 strategy	device: 8388624 block: 887664 size: 8192 buf: 0xf5aff760 flags: 1033
21935.580500      0.087000     	 3     	  1 	0xf5d8cc80   	 0 	 strategy	device: 8388624 block: 194560 size: 8192 buf: 0xf5afdc30 flags: 1033
21935.676000      0.095500     	 3     	  1 	0xf5d8cc80   	 0 	 strategy	device: 8388632 block: 795344 size: 8192 buf: 0xf5afdb40 flags: 1033
21935.753000      0.077000     	 3     	  1 	0xf5d8cc80   	 0 	 strategy	device: 8388624 block: 32480 size: 8192 buf: 0xf5afe4a0 flags: 1033
21935.833000      0.080000     	 3     	  1 	0xf5d8cc80   	 0 	 strategy 	device: 8388624 block: 887680 size: 8192 buf: 0xf5afe6f8 flags: 1033
21935.911500      0.078500     	 3     	  1 	0xf5d8cc80   	 0 	 strategy	device: 8388624 block: 1629440 size: 8192 buf: 0xf5afd078 flags: 1033

The requested seek order on root is blocks 32544, 64896, 887696, 855296, 855248, 678272, 887644, 194560, 32480, 887680, 1629440. On /export/home the order is 796424, 795328, 795344 (again). That's pretty random for root.

21964.151000	 28.239500       0     	  0 	0xfbe1cec0   	 0 	 biodone	device: 8388624 block: 32544 buf: 0xf5afd780
21971.758000      7.607000       0        0 	0xfbe1cec0   	 0 	 biodone       	device: 8388632 block: 796464 buf: 0xf5e707f8
21980.053500      8.295500       0        0 	0xfbf6aec0   	 2 	 biodone       	device: 8388624 block: 64896 buf: 0xf5e71a40
21987.978000      7.924500       0        0 	0xfbf6aec0   	 2 	 biodone       	device: 8388632 block: 795328 buf: 0xf5afe9c8
21994.076500      6.098500       0        0 	0xfbe1cec0   	 0 	 biodone        device: 8388624 block: 887696 buf: 0xf5e71ef0
21995.241500      1.165000       0        0  	0xfbe1cec0  	 0 	 biodone       	device: 8388632 block: 795344 buf: 0xf5afdb40
22009.219000     13.977500       0        0 	0xfbf6aec0  	 2 	 biodone       	device: 8388624 block: 855296 buf: 0xf5e70e10
22023.829500     14.610500       0        0 	0xfbe1cec0  	 0 	 biodone       	device: 8388624 block: 855248 buf: 0xf5e70438
22037.521500     13.692000       0        0 	0xfbf6aec0  	 2 	 biodone       	device: 8388624 block: 678272 buf: 0xf5afd000
22055.083500     17.562000       0        0 	0xfbe1cec0  	 0 	 biodone       	device: 8388624 block: 887664 buf: 0xf5aff760
22077.595000     22.511500       0        0 	0xfbf6aec0  	 2 	 biodone       	device: 8388624 block: 194560 buf: 0xf5afdc30
22099.481000     21.886000       0        0 	0xfbe1cec0  	 0 	 biodone       	device: 8388624 block: 32480 buf: 0xf5afe4a0
22125.714500     26.233500       0        0 	0xfbf6aec0  	 2 	 biodone       	device: 8388624 block: 887680 buf: 0xf5afe6f8
22146.705500     20.991000       0        0 	0xfbe1cec0   	 0 	 biodone       	device: 8388624 block: 1629440 buf: 0xf5afd078

The order of completion for each disk is the same as the order of issue. There are about 1000 blocks per cylinder on this disk, so we can approximate the number of tracks per seek. Remember that disks are variable geometry nowadays, so it's probably more than 1000 blocks per cylinder near block 0 and progressively less at higher block numbers. For the root disk, only the delay and seek distances are shown in the table. What we see is that 11 writes of eight kilobytes were fired off in one go, randomly spread over the disk, and that they completed sequentially, taking on average 19 milliseconds each (not that bad, and accesses to the other disk probably slowed it down a bit). The later accesses were delayed by the earlier ones sufficiently that this burst (which only took 210 milliseconds to complete) would report an average service time of 111 milliseconds in iostat.

Measured over a 10-second interval, and if there were no other disk activity, we would see something like this:

Seek sequence details for root disk
Block seek cylinders seek time service time
32544 1333 29.65ms 29.65ms
64896 32 15.67ms 45.32ms
887696 822 13.81ms 59.13ms
855296 back 32 15.05ms 74.18ms
855248 0 14.52ms 88.70ms
678272 back 177 13.50ms 102.2ms
887644 209 17.39ms 119.59ms
194560 back 693 22.42ms 142.01ms
32480 back 162 21.72ms 163.73ms
887680 855 26.15ms 189.88ms
1629440 742 20.91ms 210.79ms
Average 19.16ms 111.38ms
% iostat -x 10
...
                                 extended disk statistics 
disk      r/s  w/s   Kr/s   Kw/s wait actv  svc_t  %w  %b 
sd2       0.0  1.1    0.0    8.8  0.0  0.0  111.4   0   2 

Now that looks very familiar. It seems that we have found the problem...

I rest my case M'lud
I accuse Mr. Fsflush of killing the service time in the kernel with a burst! In most cases, applications generate access patterns that are either sequential bursts (large transfers) or separate random accesses in a sequence over time. In this case a long, completely random sequence is being generated in such a short time that a queue forms. It is all over with very quickly, and only occurs every 30 seconds or so, but is quite likely to happen on any disk that contains a filesystem. The trace data shown was taken at the first attempt, so it isn't hard to find this happening.

Wrap up
I'd like to thank my colleagues Dave Fisk and Richard McDougall, who have used TNF much more extensively than I have, for their advice. Their ideas on TNF's strong points and shortcomings are much appreciated.

I think I just laid to rest another one of those frequently asked questions that has been bugging many people (myself included) for a long time. When I sat down to write this I knew the problem was related to fsflush, but was not sure exactly what the trace would show.

I hope you are also inspired to start using TNF. It can illuminate all kinds of situations for you, and at user level is a great way of embedding timing points into applications. A dormant probe point is extremely lightweight, so there is little penalty, and probes can be left in production code. I'm always surprised by each new tool I try. Every time I look at a familiar situation from a new angle something unexpected becomes obvious. It seems that diversity of attack is a useful technique for solving performance problems.


Click on our Sponsors to help Support SunWorld


Resources