Disks can behave enigmatically. The only way to find out what is really happening is to delve beneath the averages shown by
iostat. This month we'll use the Solaris 2.5 trace system to see each individual disk I/O and solve a riddle that has puzzled people for years. (3,000 words)
extended disk statistics disk r/s w/s Kr/s Kw/s wait actv svc_t %w %b sd2 1.3 0.3 11.7 3.3 0.1 0.1 146.6 0 3 sd3 0.0 0.1 0.1 0.7 0.0 0.0 131.0 0 0
--Worried in Weyauwega
A:This is one of those recurring questions that everyone seems to ask at one time or another. The short answer is that it can safely be ignored, as the disks are so lightly used that they don't make a difference to the performance of the system.
It's a rather unsatisfying answer, however, as it doesn't explain why the large service times occur in the first place. There are several theories circulating:
iostatare well tested and correct.
fsflushprocess (pid 3) keeps the data on disk in sync with the data in memory at regular intervals. It normally runs every five seconds to flush data, and does additional work to flush inodes every 30 seconds. This is frequent enough to be a good candidate for the problem. It also explains why it is not seen on raw disks.
The only way to settle this is to trace each individual access, to see which process initiated the I/O, how long it took, and what else is happening at the same time. Solaris 2.5 introduced a new tracing capability that we can use, so I'll start by explaining how it works, then use it to capture some I/O processes.
The Solaris 2.5 trace capability
Unix systems have had a kernel trace capability for many years. It was designed to be used for development and debugging purposes, not for end users. The production kernel is normally built without the trace capability for performance reasons. One of the first production kernels to include tracing was IBM's AIX kernel on the RS/6000 family. IBM left it turned on during early releases to assist in debugging, then decided that it was useful enough to pay its way, and the overhead was quite low, so it is now a permanent feature. SunSoft also recognized the value of trace information, but decided to extend the trace capability to make it more generally useful, and implement it alongside the existing kernel trace system. It was introduced in Solaris 2.5 and consists of the following features.
prex" that controls probe execution for both user and kernel traces.
tnfxtractthat reads out the kernel trace buffer, and one called
tnfdumpto display TNF data in human-readable ascii.
There are a few things about kernel probes that are inconvenient.
While user-level probes can write to trace files, the kernel probes
write to a ring buffer. This buffer is not a single global ring. It is
a buffer-per-kernel thread. This avoids any need to lock the data
structures, so there is no performance loss or contention on
multiprocessor systems. You cannot easily tell how big the buffer
needs to be, and one highly active probe point may loop right around its
buffer while others have hardly started. If you are trying to capture
every single probe, you need to make the buffer as big as you can. In
general it is best to work with low-event rate probes or rely on
sampling, and put up with missing probes. The
routine just takes a snapshot of the buffer, so a second snapshot will
include anything left over from the first one. The
program does quite a lot of work to sort the probe events into time
Obtaining an I/O trace
The command sequence is quite simple. The commands must be run as root, and a directory is needed to hold the output. I find it easiest to have two windows open. One to run
prex, and the other to go
through a cycle of extract, dump, and viewing the data as required. The
command sequence for
prex is to first allocate a buffer
(the default is 384 kilobytes -- you can make it bigger), enable the
"io" group of probes, make them trace accesses, then turn on the global
flag that enables all kernel tracing.
# prex -k Type "help" for help ... prex> buffer alloc Buffer of size 393216 bytes allocated prex> enable io prex> trace io prex> ktrace on
Now wait a while or run the program you want to trace. In this case
iostat -x 10 in another window, didn't try to cause
any activity, and waited for some slow service time to appear. After a
minute or so I stopped collecting.
prex> ktrace off In the other window I extracted the data and dumped it to take a look. # mkdir /tmp/tnf # cd /tmp/tnf # tnfxtract io.tnf # tnfdump io.tnf | more
The first section identifies the three probes that were enabled because they have the key "io." The "strategy" probe records an I/O being issued, and "biodone" records an I/O completing. The other mechanism that may cause I/O is the pageout scanner, and this is also being probed. The location of the probe in the Solaris source code is printed for kernel developers to refer to.
probe tnf_name: "strategy" tnf_string: "keys io blockio;file ../../common/os/driver.c;line 358;" probe tnf_name: "pageout" tnf_string: "keys vm pageio io;file ../../common/vm/vm_pvn.c;line 511;" probe tnf_name: "biodone" tnf_string: "keys io blockio;file ../../common/os/bio.c;line 935;"
The next section is a table of probe events in time order. Where a user process was scheduling the I/O, its PID is recorded. Events caused by interrupts or kernel activity are recorded as PID 0. This trace was measured on a dual-CPU SPARCstation 10, so you can see a mixture of CPU 0 and CPU 2. I've listed the first few records then skipped through looking for something interesting.
Elapsed (ms) Delta (ms) PID LWPID TID CPU Probe Name Data/Description . . . 0.000000 0.000000 632 1 0xf61a1d80 0 strategy device: 8388632 block: 795344 size: 8192 buf: 0xf5afdb40 flags: 9 108.450000 108.450000 0 0 0xfbf6aec0 2 biodone device: 8388632 block: 795344 buf: 0xf5afdb40 108.977000 0.527000 632 1 0xf61a1d80 0 strategy device: 8388632 block: 795086 size: 1024 buf: 0xf610a358 flags: 524569 121.557000 12.580000 0 0 0xfbf6aec0 2 biodone device: 8388632 block: 795086 buf: 0xf610a358 121.755000 0.179000 632 1 0xf61a1d80 0 pageout vnode: 0xf61d0a48 pages_pageout: 0 pages_freed: 0 pages_reclaimed: 0
The first strategy routine is an eight-kilobyte write (flags bit 0x40 is
set for reads) to block 795344. To decode the device 8388632 =
0x800018, 0x18 = 24 and running
ls -lL on /dev/dsk shows
c0t3d0s0 is minor device 24, which is mounted as
/export/home. It's an old Sun 424-megabyte drive, so valid
blocks are from 0 to 828719. This is a seek to right at the end of the
disk, and it takes 108 milliseconds. It is followed by a one-kilobyte
write to a nearby block which only takes 12.58 milliseconds.
The pageout scanner runs, but finds nothing to do. It reports no pages paged out, freed, or reclaimed this time.
After eight seconds everything goes quiet, then there is a very
interesting burst of activity around the 22-second mark that is all
initiated by PID 3, the
8034.218000 0.019000 0 0 0xfbf6aec0 2 biodone device: 8388632 block: 796976 buf: 0xf610a538 21862.015500 13827.716000 3 1 0xf5d8cc80 2 strategy device: 8388632 block: 240 size: 2048 buf: 0xf5e71158 flags: 9 21897.756000 35.740500 0 0 0xfbf6aec0 2 biodone device: 8388632 block: 240 buf: 0xf5e71158 21897.944000 0.188000 3 1 0xf5d8cc80 0 strategy device: 8388632 block: 16 size: 2048 buf: 0xf5adadc0 flags: 524809 21907.130500 9.186500 0 0 0xfbf6aec0 2 biodone device: 8388632 block: 16 buf: 0xf5adadc0
We start by seeking from block 796976 to 240 with a 36-millisecond
write, followed by block 16 with a nine-millisecond write. This is
probably the filesystem superblock being updated. The following
writes are issued extremely close together, 14 of them in about 1.5
milliseconds, and they refer to two different devices. It starts with
device 8388624 mapping to c0t2d0s0, which is a Sun 1.05-gigabyte
disk holding the root filesystem, then also accesses
/export/home again during the sequence.
21934.506000 27.375500 3 1 0xf5d8cc80 0 strategy device: 8388624 block: 32544 size: 8192 buf: 0xf5afd780 flags: 1033 21934.732000 0.226000 3 1 0xf5d8cc80 0 strategy device: 8388624 block: 64896 size: 8192 buf: 0xf5e71a40 flags: 1033 21934.949000 0.217000 3 1 0xf5d8cc80 0 strategy device: 8388624 block: 887696 size: 8192 buf: 0xf5e71ef0 flags: 1033 21935.042000 0.093000 3 1 0xf5d8cc80 0 strategy device: 8388624 block: 855296 size: 8192 buf: 0xf5e70e10 flags: 1033 21935.129000 0.087000 3 1 0xf5d8cc80 0 strategy device: 8388624 block: 855248 size: 8192 buf: 0xf5e70438 flags: 1033 21935.198500 0.069500 3 1 0xf5d8cc80 0 strategy device: 8388632 block: 796464 size: 8192 buf: 0xf5e707f8 flags: 1033 21935.326500 0.128000 3 1 0xf5d8cc80 0 strategy device: 8388624 block: 678272 size: 8192 buf: 0xf5afd000 flags: 1033 21935.411000 0.084500 3 1 0xf5d8cc80 0 strategy device: 8388632 block: 795328 size: 2048 buf: 0xf5afe9c8 flags: 1033 21935.493500 0.082500 3 1 0xf5d8cc80 0 strategy device: 8388624 block: 887664 size: 8192 buf: 0xf5aff760 flags: 1033 21935.580500 0.087000 3 1 0xf5d8cc80 0 strategy device: 8388624 block: 194560 size: 8192 buf: 0xf5afdc30 flags: 1033 21935.676000 0.095500 3 1 0xf5d8cc80 0 strategy device: 8388632 block: 795344 size: 8192 buf: 0xf5afdb40 flags: 1033 21935.753000 0.077000 3 1 0xf5d8cc80 0 strategy device: 8388624 block: 32480 size: 8192 buf: 0xf5afe4a0 flags: 1033 21935.833000 0.080000 3 1 0xf5d8cc80 0 strategy device: 8388624 block: 887680 size: 8192 buf: 0xf5afe6f8 flags: 1033 21935.911500 0.078500 3 1 0xf5d8cc80 0 strategy device: 8388624 block: 1629440 size: 8192 buf: 0xf5afd078 flags: 1033
The requested seek order on root is blocks 32544, 64896, 887696,
855296, 855248, 678272, 887644, 194560, 32480, 887680, 1629440. On
/export/home the order is 796424, 795328, 795344 (again).
That's pretty random for root.
21964.151000 28.239500 0 0 0xfbe1cec0 0 biodone device: 8388624 block: 32544 buf: 0xf5afd780 21971.758000 7.607000 0 0 0xfbe1cec0 0 biodone device: 8388632 block: 796464 buf: 0xf5e707f8 21980.053500 8.295500 0 0 0xfbf6aec0 2 biodone device: 8388624 block: 64896 buf: 0xf5e71a40 21987.978000 7.924500 0 0 0xfbf6aec0 2 biodone device: 8388632 block: 795328 buf: 0xf5afe9c8 21994.076500 6.098500 0 0 0xfbe1cec0 0 biodone device: 8388624 block: 887696 buf: 0xf5e71ef0 21995.241500 1.165000 0 0 0xfbe1cec0 0 biodone device: 8388632 block: 795344 buf: 0xf5afdb40 22009.219000 13.977500 0 0 0xfbf6aec0 2 biodone device: 8388624 block: 855296 buf: 0xf5e70e10 22023.829500 14.610500 0 0 0xfbe1cec0 0 biodone device: 8388624 block: 855248 buf: 0xf5e70438 22037.521500 13.692000 0 0 0xfbf6aec0 2 biodone device: 8388624 block: 678272 buf: 0xf5afd000 22055.083500 17.562000 0 0 0xfbe1cec0 0 biodone device: 8388624 block: 887664 buf: 0xf5aff760 22077.595000 22.511500 0 0 0xfbf6aec0 2 biodone device: 8388624 block: 194560 buf: 0xf5afdc30 22099.481000 21.886000 0 0 0xfbe1cec0 0 biodone device: 8388624 block: 32480 buf: 0xf5afe4a0 22125.714500 26.233500 0 0 0xfbf6aec0 2 biodone device: 8388624 block: 887680 buf: 0xf5afe6f8 22146.705500 20.991000 0 0 0xfbe1cec0 0 biodone device: 8388624 block: 1629440 buf: 0xf5afd078
The order of completion for each disk is the same as the order of
issue. There are about 1000 blocks per cylinder on this disk, so we can
approximate the number of tracks per seek. Remember that disks are
variable geometry nowadays, so it's probably more than 1000 blocks per
cylinder near block 0 and progressively less at higher block numbers.
For the root disk, only the delay and seek distances are shown in the
table. What we see is that 11 writes of eight kilobytes were fired off
in one go, randomly spread over the disk, and that they completed
sequentially, taking on average 19 milliseconds each (not that bad, and
accesses to the other disk probably slowed it down a bit). The later
accesses were delayed by the earlier ones sufficiently that this burst
(which only took 210 milliseconds to complete) would report an average
service time of 111 milliseconds in
Measured over a 10-second interval, and if there were no other disk activity, we would see something like this:
|Seek sequence details for root disk|
|Block||seek cylinders||seek time||service time|
% iostat -x 10 ... extended disk statistics disk r/s w/s Kr/s Kw/s wait actv svc_t %w %b sd2 0.0 1.1 0.0 8.8 0.0 0.0 111.4 0 2
Now that looks very familiar. It seems that we have found the problem...
I rest my case M'lud
I accuse Mr.
Fsflush of killing the service time in the
kernel with a burst! In most cases, applications generate access
patterns that are either sequential bursts (large transfers) or
separate random accesses in a sequence over time. In this case a long,
completely random sequence is being generated in such a short time that
a queue forms. It is all over with very quickly, and only occurs every
30 seconds or so, but is quite likely to happen on any disk that
contains a filesystem. The trace data shown was taken at the first
attempt, so it isn't hard to find this happening.
I'd like to thank my colleagues Dave Fisk and Richard McDougall, who have used TNF much more extensively than I have, for their advice. Their ideas on TNF's strong points and shortcomings are much appreciated.
I think I just laid to rest another one of those frequently asked
questions that has been bugging many people (myself included) for a
long time. When I sat down to write this I knew the problem was related
fsflush, but was not sure exactly what the trace would
I hope you are also inspired to start using TNF. It can illuminate all kinds of situations for you, and at user level is a great way of embedding timing points into applications. A dormant probe point is extremely lightweight, so there is little penalty, and probes can be left in production code. I'm always surprised by each new tool I try. Every time I look at a familiar situation from a new angle something unexpected becomes obvious. It seems that diversity of attack is a useful technique for solving performance problems.
About the author
Adrian Cockcroft joined Sun in 1988, and currently works as a performance specialist for the Server Division of SMCC. He wrote Sun Performance and Tuning: SPARC and Solaris, published by SunSoft Press PTR Prentice Hall. Reach Adrian at firstname.lastname@example.org.
If you have technical problems with this magazine, contact email@example.com