Click on our Sponsors to help Support SunWorld
Performance Q & A by Adrian Cockcroft

What's the holdup?

How is wait time measured? Are the results exaggerated?

SunWorld
October  1998
[Next story]
[Table of Contents]
[Search]
Subscribe to SunWorld, it's free!

Abstract
Overall CPU time is sometimes reported as user, system, wait, and idle. The concept of wait time is confusing, and in some circumstances an inflated value is reported. Adrian explains what it all means and describes a more accurate measurement method. (1,600 words)


Mail this
article to
a friend
Q: Some tools such as vmstat don't report wait time, but iostat and sar do. What does it mean and why does it get so high on some of my large multiprocessor servers?

A: This is one of the most common "frequently asked questions." I'll describe how it's measured, what it does and doesn't include, and the problems that occur in current implementations. We've made a change to the way this value is calculated for Solaris 7 which obtains more accurate results.

As I mentioned in my June 1998 column (see Resources section below), CPU time is measured using the clock interrupt, which samples the state of each CPU one hundred times a second. If the CPU is idle then an extra check is made to see whether there are any block disk I/O requests outstanding. If there are, then the time is counted as I/O wait rather than idle time. The only possible cause of wait time is normal disk I/O to a block device. Nothing else is counted, no network waits, and no semaphore or database lock waits. If your system is using asynchronous I/O to raw disks then that is also not counted as a cause for CPU wait.

The block I/O request queue is reported by vmstat in its second column. To show what happens, three copies of find were run on the same disk. The output from both vmstat and iostat is shown.

Figure 1: Example vmstat and iostat output

% vmstat 5
 procs     memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr s3 -- -- --   in   sy   cs us sy id
 0 0 0  12520 11204   0   3  7  4  5  0 12  2  0  0  0   75  615   84  6  3 91
 0 0 0 140328 17144   0  21 66  0  0  0  0 23  0  0  0  183  450  109 16  8 75
 0 1 0 140292 16508   0   0 131 0  0  0  0 46  0  0  0  342  718  150 17 14 68
 0 1 0 140136 15824   0  43 76  0  0  0  0 41  0  0  0  278 2390  181 39 31 29
 0 2 0 140016 15508   0   0 27  0  0  0  0 44  0  0  0  332  981  125 17 12 71
 0 2 0 140016 15332   0   0 42  0  0  0  0 40  0  0  0  202  550  143  8 10 82
 0 1 0 140016 15176   0   0 24  0  0  0  0 41  0  0  0  206  320   87  5  6 90

% iostat -xc 5
                               extended device statistics         cpu
device    r/s  w/s   kr/s   kw/s wait actv  svc_t  %w  %b  us sy wt id
sd3       1.4  0.8    7.3   10.3  0.7  0.1  364.2   1   4   6  3  3 88
                               extended device statistics         cpu
device    r/s  w/s   kr/s   kw/s wait actv  svc_t  %w  %b  us sy wt id
sd3       0.6  2.4    2.4   18.6  0.0  0.0   12.0   0   4   5  1  4 90
                               extended device statistics         cpu
device    r/s  w/s   kr/s   kw/s wait actv  svc_t  %w  %b  us sy wt id
sd3      12.6  1.2  100.6    5.9  0.0  0.4   29.5   0  33  19 14 28 39
                               extended device statistics         cpu
device    r/s  w/s   kr/s   kw/s wait actv  svc_t  %w  %b  us sy wt id
sd3      24.2 16.6  193.6  124.4  0.9  2.2   77.9  26 100  42 32 26  0
                               extended device statistics         cpu
device    r/s  w/s   kr/s   kw/s wait actv  svc_t  %w  %b  us sy wt id
sd3      21.0 18.8  168.0  150.4  0.7  2.4   78.4  33 100  14 18 69  0

You can see that there were one or two blocked processes shown in the "b" column of vmstat after a few seconds. This corresponds to the idle time moving to wait time as reported by iostat. The two commands weren't synchronized, so the output doesn't match exactly. Also the first line of output for each should be ignored as it's a long-term average.

This output was collected on a uniprocessor system. When there's only one CPU involved, and it's idle, then the CPU must truly be waiting for the I/O to complete. If the I/O completed more quickly, then the CPU could do more work, so wait time is the proportion of the time that the system is I/O bound.

On a multiprocessor system the situation gets more complex. All measurements are made on a per-processor basis, and these values can be seen using mpstat. During the clock interrupt, the total number of blocked jobs is added up for all the CPUs. If this total is nonzero, then, when the CPU state is obtained, every idle CPU is recorded as waiting rather than idle.

Even a single blocked I/O request can change every CPU from idle to "waiting." Because only one CPU can process that request, it's obviously an overestimate of the situation. However, this is the way that most Unix systems calculate this value. The resulting mpstat output is shown below for a nine-CPU system running Solaris 2.6 that was idle apart from a single find command. You can see that there is some activity on CPU 9, but that every CPU is now waiting.


Advertisements

Figure 2 Example mpstat output on Solaris 2.6

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0     0    0    8    0    0    0    0     0    0   0  92   8
  1    0   0    0     1    1    6    0    0    0    0     0    0   0  92   8
  8    0   0    0   368  368   12    0    0    0    0     4    0   0  92   8
  9    0 103 2089   201    1  237    1    0    0    0  1395    4   4  92   0
 12    0   0    0     0    0    0    0    0    0    0     0    0   0  92   8
 13    0   0    0     0    0    6    0    0    0    0     0    0   0  92   8
 16    0   0    0     0    0    5    0    0    0    0     0    0   0  92   8
 20    0   0    0     0    0    9    0    0    0    0     0    0   0  92   8
 21    0   0    0     1    1    7    0    0    0    0    29    0   0  92   8

A work-around for this inflation is to divide the wait time by the number of CPUs, and put the remainder back into idle time. It prevents the gross exaggeration of wait time, but it isn't accurate.

Some time ago when I was looking at the algorithm, I noticed that the number of blocked I/O requests was being maintained on a per-CPU basis. This means that the per-CPU blocked value could be used in the per-CPU calculation of wait time. I filed a bug, and it was accepted as a change for Solaris 2.7, which will be released shortly. In this case with the same workload the difference can be seen using mpstat. The I/O requests move from one CPU to another, so when examined at 5-second intervals the wait time could be spread over several CPUs. However, the overall level of wait time is far lower as it now accurately represents the number of CPUs that were waiting at each measurement interval. In this case, the command stayed primarily on CPU 5, and only that CPU shows high wait time on this 18-processor system.

Figure 3 Example mpstat output on Solaris 2.7

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0     0    0    7    0    0    0    0    15    0   0   0 100
  1    0   0    0    79    7    6    0    0    0    0     1    0   3   0  97
  4    0   0    0     1    1    2    0    0    0    0     7    0   0   0 100
  5    0  94 8063     2    0  201    1    0    0    0  1257    4   7  89   0
  8    0   0    0     1    1    3    0    0    0    0   306    0   0   0 100
  9    0   0  108     0    0    8    0    0    1    0     2    0   0   1  99
 12   20   0    6     0    0    8    0    0    0    0    28    0   0   0 100
 13    0   0    0     1    1   10    0    0    1    0     0    0   0   0 100
 16    0   0    0     0    0    8    0    0    0    0     1    0   0   0 100
 17    0   0    0     1    1   11    0    0    0    0     4    0   0   0 100
 20    0   0    0     0    0    7    0    0    0    0     0    0   0   0 100
 21    0   0    6   342  342    1    0    0    0    0    63    0   0   0 100
 24    0   0   68   205    5    0    0    0    0    0     0    0   0   0 100
 25    0   0    0     0    0    7    0    0    0    0     0    0   0   0 100
 26    0   0    0     0    0    5    0    0    0    0     1    0   0   0 100
 27    0   0    0     1    1    6    0    0    0    0     4    0   0   0 100
 28    0   0    0     0    0    5    0    0    0    0    17    0   0   0 100
 29    0   0    0     1    1    8    0    0    0    0     3    0   0   0 100

Wrap up
Even though I've gone over this issue many times, the wait time question still keeps coming back. Finally, with a fix in the next release of Solaris, we should have a more useful, accurate measure of the time in which large multiprocessor systems are I/O bound.


Click on our Sponsors to help Support SunWorld


Resources

Other Cockcroft columns at www.sun.com

About the author
Adrian Cockcroft joined Sun Microsystems in 1988, and currently works as a performance specialist for the Server Division of SMCC. He wrote Sun Performance and Tuning: SPARC and Solaris and Sun Performance and Tuning -- Java and the Internet, both published by SunSoft Press PTR Prentice Hall. Reach Adrian at adrian.cockcroft@sunworld.com.

What did you think of this article?
-Very worth reading
-Worth reading
-Not worth reading
-Too long
-Just right
-Too short
-Too technical
-Just right
-Not technical enough
 
 
 
    

SunWorld
[Table of Contents]
Subscribe to SunWorld, it's free!
[Search]
Feedback
[Next story]
Sun's Site

[(c) Copyright  Web Publishing Inc., and IDG Communication company]

If you have technical problems with this magazine, contact webmaster@sunworld.com

URL: http://www.sunworld.com/swol-10-1998/swol-10-perf.html
Last modified: