Processor utilization and wait I/O -- a look insideHave high percentages of wait I/O time caused you endless hours of looking for I/O bottlenecks, only to come up empty? How is it that a system can have a well-tuned I/O subsystem and still show high wait I/O times? Read on and discover the answers |
In the never-ending pursuit of optimal system performance, the iterative process of looking at system utilization and figuring out which knobs to turn is not always straightforward. In particular, any significant amount of time spent waiting for I/O (Input/Output) typically results in a search for I/O bottlenecks in the system. In this article, we will look at how wait I/O time is calculated internally in SunOS. This will provide a solid understanding of how processor utilization is measured, show how the numbers really get generated, and why they can sometimes be misleading. (3,300 words)
Mail this article to a friend |
The goal of this column is to take a topic of general interest around the implementation and use of Solaris and provide detailed technical information (based on source code reading, as well as knowledge transfer from the engineering community and other subject matter experts) that will help clarify a particular point. When applicable, examples will be provided that serve to illustrate a particular issue.
Readers are, of course, encouraged to send suggestions on topics they would like to see covered, as well as comments and questions on anything discussed.
Advertisements
|
|
|
|
The tools of the trade
Any system performance analysis or load characterization effort
should always begin by looking at the utilization of the system's
hardware resources. At a macroscopic level, these resources can be
grouped as processors, physical memory (RAM), the storage I/O
susbsystem (disk), and the network interfaces. Determining the
general utilization of a system running SunOS begins with the
running of some bundled utilities. Specifically,
sar(1M)
, vmstat(1M)
, and
mpstat(1M)
are used as the starting point for
understanding what the overall system load looks like. It may very
well be that the initial effort is not undertaken specifically to
locate a system bottleneck, but rather as a general information
gathering excursion to characterize the system load for capacity
planning or documentation purposes.
Whatever the reason, the aforementioned tools are the perfect place to begin, since they are bundled, documented, easy to use, and, in general, it's relatively simple to interpret the data they provide. For example, processor utilization is determined by looking at a few select bits of data, depending on which utility you are using. SunOS reports that a processor will spend its time doing one of four things; running in user mode, running in system mode, running the idle thread, or waiting for I/O. The first three are pretty easy to understand, but waiting for I/O is often times confusing to people analyzing system behavior, because intuitively one is compelled to think that any significant amount of time spent waiting for I/O indicates that an I/O bottleneck must exist. Also, it is not always clear if the I/O is strictly disk I/O, or perhaps includes other device I/O, such as tape devices or network I/O.
Regardless of which utility is used, processor utilization is reported in percentages, relative to the sampling period:
# mpstat 5 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 10 0 0 8 209 9 13 0 0 0 0 2 0 0 0 100 11 0 0 189 0 0 4 0 0 0 0 0 0 0 0 100 14 0 0 0 0 0 9 0 0 0 0 121 0 0 0 100 15 0 0 21 3 3 16 0 0 0 0 21 0 0 0 100
The mpstat(1M)
command provides statistical information
for every processor installed and online in the system. (Processors
can be taken offline with psradm(1M)
.
Offline processors are not reported by mpstat
.) The
above sample is one snapshot of a four-processor Enterprise 3000
system. The last four columns of mpstat
are processor
utilization, indicating the percent of time for the sampling period
the processors spent in user mode, system mode, waiting for I/O, or
idle. For each line, the sum of the last four columns across should
account for 100 percent of time spent.
# sar -u 5 5 SunOS allaire 5.5.1 Generic_103640-08 sun4u 06/16/97 21:40:07 %usr %sys %wio %idle 21:40:12 0 0 1 98 21:40:17 0 0 1 98
The sar(1M)
command with the "-u" flag provides system
utilization information. Unlike mpstat(1M)
,
sar(1M)
does not break out utilization on a
per-processor basis. Instead, sar(1M)
averages the time
spent for all processors. The columns are essentially identical to
mpstat(1M)
, with "%wio" representing the time all the
processors spent waiting for I/O.
We also have vmstat(1M)
for gathering system load information:
# vmstat 5 procs memory page disk faults cpu r b w swap free re mf pi po fr de sr s0 s1 s2 s3 in sy cs us sy id 0 0 0 658464 431384 0 2 1 0 0 0 0 1 0 0 0 160 673 154 0 0 99 0 0 0 657752 424264 0 0 0 0 0 0 0 1 0 0 0 118 585 78 0 0 100
The last three columns on the right, "us," "sy," and "id" provide the
processors utilization information, representing time spent in user
time, system time, or idle, respectively. Vmstat(1M)
does not provide a column indicating wait I/O time spent. Rather,
wait I/O is added to idle time in the vmstat
code.
Other bundled commands, such as iostat(1M)
can report
processor utilization. Identifying the columns that provide
processor utilization information is a straightforward task, and the
man page can always be consulted if there is any uncertainty.
Processor utilization data collection
System statistics and load data is collected largely by maintaining
counters installed in many strategic places in the kernel. There are
several structures defined in the sys/sysinfo.h
header file that
contain many counters which get incremented at the appropriate
places as the operating system runs. For example, the cpu_sysinfo
structure contains counters for system calls, read system calls,
write system calls, forks, execs, etc. There is also a cpu_vminfo
structure that maintains counters for statistical information
related to virtual memory, such as page faults, page-ins, page-outs,
etc.
When a SunOS system boots, during the initialization process, a
linked list of cpu_stat
structures is created, one for
each processor on the system. The cpu_stat
structure
has embedded within it the data structures mentioned previously,
cpu_sysinfo
and cpu_vminfo
, in addition to
a cpu_syswait
structure. The cpu_syswait
structure is intended to maintain finer detailed information on
processors waiting for I/O (e.g., is the processor waiting for I/O
from a swap device, a physical I/O, or anything else?) Only one
field in the cpu_syswait
structure is currently used in
SunOS -- the iowait
field (which we will examine
shortly). The other fields defined, swap and physio, are not
implemented in previous or current SunOS releases, including the
next major release of SunOS 5.6 (Solaris 2.6).
SunOS also implements a mechanism known askstats
for maintaining kernel statistical information. Thekstat
data is maintained on a linked list of kernel statistics (kstat
) structures by the operating system and is made available programmatically vialibkstat
library routines. All of the bundled performance utilities usekstat
calls to gather some of the information they provide. Details on the actual implementation and use ofkstat
will be the subject of a forthcoming piece.For now, take a look at the
kstat
man pages, beginning withkstat(3K)
. Note that thekstat
interfaces are not standardized, and are subject to change from one SunOS release to another. I offer this in case you decide to write some code using thekstat(3K)
interfaces.
The information for processor utilization is maintained by an array
in the cpu_sysinfo
structure:
#define CPU_IDLE 0 #define CPU_USER 1 #define CPU_KERNEL 2 #define CPU_WAIT 3 #define CPU_STATES 4 struct cpu_sysinfo { ulong cpu[CPU_STATES]; /* processor utilization information */ . . .
As we said previously, a processor running SunOS will be in one of four possible states at any point in time. The code definition above declares an array of four unsigned long data types -- one array location for each of the four possible states a processor can be in. The values stored in these array locations are used by the bundled tools to determine how processors are being utilized.
The next thing to look at is how the array locations get incremented. In other words, how, when, and where do we establish non-zero values representing time spent in user mode, system mode, waiting for I/O, or idle? The answer is relatively straightforward, as system utilization is updated in only one place in the kernel -- the clock routine. And, there's just one location in the kernel where wait I/O state is reflected. Let's take a closer look.
The waiting is the hardest part (with apologies to Tom Petty)
The area in SunOS that increments the iowait
field is a
kernel routine called biowait()
. The bio routines in
SunOS are the interfaces to the Unix "buffer cache," which
traditionally is defined as an area of physical memory allocated to
caching file system data blocks for performance purposes. That is
the case for some other (and older) variations of Unix, but is
misleading in the context of SunOS, even going back to SunOS 4.X.
In SunOS, the implementation of virtual memory is such that all
physical RAM is one big "page cache" (more or less), and objects
cached in the page cache are driven by the load imposed on the
system (i.e., a lot of file system I/O will result in a lot of
physical RAM being used to cache file system pages). The caching of
file system data is no longer bound to some statically sized chunk
of RAM. The buffer cache in SunOS is used for the caching of UFS
metadata, specifically superblock, inode, and cylinder group
information -- components of the file system separate and distinct
from the actual data blocks cached in the page cache. How this all
plays together gets somewhat complicated and will be the subject of
an Inside Solaris column in the not-too-distant future. I mention the
bio interfaces in an effort to convey a better understanding and
more detailed description of how and where iowait
gets updated.
Inside the biowait()
routine, the kernel uses a macro to increment
the iowait
value:
CPU_STAT_ADD_K(cpu_syswait.iowait, 1);
The expansion of the macro is simply to add a value of 1 to the
iowait
member (which is a long data type) in the cpu_syswait
structure. (In other words, increment iowait
.) Some number of
instructions later, in the same routine, another call to the same
macro decrements the value:
CPU_STAT_ADD_K(cpu_syswait.iowait, -1);
The reason this is done is to provide a reasonably accurate snapshot
of processor state. How? Well, as we'll see, the
gathering of the utilization information is done in the clock
interrupt handler. Clock interrupts occur 100 times a second
(every 10 milliseconds) in SunOS. If we take a clock interrupt and
enter the interrupt handler code after entering the biowait()
routine, we will catch the value as a 1. If we hit the handler
towards the bottom of the biowait()
routine, the value will have
been decremented back to zero and no wait I/O time will be charged:
Figure 1: Execution path of biowait() |
Referencing Figure 1, the time line along the execution path of
biowait()
, if we take a clock interrupt between T0 and
Tn time, wait I/O time will get charged. If the interrupt handler
kicks in after Tn time, no wait I/O time will get charged.
Now, just because this only happens in one place does not mean is
doesn't happen very often. As a matter of fact, the
biowait()
routine is called from a bunch of places in
the operating system, from various file system support routines
(e.g., UFS, HSFS, ProcFS, PCFS, etc.), as well as lower-level device
drivers, such as SCSI disk (sd), SCSI tape (st), and the
SPARCstorage Array (pln/ssd) drivers.
The net-net of this is that iowait
gets updated for
disk I/Os in SunOS. Actually, I/O to CD-ROM and SCSI tapes will
cause updates to iowait
as well, but the crux of it
comes from disk I/O, both through the file system, as well as raw
devices. Important and interesting to note is that network I/O does
not reflect itself in wait I/O. None of the network drivers or other
kernel layers involved in network I/O (e.g., the TCP/IP stack) make
calls to biowait()
.
Let's look now at the actual gathering of the utilization information in the SunOS clock interrupt handler. As we said, we enter this clock routine every 10 milliseconds. The handler is used to do some general OS housekeeping, such as handling the callout queue, nudging the scheduler, collecting some virtual memory statistics, and updating the processor utilization information.
The utilization data is collected using the following algorithm,
shown in pseudo code. There are two main data collection loops
involved. The first loop walks through the linked list of cpu_stat
structures and saves the values of the cpu_syswait
structure
fields. Remember, we mentioned a cpu_syswait
structure contains
variables used to store finer grained information on the type of I/O
the CPU may be waiting for (swap, physio, or anything else). As we
said previously, only the iowait
counter is implemented in SunOS.
for (every per-processor cpu_stat structure) do total value for swap wait; total value for waitio; total number of runnable threads on each processors dispatch queue done wait_value = total swap wait + total waitio;
(Note the last item, total runnable threads, is not related to the subject at hand, but I included it as a data point and to prevent a potential burst of e-mails from readers with source code access!)
Once the code falls out of the first loop, a second loop is entered to update user, system, and idle time.
for (every per-processor cpu_stat structure) do if (current cpu is quiesced) increment idle; else if (current cpu is on the interrupt stack) increment kernel; else if (current cpu is running this clock handler code) increment kernel; else if (current cpu is running the idle thread) if (wait_value from first loop is non-zero) increment waitio; else (wait_value from first loop was zero) increment idle; else if (current cpu lwp state is USER) increment user; done
This second loop is where the kernel figures out where each
processor is spending its time. In the first loop, it checked for
iowait
. With that information, the second figures which
of the four possible states the processors should be charged with
spending time in, based on the conditions described in the
pseudo-code loop above.
Run that by me again, please
So, more concisely, here's what happens. As I/Os get issued on a
SunOS system, various routines at different levels of the operating
system end up calling the biowait()
kernel routine,
either directly or implicitly through another buffer cache routine
(such as bread()
, bwrite()
, etc). When we
enter biowait()
, the iowait
field of the
cpu_syswait
structure gets incremented. On the way out
of biowait()
, the field gets decremented.
External to all this activity is the system clock, interrupting
every 10 milliseconds. Only one processor on an MP system will
handle the clock interrupt. In the handler, we "poll" the
cpu_syswait.iowait
field for each processor. Any
processor sitting in the biowait()
routine with the
iowait
field set to one will result in a non-zero wait
I/O total in the clock handler. This is the interesting part. Even
one processor in biowait()
could potentially
cause all the processors on an MP system to show up as waiting for
I/O. Why? Because, if you re-examine the loops illustrated above,
you'll see that if the current processor is idle, and the
wait_total
from the first loop is non-zero, which will
be the case if even one processor was in biowait()
, the
current processor will get charged with being in a wait I/O state.
(By current I mean the current cpu_sysinfo
structure --
remember, we're walking a linked list of these structures; one
structure for every processor on the system.)
Thus, we get the situations seen so often on production systems; that is, MP systems showing very high I/O wait times when in fact they're not nearly so "I/O bound" as one might think by looking at the utilization data.
What about user time and system time?
While our primary objective here was to discuss wait I/O, it's a
good time to say a few things about systems that are spending a
significant amount of time in system (kernel) or user mode.
Significant time spent in user mode puts you right at the
application processes. (If you're only spending 10 percent of your
time in sys, then how much return are you going to get by tuning the
operating system? Very little, if any.) Examine the application
processes running and use a code profiler when appropriate to figure
out where the application code is spending its time. Also, Solaris
ships with several commands that are useful for gathering
information about running processes in the
/usr/proc/bin
directory. These commands use the
procfs pseudo
file system for gathering process
information and providing some basic process control functions. See
the proc(1)
man page for detailed information on the
bundled commands, and proc(4)
for information on
procfs
.
System time means we're executing operating system (kernel) code. Look at what the system is doing and rule out things that can be tuned (e.g. lots of paging due to insufficient RAM, errors causing the system to spend time in error handlers, etc). Remember, a SunOS system booted and idle will reflect as much with the performance utilities (i.e., near 100 percent idle time), so don't lose sight of the fact that the operating system pretty much does what it is asked to do based on the load being imposed on it. A high percentage of system time does not necessarily mean something bad is happening -- an application load that is system call intensive and does little user-land work will result in a system that has a lot of system time.
Previous versions of Solaris shipped with a kernel profile
(kgmon
), which enabled you to get a snapshot of which
kernel routines the operating system was spending most of its time
in. There wasn't necessarily something that could be "tuned" as a
result of knowing what kernel routines where executing most
frequently, but it sometimes gave clues as to what the application
load was asking the system to do, and perhaps with that information
application changes could be made that would reduce time spent in
the kernel and thus provide more CPU for other things.
Summary
Understanding how processors on a SunOS system are being utilized is
the first step to any load characterization or performance analysis
effort. The amount of time a system spends waiting for I/O can be
misleading, since on a multiprocessor system, one processor doing I/O
can result in the entire system showing up in a wait I/O state. We
saw how this happens by examining how the state information is
updated and collected.
The key points to remember are:
iostat(1M)
and carefully
examine how effectively the disk subsystem is performing. In other
words, a high percentage of wait I/O time may be the result
of an I/O bottleneck in the disk subsystem, but that is not
necessarily the case. A careful analysis of disk I/O performance
with iostat(1M)
(or other tools like
ssadm(1M)
if SPARCstorage Arrays are being used; the
Solstice Volume Manager performance facilities if applicable, etc.)
is the only definitive way to determine if a hot spindle, RAID
volume, etc, exists and tuning is required.
In future columns, we'll take a closer look at the execution paths of disk I/Os in SunOS, including expanded examinations of the buffer cache and memory page cache implementation. Stay tuned.
|
For more information on Solaris utilization reporting:
Jim Mauro is currently an Area Technology Manager for Sun Microsystems in the Northeast area, focusing on server systems, clusters, and high availability. He has a total of 18 years industry experience, working in service, educational services ( he developed and delivered courses on Unix internals and administration), and software consulting. Reach Jim at jim.mauro@sunworld.com.
If you have technical problems with this magazine, contact webmaster@sunworld.com
URL: http://www.sunworld.com/swol-08-1997/swol-08-insidesolaris.html
Last modified:
Resources
About the author
Jim Mauro is currently an Area Technology Manager for Sun Microsystems in the Northeast area, focusing on server systems, clusters, and high availability. He has a total of 18 years industry experience, working in service, educational services ( he developed and delivered courses on Unix internals and administration), and software consulting.
Reach Jim at jim.mauro@sunworld.com.
If you have technical problems with this magazine, contact webmaster@sunworld.com
URL: http://www.sunworld.com/swol-08-1997/swol-08-insidesolaris.html
Last modified: