Processor utilization and wait I/O -- a look inside

Have high percentages of wait I/O time caused you endless hours of looking for I/O bottlenecks, only to come up empty? How is it that a system can have a well-tuned I/O subsystem and still show high wait I/O times? Read on and discover the answers

August 1997

Abstract

In the never-ending pursuit of optimal system performance, the iterative process of looking at system utilization and figuring out which knobs to turn is not always straightforward. In particular, any significant amount of time spent waiting for I/O (Input/Output) typically results in a search for I/O bottlenecks in the system. In this article, we will look at how wait I/O time is calculated internally in SunOS. This will provide a solid understanding of how processor utilization is measured, show how the numbers really get generated, and why they can sometimes be misleading. (3,300 words)

Mail this
article to
a friend

elcome to Inside Solaris, a new column dedicated to providing a detailed look inside the operating system component of Solaris (SunOS), with an eye on the practical application of such information.

The goal of this column is to take a topic of general interest around the implementation and use of Solaris and provide detailed technical information (based on source code reading, as well as knowledge transfer from the engineering community and other subject matter experts) that will help clarify a particular point. When applicable, examples will be provided that serve to illustrate a particular issue.

Readers are, of course, encouraged to send suggestions on topics they would like to see covered, as well as comments and questions on anything discussed.

Advertisements

The tools of the trade
Any system performance analysis or load characterization effort should always begin by looking at the utilization of the system's hardware resources. At a macroscopic level, these resources can be grouped as processors, physical memory (RAM), the storage I/O susbsystem (disk), and the network interfaces. Determining the general utilization of a system running SunOS begins with the running of some bundled utilities. Specifically, sar(1M), vmstat(1M), and mpstat(1M) are used as the starting point for understanding what the overall system load looks like. It may very well be that the initial effort is not undertaken specifically to locate a system bottleneck, but rather as a general information gathering excursion to characterize the system load for capacity planning or documentation purposes.

Whatever the reason, the aforementioned tools are the perfect place to begin, since they are bundled, documented, easy to use, and, in general, it's relatively simple to interpret the data they provide. For example, processor utilization is determined by looking at a few select bits of data, depending on which utility you are using. SunOS reports that a processor will spend its time doing one of four things; running in user mode, running in system mode, running the idle thread, or waiting for I/O. The first three are pretty easy to understand, but waiting for I/O is often times confusing to people analyzing system behavior, because intuitively one is compelled to think that any significant amount of time spent waiting for I/O indicates that an I/O bottleneck must exist. Also, it is not always clear if the I/O is strictly disk I/O, or perhaps includes other device I/O, such as tape devices or network I/O.

Regardless of which utility is used, processor utilization is reported in percentages, relative to the sampling period:

# mpstat 5
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 10    0   0    8   209    9   13    0    0    0    0     2    0   0   0 100
 11    0   0  189     0    0    4    0    0    0    0     0    0   0   0 100
 14    0   0    0     0    0    9    0    0    0    0   121    0   0   0 100
 15    0   0   21     3    3   16    0    0    0    0    21    0   0   0 100

The mpstat(1M) command provides statistical information for every processor installed and online in the system. (Processors can be taken offline with psradm(1M). Offline processors are not reported by mpstat.) The above sample is one snapshot of a four-processor Enterprise 3000 system. The last four columns of mpstat are processor utilization, indicating the percent of time for the sampling period the processors spent in user mode, system mode, waiting for I/O, or idle. For each line, the sum of the last four columns across should account for 100 percent of time spent.

# sar -u 5 5

SunOS allaire 5.5.1 Generic_103640-08 sun4u    06/16/97

21:40:07    %usr    %sys    %wio   %idle
21:40:12       0       0       1      98
21:40:17       0       0       1      98

The sar(1M) command with the "-u" flag provides system utilization information. Unlike mpstat(1M), sar(1M) does not break out utilization on a per-processor basis. Instead, sar(1M) averages the time spent for all processors. The columns are essentially identical to mpstat(1M), with "%wio" representing the time all the processors spent waiting for I/O.

We also have vmstat(1M) for gathering system load information:

# vmstat 5
 procs     memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr s0 s1 s2 s3   in   sy   cs us sy id
 0 0 0 658464 431384  0   2  1  0  0  0  0  1  0  0  0  160  673  154  0  0 99
 0 0 0 657752 424264  0   0  0  0  0  0  0  1  0  0  0  118  585   78  0  0 100

The last three columns on the right, "us," "sy," and "id" provide the processors utilization information, representing time spent in user time, system time, or idle, respectively. Vmstat(1M) does not provide a column indicating wait I/O time spent. Rather, wait I/O is added to idle time in the vmstat code.

Other bundled commands, such as iostat(1M) can report processor utilization. Identifying the columns that provide processor utilization information is a straightforward task, and the man page can always be consulted if there is any uncertainty.

Processor utilization data collection
System statistics and load data is collected largely by maintaining counters installed in many strategic places in the kernel. There are several structures defined in the sys/sysinfo.h header file that contain many counters which get incremented at the appropriate places as the operating system runs. For example, the cpu_sysinfo structure contains counters for system calls, read system calls, write system calls, forks, execs, etc. There is also a cpu_vminfo structure that maintains counters for statistical information related to virtual memory, such as page faults, page-ins, page-outs, etc.

When a SunOS system boots, during the initialization process, a linked list of cpu_stat structures is created, one for each processor on the system. The cpu_stat structure has embedded within it the data structures mentioned previously, cpu_sysinfo and cpu_vminfo, in addition to a cpu_syswait structure. The cpu_syswait structure is intended to maintain finer detailed information on processors waiting for I/O (e.g., is the processor waiting for I/O from a swap device, a physical I/O, or anything else?) Only one field in the cpu_syswait structure is currently used in SunOS -- the iowait field (which we will examine shortly). The other fields defined, swap and physio, are not implemented in previous or current SunOS releases, including the next major release of SunOS 5.6 (Solaris 2.6).

SunOS also implements a mechanism known as kstats for maintaining kernel statistical information. The kstat data is maintained on a linked list of kernel statistics (kstat) structures by the operating system and is made available programmatically via libkstat library routines. All of the bundled performance utilities use kstat calls to gather some of the information they provide. Details on the actual implementation and use of kstat will be the subject of a forthcoming piece.
For now, take a look at the kstat man pages, beginning with kstat(3K). Note that the kstat interfaces are not standardized, and are subject to change from one SunOS release to another. I offer this in case you decide to write some code using the kstat(3K) interfaces.

The information for processor utilization is maintained by an array in the cpu_sysinfo structure:

#define CPU_IDLE	0
#define CPU_USER	1
#define CPU_KERNEL	2
#define CPU_WAIT	3
#define CPU_STATES	4

struct cpu_sysinfo {
	ulong	cpu[CPU_STATES];	/* processor utilization information */
			.
			.
			.

As we said previously, a processor running SunOS will be in one of four possible states at any point in time. The code definition above declares an array of four unsigned long data types -- one array location for each of the four possible states a processor can be in. The values stored in these array locations are used by the bundled tools to determine how processors are being utilized.

The next thing to look at is how the array locations get incremented. In other words, how, when, and where do we establish non-zero values representing time spent in user mode, system mode, waiting for I/O, or idle? The answer is relatively straightforward, as system utilization is updated in only one place in the kernel -- the clock routine. And, there's just one location in the kernel where wait I/O state is reflected. Let's take a closer look.

The waiting is the hardest part (with apologies to Tom Petty)
The area in SunOS that increments the iowait field is a kernel routine called biowait(). The bio routines in SunOS are the interfaces to the Unix "buffer cache," which traditionally is defined as an area of physical memory allocated to caching file system data blocks for performance purposes. That is the case for some other (and older) variations of Unix, but is misleading in the context of SunOS, even going back to SunOS 4.X.

In SunOS, the implementation of virtual memory is such that all physical RAM is one big "page cache" (more or less), and objects cached in the page cache are driven by the load imposed on the system (i.e., a lot of file system I/O will result in a lot of physical RAM being used to cache file system pages). The caching of file system data is no longer bound to some statically sized chunk of RAM. The buffer cache in SunOS is used for the caching of UFS metadata, specifically superblock, inode, and cylinder group information -- components of the file system separate and distinct from the actual data blocks cached in the page cache. How this all plays together gets somewhat complicated and will be the subject of an Inside Solaris column in the not-too-distant future. I mention the bio interfaces in an effort to convey a better understanding and more detailed description of how and where iowait gets updated.

Inside the biowait() routine, the kernel uses a macro to increment the iowait value:

	CPU_STAT_ADD_K(cpu_syswait.iowait, 1);

The expansion of the macro is simply to add a value of 1 to the iowait member (which is a long data type) in the cpu_syswait structure. (In other words, increment iowait.) Some number of instructions later, in the same routine, another call to the same macro decrements the value:

	CPU_STAT_ADD_K(cpu_syswait.iowait, -1);

The reason this is done is to provide a reasonably accurate snapshot of processor state. How? Well, as we'll see, the gathering of the utilization information is done in the clock interrupt handler. Clock interrupts occur 100 times a second (every 10 milliseconds) in SunOS. If we take a clock interrupt and enter the interrupt handler code after entering the biowait() routine, we will catch the value as a 1. If we hit the handler towards the bottom of the biowait() routine, the value will have been decremented back to zero and no wait I/O time will be charged:

Figure 1: Execution path of biowait()

Referencing Figure 1, the time line along the execution path of biowait(), if we take a clock interrupt between T0 and Tn time, wait I/O time will get charged. If the interrupt handler kicks in after Tn time, no wait I/O time will get charged.

Now, just because this only happens in one place does not mean is doesn't happen very often. As a matter of fact, the biowait() routine is called from a bunch of places in the operating system, from various file system support routines (e.g., UFS, HSFS, ProcFS, PCFS, etc.), as well as lower-level device drivers, such as SCSI disk (sd), SCSI tape (st), and the SPARCstorage Array (pln/ssd) drivers.

The net-net of this is that iowait gets updated for disk I/Os in SunOS. Actually, I/O to CD-ROM and SCSI tapes will cause updates to iowait as well, but the crux of it comes from disk I/O, both through the file system, as well as raw devices. Important and interesting to note is that network I/O does not reflect itself in wait I/O. None of the network drivers or other kernel layers involved in network I/O (e.g., the TCP/IP stack) make calls to biowait().

Let's look now at the actual gathering of the utilization information in the SunOS clock interrupt handler. As we said, we enter this clock routine every 10 milliseconds. The handler is used to do some general OS housekeeping, such as handling the callout queue, nudging the scheduler, collecting some virtual memory statistics, and updating the processor utilization information.

The utilization data is collected using the following algorithm, shown in pseudo code. There are two main data collection loops involved. The first loop walks through the linked list of cpu_stat structures and saves the values of the cpu_syswait structure fields. Remember, we mentioned a cpu_syswait structure contains variables used to store finer grained information on the type of I/O the CPU may be waiting for (swap, physio, or anything else). As we said previously, only the iowait counter is implemented in SunOS.

for (every per-processor cpu_stat structure)
do
	total value for swap wait;
	total value for waitio;
	total number of runnable threads on each processors dispatch queue
done
wait_value = total swap wait + total waitio;

(Note the last item, total runnable threads, is not related to the subject at hand, but I included it as a data point and to prevent a potential burst of e-mails from readers with source code access!)

Once the code falls out of the first loop, a second loop is entered to update user, system, and idle time.

for (every per-processor cpu_stat structure)
do
	if (current cpu is quiesced)
		increment idle;
	else if (current cpu is on the interrupt stack)
		increment kernel;
	else if (current cpu is running this clock handler code)
		increment kernel;
	else if (current cpu is running the idle thread)
		if (wait_value from first loop is non-zero)
			increment waitio;
		else (wait_value from first loop was zero)			
			increment idle;
	else if (current cpu lwp state is USER)
		increment user;
done

This second loop is where the kernel figures out where each processor is spending its time. In the first loop, it checked for iowait. With that information, the second figures which of the four possible states the processors should be charged with spending time in, based on the conditions described in the pseudo-code loop above.

Run that by me again, please
So, more concisely, here's what happens. As I/Os get issued on a SunOS system, various routines at different levels of the operating system end up calling the biowait() kernel routine, either directly or implicitly through another buffer cache routine (such as bread(), bwrite(), etc). When we enter biowait(), the iowait field of the cpu_syswait structure gets incremented. On the way out of biowait(), the field gets decremented.

External to all this activity is the system clock, interrupting every 10 milliseconds. Only one processor on an MP system will handle the clock interrupt. In the handler, we "poll" the cpu_syswait.iowait field for each processor. Any processor sitting in the biowait() routine with the iowait field set to one will result in a non-zero wait I/O total in the clock handler. This is the interesting part. Even one processor in biowait() could potentially cause all the processors on an MP system to show up as waiting for I/O. Why? Because, if you re-examine the loops illustrated above, you'll see that if the current processor is idle, and the wait_total from the first loop is non-zero, which will be the case if even one processor was in biowait(), the current processor will get charged with being in a wait I/O state. (By current I mean the current cpu_sysinfo structure -- remember, we're walking a linked list of these structures; one structure for every processor on the system.)

Thus, we get the situations seen so often on production systems; that is, MP systems showing very high I/O wait times when in fact they're not nearly so "I/O bound" as one might think by looking at the utilization data.

What about user time and system time?
While our primary objective here was to discuss wait I/O, it's a good time to say a few things about systems that are spending a significant amount of time in system (kernel) or user mode.

Significant time spent in user mode puts you right at the application processes. (If you're only spending 10 percent of your time in sys, then how much return are you going to get by tuning the operating system? Very little, if any.) Examine the application processes running and use a code profiler when appropriate to figure out where the application code is spending its time. Also, Solaris ships with several commands that are useful for gathering information about running processes in the /usr/proc/bin directory. These commands use the procfs pseudo file system for gathering process information and providing some basic process control functions. See the proc(1) man page for detailed information on the bundled commands, and proc(4) for information on procfs.

System time means we're executing operating system (kernel) code. Look at what the system is doing and rule out things that can be tuned (e.g. lots of paging due to insufficient RAM, errors causing the system to spend time in error handlers, etc). Remember, a SunOS system booted and idle will reflect as much with the performance utilities (i.e., near 100 percent idle time), so don't lose sight of the fact that the operating system pretty much does what it is asked to do based on the load being imposed on it. A high percentage of system time does not necessarily mean something bad is happening -- an application load that is system call intensive and does little user-land work will result in a system that has a lot of system time.

Previous versions of Solaris shipped with a kernel profile (kgmon), which enabled you to get a snapshot of which kernel routines the operating system was spending most of its time in. There wasn't necessarily something that could be "tuned" as a result of knowing what kernel routines where executing most frequently, but it sometimes gave clues as to what the application load was asking the system to do, and perhaps with that information application changes could be made that would reduce time spent in the kernel and thus provide more CPU for other things.

Summary
Understanding how processors on a SunOS system are being utilized is the first step to any load characterization or performance analysis effort. The amount of time a system spends waiting for I/O can be misleading, since on a multiprocessor system, one processor doing I/O can result in the entire system showing up in a wait I/O state. We saw how this happens by examining how the state information is updated and collected.

The key points to remember are:

Wait I/O is essentially idle time on an MP system. If an I/O bottleneck is suspected, use iostat(1M) and carefully examine how effectively the disk subsystem is performing. In other words, a high percentage of wait I/O time may be the result of an I/O bottleneck in the disk subsystem, but that is not necessarily the case. A careful analysis of disk I/O performance with iostat(1M) (or other tools like ssadm(1M) if SPARCstorage Arrays are being used; the Solstice Volume Manager performance facilities if applicable, etc.) is the only definitive way to determine if a hot spindle, RAID volume, etc, exists and tuning is required.
Network I/O is not reflected in wait I/O time. Disk and tape I/O is. Disk I/O both through the file system, as well as to raw devices is reflected.

In future columns, we'll take a closer look at the execution paths of disk I/Os in SunOS, including expanded examinations of the buffer cache and memory page cache implementation. Stay tuned.

For more information on Solaris utilization reporting:

"Sun Performance and Tuning," by Adrian Cockcroft, Prentice Hall/SunSoft Press. ISBN 0131496425 http://www.amazon.com/exec/obidos/ISBN=0131496425/sunworldonlineA/

For more generic discussion on processor utilization and clock routines:

"The Magic Garden Explained: The Internals of Unix System V Release 4," by Berny Goodheart and James Cox, Prentice Hall. ISBN 0132075563 http://www.amazon.com/exec/obidos/ISBN=0132075563/sunworldonlineA/
"Unix Internals: The New Frontiers," by Uresh Vahalia, Prentice Hall. ISBN 0131019082 http://www.amazon.com/exec/obidos/ISBN=0131019082/sunworldonlineA/

Jim Mauro is currently an Area Technology Manager for Sun Microsystems in the Northeast area, focusing on server systems, clusters, and high availability. He has a total of 18 years industry experience, working in service, educational services ( he developed and delivered courses on Unix internals and administration), and software consulting. Reach Jim at jim.mauro@sunworld.com.

If you have technical problems with this magazine, contact webmaster@sunworld.com

URL: http://www.sunworld.com/swol-08-1997/swol-08-insidesolaris.html

Last modified:

Resources

"Sun Performance and Tuning," by Adrian Cockcroft, Prentice Hall/SunSoft Press. ISBN 0131496425
http://www.amazon.com/exec/obidos/ISBN=0131496425/sunworldonlineA/
"The Magic Garden Explained: The Internals of Unix System V Release 4," by Berny Goodheart and James Cox, Prentice Hall. ISBN 0132075563
http://www.amazon.com/exec/obidos/ISBN=0132075563/sunworldonlineA/
"Unix Internals: The New Frontiers," by Uresh Vahalia, Prentice Hall. ISBN 0131019082
http://www.amazon.com/exec/obidos/ISBN=0131019082/sunworldonlineA/

About the author
Jim Mauro is currently an Area Technology Manager for Sun Microsystems in the Northeast area, focusing on server systems, clusters, and high availability. He has a total of 18 years industry experience, working in service, educational services ( he developed and delivered courses on Unix internals and administration), and software consulting. Reach Jim at jim.mauro@sunworld.com.

If you have technical problems with this magazine, contact webmaster@sunworld.com

URL: http://www.sunworld.com/swol-08-1997/swol-08-insidesolaris.html
Last modified:

Comments:
Name:
Email:
Company Name:

Comments:
Name:
Email:
Company Name: