Click on our Sponsors to help Support SunWorld
Performance Q & A by Adrian Cockcroft

Unveiling vmstat's charms

vmstat offers too much information to the uninitiated.
Here's how to find what you need

SunWorld
September  1996
[Next story]
[Table of Contents]
[Search]
Subscribe to SunWorld, it's free!

Abstract
The vmstat command is one of the best known performance utilities. We'll explore the sources of its data, see what extra information is available, and look at a few related commands. (2,600 words)


Mail this
article to
a friend

Q: What do all those columns of data in vmstat mean? How do they relate to the data from mpstat, and where does it all come from?
--statting in Sturgeon Bay

We covered vmstat and the virtual memory system that it monitors in the October, 1995 column, "Help! I've lost my memory!" We haven't looked behind the scenes though, to see what other data is available. Following on from recent columns on the underlying disk data and the process data discussed last month, we'll hunt around the header files provided with Solaris and see what we can find.

First let's remind ourselves what vmstat itself looks like.

% vmstat 5
 procs     memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr f0 s0 s2 s3   in   sy   cs us sy id
 0 0 0  72724 25348   0   2  3  1  1  0  0  0  0  1  0   63  362   85  1  1 98
 0 0 0  64724 25184   0  24 56  0  0  0  0  0  0 19  0  311 1112  356  2  4 94
 0 0 0  64724 24796   0   5 38  0  0  0  0  0  0 15  0   92  325  212  0  1 99
 0 0 0  64680 24584   0  12 106 0  0  0  0  0  0 41  0  574 1094  340  2  5 93
 0 1 0  64632 23736   0   0 195 0  0  0  0  0  0 66  0  612  870  270  1  7 92
 0 0 0  64628 22796   0   0 144 0  0  0  0  0  0 59  0  398  764  222  1  8 91
 0 0 0  64620 22796   0   0 79  0  0  0  0  0  0 50  0  255 1383  136  2 18 80

The command printed the first line of data immediately, then a new line every five seconds that gives the average rates over the five second interval. The first line is also the average rate over the interval that started when the system was booted! The reason is that the numbers are stored by the system as counts of the number of times each event has happened. To get the average over a time interval you measure the counters at the start and end, and divide the difference by the time interval. For the very first measure there is nothing to subtract, so you automatically get the count since boot, divided by the time since boot. The absolute counters themselves can be seen using another option to vmstat as shown below.

% vmstat -s
        0 swap ins
        0 swap outs
        0 pages swapped in
        0 pages swapped out
   208724 total address trans. faults taken
    45821 page ins
     3385 page outs
    61800 pages paged in
    27132 pages paged out
      712 total reclaims
      712 reclaims from free list
        0 micro (hat) faults
   208724 minor (as) faults
    44636 major faults
    34020 copy-on-write faults
    77883 zero fill page faults
     9098 pages examined by the clock daemon
        1 revolutions of the clock hand
    27748 pages freed by the clock daemon
     1333 forks
      187 vforks
     1589 execs
  6730851 cpu context switches
 12848989 device interrupts
   340014 traps
 28393796 system calls
   285638 total name lookups (cache hits 91%)
      108 toolong
   159288 user   cpu
   123409 system cpu
 15185004 idle   cpu
   192794 wait   cpu

The other closely related command is mpstat, which shows basically the same data but on a per-CPU basis. Here's some output from a dual-CPU system.

% mpstat 5
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    1   0    4    82   17   43    0    5    1    0   182    1   1   1  97
  2    1   0    3    81   17   42    0    5    2    0   181    1   1   1  97
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    2   0   39   156  106   42    0    5   21    0    30    0   2  61  37
  2    0   0    0   158  106  103    5    4    8    0  1704    3  36  61   0
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0  19   28   194  142   96    1    4   18    0   342    1   8  76  16
  2    0   6   11   193  141   62    4    4   10    0   683    5  15  74   6
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0  22   33   215  163   87    0    7    0    0   287    1   4  90   5
  2    0  22   29   214  164   88    2    8    1    0   304    2   5  89   4


Advertisements

Where does it all come from?
The data is read from the kernel statistics interface. Most of the data is maintained on a per-CPU basis by the kernel, and is combined into the overall summaries by the commands themselves. The same kstat(3K) programming interface that is used to get at the per-disk statistics is used. This is quite different to the per-process interface I described last month in "Probing Processes". It is also very lightweight, it only takes a few microseconds to retrieve each data structure. The data structures are based on those described in the file /usr/include/sys/sysinfo.h, the system information header file. Of course, all the raw kstat data is directly available to the SE toolkit, which contains a customized vmstat.se script and a hybrid mpvmstat.se.

While the kstat programming interface is stable, the metrics obtained via that interface are not fixed. They vary from one type of hardware to another, from one OS release or patch to another, and are extremely depenent on the particular implementation in use. The metrics obtained by vmstat don't seem to vary much, but could without warning.

Since both performance and behavior are also implementation dependent, you just have to put up with this problem. Performance tools are amongst the least portable software products available.

Process queues
If we follow through the fields of vmstat, the first one is labelled procs, r, b, w. This is derived from the sysinfo data and there is a single global kstat.

typedef struct sysinfo {        /* (update freq) update action          */
        ulong   updates;        /* (1 sec) ++                           */
        ulong   runque;         /* (1 sec) += num runnable procs        */
        ulong   runocc;         /* (1 sec) ++ if num runnable procs > 0 */
        ulong   swpque;         /* (1 sec) += num swapped procs         */
        ulong   swpocc;         /* (1 sec) ++ if num swapped procs > 0  */
        ulong   waiting;        /* (1 sec) += jobs waiting for I/O      */
} sysinfo_t;

As the comments indicate, it is updated once per second. The extra data called runocc and swpocc are displayed by sar -q and are the occupancy of the queues. Solaris 2 counts the total number of swapped out idle processes, so if you see any swapped jobs registered here there is no cause for alarm. sar -q is strange, if the number to be displayed is zero, it displays nothing at all, just white space. This makes it very hard to extract data to be plotted.

Virtual memory counters
The next part of vmstat lists the free swap space and memory. This is obtained as a kstat from a single global vminfo structure.

typedef struct vminfo {         /* (update freq) update action          */
        longlong_t freemem;     /* (1 sec) += freemem in pages          */
        longlong_t swap_resv;   /* (1 sec) += reserved swap in pages    */
        longlong_t swap_alloc;  /* (1 sec) += allocated swap in pages   */
        longlong_t swap_avail;  /* (1 sec) += unreserved swap in pages  */
        longlong_t swap_free;   /* (1 sec) += unallocated swap in pages */
} vminfo_t;

The only swap number shown by vmstat is swap_avail, which is the most important one. If it ever gets to zero, your system will hang and be unable to start more processes! For some strange reason sar -r reports swap_free instead and converts the data into stupid units of 512-byte blocks. The bizarre state of the sar command is one of the reasons we were motivated to create the SE toolkit in the first place!

Paging counters
This one is per-cpu, its also clear that vmstat and sar don't show all of the available information. The states and state transitions being counted were described in detail in my first column on memory issues.

typedef struct cpu_vminfo {
        ulong   pgrec;          /* page reclaims (includes pageout)     */
        ulong   pgfrec;         /* page reclaims from free list         */
        ulong   pgin;           /* pageins                              */
        ulong   pgpgin;         /* pages paged in                       */
        ulong   pgout;          /* pageouts                             */
        ulong   pgpgout;        /* pages paged out                      */
        ulong   swapin;         /* swapins                              */
        ulong   pgswapin;       /* pages swapped in                     */
        ulong   swapout;        /* swapouts                             */
        ulong   pgswapout;      /* pages swapped out                    */
        ulong   zfod;           /* pages zero filled on demand          */
        ulong   dfree;          /* pages freed by daemon or auto        */
        ulong   scan;           /* pages examined by pageout daemon     */
        ulong   rev;            /* revolutions of the page daemon hand  */
        ulong   hat_fault;      /* minor page faults via hat_fault()    */
        ulong   as_fault;       /* minor page faults via as_fault()     */
        ulong   maj_fault;      /* major page faults                    */
        ulong   cow_fault;      /* copy-on-write faults                 */
        ulong   prot_fault;     /* protection faults                    */
        ulong   softlock;       /* faults due to software locking req   */
        ulong   kernel_asflt;   /* as_fault()s in kernel addr space     */
        ulong   pgrrun;         /* times pager scheduled                */
} cpu_vminfo_t;

A few of these might need some extra explanation. Protection faults occur when a program tries to access memory it shouldn't, gets a segmentation violation signal, and dumps a core file. hat_faults only occur on systems that have a software managed memory management unit (sun4c and sun4u). Trivia alert: hat stands for hardware address translation.

I'll skip the disk counters printed by vmstat as it is just reading the same kstat data as iostat, and providing a crude count of the number of operations.

CPU usage and event counters
Let's remind ourselves what vmstat looks like.

% vmstat 5
 procs     memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr f0 s0 s2 s3   in   sy   cs us sy id
 0 0 0  72724 25348   0   2  3  1  1  0  0  0  0  1  0   63  362   85  1  1 98
The last six columns show the interrupt rate, system call rate, context switch rate, and CPU user, system, and idle time. The per-cpu kstat that these are derived from is the biggest kstat yet, with about sixty values. Some of them are summarized by sar, but there is a lot of interesting information here that is being carefully recorded by the kernel, read by vmstat, and then just thrown away. Look down the comments, and at the end I'll point out some non-obvious and interesting values. The cpu and wait states are arrays holding the four CPU states usr/sys/idle/wait, and three wait states io/swap/pio. Only the io wait state is implemented, so this is actually redundant. I made the mpvmstat.se script display the wait states before I realized that they were always zero :-(.

typedef struct cpu_sysinfo {
        ulong   cpu[CPU_STATES]; /* CPU utilization                     */
        ulong   wait[W_STATES]; /* CPU wait time breakdown              */
        ulong   bread;          /* physical block reads                 */
        ulong   bwrite;         /* physical block writes (sync+async)   */
        ulong   lread;          /* logical block reads                  */
        ulong   lwrite;         /* logical block writes                 */
        ulong   phread;         /* raw I/O reads                        */
        ulong   phwrite;        /* raw I/O writes                       */
        ulong   pswitch;        /* context switches                     */
        ulong   trap;           /* traps                                */
        ulong   intr;           /* device interrupts                    */
        ulong   syscall;        /* system calls                         */
        ulong   sysread;        /* read() + readv() system calls        */
        ulong   syswrite;       /* write() + writev() system calls      */
        ulong   sysfork;        /* forks                                */
        ulong   sysvfork;       /* vforks                               */
        ulong   sysexec;        /* execs                                */
        ulong   readch;         /* bytes read by rdwr()                 */
        ulong   writech;        /* bytes written by rdwr()              */
        ulong   rcvint;         /* XXX: UNUSED                          */
        ulong   xmtint;         /* XXX: UNUSED                          */
        ulong   mdmint;         /* XXX: UNUSED                          */
        ulong   rawch;          /* terminal input characters            */
        ulong   canch;          /* chars handled in canonical mode      */
        ulong   outch;          /* terminal output characters           */
        ulong   msg;            /* msg count (msgrcv()+msgsnd() calls)  */
        ulong   sema;           /* semaphore ops count (semop() calls)  */
        ulong   namei;          /* pathname lookups                     */
        ulong   ufsiget;        /* ufs_iget() calls                     */
        ulong   ufsdirblk;      /* directory blocks read                */
        ulong   ufsipage;       /* inodes taken with attached pages     */
        ulong   ufsinopage;     /* inodes taked with no attached pages  */
        ulong   inodeovf;       /* inode table overflows                */
        ulong   fileovf;        /* file table overflows                 */
        ulong   procovf;        /* proc table overflows                 */
        ulong   intrthread;     /* interrupts as threads (below clock)  */
        ulong   intrblk;        /* intrs blkd/prempted/released (swtch) */
        ulong   idlethread;     /* times idle thread scheduled          */
        ulong   inv_swtch;      /* involuntary context switches         */
        ulong   nthreads;       /* thread_create()s                     */
        ulong   cpumigrate;     /* cpu migrations by threads            */
        ulong   xcalls;         /* xcalls to other cpus                 */
        ulong   mutex_adenters; /* failed mutex enters (adaptive)       */
        ulong   rw_rdfails;     /* rw reader failures                   */
        ulong   rw_wrfails;     /* rw writer failures                   */
        ulong   modload;        /* times loadable module loaded         */
        ulong   modunload;      /* times loadable module unloaded       */
        ulong   bawrite;        /* physical block writes (async)        */
/* Following are gathered only under #ifdef STATISTICS in source        */
        ulong   rw_enters;      /* tries to acquire rw lock             */
        ulong   win_uo_cnt;     /* reg window user overflows            */
        ulong   win_uu_cnt;     /* reg window user underflows           */
        ulong   win_so_cnt;     /* reg window system overflows          */
        ulong   win_su_cnt;     /* reg window system underflows         */
        ulong   win_suo_cnt;    /* reg window system user overflows     */
} cpu_sysinfo_t;

Some of the numbers printed by mpstat are visible here. The smtx value used to watch for kernel contention is mutex_adenters. The srw value is the sum of the failures to obtain a readers/writer lock. The term xcalls is shorthand for cross-calls. A cross-call occurs when one CPU passes work to another CPU by interrupting it.

Could you do better?
vmstat displays 22 columns of numbers, summarizing more than 100 underlying measures (even more on a multiprocessor). It's good to have a lot of different things summarized together, but the layout of vmstat (and sar) is as much a result of their long history as it is by design.

I'm afraid I'm going to end up plugging the SE toolkit again. It's just so easy to get at this data and do things with it. All the kstats can be read by any user, with no need to be setuid root (this is a key advantage of Solaris 2, other Unix systems read the kernel directly so you would have to obtain root permissions).

If you want to customize your very own vmstat, you could either write one from scratch in C using the kstat library, or load the SE toolkit and spend a few seconds hacking at a trivial script. Either way, if you come up with something that you think is an improvement, while staying with the basic concept of one line of output that fits in 80 columns, send it to me, the best ones get highlighted in a future column and we'll add them to the next version of SE.

If you are curious about some of these numbers, but can't be bothered to write your own SE scripts, you should try out the GUI front end to the raw kstat data that is provided in the SE toolkit as /opt/RICHPse/examples/infotool.se. A sample snapshot is shown above.


Click on our Sponsors to help Support SunWorld


Resources