Click on our Sponsors to help Support SunWorld

Prying into processes and workloads

Presenting a new way to look at per-process information, techniques for summarizing processes into workloads, and analysis of how (in)accurate CPU usage measures really are

April 1998

Abstract

Need to move beyond the standard tools used for monitoring and measuring how your systems are functioning? Performance Q&A columnist Adrian Cockcroft unveils his extension to the SE toolkit, a process class designed to help you plunge further into system processes. Look here for several new ways to collect and display data. (5,700 words)

Mail this
article to
a friend

How can I tell which processes are causing problems and which ones are stuck in a bottleneck?

A significant amount of data is available that is not shown by the ps command. In addition, there are more clever ways to process and display data than top or proctool use. A new extension to the SE toolkit implements some of my ideas in this area. Along the way it becomes clear that the CPU usage measurements everyone relies on are somewhat inaccurate.

Process data sources
I described process data sources in my August 1996 Performance Q&A column, but this time I'll go a step further with the data. These data structures are described in full in the proc(4) manual page. They are also available in the SE toolkit, so if you want to obtain the data and play around with it, you should look at the code for ps-ax.se and msacct.se.

The interface to /proc involves sending ioctl commands or opening special pseudo-files and reading them (a new feature of Solaris 2.6). The data that ps uses is called PIOCPSINFO. Here's what you get back from ioctl (you get slightly different data if you read it from the pseudo-file):

proc(4)                   File Formats                    proc(4)
  PIOCPSINFO
     This returns miscellaneous process information such as that
     reported by ps(1). p is a pointer to a prpsinfo structure
     containing at least the following fields:
     typedef struct prpsinfo {
       char        pr_state;    /* numeric process state (see pr_sname) */
       char        pr_sname;    /* printable character representing pr_state */
       char        pr_zomb;     /* !=0: process terminated but not waited for */
       char        pr_nice;     /* nice for cpu usage */
       u_long      pr_flag;     /* process flags */
       int         pr_wstat;    /* if zombie, the wait() status */
       uid_t       pr_uid;      /* real user id */
       uid_t       pr_euid;     /* effective user id */
       gid_t       pr_gid;      /* real group id */
       gid_t       pr_egid;     /* effective group id */
       pid_t       pr_pid;      /* process id */
       pid_t       pr_ppid;     /* process id of parent */
       pid_t       pr_pgrp;     /* pid of process group leader */
       pid_t       pr_sid;      /* session id */
       caddr_t     pr_addr;     /* physical address of process */
       long        pr_size;     /* size of process image in pages */
       long        pr_rssize;   /* resident set size in pages */
       u_long      pr_bysize;   /* size of process image in bytes */
       u_long      pr_byrssize; /* resident set size in bytes */
       caddr_t     pr_wchan;    /* wait addr for sleeping process */
       short      pr_syscall;  /* system call number (if in syscall) */
       id_t        pr_aslwpid;  /* lwp id of the aslwp; zero if no aslwp */
       timestruc_t pr_start;    /* process start time, sec+nsec since epoch */
       timestruc_t pr_time;     /* usr+sys cpu time for this process */
       timestruc_t pr_ctime;    /* usr+sys cpu time for reaped children */
       long        pr_pri;      /* priority, high value is high priority */
       char        pr_oldpri;   /* pre-SVR4, low value is high priority */
       char        pr_cpu;      /* pre-SVR4, cpu usage for scheduling */
       u_short     pr_pctcpu;   /* % of recent cpu time, one or all lwps */
       u_short     pr_pctmem;   /* % of system memory used by the process */
       dev_t       pr_ttydev;   /* controlling tty device (PRNODEV if none) */
       char        pr_clname[PRCLSZ]; /* scheduling class name */
       char        pr_fname[PRFNSZ]; /* last component of exec()ed pathname */
       char        pr_psargs[PRARGSZ];/* initial characters of arg list */
       int         pr_argc;     /* initial argument count */
       char        **pr_argv;   /* initial argument vector */
       char        **pr_envp;   /* initial environment vector */
     } prpsinfo_t;

You can get the data for each lightweight process of a multithreaded process separately. While there's a lot of useful-looking information there, there's no sign of the high-resolution microstate accounting that /usr/proc/bin/ptime (and msacct.se) display. They use a separate ioctl, PIOCUSAGE:

proc(4)                   File Formats                    proc(4)
  PIOCUSAGE
     When applied to the process file descriptor, PIOCUSAGE
     returns the process usage information; when applied to an
     lwp file descriptor, it returns usage information for the
     specific lwp.   p points to a prusage structure which is
     filled by the operation. The prusage structure contains at
     least the following fields:
     typedef struct prusage {
          id_t           pr_lwpid;    /* lwp id.  0: process or defunct */
          u_long         pr_count;    /* number of contributing lwps */
          timestruc_t    pr_tstamp;   /* current time stamp */
          timestruc_t    pr_create;   /* process/lwp creation time stamp */
          timestruc_t    pr_term;     /* process/lwp termination timestamp */
          timestruc_t    pr_rtime;    /* total lwp real (elapsed) time */
          timestruc_t    pr_utime;    /* user level CPU time */
          timestruc_t    pr_stime;    /* system call CPU time */
          timestruc_t    pr_ttime;    /* other system trap CPU time */
          timestruc_t    pr_tftime;   /* text page fault sleep time */
          timestruc_t    pr_dftime;   /* data page fault sleep time */
          timestruc_t    pr_kftime;   /* kernel page fault sleep time */
          timestruc_t    pr_ltime;    /* user lock wait sleep time *
          timestruc_t    pr_slptime;  /* all other sleep time */
          timestruc_t    pr_wtime;    /* wait-cpu (latency) time */
          timestruc_t    pr_stoptime; /* stopped time */
          u_long         pr_minf;     /* minor page faults */
          u_long         pr_majf;     /* major page faults */
          u_long         pr_nswap;    /* swaps */
          u_long         pr_inblk;    /* input blocks */
          u_long         pr_oublk;    /* output blocks */
          u_long         pr_msnd;     /* messages sent */
          u_long         pr_mrcv;     /* messages received */
          u_long         pr_sigs;     /* signals received */
          u_long         pr_vctx;     /* voluntary context switches */
          u_long         pr_ictx;     /* involuntary context switches */
          u_long         pr_sysc;     /* system calls */
          u_long         pr_ioch;     /* chars read and written */
     } prusage_t;
     PIOCUSAGE can be applied to a zombie   process   (see
     PIOCPSINFO).
     Applying PIOCUSAGE to a process that does not have micro-
     state accounting enabled will enable microstate accounting
     and return an estimate of times spent in the various states
     up to this point.   Further invocations of PIOCUSAGE will
     yield accurate microstate time accounting from this point.
     To disable microstate accounting, use PIOCRESET with the
     PR_MSACCT flag.

You'll find a lot of useful data here. The time spent waiting for various events is a key measure. I summarize it in msacct.se as follows:

Elapsed time         3:20:50.049  Current time Fri Jul 26 12:49:28 1996
User CPU time           2:11.723  System call time        1:54.890
System trap time           0.006  Text pfault sleep          0.000
Data pfault sleep          0.023  Kernel pfault sleep        0.000
User lock sleep            0.000  Other sleep time     3:16:43.022
Wait for CPU time          0.382  Stopped time               0.000

Advertisements

Microstate accounting
Microstate accounting is not turned on by default. It slows the system down very slightly. Although it was a default up to Solaris 2.3, from Solaris 2.4 on it is enabled the first time you read the data. CPU time is normally measured by sampling, 100 times per second, the state of all the CPUs from the clock interrupt. Microstate accounting, on the other hand, takes a high-resolution timestamp on every state change, every system call, every page fault, and every scheduler change. Microstate accounting doesn't miss anything, and the results are much more accurate than those from sampled measurements. The normal measures of CPU user and system time made by sampling can be off by 20 percent or more because the sample is biased, not random. Process scheduling employs the same clock interrupt used to measure CPU usage, and this approach leads to systematic errors in the sampled data. The microstate-measured CPU-usage data does not suffer from such errors.

For example, consider a performance monitor that wakes up every 10 seconds, reads some data from the kernel, then prints the results and sleeps. On a fast system, the total CPU time consumed per wake-up might be a few milliseconds. On exit from the clock interrupt, the scheduler wakes up processes and kernel threads that have been sleeping. Processes that sleep consume less than their allotted CPU time-quanta and always run at the highest timeshare priority.

On a lightly loaded system there is no queue for access to the CPU, so immediately after the clock interrupt, it's likely that the performance monitor will be scheduled. If it runs for less than 10 milliseconds it will have completed its task and be sleeping again by the time the next clock interrupt comes along. Now, given that CPU time is allocated based on what is running when the clock interrupt occurs, you can see that the performance monitor could be sneaking a bite of CPU time whenever the clock interrupt isn't looking. This is an artifact of the dual functions of the clock interrupt -- if two independent unsynchronized interrupts were used, one for scheduling and one for performance measurement, the errors would be averaged away over time.

Another approach to the problem is to sample more frequently by running the clock interrupt more often. This does not remove the bias, but it makes it harder to hide small bites of the CPU. The overhead of splitting the interrupts up is not worth implementing. And, while it's possible to increase the CPU clock rate for the sake of more accurate measurements, this method creates a higher overhead than using direct microstate measurement. In any case, microstate measurement is far more useful and accurate, as it measures more interesting state transitions. When there is a significant amount of queuing for CPU time, the performance monitor will be delayed by a random amount of time, so it will be seen by the clock interrupt some of the time.

As a simple experiment I ran vmstat with a one-second interval and output redirected to /dev/null so that it would not be delayed by display or filesystem operations.

The ptime command uses microstate accounting to accurately measure the CPU time used. I left this running for a long time on a pair of fairly idle systems. I modified the ps-p.se script to show CPU time used down to 100-hertz accuracy of the underlying measurements. After a few minutes the process had accumulated only 15 ticks of CPU time on an 85-megahertz microSPARC, and only one tick of CPU time on a dual 300-megahertz UltraSPARC. After an hour the number of ticks had not increased at all! (The error increases on a quieter system with a faster CPU, as it is easier to sneak a bite of the CPU time without the clock noticing.)

Using microstate accounting to measure the same processes, however, it turned out that about 4.8 seconds of CPU time had been used on the 85-megahertz microSPARC and 1.2 seconds on the 300-megahertz UltraSPARC. This is an extreme case, but the fact remains that the actual CPU usage is far more than is being reported by the normal mechanism. Since the number of ticks is not increasing linearly, the actual error is infinite. The longer I let this run, the larger it gets:

micro85% /usr/proc/bin/ptime vmstat 1 >/dev/null

^C
real  1:03:26.115
user        2.913
sys         1.891

ultra300% /usr/proc/bin/ptime vmstat 1 >/dev/null

^C
real  1:02:01.626
user        0.621
sys         0.555

Just before stopping vmstat, I ran my modified ps-p.se on it on the two systems:

micro85% se ps-p.se 6513
   PID TT       S  TIME    COMMAND
  6513 pts/1    S  0:00.16 vmstat 1

ultra300% se ps-p.se 21560
   PID TT       S  TIME    COMMAND
 21560 pts/3    S  0:00.01 vmstat 1

What is needed is a way to monitor process CPU usage more accurately and using more convenient commands than ptime and msacct.se. I decided to extend the SE toolkit to include a process class, process_class.se, that could be reused by several commands and would provide the information that I really want about each process on a system. I've tried to get performance-tool vendors interested in microstate data without any success. Hopefully, offering an example of how to get and use this information will generate increased user demand for this kind of tool.

The process class
The basic requirement for the process class was that it should collect both the psinfo and usage data for every process on the system. For consistency, all data should be collected at once, and as quickly as possible, then offered for display one process at a time. This avoids the problem inherent in the ps command, where the data for the last process displayed is measured after all the other processes have been measured and displayed, so the data is not associated with a consistent timestamp.

The psinfo data contains a measure of recent average CPU usage, but I really want all the data measured over the time interval since the last reading. This gets complex as new processes arrive and old ones die. Matching up all the data is not as trivial as measuring the performance deltas for the CPUs or disks in the system. There also can be up to 32000 processes to keep track of.

The resulting code is quite complex, but it does the job, and all the complexity is hidden in the class code in /opt/RICHPse/include/process_class.se. Note that this is not part of SE3.0, as I wrote it after that release. It is provided as a tar file that can be loaded over the top of an SE3.0 installation, and an improved version will be included in the next release of SE.

Much of the data is left as a difference over the interval. To calculate rates, it can be divided by the interval that is provided as part of the data and that is accurately measured as the difference in time for each process separately. If the collection process is delayed on a busy system by other processing, the measurements are still accurate. I'll discuss the data in detail next.

	
			Control Entries
/* codes for action$ */
#define PROC_ACTION_INIT       0    /* starting point -> next index */
#define PROC_ACTION_PID        1    /* get the specified pid */
#define PROC_ACTION_NEXT_INDEX 2    /* index order is based on /proc  */
#define PROC_ACTION_NEXT_PID   3    /* search for pid and return data */

class proc_class_t {
	/* input controls */
	int	index$;	             /* always contains current index or -1 */
	int	pid$;	             /* always contains current pid */
	int	action$;

It is a convention in SE that control variables for a class have a $ sign attached to them. When process data is read it is returned in the order that /proc provides entries, not in order of process ID. The index$ entry counts through the data in this order. When all processes have been read, it returns -1. This is a sign to the calling program that it should sleep a while before reading any more data. On the next read, all the process data is captured, then data for the first process is returned. By default, subsequent reads return data in index order. The pid$ entry is always updated to contain the process ID. The action$ entry controls the automatic behavior of the class. It starts of by initializing the class and changes itself to PROC_ACTION_NEXT_INDEX. If you would rather get data for a particular pid, you can set action$ to PROC_ACTION_PID and set pid$ to specify which one. If you want data to be returned in order of increasing pid, you set action$ to PROC_ACTION_NEXT_PID. This mode is less efficient in this implementation.

			Summary Data
	/* summary totals */
	double  lasttime;       /* timestamp for the end of the last update */
	int	nproc;          /* current number of processes */
	int	newproc;        /* number of new processes this time */
	int	deadproc;       /* number of processes that died */

The timestamp indicates that all process data was collected before that time. The current number of processes and the number of new ones are easy to understand; the handling of dead processes is a bit odd. Rather than being ignored, dead processes are provided after all current processes are reported. This allows a last chance to see what the process did before it died, as the data for that process is erased once it is reported for the last time. It would be nice to report the process accounting record, but that does not include the pid. Also processes may have come and gone completely between samples. These will show up as child CPU activity below.

			Per-Process Data
	/* output data for specified process */
	double interval;        /* measured time interval for averages */
	double timestamp;       /* last time process was measured */
	double creation;        /* process start time */
	double termination;     /* process termination time stamp */
	double elapsed;         /* elapsed time for all lwps in process */
	double total_user;      /* current totals in seconds */
	double total_system;
	double total_child;     /* child processes that have exited */
	double user_time;       /* user time in this interval */
	double system_time;     /* system call time in this interval */
	double trap_time;       /* system trap time in interval */
	double child_time;      /* child CPU in this interval */
	double text_pf_time;    /* text page fault wait in interval */
	double data_pf_time;    /* data page fault wait in interval */
	double kernel_pf_time;  /* kernel page fault wait in interval */
	double user_lock_time;  /* user lock wait in interval */
	double sleep_time;      /* all other sleep time */
	double cpu_wait_time;   /* time on runqueue waiting for CPU */
	double stoptime;        /* time stopped from ^Z */
	ulong syscalls;         /* syscall/interval for this process */
	ulong inblocks;         /* input blocks/interval - metadata only - not interesting */
	ulong outblocks;        /* output blocks/interval - metadata only - not interesting */
	ulong vmem_size;        /* size in KB */
	ulong rmem_size;        /* RSS in KB */
#ifdef XMAP                     /* XMAP not yet implemented */
	ulong pmem_size;        /* private mem in KB */
	ulong smem_size;        /* shared mem in KB */
#endif
	ulong maj_faults;       /* majf/interval */
	ulong min_faults;       /* minf/interval - always zero - bug? */
	ulong total_swaps;      /* swapout count */
	long  priority;         /* current sched priority */
	long  niceness;         /* current nice value */
	char  sched_class[PRCLSZ];	/* name of class */
	ulong messages;         /* msgin+msgout/interval */
	ulong signals;          /* signals/interval */
	ulong vcontexts;        /* voluntary context switches/interval */
	ulong icontexts;        /* involuntary context switches/interval */
	ulong charios;          /* characters in and out/interval */
	ulong lwp_count;        /* number of lwps for the process */
	int   uid;              /* current uid */
	long  ppid;             /* parent pid */
	char  fname[PRFNSZ];    /* last component of exec'd pathname */
	char  args[PRARGSZ];    /* initial part of command name and arg list */
	proc$() {
		/* lots of complex code and data hides in here */
}

A future version of this class will also include the extended memory information described in last month's column (see Resources below) and shown above as #ifdef XMAP. Most of the above data is self-explanatory. All times are in seconds in double precision with microsecond accuracy. The minor fault counter seems to be broken as it always reports zero. The inblock and outblock counters are uninteresting as they only refer to filesystem metadata for the old-style buffer cache. The charios counter includes all read and write data for all file descriptors so you can see the file I/O rate. The lwp_count is not the current number of lwps; it is a count of how many lwps the process has ever had. If the number is more than one the process is multithreaded. It's possible to access each lwp in turn and read its psinfo and usage data. The process data is the sum of these.

Child data is accumulated when a child process exits. The CPU used by the child is added into the data for the parent. This can be used to find processes that are forking lots of little short-lived commands.

Data access permissions
To get at process data you must have access permissions for entries in /proc or run as a setuid root command. In Solaris 2.5.1, using the ioctl access method for /proc, you can only access processes that you own, unless you login as root. In Solaris 2.6, although you cannot access the /proc/pid entry for every process, you can read /proc/pid/psinfo and /proc/pid/usage for every process. This means that the full functionality of ps and the process class can be employed by any user. The code for process_class.se conditionally uses the new Solaris 2.6 access method and the slightly changed definition of the psinfo data structure.

% ls -l /proc/3209
total 2217
-rw-------   1 adrianc  9506     1118208 Mar  5 22:39 as
-r--------   1 adrianc  9506         152 Mar  5 22:39 auxv
-r--------   1 adrianc  9506          36 Mar  5 22:39 cred
--w-------   1 adrianc  9506           0 Mar  5 22:39 ctl
lr-x------   1 adrianc  9506           0 Mar  5 22:39 cwd -> /
dr-x------   2 adrianc  9506         416 Mar  5 22:39 fd/
-r--r--r--   1 adrianc  9506         120 Mar  5 22:39 lpsinfo
-r--------   1 adrianc  9506         912 Mar  5 22:39 lstatus
-r--r--r--   1 adrianc  9506         536 Mar  5 22:39 lusage
dr-xr-xr-x   3 adrianc  9506          48 Mar  5 22:39 lwp/
-r--------   1 adrianc  9506        1440 Mar  5 22:39 map
dr-x------   2 adrianc  9506         288 Mar  5 22:39 object/
-r--------   1 adrianc  9506        1808 Mar  5 22:39 pagedata
-r--r--r--   1 adrianc  9506         336 Mar  5 22:39 psinfo
-r--------   1 adrianc  9506        1440 Mar  5 22:39 rmap
lr-x------   1 adrianc  9506           0 Mar  5 22:39 root -> /
-r--------   1 adrianc  9506        1440 Mar  5 22:39 sigact
-r--------   1 adrianc  9506        1232 Mar  5 22:39 status
-r--r--r--   1 adrianc  9506         256 Mar  5 22:39 usage
-r--------   1 adrianc  9506           0 Mar  5 22:39 watch
-r--------   1 adrianc  9506        2280 Mar  5 22:39 xmap

The pea.se script

Usage: se [-DWIDE] pea.se [interval]

The pea.se script is an extended process monitor that acts as a test program for process_class.se and displays very useful information that is not extracted by any standard tool. It is based on the microstate accounting information described above. The script runs continuously and reports on the average data for each active process in the measured interval. This reporting is very different than tools such as top or ps, which print the current data only. There are two display modes: By default pea.se fits into an 80-column format, but the wide mode has much more information. The initial data display includes all processes and shows their average data since the process was created. Any new processes that appear are also treated this way. When a process is measured a second time its averages for the measured interval are displayed if it has consumed any CPU time. Idle processes are ignored. The output is generated every 10 seconds by default. It can report only on processes that it has permission to access, so it must be run as root to see everything on Solaris 2.5.1. And as described above, it sees everything on Solaris 2.6 without needing root permissions.

% se pea.se
09:34:06 name  lwp   pid  ppid   uid   usr%   sys% wait% chld%   size   rss   pf
olwm             1   322   299  9506   0.01   0.01  0.03  0.00   2328  1032  0.0
maker5X.exe      1 21508     1  9506   0.55   0.33  0.04  0.00  29696 19000  0.0
perfmeter        1   348     1  9506   0.04   0.02  0.00  0.00   3776  1040  0.0
cmdtool          1   351     1  9506   0.01   0.00  0.03  0.00   3616   960  0.0
cmdtool          1 22815   322  9506   0.08   0.03  2.28  0.00   3616  1552  2.2
xterm            1 22011  9180  9506   0.04   0.03  0.30  0.00   2840  1000  0.0
se.sparc.5.5.1   1 23089 22818  9506   1.92   0.07  0.00  0.00   1744  1608  0.0
fa.htmllite      1 21559     1  9506   0.00   0.00  0.00  0.00   1832    88  0.0
fa.tooltalk      1 21574     1  9506   0.00   0.00  0.00  0.00   2904  1208  0.0
nproc 31  newproc 0  deadproc 0

% se -DWIDE pea.se
09:34:51 name  lwp   pid  ppid   uid   usr%   sys% wait% chld%   size   rss   pf  inblk outblk chario   sysc   vctx   ictx   msps
maker5X.exe      1 21508     1  9506   0.86   0.36  0.10  0.00  29696 19088  0.0   0.00   0.00   5811    380  60.03   0.30   0.20
perfmeter        1   348     1  9506   0.03   0.02  0.00  0.00   3776  1040  0.0   0.00   0.00    263     12   1.39   0.20   0.29
cmdtool          1 22815   322  9506   0.04   0.00  0.04  0.00   3624  1928  0.0   0.00   0.00    229      2   0.20   0.30   0.96
se.sparc.5.5.1   1  3792   341  9506   0.12   0.01  0.00  0.00   9832  3376  0.0   0.00   0.00      2      9   0.20   0.10   4.55
se.sparc.5.5.1   1 23097 22818  9506   0.75   0.06  0.00  0.00   1752  1616  0.0   0.00   0.00    119     19   0.10   0.30  20.45
fa.htmllite      1 21559     1  9506   0.00   0.00  0.00  0.00   1832    88  0.0   0.00   0.00      0      0   0.10   0.00   0.06
nproc 31  newproc 0  deadproc 0

The pea.se script is 90 lines of code, a few simple printfs in a loop. The real work is done in process_class.se (over 500 lines of code) and can be used by any other script. The default data shown by pea.se consists of:

Current time and process name
Number of lwps for the process, so you can see which are multithreaded
Process ID, parent process ID, and user ID
User and system CPU percentage measured accurately by microstate accounting
Process time percentage spent waiting in the run queue or for page faults to complete
CPU percentage accumulated from child processes that have exited
Virtual address space size and resident set size, in kilobytes
Page fault per second rate for the process over a given interval

When the command is run in wide mode, the following data is added:

Input and output blocks per second --I'm still not sure what this actually counts
Characters transferred by read and write calls
System call per second rate over this interval
Voluntary context switches, where the process slept
Involuntary context switches where the process was interrupted by higher priority work or exceeded its time slice
Milliseconds per slice -- the calculated average amount of CPU time consumed between each context switch

Process class implementation overhead
It's quite hard to handle large amounts of dynamic data in SE. In the end I used a very crude approach based on an array of pointers indexed by process ID (i.e. 128 kilobytes of memory) with malloced data structures to hold the information. A problem with this is that after collecting all the data, the class does a sweep through the array looking for dead processes. This adds some CPU load, but it's not that bad and doesn't increase as you add more processes. On my 85-megahertz microSPARC with Solaris 2.6 pea.se uses 15 percent of the CPU at a 10-second interval (i.e. 1.5 seconds per invocation). On the 300-megahertz UltraSPARC with Solaris 2.5.1 pea.se uses three percent (i.e. 0.3 seconds per invocation). In both cases about 80 processes were being monitored. Since then, Richard Pettit and I have decided that SE needs better ways to handle dynamic data and pointer handling, so Richard is working on extensions to the language. I'm going to rewrite process_class.se to be far smaller and more efficient. The code will be more like standard C as well.

A read of the usage data itself turns on microstate accounting for that process. This increases the overhead for each system call. To measure the overhead, I put a C-shell into a while loop and watched the systemwide system call rate. I then ran the shell with microstate accounting enabled for that process. The call rate reduced from 110,000 system calls per second to 98,000 system calls per second. Both these rates are far higher than normal, and are measured using a single 300-megahertz UltraSPARC with Solaris 2.5.1. That puts the worst-case overhead at about 10 percent for system call intensive processes. Another way of looking at it is that it adds about one microsecond to each system call. In normal use I doubt that the overhead is measurable.

Workload-based summarization
When you have a lot of processes, you want to group them together to make it more manageable. If you group by user name and command you can form workloads, which are a very powerful way to view the system. I have also built a workload class that sits on top of the process class. It pattern matches on user name, command, and arguments. It can work on a first-fit basis, where each process is included only in the first workload that matches. It can also work on a summary basis, where each process is included in every workload that matches. The code is quite simple, 160 lines or so, and by default it allows up to 10 workloads to be specified. SE includes a neat regular expression pattern match comparison operator "string =~ expression", but this could be translated to C using the regexp library routines. The workload_class.se file is provided in the tar bundle along with the process_class.se file.

Test program for workload class -- pw.se
The challenge is how to specify workloads. It would be nice to have a GUI, but to get me started I resorted to my old favorite of using environment variables. The first variable is PW_COUNT, the number of workloads. This is then followed by PW_CMD_n, PW_ARGS_n, and PW_USER_n, where n is from 0 to PW_COUNT -1. If no pattern is provided, it automatically matches anything. Running pw.se with nothing specified gives you all processes accumulated into a single catch-all workload. The size value is accumulated as it is related to the total swap space usage for the workload. The rss value is not, as too much memory is shared for the result to have any meaning.

12:46:54 nproc 31  newproc 0  deadproc 0
wk  command    args    user procs   usr%   sys% wait% chld%   size   pf
 0                             31    2.2    0.7   0.2   0.0 112176    0
 1                              0    0.0    0.0   0.0   0.0      0    0
 2                              0    0.0    0.0   0.0   0.0      0    0
 3                              0    0.0    0.0   0.0   0.0      0    0
 4                              0    0.0    0.0   0.0   0.0      0    0
 5                              0    0.0    0.0   0.0   0.0      0    0
 6                              0    0.0    0.0   0.0   0.0      0    0
 7                              0    0.0    0.0   0.0   0.0      0    0
 8                              0    0.0    0.0   0.0   0.0      0    0
 9                              0    0.0    0.0   0.0   0.0      0    0

To make life easier, I built a small script that sets up a workload suitable for monitoring a desktop workstation that is also running a Netscape Web server:

% more pw.sh
#!/bin/csh

setenv PW_CMD_0 ns-httpd
setenv PW_CMD_1 'se.sparc'
setenv PW_CMD_2 'dtmail'
setenv PW_CMD_3 'dt'
setenv PW_CMD_4 'roam'
setenv PW_CMD_5 'netscape'
setenv PW_CMD_6 'X'
setenv PW_USER_7 'adrianc'
setenv PW_USER_8 'root'
setenv PW_COUNT 10
exec /opt/RICHPse/bin/se -DWIDE pw.se 60

This runs with a one-minute update rate and uses the wide mode by default. It's useful to use this information to note that a particular workload that has a high wait% is either being starved of memory (waiting for page faults) or of CPU power. A high number of page faults for a workload would indicate that it's either starting many new processes, doing a lot of filesystem I/O, or short of memory.

12:53:06 nproc 85  newproc 2  deadproc 0
wk  command    args    user count   usr%   sys% wait% chld%   size   pf  inblk outblk chario   sysc   vctx   ictx   msps
 0 ns-httpd                     2    0.0    0.0   0.0   0.0  17736    0      0      0      6      1      0      0   0.00
 1 se.sparc                     1    0.6    0.0   0.0   0.0   2120    0      0      0     44     10      1      0   6.42
 2   dtmail                     0    0.0    0.0   0.0   0.0      0    0      0      0      0      0      0      0   0.00
 3       dt                     6    0.0    0.0   0.0   0.0  20656    0      0      0     95      3      0      0   0.00
 4     roam                     0    0.0    0.0   0.0   0.0      0    0      0      0      0      0      0      0   0.00
 5 netscape                     0    0.0    0.0   0.0   0.0      0    0      0      0      0      0      0      0   0.00
 6        X                     2    0.4    0.3   0.0   0.0 151032    0      0      0   2071    166     14      0   0.49
 7                  adrianc    27    0.1    0.0   0.0   0.0  83840    0      0      0    652     59      3      0   0.42
 8                     root    41    0.6    0.1   0.1   0.5  70640    0      0      0   3583     66      4      0   1.85
 9                              4    1.2    0.0   0.3   0.0   4216    0      0      0    138   3016      1      0  11.94

Wrap up
After whining about the lack of use that microstate accounting data was getting for several years, I finally spent just a few days writing this code. It's not yet as efficient as I'd like, and it's probably a bit buggy, but it seems to open up another very useful window on what is going on inside a system. You can download a tar file from the regular SE3.0 download page that contains workload_class.se and process_class.se, pea.se and pw.se, a new version of the proc.se header file and the pw.sh script. When you untar it as root, it automatically puts the SE files in the /opt/RICHPse directory, and it puts pw.sh in your current directory.

New book update
You should be able to get my new book in the shops this month. The title is Sun Performance and Tuning -- Java and the Internet, by Adrian Cockcroft and Richard Pettit, Sun Press/PTR Prentice Hall, ISBN 0-13-095249-4. At the time of writing the book, I had written the process class, so pea.se is described, but I had not written the workload class, so pw.se is not in the book.

Click on our Sponsors to help Support SunWorld

Resources

The SE3.0 Toolkit page http://www.sun.com/sun-on-net/performance/se3
SE Toolkit FAQ, Adrian's January 1998 SunWorld Performance Q&A column http://www.sun.com/sunworldonline/swol-01-1998/swol-01-perf.html
"Sizing up memory in Solaris," March 1998 SunWorld Performance Q&A column http://www.sun.com/sunworldonline/swol-03-1998/swol-03-perf.html
"What's the best way to probe processes?" August 1996 SunWorld Performance Q&A column http://www.sun.com/sunworldonline/swol-08-1996/swol-08-perf.html
See Adrian Cockcroft's frequently asked questions for answers to three dozen performance-related questions. Subjects covered include performance monitoring commands, tuning variables, logins and processes, how to interpret the output of performance measurements, and how to optimize Web servers and news servers. http://www.sun.com/sunworldonline/common/cockcroft.letters.html
virtual_adrian.se rule http://www.sun.com/951001/columns/adrian/column2.html
Interested in Web server performance? Go to SunWorld's Site Index http://www.sun.com/sunworldonline/common/swol-siteindex.html#webperf
If you want to build performance tools and utilities, get a copy of the SE Performance Toolkit Version 2.5.0.2 http://www.sun.com/960601/columns/adrian/se2.5.html
Adrian Cockcroft's profile (complete with low- and high-bandwidth bios) http://www.sun.com/950901/columns/adrian/adrian.html
A full listing of Adrian Cockcroft's other Performance Q&A columns in SunWorld http://www.sun.com/sunworldonline/common/swol-backissues-columns.html#perf

Other Cockcroft columns at www.sun.com

"New Release of the SE Performance Toolkit" http://www.sun.com/960301/columns/adrian/column7.html
"Solaris 2.5 Performance Update" http://www.sun.com/960201/columns/adrian/
"Confessions of an Ultra 1 User" http://www.sun.com/951107/columns/adrian/column3.html
"Advanced Monitoring and Tuning" http://www.sun.com/951001/columns/adrian/column2.html
"System Performance Monitoring" http://www.sun.com/950901/columns/adrian/column1.html

About the author
Adrian Cockcroft joined Sun Microsystems in 1988, and currently works as a performance specialist for the Server Division of SMCC. He wrote Sun Performance and Tuning: SPARC and Solaris, published by SunSoft Press PTR Prentice Hall. Reach Adrian at adrian.cockcroft@sunworld.com.

If you have technical problems with this magazine, contact webmaster@sunworld.com

URL: http://www.sunworld.com/swol-04-1998/swol-04-perf.html
Last modified:

Comments:
Name:
Email:
Company Name: