|
Prying into processes and workloadsPresenting a new way to look at per-process information, techniques for summarizing processes into workloads, and analysis of how (in)accurate CPU usage measures really are |
Need to move beyond the standard tools used for monitoring and measuring how your systems are functioning? Performance Q&A columnist Adrian Cockcroft unveils his extension to the SE toolkit, a process class designed to help you plunge further into system processes. Look here for several new ways to collect and display data. (5,700 words)
Mail this article to a friend |
How can I tell which processes are causing problems and which ones are stuck in a bottleneck?
A significant amount of data is available that is not shown by the ps
command. In
addition, there are more clever ways to process and display data than top
or proctool
use. A new extension to the SE toolkit implements some of my ideas in this area. Along the way it becomes clear that the CPU usage measurements everyone relies on are somewhat inaccurate.
Process data sources
I described process data sources in my August 1996 Performance Q&A
column, but this time I'll go a step further with the data.
These data structures are described in full in the proc(4)
manual page. They are also available in the SE toolkit, so if you want
to obtain the data and play around with it, you should look at the code
for ps-ax.se
and msacct.se
.
The interface to /proc involves sending ioctl
commands or opening special pseudo-files and reading them (a new
feature of Solaris 2.6). The data that ps
uses is called
PIOCPSINFO
. Here's what you get back from
ioctl
(you get slightly different data if you read it from
the pseudo-file):
proc(4) File Formats proc(4) PIOCPSINFO This returns miscellaneous process information such as that reported by ps(1). p is a pointer to a prpsinfo structure containing at least the following fields: typedef struct prpsinfo { char pr_state; /* numeric process state (see pr_sname) */ char pr_sname; /* printable character representing pr_state */ char pr_zomb; /* !=0: process terminated but not waited for */ char pr_nice; /* nice for cpu usage */ u_long pr_flag; /* process flags */ int pr_wstat; /* if zombie, the wait() status */ uid_t pr_uid; /* real user id */ uid_t pr_euid; /* effective user id */ gid_t pr_gid; /* real group id */ gid_t pr_egid; /* effective group id */ pid_t pr_pid; /* process id */ pid_t pr_ppid; /* process id of parent */ pid_t pr_pgrp; /* pid of process group leader */ pid_t pr_sid; /* session id */ caddr_t pr_addr; /* physical address of process */ long pr_size; /* size of process image in pages */ long pr_rssize; /* resident set size in pages */ u_long pr_bysize; /* size of process image in bytes */ u_long pr_byrssize; /* resident set size in bytes */ caddr_t pr_wchan; /* wait addr for sleeping process */ short pr_syscall; /* system call number (if in syscall) */ id_t pr_aslwpid; /* lwp id of the aslwp; zero if no aslwp */ timestruc_t pr_start; /* process start time, sec+nsec since epoch */ timestruc_t pr_time; /* usr+sys cpu time for this process */ timestruc_t pr_ctime; /* usr+sys cpu time for reaped children */ long pr_pri; /* priority, high value is high priority */ char pr_oldpri; /* pre-SVR4, low value is high priority */ char pr_cpu; /* pre-SVR4, cpu usage for scheduling */ u_short pr_pctcpu; /* % of recent cpu time, one or all lwps */ u_short pr_pctmem; /* % of system memory used by the process */ dev_t pr_ttydev; /* controlling tty device (PRNODEV if none) */ char pr_clname[PRCLSZ]; /* scheduling class name */ char pr_fname[PRFNSZ]; /* last component of exec()ed pathname */ char pr_psargs[PRARGSZ];/* initial characters of arg list */ int pr_argc; /* initial argument count */ char **pr_argv; /* initial argument vector */ char **pr_envp; /* initial environment vector */ } prpsinfo_t;
You can get the data for each lightweight process of a multithreaded process separately. While
there's a lot of useful-looking information there, there's no sign of the high-resolution microstate accounting that /usr/proc/bin/ptime (and msacct.se
) display. They use a separate ioctl
, PIOCUSAGE
:
proc(4) File Formats proc(4) PIOCUSAGE When applied to the process file descriptor, PIOCUSAGE returns the process usage information; when applied to an lwp file descriptor, it returns usage information for the specific lwp. p points to a prusage structure which is filled by the operation. The prusage structure contains at least the following fields: typedef struct prusage { id_t pr_lwpid; /* lwp id. 0: process or defunct */ u_long pr_count; /* number of contributing lwps */ timestruc_t pr_tstamp; /* current time stamp */ timestruc_t pr_create; /* process/lwp creation time stamp */ timestruc_t pr_term; /* process/lwp termination timestamp */ timestruc_t pr_rtime; /* total lwp real (elapsed) time */ timestruc_t pr_utime; /* user level CPU time */ timestruc_t pr_stime; /* system call CPU time */ timestruc_t pr_ttime; /* other system trap CPU time */ timestruc_t pr_tftime; /* text page fault sleep time */ timestruc_t pr_dftime; /* data page fault sleep time */ timestruc_t pr_kftime; /* kernel page fault sleep time */ timestruc_t pr_ltime; /* user lock wait sleep time * timestruc_t pr_slptime; /* all other sleep time */ timestruc_t pr_wtime; /* wait-cpu (latency) time */ timestruc_t pr_stoptime; /* stopped time */ u_long pr_minf; /* minor page faults */ u_long pr_majf; /* major page faults */ u_long pr_nswap; /* swaps */ u_long pr_inblk; /* input blocks */ u_long pr_oublk; /* output blocks */ u_long pr_msnd; /* messages sent */ u_long pr_mrcv; /* messages received */ u_long pr_sigs; /* signals received */ u_long pr_vctx; /* voluntary context switches */ u_long pr_ictx; /* involuntary context switches */ u_long pr_sysc; /* system calls */ u_long pr_ioch; /* chars read and written */ } prusage_t; PIOCUSAGE can be applied to a zombie process (see PIOCPSINFO). Applying PIOCUSAGE to a process that does not have micro- state accounting enabled will enable microstate accounting and return an estimate of times spent in the various states up to this point. Further invocations of PIOCUSAGE will yield accurate microstate time accounting from this point. To disable microstate accounting, use PIOCRESET with the PR_MSACCT flag.
You'll find a lot of useful data here. The time spent waiting for various events is a key measure. I summarize it in msacct.se
as follows:
Elapsed time 3:20:50.049 Current time Fri Jul 26 12:49:28 1996 User CPU time 2:11.723 System call time 1:54.890 System trap time 0.006 Text pfault sleep 0.000 Data pfault sleep 0.023 Kernel pfault sleep 0.000 User lock sleep 0.000 Other sleep time 3:16:43.022 Wait for CPU time 0.382 Stopped time 0.000
|
|
|
|
Microstate accounting
Microstate accounting is not turned on by default. It slows the
system down very slightly. Although it was a default up to
Solaris 2.3, from Solaris 2.4 on it is enabled the first time you
read the data. CPU time is normally measured by sampling,
100 times per second, the state of all the CPUs from the clock
interrupt. Microstate accounting, on the other hand, takes a
high-resolution timestamp on every state change, every system call, every
page fault, and every scheduler change. Microstate accounting
doesn't miss anything, and the results are much more accurate than
those from sampled measurements. The normal measures of CPU user and
system time made by sampling can be off by 20 percent or more
because the sample is biased, not random. Process scheduling employs
the same clock interrupt used to measure CPU usage, and this
approach leads to systematic errors in the sampled data. The
microstate-measured CPU-usage data does not suffer from such
errors.
For example, consider a performance monitor that wakes up every 10 seconds, reads some data from the kernel, then prints the results and sleeps. On a fast system, the total CPU time consumed per wake-up might be a few milliseconds. On exit from the clock interrupt, the scheduler wakes up processes and kernel threads that have been sleeping. Processes that sleep consume less than their allotted CPU time-quanta and always run at the highest timeshare priority.
On a lightly loaded system there is no queue for access to the CPU, so immediately after the clock interrupt, it's likely that the performance monitor will be scheduled. If it runs for less than 10 milliseconds it will have completed its task and be sleeping again by the time the next clock interrupt comes along. Now, given that CPU time is allocated based on what is running when the clock interrupt occurs, you can see that the performance monitor could be sneaking a bite of CPU time whenever the clock interrupt isn't looking. This is an artifact of the dual functions of the clock interrupt -- if two independent unsynchronized interrupts were used, one for scheduling and one for performance measurement, the errors would be averaged away over time.
Another approach to the problem is to sample more frequently by running the clock interrupt more often. This does not remove the bias, but it makes it harder to hide small bites of the CPU. The overhead of splitting the interrupts up is not worth implementing. And, while it's possible to increase the CPU clock rate for the sake of more accurate measurements, this method creates a higher overhead than using direct microstate measurement. In any case, microstate measurement is far more useful and accurate, as it measures more interesting state transitions. When there is a significant amount of queuing for CPU time, the performance monitor will be delayed by a random amount of time, so it will be seen by the clock interrupt some of the time.
As a simple experiment I ran vmstat
with a one-second
interval and output redirected to /dev/null so that it would
not be delayed by display or filesystem operations.
The ptime
command uses microstate accounting to accurately
measure the CPU time used. I left this running for a long time on a
pair of fairly idle systems. I modified the ps-p.se
script
to show CPU time used down to 100-hertz accuracy of the underlying
measurements. After a few minutes the process had accumulated only 15
ticks of CPU time on an
85-megahertz microSPARC, and only one tick of CPU time on a dual 300-megahertz UltraSPARC. After an hour the number of ticks had not increased at all! (The error increases on a quieter system with a faster CPU, as it is easier to sneak a bite of the CPU time without the clock noticing.)
Using microstate accounting to measure the same processes, however, it turned out that about 4.8 seconds of CPU time had been used on the 85-megahertz microSPARC and 1.2 seconds on the 300-megahertz UltraSPARC. This is an extreme case, but the fact remains that the actual CPU usage is far more than is being reported by the normal mechanism. Since the number of ticks is not increasing linearly, the actual error is infinite. The longer I let this run, the larger it gets:
micro85% /usr/proc/bin/ptime vmstat 1 >/dev/null^C real 1:03:26.115 user 2.913 sys 1.891 ultra300% /usr/proc/bin/ptime vmstat 1 >/dev/null ^C real 1:02:01.626 user 0.621 sys 0.555
Just before stopping vmstat
, I ran my modified ps-p.se
on it on the two systems:
micro85% se ps-p.se 6513 PID TT S TIME COMMAND 6513 pts/1 S 0:00.16 vmstat 1 ultra300% se ps-p.se 21560 PID TT S TIME COMMAND 21560 pts/3 S 0:00.01 vmstat 1
What is needed is a way to monitor process CPU usage more accurately
and using more convenient commands than ptime
and
msacct.se. I decided to extend the SE toolkit to include
a process class, process_class.se
, that could be reused by
several commands and would provide the information that I really want
about each process on a system. I've tried to get performance-tool
vendors interested in microstate data without any success. Hopefully,
offering an example of how to get and use this information will
generate increased user demand for this kind of tool.
The process class
The basic requirement for the process class was that it should collect
both the psinfo
and usage
data for every
process on the system. For consistency, all data should be collected at
once, and as quickly as possible, then offered for display one process
at a time. This avoids the problem inherent in the ps
command, where the data for the last process displayed is measured
after all the other processes have been measured and displayed, so the
data is not associated with a consistent timestamp.
The psinfo
data contains a measure of recent average CPU
usage, but I really want all the data measured over the time interval
since the last reading. This gets complex as new processes arrive and
old ones die. Matching up all the data is not as trivial as measuring
the performance deltas for the CPUs or disks in the system. There also
can be up to 32000 processes to keep track of.
The resulting code is quite complex, but it does the job, and all the complexity is hidden in the class code in /opt/RICHPse/include/process_class.se. Note that this is not part of SE3.0, as I wrote it after that release. It is provided as a tar file that can be loaded over the top of an SE3.0 installation, and an improved version will be included in the next release of SE.
Much of the data is left as a difference over the interval. To calculate rates, it can be divided by the interval that is provided as part of the data and that is accurately measured as the difference in time for each process separately. If the collection process is delayed on a busy system by other processing, the measurements are still accurate. I'll discuss the data in detail next.
Control Entries /* codes for action$ */ #define PROC_ACTION_INIT 0 /* starting point -> next index */ #define PROC_ACTION_PID 1 /* get the specified pid */ #define PROC_ACTION_NEXT_INDEX 2 /* index order is based on /proc */ #define PROC_ACTION_NEXT_PID 3 /* search for pid and return data */ class proc_class_t { /* input controls */ int index$; /* always contains current index or -1 */ int pid$; /* always contains current pid */ int action$;
It is a convention in SE that control variables for a class have a
$
sign attached to them. When process data is read it is
returned in the order that /proc provides entries, not in
order of process ID. The index$
entry counts through the
data in this order. When all processes have been read, it returns
-1
. This is a sign to the calling program that it should
sleep a while before reading any more data. On the next read, all the
process data is captured, then data for the first process is returned.
By default, subsequent reads return data in index order. The
pid$
entry is always updated to contain the process ID.
The action$
entry controls the automatic behavior of the
class. It starts of by initializing the class and changes itself to
PROC_ACTION_NEXT_INDEX
. If you would rather get data for a
particular pid, you can set action$
to
PROC_ACTION_PID
and set pid$
to specify which
one. If you want data to be returned in order of increasing pid, you set action$
to PROC_ACTION_NEXT_PID
. This mode is less efficient in this implementation.
Summary Data /* summary totals */ double lasttime; /* timestamp for the end of the last update */ int nproc; /* current number of processes */ int newproc; /* number of new processes this time */ int deadproc; /* number of processes that died */
The timestamp indicates that all process data was collected before that time. The current number of processes and the number of new ones are easy to understand; the handling of dead processes is a bit odd. Rather than being ignored, dead processes are provided after all current processes are reported. This allows a last chance to see what the process did before it died, as the data for that process is erased once it is reported for the last time. It would be nice to report the process accounting record, but that does not include the pid. Also processes may have come and gone completely between samples. These will show up as child CPU activity below.
Per-Process Data /* output data for specified process */ double interval; /* measured time interval for averages */ double timestamp; /* last time process was measured */ double creation; /* process start time */ double termination; /* process termination time stamp */ double elapsed; /* elapsed time for all lwps in process */ double total_user; /* current totals in seconds */ double total_system; double total_child; /* child processes that have exited */ double user_time; /* user time in this interval */ double system_time; /* system call time in this interval */ double trap_time; /* system trap time in interval */ double child_time; /* child CPU in this interval */ double text_pf_time; /* text page fault wait in interval */ double data_pf_time; /* data page fault wait in interval */ double kernel_pf_time; /* kernel page fault wait in interval */ double user_lock_time; /* user lock wait in interval */ double sleep_time; /* all other sleep time */ double cpu_wait_time; /* time on runqueue waiting for CPU */ double stoptime; /* time stopped from ^Z */ ulong syscalls; /* syscall/interval for this process */ ulong inblocks; /* input blocks/interval - metadata only - not interesting */ ulong outblocks; /* output blocks/interval - metadata only - not interesting */ ulong vmem_size; /* size in KB */ ulong rmem_size; /* RSS in KB */ #ifdef XMAP /* XMAP not yet implemented */ ulong pmem_size; /* private mem in KB */ ulong smem_size; /* shared mem in KB */ #endif ulong maj_faults; /* majf/interval */ ulong min_faults; /* minf/interval - always zero - bug? */ ulong total_swaps; /* swapout count */ long priority; /* current sched priority */ long niceness; /* current nice value */ char sched_class[PRCLSZ]; /* name of class */ ulong messages; /* msgin+msgout/interval */ ulong signals; /* signals/interval */ ulong vcontexts; /* voluntary context switches/interval */ ulong icontexts; /* involuntary context switches/interval */ ulong charios; /* characters in and out/interval */ ulong lwp_count; /* number of lwps for the process */ int uid; /* current uid */ long ppid; /* parent pid */ char fname[PRFNSZ]; /* last component of exec'd pathname */ char args[PRARGSZ]; /* initial part of command name and arg list */ proc$() { /* lots of complex code and data hides in here */ }
A future version of this class will also include the extended memory
information described in last month's column (see Resources below) and shown above as #ifdef
XMAP
. Most of the above data is self-explanatory. All times are
in seconds in double precision with microsecond accuracy. The minor
fault counter seems to be broken as it always reports zero. The
inblock
and outblock
counters are
uninteresting as they only refer to filesystem metadata for the
old-style buffer cache. The charios
counter includes all
read and write data for all file descriptors so you can see the file
I/O rate. The lwp_count
is not the current number of lwps;
it is a count of how many lwps the process has ever had. If the number
is more than one the process is multithreaded. It's possible to access
each lwp in turn and read its psinfo
and
usage
data. The process data is the sum of these.
Child data is accumulated when a child process exits. The CPU used by the child is added into the data for the parent. This can be used to find processes that are forking lots of little short-lived commands.
Data access permissions
To get at process data you must have access permissions for entries in
/proc or run as a setuid root command. In Solaris 2.5.1, using
the ioctl
access method for /proc, you can only
access processes that you own, unless you login as root. In Solaris
2.6, although you cannot access the /proc/pid entry for every
process, you can read /proc/pid/psinfo and
/proc/pid/usage for every process. This means that the full
functionality of ps
and the process class can be employed
by any user. The code for process_class.se
conditionally
uses the new Solaris 2.6 access method and the slightly changed
definition of the psinfo
data structure.
% ls -l /proc/3209 total 2217 -rw------- 1 adrianc 9506 1118208 Mar 5 22:39 as -r-------- 1 adrianc 9506 152 Mar 5 22:39 auxv -r-------- 1 adrianc 9506 36 Mar 5 22:39 cred --w------- 1 adrianc 9506 0 Mar 5 22:39 ctl lr-x------ 1 adrianc 9506 0 Mar 5 22:39 cwd -> / dr-x------ 2 adrianc 9506 416 Mar 5 22:39 fd/ -r--r--r-- 1 adrianc 9506 120 Mar 5 22:39 lpsinfo -r-------- 1 adrianc 9506 912 Mar 5 22:39 lstatus -r--r--r-- 1 adrianc 9506 536 Mar 5 22:39 lusage dr-xr-xr-x 3 adrianc 9506 48 Mar 5 22:39 lwp/ -r-------- 1 adrianc 9506 1440 Mar 5 22:39 map dr-x------ 2 adrianc 9506 288 Mar 5 22:39 object/ -r-------- 1 adrianc 9506 1808 Mar 5 22:39 pagedata -r--r--r-- 1 adrianc 9506 336 Mar 5 22:39 psinfo -r-------- 1 adrianc 9506 1440 Mar 5 22:39 rmap lr-x------ 1 adrianc 9506 0 Mar 5 22:39 root -> / -r-------- 1 adrianc 9506 1440 Mar 5 22:39 sigact -r-------- 1 adrianc 9506 1232 Mar 5 22:39 status -r--r--r-- 1 adrianc 9506 256 Mar 5 22:39 usage -r-------- 1 adrianc 9506 0 Mar 5 22:39 watch -r-------- 1 adrianc 9506 2280 Mar 5 22:39 xmap
The pea.se
script
Usage: se [-DWIDE] pea.se [interval]
The pea.se
script is an extended process monitor that acts
as a test program for process_class.se
and displays very
useful information that is not extracted by any standard tool. It is
based on the microstate accounting information described above. The
script runs continuously and reports on the average data for each
active process in the measured interval. This reporting is very
different than tools such as top
or ps
, which
print the current data only. There are two display modes: By default
pea.se
fits into an 80-column format, but the wide mode
has much more information. The initial data display includes all
processes and shows their average data since the process was created.
Any new processes that appear are also treated this way. When a process
is measured a second time its averages for the measured interval are
displayed if it has consumed any CPU time. Idle processes are ignored.
The output is generated every 10 seconds by default. It can report only on processes that it has permission to access, so it must be run as root to see everything on Solaris 2.5.1. And as described above, it sees everything on Solaris 2.6 without needing root permissions.
% se pea.se 09:34:06 name lwp pid ppid uid usr% sys% wait% chld% size rss pf olwm 1 322 299 9506 0.01 0.01 0.03 0.00 2328 1032 0.0 maker5X.exe 1 21508 1 9506 0.55 0.33 0.04 0.00 29696 19000 0.0 perfmeter 1 348 1 9506 0.04 0.02 0.00 0.00 3776 1040 0.0 cmdtool 1 351 1 9506 0.01 0.00 0.03 0.00 3616 960 0.0 cmdtool 1 22815 322 9506 0.08 0.03 2.28 0.00 3616 1552 2.2 xterm 1 22011 9180 9506 0.04 0.03 0.30 0.00 2840 1000 0.0 se.sparc.5.5.1 1 23089 22818 9506 1.92 0.07 0.00 0.00 1744 1608 0.0 fa.htmllite 1 21559 1 9506 0.00 0.00 0.00 0.00 1832 88 0.0 fa.tooltalk 1 21574 1 9506 0.00 0.00 0.00 0.00 2904 1208 0.0 nproc 31 newproc 0 deadproc 0
% se -DWIDE pea.se 09:34:51 name lwp pid ppid uid usr% sys% wait% chld% size rss pf inblk outblk chario sysc vctx ictx msps maker5X.exe 1 21508 1 9506 0.86 0.36 0.10 0.00 29696 19088 0.0 0.00 0.00 5811 380 60.03 0.30 0.20 perfmeter 1 348 1 9506 0.03 0.02 0.00 0.00 3776 1040 0.0 0.00 0.00 263 12 1.39 0.20 0.29 cmdtool 1 22815 322 9506 0.04 0.00 0.04 0.00 3624 1928 0.0 0.00 0.00 229 2 0.20 0.30 0.96 se.sparc.5.5.1 1 3792 341 9506 0.12 0.01 0.00 0.00 9832 3376 0.0 0.00 0.00 2 9 0.20 0.10 4.55 se.sparc.5.5.1 1 23097 22818 9506 0.75 0.06 0.00 0.00 1752 1616 0.0 0.00 0.00 119 19 0.10 0.30 20.45 fa.htmllite 1 21559 1 9506 0.00 0.00 0.00 0.00 1832 88 0.0 0.00 0.00 0 0 0.10 0.00 0.06 nproc 31 newproc 0 deadproc 0
The pea.se script
is 90 lines of code, a few simple printf
s in a loop. The real work is done in process_class.se
(over 500 lines of code) and can be used by any other script. The default data shown by pea.se
consists of:
When the command is run in wide mode, the following data is added:
Process class implementation overhead
It's quite hard to handle large amounts of dynamic data in SE. In the
end I used a very crude approach based on an array of pointers indexed
by process ID (i.e. 128 kilobytes of memory) with malloced data
structures to hold the information. A problem with this is that after
collecting all the data, the class does a sweep through the array
looking for dead processes. This adds some CPU load, but it's not that
bad and doesn't increase as you add more processes. On my 85-megahertz
microSPARC with Solaris 2.6 pea.se
uses 15 percent of the
CPU at a 10-second interval (i.e. 1.5 seconds per invocation). On the
300-megahertz UltraSPARC with Solaris 2.5.1 pea.se
uses
three percent (i.e. 0.3 seconds per invocation). In both cases about 80
processes were being monitored. Since then, Richard Pettit and I have
decided that SE needs better ways to handle dynamic data and pointer
handling, so Richard is working on extensions to the language. I'm
going to rewrite process_class.se
to be far smaller and more efficient. The code will be more like standard C as well.
A read of the usage data itself turns on microstate accounting for that process. This increases the overhead for each system call. To measure the overhead, I put a C-shell into a while loop and watched the systemwide system call rate. I then ran the shell with microstate accounting enabled for that process. The call rate reduced from 110,000 system calls per second to 98,000 system calls per second. Both these rates are far higher than normal, and are measured using a single 300-megahertz UltraSPARC with Solaris 2.5.1. That puts the worst-case overhead at about 10 percent for system call intensive processes. Another way of looking at it is that it adds about one microsecond to each system call. In normal use I doubt that the overhead is measurable.
Workload-based summarization
When you have a lot of processes, you want to group them together to
make it more manageable. If you group by user name and command
you can form workloads, which are a very powerful way to view the
system. I have also built a workload class that sits on top of the
process class. It pattern matches on user name, command, and arguments.
It can work on a first-fit basis, where each process is included only
in the first workload that matches. It can also work on a summary
basis, where each process is included in every workload that matches.
The code is quite simple, 160 lines or so, and by default it allows up
to 10 workloads to be specified. SE includes a neat regular expression
pattern match comparison operator "string =~ expression"
,
but this could be translated to C using the regexp
library
routines. The workload_class.se file is provided in the tar
bundle along with the process_class.se file.
Test program for workload class -- pw.se
The challenge is how to specify workloads. It would be nice to have a
GUI, but to get me started I resorted to my old favorite of using
environment variables. The first variable is PW_COUNT
, the
number of workloads. This is then followed by PW_CMD_n
,
PW_ARGS_n
, and PW_USER_n
, where
n
is from 0 to PW_COUNT -1
. If no pattern is
provided, it automatically matches anything. Running pw.se
with nothing specified gives you all processes accumulated into a
single catch-all workload. The size
value is accumulated
as it is related to the total swap space usage for the workload. The
rss
value is not, as too much memory is shared for the
result to have any meaning.
12:46:54 nproc 31 newproc 0 deadproc 0 wk command args user procs usr% sys% wait% chld% size pf 0 31 2.2 0.7 0.2 0.0 112176 0 1 0 0.0 0.0 0.0 0.0 0 0 2 0 0.0 0.0 0.0 0.0 0 0 3 0 0.0 0.0 0.0 0.0 0 0 4 0 0.0 0.0 0.0 0.0 0 0 5 0 0.0 0.0 0.0 0.0 0 0 6 0 0.0 0.0 0.0 0.0 0 0 7 0 0.0 0.0 0.0 0.0 0 0 8 0 0.0 0.0 0.0 0.0 0 0 9 0 0.0 0.0 0.0 0.0 0 0
To make life easier, I built a small script that sets up a workload suitable for monitoring a desktop workstation that is also running a Netscape Web server:
% more pw.sh #!/bin/csh setenv PW_CMD_0 ns-httpd setenv PW_CMD_1 'se.sparc' setenv PW_CMD_2 'dtmail' setenv PW_CMD_3 'dt' setenv PW_CMD_4 'roam' setenv PW_CMD_5 'netscape' setenv PW_CMD_6 'X' setenv PW_USER_7 'adrianc' setenv PW_USER_8 'root' setenv PW_COUNT 10 exec /opt/RICHPse/bin/se -DWIDE pw.se 60
This runs with a one-minute update rate and uses the wide mode by default. It's useful to use
this information to note that a particular workload that has a high wait%
is either being starved of memory (waiting for page faults) or of CPU power. A high number of page faults for a workload would indicate that it's either starting many new processes, doing a lot of filesystem I/O, or short of memory.
12:53:06 nproc 85 newproc 2 deadproc 0 wk command args user count usr% sys% wait% chld% size pf inblk outblk chario sysc vctx ictx msps 0 ns-httpd 2 0.0 0.0 0.0 0.0 17736 0 0 0 6 1 0 0 0.00 1 se.sparc 1 0.6 0.0 0.0 0.0 2120 0 0 0 44 10 1 0 6.42 2 dtmail 0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 0 0 0.00 3 dt 6 0.0 0.0 0.0 0.0 20656 0 0 0 95 3 0 0 0.00 4 roam 0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 0 0 0.00 5 netscape 0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 0 0 0.00 6 X 2 0.4 0.3 0.0 0.0 151032 0 0 0 2071 166 14 0 0.49 7 adrianc 27 0.1 0.0 0.0 0.0 83840 0 0 0 652 59 3 0 0.42 8 root 41 0.6 0.1 0.1 0.5 70640 0 0 0 3583 66 4 0 1.85 9 4 1.2 0.0 0.3 0.0 4216 0 0 0 138 3016 1 0 11.94
Wrap up
After whining about the lack of use that microstate accounting data was
getting for several years, I finally spent just a few days writing this
code. It's not yet as efficient as I'd like, and it's probably a bit
buggy, but it seems to open up another very useful window on what is
going on inside a system. You can download a tar file from the regular
SE3.0 download page that contains workload_class.se
and
process_class.se
, pea.se
and pw.se
,
a new version of the proc.se header file and the pw.sh
script.
When you untar it as root, it automatically puts the SE files in the
/opt/RICHPse directory, and it puts pw.sh
in your
current directory.
New book update
You should be able to get my new book in the shops this month.
The title is Sun
Performance and Tuning -- Java and the Internet, by Adrian
Cockcroft and Richard Pettit, Sun Press/PTR Prentice Hall, ISBN
0-13-095249-4. At the time of writing the book, I had written the
process class, so pea.se
is described, but I had not
written the workload class, so pw.se
is not in the book.
|
Resources
About the author
Adrian Cockcroft joined Sun Microsystems in 1988,
and currently works as a performance specialist for the Server Division
of SMCC. He wrote
Sun Performance and Tuning: SPARC and Solaris,
published by SunSoft Press
PTR Prentice Hall.
Reach Adrian at adrian.cockcroft@sunworld.com.
If you have technical problems with this magazine, contact webmaster@sunworld.com
URL: http://www.sunworld.com/swol-04-1998/swol-04-perf.html
Last modified: