Click on our Sponsors to help Support SunWorld
Performance Q & A by Adrian Cockcroft

What's the best way to probe processes?

How to measure the resource consumption and process activity on a Unix computer.

SunWorld
August  1996
[Next story]
[Table of Contents]
[Search]
Subscribe to SunWorld, it's free!

Abstract
There are several ways to examine individual processes in Solaris, there is actually more information available than you may realize. This month we look at the commands and programming interfaces provided in Solaris 2.5. (3,500 words)


Mail this
article to
a friend

Q: How can I measure the resource consumption and activity of the processes on my Unix computer?
--Probing in Peshtigo

A: Most people know about the ps command. It gives a high level summary of the status of processes on your system. It has many options, but it is a relatively crude way to figure out what your processes are doing.

Of all the available options, the best performance related summary comes from the BSD version /usr/ucb/ps uax, which collects all the process data in one go, sorts the output by CPU usage, then displays the result. The unsorted versions of ps loop through the processes, printing as they go. This spreads out the time at which the processes are measured, and the last line process measured is significantly later than the first. By collecting in one go, the sorted versions give a more consistent view of the system. A quick look at the most active processes can be obtained easily using this command:

% /usr/ucb/ps uax | head
USER       PID %CPU %MEM   SZ  RSS TT       S    START  TIME COMMAND
adrianc    333  1.0  8.2 8372 5120 console  S 09:28:38  0:29 /usr/openwin/bin/X
root       483  0.4  1.4 1016  872 pts/1    O 09:56:36  0:00 /usr/ucb/ps uax
adrianc    433  0.3 15.812492 9832 ?        S 09:31:47  0:26 /export/framemaker
root       240  0.3  5.3 3688 3260 ?        S 09:27:22  0:07 //opt/RICHPse/bin/
adrianc    367  0.2  4.2 3472 2620 ??       S 09:28:56  0:00 cmdtool -Wp 603 49
adrianc    484  0.1  0.9  724  540 pts/1    S 09:56:36  0:00 head
root         3  0.1  0.0    0    0 ?        S 09:25:17  0:02 fsflush
adrianc    370  0.1  1.4  980  824 pts/1    S 09:28:57  0:00 /bin/csh
adrianc    358  0.1  2.6 2088 1616 console  S 09:28:54  0:00 olwm -syncpid 357

This summary immediately tells you who is running the most active processes. The %CPU measure is a time-decayed average of recent CPU usage. %MEM tells you the proportion of the total RAM in your system is in use by each process (it won't add up to 100 percent as some RAM is shared by several processes). SZ is the size of the process address space. It's a good indicator of how much swap space the process needs. In some cases it includes memory mapped devices, so don't be surprised if the X process appears to be huge on an Ultra 1 with Creator framebuffer. RSS is the basis for %MEM, its the amount of RAM in use by the process. TT shows you which "teletype" the user is logged in on. S shows the status. "S" means sleeping, "O" means on-cpu or running, "R" means runnable and waiting for a CPU to become free. START is the time the process started up, and TIME is the total amount of CPU time it has used so far. COMMAND The same basic information is displayed by the well known freeware utilities top and proctool. Solstice Symon and most commercial performance tools also display this data.


Advertisements

Where does the data come from?
I'm sure you have noticed the strange entry that pops up in df when you are checking how much disk space is left.

% df -k
Filesystem            kbytes    used   avail capacity  Mounted on
/dev/dsk/c0t2d0s0     963662  782001   85301    91%    /
/proc                      0       0       0     0%    /proc
fd                         0       0       0     0%    /dev/fd
/dev/dsk/c0t3d0s0     406854  290066   96448    76%    /export/home
It seems that there is a filesystem called /proc, if you look at it, you find a list of numbers that correspond to the processes on your system.

% ls /proc
00000   00140   00197   00237   00309   00333   00358   00379   00586
00001   00149   00207   00239   00312   00334   00359   00382
00002   00152   00216   00240   00313   00342   00367   00385
00003   00154   00217   00258   00318   00349   00370   00388
Using ls -l you see the owner and size.
% /usr/ucb/ps uax | head
USER       PID %CPU %MEM   SZ  RSS TT       S    START  TIME COMMAND
adrianc    333  0.6  8.0 8380 4984 console  S 09:28:38  1:16 /usr/openwin/bin/X
% ls -l /proc/333
-rw-------   1 adrianc  staff    8581120 Jul 26 09:28 /proc/333
The units in ps are KB, and 8581120/1024 = 8380 as expected. The owner and permissions are used to control access to the processes. The ps command has to be setuid root so be able to show anyone status of all processes, You can only debug or trace processes that you have permissions for.

As you might expect there is a manual page for /proc, if you are running Solaris 2.5 you will discover another manual page. There are a whole bunch of commands that use /proc, documented under the proc(1) manual page. The /proc programming interface is described in proc(4). Let's take a look at these commands, which live in /usr/proc/bin. Here's an excerpt from the manual page, to save you looking it up.

SunOS 5.5            Last change: 9 Nov 1994                    1

proc(1)                   User Commands                   proc(1)

NAME
     proc, pflags, pcred, pmap, pldd, psig, pstack, pfiles, pwdx,
     pstop, prun, pwait, ptree, ptime - proc tools

DESCRIPTION
     The proc tools are  utilities  which  exercise  features  of
     /proc  (see  proc(4)).  Most of them take a list of process-
     ids  (pid);  those  that  do  also  accept  /proc/nnn  as  a
     process-id,  so  the  shell expansion /proc/* can be used to
     specify all processes in the system.

     pflags              print the /proc tracing flags, the pend-
                         ing  and  held  signals, and other /proc
                         status information for each lwp in  each
                         process.

     pcred               print the credentials  (effective,  real
                         and  saved UID's and GID's) of each pro-
                         cess.

     pmap                print the address space map of each pro-
                         cess.

     pldd                list the dynamic libraries  linked  into
                         each  process,  including shared objects
                         explicitly  attached  using  dlopen(3X).
                         (See also ldd(1).)

     psig                list the signal actions of each  process
                         (See signal(5).)

     pstack              print a  hex+symbolic  stack  trace  for
                         each lwp in each process.

     pfiles              report fstat(2) and fcntl(2) information
                         for all open files in each process.

     pwdx                print the current working  directory  of
                         each process.

     pstop               stop each process (PR_REQUESTED stop).

     prun                set each  process  running  (inverse  of
                         pstop).

     pwait               wait for all of the specified  processes
                         to terminate.

     ptree               print the process trees  containing  the
                         specified  pid's  or  users,  with child
                         processes indented from their respective
                         parent  processes.   An  argument of all
                         digits is taken to be a process-id, oth-
                         erwise  it is assumed to be a user login
                         name.  Default is all processes.

     ptime               time a command, such as the time(1) com-
                         mand,  but  using  microstate accounting
                         for reproducible precision.

That's already opened up a lot more possibilities. The /proc interface is designed to support the process debugging and analyzing tools in Sun's Workshop development tools. There are a few tantalizing hints here. Look at the description of pflags, it mentions tracing. And ptime, what is microstate accounting? A way to get higher precision measurements? We need to dig further into the programming interface to find out, but first we'll take a look at a bundled tool that uses /proc to trace the system calls made by a process.

Tracing in Solaris 2
The /usr/bin/truss command has many useful features not found in the SunOS 4 trace command. It can trace child processes, and it can count and time system calls and signals. Other options allow named system calls to be excluded or focused on, and data structures can be printed out in full. Here is an excerpt showing a fragment of truss output with the -v option to set verbose mode for data structures, and an example of truss -c showing the system call counts.

% truss -v all cp NewDocument Tuning
execve("/usr/bin/cp", 0xEFFFFB28, 0xEFFFFB38)  argc = 3
open("/usr/lib/libintl.so.1", O_RDONLY, 035737561304) = 3
mmap(0x00000000, 4096, PROT_READ, MAP_SHARED, 3, 0) = 0xEF7B0000
fstat(3, 0xEFFFF768)				= 0
    d=0x0080001E i=29585 m=0100755 l=1  u=2     g=2     sz=14512
	at = Apr 27 11:30:14 PDT 1993  [ 735935414 ]
	mt = Mar 12 18:35:36 PST 1993  [ 731990136 ]
	ct = Mar 29 11:49:11 PST 1993  [ 733434551 ]
    bsz=8192  blks=30    fs=ufs
....

% truss -c cp NewDocument Tuning
syscall      seconds   calls  errors
_exit            .00       1
write            .00       1
open             .00      10      4
close            .01       7
creat            .01       1
chmod            .01       1
stat             .02       2      1
lseek            .00       1
fstat            .00       4
execve           .00       1
mmap             .01      18
munmap           .00       9
memcntl          .01       1
                ----     ---    ---
sys totals:      .07      57      5
usr time:        .02
elapsed:         .43

I use truss a great deal to find out what a process is doing, which files are being read and written, and with truss -c you can see how long system calls take to execute on average, and where your system CPU time is coming from.

Who, what, when, how much?
Many processes live very short lives. You cannot see them with ps, but they may be so frequent that they dominate the load on your system. The only way to catch them is to ask the system to keep a record of every process that has run, who ran it, what was it, when did it start and end, and how much resource did it use. This is done by the system accounting subsystem. For some reason many administrators seem to have hang-ups about accounting. Perhaps it has connotations of "big brother is watching you," or they fear additional overhead. In truth, if Fred complains that his system is too slow, and the accounting records show that he spends all his time playing Doom, you should not be too sympathetic! The overhead of collecting accounting data is always present. When you turn on accounting, you are just enabling storage of a few bytes of useful data when a process exits.

Accounting data is most useful when measured over a long period of time. This can be useful on a network of workstations as well as on a single time-shared server. From this you can identify how often programs run, how much CPU time, I/O, and memory each program uses, and what work patterns throughout the week look like. To enable accounting to start immediately, enter the three commands shown below. Check out the section "Administering Security, Performance, and Accounting in Solaris 2" in the Solaris System Administration Answerbook and see the acctcom command. Some crontab entries must also be added to summarize and checkpoint the accounting logs. Collecting and checkpointing the accounting data itself puts a negligible additional load onto the system, but the summary scripts that run once a day or once a week can have a noticeable effect, so they should be scheduled to run out of hours.


# ln /etc/init.d/acct /etc/rc0.d/K22acct
# ln /etc/init.d/acct /etc/rc2.d/S22acct
# /etc/init.d/acct start
Starting process accounting

This is what your crontab file for the adm user should contain.


# crontab -l adm
#ident  "@(#)adm        1.5     92/07/14 SMI"   /* SVr4.0 1.2   */
#min    hour    day     month   weekday
0       *       *       *       *       /usr/lib/acct/ckpacct
30      2       *       *       *       /usr/lib/acct/runacct 2&#gt; /var/adm/acct/nite/fd2log
30      9       *       *       5       /usr/lib/acct/monacct
You get a daily accounting summary, but the one I like to keep track of is the monthly one stored in /var/adm/acct/fiscal. Here is an excerpt from fiscrpt07 on my home system.
Jul 26 09:30 1996  TOTAL COMMAND SUMMARY FOR FISCAL 07 Page 1

                                     TOTAL COMMAND SUMMARY
COMMAND   NUMBER      TOTAL       TOTAL       TOTAL   MEAN    MEAN     HOG         CHARS     BLOCKS
NAME        CMDS    KCOREMIN     CPU-MIN   REAL-MIN  SIZE-K  CPU-MIN  FACTOR      TRNSFD      READ

TOTALS     26488 16062007.75   3960.11    494612.41 4055.95   0.15  0.01  17427899648      39944

mae           36  7142887.25   1501.73      2128.50 4756.45  41.71  0.71   2059814144       1653
sundgado      16  3668645.19    964.83      1074.34 3802.36  60.30  0.90    139549181         76
Xsun          29  1342108.55    251.32      9991.62 5340.18   8.67  0.03   2784769024       1295
xlock         32  1027099.38    726.87      4253.34 1413.04  22.71  0.17   4009349888         15
fountain       2   803036.25    165.11       333.65 4863.71  82.55  0.49       378388          1
netscape      22   489512.97     72.39      3647.61 6762.19   3.29  0.02    887353080       2649
maker4X.      10   426182.31     43.77      5004.30 9736.27   4.38  0.01    803267592       3434
wabiprog      53   355574.99     44.32       972.44 8022.87   0.84  0.05    355871360        570
imagetoo      21   257617.08     15.65       688.46 16456.60   0.75  0.02     64291840        387
java         235   203963.64     37.96       346.35 5373.76   0.16  0.11    155950720        240
aviator        2   101012.82     22.93        29.26 4406.20  11.46  0.78      2335744         40
se.sparc      18    46793.09     19.30      6535.43 2424.47   1.07  0.00    631756294         20
xv             3    40930.98      5.58        46.37 7337.93   1.86  0.12    109690880         28
It looks as if my kids have been using it to play games during the day! The commands reported are sorted by KCOREMIN, which is the product of the amount of CPU time used and the amount of RAM used while the command was active. CPU-MIN is the number of minutes of CPU time. REAL_MIN is the elapsed time for the commands. SIZE-K is an average value for the RSS over the active lifetime of the process. It does not include times when the process was not actually running. In Solaris 2.4 and earlier releases a bug causes this measure to be garbage. HOG FACTOR is the ratio of CPU-MIN to REAL-MIN. A high factor means that this command hogs the CPU whenever it is running. CHARS TRNSFD counts the number of characters read and written. BLOCKS READ counts data read from block devices (basically local disk filesystem reads and writes). The underlying data that is collected can be seen in the acct(4) manual page. The data structure looks like this, its very compact, around 40 bytes.
DESCRIPTION
     Files produced as a result of calling acct(2)  have  records
     in the form defined by &#lt;sys/acct.h&#gt;, whose contents are:

     typedef ushort  comp_t;      /* pseudo "floating point" representation */
                                  /* 3 bit base-8 exponent in the high */
                                  /* order bits, and a 13-bit fraction */
                                  /* in the low order bits. */

     struct  acct
     {
             char    ac_flag;     /* Accounting flag */
             char    ac_stat;     /* Exit status */
             uid_t   ac_uid;      /* Accounting user ID */
             gid_t   ac_gid;      /* Accounting group ID */
             dev_t   ac_tty;      /* control tty */
             time_t  ac_btime;    /* Beginning time */
             comp_t  ac_utime;    /* accounting user time in clock */
                                  /* ticks */
             comp_t  ac_stime;    /* accounting system time in clock */
                                  /* ticks */
             comp_t  ac_etime;    /* accounting total elapsed time in clock */
                                  /* ticks */
             comp_t  ac_mem;      /* memory usage in clicks (pages) */
             comp_t  ac_io;       /* chars transferred by read/write */
             comp_t  ac_rw;       /* number of block reads/writes */
             char    ac_comm[8];  /* command name */
     };

Process data structures
There isn't a great deal of data in the accounting record. Let's see what is available from a process that is still running. Actually, a neat trick is to open an entry in /proc and wait until the process has exited. The fact that the /proc entry is still open means that you can still get the data described below. When you close the /proc "file" the zombie process will disappear.

These data structures are described in full in the proc(4) manual page. They are also available in the SE toolkit, so if you want to obtain the data and play around with it you should look at the code for ps-ax.se and msacct.se. The interface to /proc involves sending ioctl commands. The one that ps uses is called PIOCPSINFO and this is what you get back.

SunOS 5.5           Last change: 28 Mar 1995                    2

proc(4)                   File Formats                    proc(4)

  PIOCPSINFO
     This returns miscellaneous process information such as  that
     reported  by  ps(1).  p is a pointer to a prpsinfo structure
     containing at least the following fields:

     typedef struct prpsinfo {
       char        pr_state;    /* numeric process state (see pr_sname) */
       char        pr_sname;    /* printable character representing pr_state */
       char        pr_zomb;     /* !=0: process terminated but not waited for */
       char        pr_nice;     /* nice for cpu usage */
       u_long      pr_flag;     /* process flags */
       int         pr_wstat;    /* if zombie, the wait() status */
       uid_t       pr_uid;      /* real user id */
       uid_t       pr_euid;     /* effective user id */
       gid_t       pr_gid;      /* real group id */
       gid_t       pr_egid;     /* effective group id */
       pid_t       pr_pid;      /* process id */
       pid_t       pr_ppid;     /* process id of parent */
       pid_t       pr_pgrp;     /* pid of process group leader */
       pid_t       pr_sid;      /* session id */
       caddr_t     pr_addr;     /* physical address of process */
       long        pr_size;     /* size of process image in pages */
       long        pr_rssize;   /* resident set size in pages */
       u_long      pr_bysize;   /* size of process image in bytes */
       u_long      pr_byrssize; /* resident set size in bytes */
       caddr_t     pr_wchan;    /* wait addr for sleeping process */
       short       pr_syscall;  /* system call number (if in syscall) */       id_t        pr_aslwpid;  /* lwp id of the aslwp; zero if no aslwp */
       timestruc_t pr_start;    /* process start time, sec+nsec since epoch */
       timestruc_t pr_time;     /* usr+sys cpu time for this process */
       timestruc_t pr_ctime;    /* usr+sys cpu time for reaped children */
       long        pr_pri;      /* priority, high value is high priority */
       char        pr_oldpri;   /* pre-SVR4, low value is high priority */
       char        pr_cpu;      /* pre-SVR4, cpu usage for scheduling */
       u_short     pr_pctcpu;   /* % of recent cpu time, one or all lwps */
       u_short     pr_pctmem;   /* % of system memory used by the process */
       dev_t       pr_ttydev;   /* controlling tty device (PRNODEV if none) */
       char        pr_clname[PRCLSZ]; /* scheduling class name */
       char        pr_fname[PRFNSZ];  /* last component of exec()ed pathname */
       char        pr_psargs[PRARGSZ];/* initial characters of arg list */
       int         pr_argc;     /* initial argument count */
       char        **pr_argv;   /* initial argument vector */
       char        **pr_envp;   /* initial environment vector */
     } prpsinfo_t;
For a multithreaded process it is possible to get the data for each lightweight process separately. There's a lot more useful looking information there, but no sign of the high resolution microstate accounting that /usr/proc/bin/ptime (and msacct.se) display. They use a separate ioctl, PIOCUSAGE.
SunOS 5.5           Last change: 28 Mar 1995                   18

proc(4)                   File Formats                    proc(4)

  PIOCUSAGE
     When applied  to  the  process  file  descriptor,  PIOCUSAGE
     returns  the  process  usage information; when applied to an
     lwp file descriptor, it returns usage  information  for  the
     specific  lwp.   p  points  to  a prusage structure which is
     filled by the operation.  The prusage structure contains  at
     least the following fields:

     typedef struct prusage {
          id_t           pr_lwpid;     /* lwp id.  0: process or defunct */
          u_long         pr_count;     /* number of contributing lwps */
          timestruc_t    pr_tstamp;    /* current time stamp */
          timestruc_t    pr_create;    /* process/lwp creation time stamp */
          timestruc_t    pr_term;      /* process/lwp termination time stamp */
          timestruc_t    pr_rtime;     /* total lwp real (elapsed) time */
          timestruc_t    pr_utime;     /* user level CPU time */
          timestruc_t    pr_stime;     /* system call CPU time */
          timestruc_t    pr_ttime;     /* other system trap CPU time */
          timestruc_t    pr_tftime;    /* text page fault sleep time */
          timestruc_t    pr_dftime;    /* data page fault sleep time */
          timestruc_t    pr_kftime;    /* kernel page fault sleep time */
          timestruc_t    pr_ltime;     /* user lock wait sleep time */
          timestruc_t    pr_slptime;   /* all other sleep time */
          timestruc_t    pr_wtime;     /* wait-cpu (latency) time */
          timestruc_t    pr_stoptime;  /* stopped time */
          u_long         pr_minf;      /* minor page faults */
          u_long         pr_majf;      /* major page faults */
          u_long         pr_nswap;     /* swaps */
          u_long         pr_inblk;     /* input blocks */
          u_long         pr_oublk;     /* output blocks */
          u_long         pr_msnd;      /* messages sent */
          u_long         pr_mrcv;      /* messages received */
          u_long         pr_sigs;      /* signals received */
          u_long         pr_vctx;      /* voluntary context switches */
          u_long         pr_ictx;      /* involuntary context switches */
          u_long         pr_sysc;      /* system calls */
          u_long         pr_ioch;      /* chars read and written */
     } prusage_t;

     PIOCUSAGE  can  be  applied  to  a   zombie   process   (see
     PIOCPSINFO).

     Applying PIOCUSAGE to a process that does  not  have  micro-
     state  accounting  enabled will enable microstate accounting
     and return an estimate of times spent in the various  states
     up  to  this  point.   Further invocations of PIOCUSAGE will
     yield accurate microstate time accounting from  this  point.
     To  disable  microstate  accounting,  use PIOCRESET with the
     PR_MSACCT flag.
There is a lot of useful data here. The time spent waiting for various events is a key measure. I summarize it in msacct.se like this:
Elapsed time         3:20:50.049  Current time Fri Jul 26 12:49:28 1996
User CPU time           2:11.723  System call time        1:54.890
System trap time           0.006  Text pfault sleep          0.000
Data pfault sleep          0.023  Kernel pfault sleep        0.000
User lock sleep            0.000  Other sleep time     3:16:43.022
Wait for CPU time          0.382  Stopped time               0.000
The other thing to notice is that microstate accounting is not turned on by default. It slows the system down slightly, and while it was on by default up to Solaris 2.3, from Solaris 2.4 onwards it is only collected if you use it. The way CPU time is normally measured is to sample the state of all the CPUs from the clock interrupt 100 times a second. The way microstate accounting works, a high resolution timestamp is taken on every state change, every system call, every page fault, every scheduler change. It doesn't miss anything.

Wrap up
There are a lot more things that can be done with /proc, but I hope I have opened your eyes to some of the most useful performance information available in Solaris 2.


Click on our Sponsors to help Support SunWorld


Resources