Click on our Sponsors to help Support SunWorld
Performance Q & A by Adrian Cockcroft

Performance perplexities: Help! Where do I start?

Got a problem? Here are 12 simple questions to ask yourself before you ask others

November  1997
[Next story]
[Table of Contents]
Subscribe to SunWorld, it's free!

Where should you start with a performance problem, and what information should you collect if you are going to ask for help? This month Adrian presents his cold-start guide to get you going. (1,900 words)

Mail this
article to
a friend

Q I have a performance problem. What do you need to know to help me?

AI see a lot of questions from users or administrators who have decided that they have a performance problem but don't know where to start or what information to provide when they ask for help. I have seen e-mail from people who just say "my system is slow" and give no additional information at all. I have also seen 10-megabyte e-mail messages with 20 attachments containing days of vmstat, sar, and iostat reports, but with no indication of what application the machine is supposed to be running. In this section, I'll lead you through the initial questions that need to be answered. This may be enough to get you on the right track to solving the problem yourself, and it will make it easier to ask for help effectively.

1. What is the business function of the system?

What is the system used for? What is its primary application? It could be a file server, database server, end-user CAD workstation, Internet server, or embedded control system.

2. Who and where are the users?

How many users are there? How do they use the system, and what kind of work patterns do they have? They might be a classroom full of students, people browsing the Internet from home, data entry clerks, development engineers, real-time data feeds, batch jobs. Are the end users directly connected? From what kind of device?

3. Who says there is a performance problem? What is slow?

Are the end users complaining, or do you have some objective business measure like batch jobs not completing quickly enough? If there are no complaints, then you should be measuring business-oriented throughput and response times, together with system utilization levels. Don't waste time worrying about obscure kernel measurements. If you have established a baseline of utilization, business throughput, and response times, then it is obvious when there is a problem because the response time will have increased, and that is what drives user perceptions of performance. It is useful to have real measures of response times or a way to derive them. You may get only subjective measures -- "it feels sluggish today" -- or have to use a stopwatch to time things.


4. What is the system configuration?

How many machines are involved; what is the CPU, memory, network, and disk setup; what version of Solaris is running; what relevant patches are loaded? A good description of a system might be something like this: an Ultra2/2200, with 512 MB, one 100-Mbit switched duplex Ethernet, two internal 2-GB disks with six external 4-GB disks on their own controller, running Solaris 2.5.1 with the latest kernel, network device, and TCP patches.

5. What application software is in use?

If the system is just running Solaris services, which ones are most significant? If it is an NFS server, is it running NFS V2 or NFS V3 (this depends mostly upon the NFS clients). If it is a Web server, is it running Sun's SWS, Netscape, or Apache (and which version)? If it is a database server, which database is it, and are the database tables running on raw disk or in filesystem tables? Has a database vendor specialist checked that the database is configured for good performance and indexed correctly?

6. What are the busy processes on the system doing?

A system becomes busy by running application processes; the most important thing to look at is which processes are busy, who started them, how much CPU they are using, how much memory they are using, and how long they have been running. If you have a lot of short-lived processes, the only way to catch their usage is to use system accounting. For long-lived processes, you can use the ps command or a tool such as top, proctool, or symon. A simple and effective summary is to use the old Berkeley version of ps to get a top 10 listing, as shown below. On a large system, there may be a lot more than 10 busy processes, so get all that are using significant amounts of CPU so that you have captured 90 percent or more of the CPU consumption by processes.

Figure 1-0 Example listing the busiest processes on a system

% /usr/ucb/ps uaxw | head
adrianc   2431 17.9 22.63857628568 ?        S   Oct 13  7:38 /usr/dist/pkgs/framemaker,v4.0/bin/sunxm.s5.sparc/maker
adrianc    666  3.0 14.913073618848 console  R   Oct 02 12:28 /usr/openwin/bin/X :0 -dev /dev/fbs/ffb0 defclass TrueColor defdepth 24
root      6268  0.2  0.9 1120 1072 pts/4    O 17:00:29  0:00 /usr/ucb/ps uaxw
adrianc   2936  0.1  1.8 3672 2248 ??       S   Oct 14  0:04 /usr/openwin/bin/cmdtool -C
root         3  0.1  0.0    0    0 ?        S   Oct 02  2:17 fsflush
root         0  0.0  0.0    0    0 ?        T   Oct 02  0:00 sched
root         1  0.0  0.1 1664  136 ?        S   Oct 02  0:00 /etc/init -
root         2  0.0  0.0    0    0 ?        S   Oct 02  0:00 pageout
root        93  0.0  0.2 1392  216 ?        S   Oct 02  0:00 /usr/sbin/in.routed -q

Unfortunately, some of the numbers above run together: the %MEM field shows the RSS as a percentage of total memory. SZ shows the size of the process virtual address space; for X servers this size includes a memory-mapped frame buffer, and in this case, for a Creator3D the frame buffer address space adds over 100 megabytes to the total. For normal processes, a large SZ indicates a large swap space usage. The RSS column shows the amount of RAM mapped to that process, including RAM shared with other processes. In this case, PID 2431 has an SZ of 38576 KB and RSS of 28568 KB, 22.6 percent of the available memory on this 128-MB Ultra. The X server has an SZ of 130736 KB and an RSS of 18848 KB.

7. What are the CPU and disk utilization levels?

How busy is the CPU overall, what's the proportion of user and system CPU time, how busy are the disks, which ones have the highest load? All this information can be seen with iostat -xc (iostat -xPnce in Solaris 2.6 -- think of "expense" to remember the new options). Don't collect more than 100 samples, strip out all the idle disks, and set your recording interval to match the time span you need to instrument. For a 24-hour day, 15-minute intervals are fine. For a 10-minute period when the system is busy, 10-second intervals are fine. The shorter the time interval, the more "noisy" the data will be because the peaks are not smoothed out over time. Gathering both a long-term and a short-term peak view helps highlight the problem areas. One way to collect this data is to use the SE toolkit -- a script I wrote, called (see Resources below), writes out to a text-based log whenever it sees part of the system (a disk or whatever) that seems to be slow or overloaded.

8. What is making the disks busy?

If the whole disk subsystem is idle, then you can skip this question. The per-process data does not tell you which disks the processes are accessing. Use the df command to list mounted file systems, and use showmount to show which ones are exported from an NFS server; then, figure out how the applications are installed to work out which disks are being hit and where raw database tables are located. The swap -l command lists swap file locations; watch these carefully in the iostat data because they all become very busy with paging activity when there is a memory shortage.

9. What is the network name service configuration?

If the machine is responding slowly but does not seem to be at all busy, it may be waiting for some other system to respond to a request. A surprising number of problems can be caused by badly configured name services. Check /etc/nsswitch.conf and /etc/resolv.conf to see if DNS, NIS, or NIS+ is in use. Make sure the name servers are all running and responding quickly. Also check that the system is properly routing over the network.

10. How much network activity is there?

You need to look at the packet rate on each interface, the NFS client and server operation rates, and the TCP connection rate, throughput, and retransmission rate. One way is to run this twice, separated by a defined time interval.

% netstat -i; nfsstat; netstat -s

Another way is to use the SE toolkit's script that monitors the interfaces and TCP data along the lines of iostat -x.

% se 10
Current tcp RtoMin is 200, interval 10, start Thu Oct 16 16:52:33 1997
Name  Ipkt/s Opkt/s  Err/s  Coll% NoCP/s Defr/s  tcpIn  tcpOut Conn/s %Retran
hme0   212.0  426.9   0.00   0.00   0.00   0.00     65  593435   0.00    0.00
hme0   176.1  352.6   0.00   0.00   0.00   0.00     53  490379   0.00    0.00

11. Is there enough memory?

When an application starts up or grows or reads files, it takes memory from the free list. When the free list gets down to a few megabytes, the kernel decides which files and processes to steal memory from, to replenish the free list. It decides by scanning pages, looking for ones that haven't been used recently and paging out their contents so that the memory can be put on the free list. If there is no scanning, then you definitely have enough memory. If there is a lot of scanning and the swap disks are busy at the same time, you need more memory. If the swap disks are more than 50 percent busy, you should make swap files or partitions on other disks to spread the load and improve performance while waiting for more RAM to be delivered. You can use vmstat or sar -g to look at the paging system, or will watch it for you.

12. What changed recently and what is on the way?

It is always useful to know what was changed. You might have added a lot more users, or some event might have caused higher user activity than usual. You might have upgraded an application to add features or installed a newer version. Other systems may have been added to the network. Configuration changes or hardware "upgrades" can sometimes impact performance if they are not configured properly. You might have added a hardware RAID controller but forgotten to enable its nonvolatile RAM for fast write capability. It is also useful to know what might happen in the future. How much extra capacity might be needed for the next bunch of additional users or new applications?

Wrap up
I hope that helps you get started. You may even solve the problem yourself while collecting data to send to someone else! Feel free to send problems to me, but please don't send urgent ones that need a prompt answer, as I may not be able to help you in time. That's what SunService gets paid for!

We are getting close to a new version of SE. It's a major rewrite with lots of new features and includes Solaris 2.5, 2.5.1, and 2.6 support for SPARC and x86 (no older releases any more). It has taken a long time to build and test, and it will be the subject of next month's column. It's not available before December 1st, so please don't keep asking for it! Any queries about SE should go to the alias only, which reaches both myself and Rich Pettit and is logged for posterity.

Click on our Sponsors to help Support SunWorld


Other Cockcroft columns at

About the author
Adrian Cockcroft joined Sun in 1988, and currently works as a performance specialist for the Server Division of SMCC. He wrote Sun Performance and Tuning: SPARC and Solaris, published by SunSoft Press PTR Prentice Hall. Reach Adrian at

What did you think of this article?
-Very worth reading
-Worth reading
-Not worth reading
-Too long
-Just right
-Too short
-Too technical
-Just right
-Not technical enough

[Table of Contents]
Subscribe to SunWorld, it's free!
[Next story]
Sun's Site

[(c) Copyright  Web Publishing Inc., and IDG Communication company]

If you have technical problems with this magazine, contact

Last modified: