Performance perplexities: Help! Where do I start?
Got a problem? Here are 12 simple questions to ask yourself before you ask others
Where should you start with a performance problem, and what information should you collect if you are going to ask for help? This month Adrian presents his cold-start guide to get you going. (1,900 words)
I have a performance problem. What do you need to know to help me?
I see a lot of questions
from users or administrators who have decided that they have a
performance problem but don't know where to start or what
information to provide when they ask for help. I have seen e-mail
from people who just say "my system is slow" and give no additional
information at all. I have also seen 10-megabyte e-mail messages
with 20 attachments containing days of
iostat reports, but with no
indication of what application the machine is supposed to be
running. In this section, I'll lead you through the initial
questions that need to be answered. This may be enough to get you on
the right track to solving the problem yourself, and it will make it
easier to ask for help effectively.
1. What is the business function of the system?
What is the system used for? What is its primary application? It could be a file server, database server, end-user CAD workstation, Internet server, or embedded control system.
2. Who and where are the users?
How many users are there? How do they use the system, and what kind of work patterns do they have? They might be a classroom full of students, people browsing the Internet from home, data entry clerks, development engineers, real-time data feeds, batch jobs. Are the end users directly connected? From what kind of device?
3. Who says there is a performance problem? What is slow?
Are the end users complaining, or do you have some objective business measure like batch jobs not completing quickly enough? If there are no complaints, then you should be measuring business-oriented throughput and response times, together with system utilization levels. Don't waste time worrying about obscure kernel measurements. If you have established a baseline of utilization, business throughput, and response times, then it is obvious when there is a problem because the response time will have increased, and that is what drives user perceptions of performance. It is useful to have real measures of response times or a way to derive them. You may get only subjective measures -- "it feels sluggish today" -- or have to use a stopwatch to time things.
4. What is the system configuration?
How many machines are involved; what is the CPU, memory, network, and disk setup; what version of Solaris is running; what relevant patches are loaded? A good description of a system might be something like this: an Ultra2/2200, with 512 MB, one 100-Mbit switched duplex Ethernet, two internal 2-GB disks with six external 4-GB disks on their own controller, running Solaris 2.5.1 with the latest kernel, network device, and TCP patches.
5. What application software is in use?
If the system is just running Solaris services, which ones are most significant? If it is an NFS server, is it running NFS V2 or NFS V3 (this depends mostly upon the NFS clients). If it is a Web server, is it running Sun's SWS, Netscape, or Apache (and which version)? If it is a database server, which database is it, and are the database tables running on raw disk or in filesystem tables? Has a database vendor specialist checked that the database is configured for good performance and indexed correctly?
6. What are the busy processes on the system doing?
A system becomes busy by running application processes; the most
important thing to look at is which processes are busy, who started
them, how much CPU they are using, how much memory they are using, and
how long they have been running. If you have a lot of short-lived
processes, the only way to catch their usage is to use system
accounting. For long-lived processes, you can use the ps command or a
tool such as
symon. A simple and effective summary is to use the old
Berkeley version of
ps to get a top 10 listing, as shown
below. On a large system, there may be a lot more than 10 busy
processes, so get all that are using significant amounts of CPU so that
you have captured 90 percent or more of the CPU consumption by
Figure 1-0 Example listing the busiest processes on a system
% /usr/ucb/ps uaxw | head USER PID %CPU %MEM SZ RSS TT S START TIME COMMAND adrianc 2431 17.9 22.63857628568 ? S Oct 13 7:38 /usr/dist/pkgs/framemaker,v4.0/bin/sunxm.s5.sparc/maker adrianc 666 3.0 14.913073618848 console R Oct 02 12:28 /usr/openwin/bin/X :0 -dev /dev/fbs/ffb0 defclass TrueColor defdepth 24 root 6268 0.2 0.9 1120 1072 pts/4 O 17:00:29 0:00 /usr/ucb/ps uaxw adrianc 2936 0.1 1.8 3672 2248 ?? S Oct 14 0:04 /usr/openwin/bin/cmdtool -C root 3 0.1 0.0 0 0 ? S Oct 02 2:17 fsflush root 0 0.0 0.0 0 0 ? T Oct 02 0:00 sched root 1 0.0 0.1 1664 136 ? S Oct 02 0:00 /etc/init - root 2 0.0 0.0 0 0 ? S Oct 02 0:00 pageout root 93 0.0 0.2 1392 216 ? S Oct 02 0:00 /usr/sbin/in.routed -q
Unfortunately, some of the numbers above run together: the %MEM field shows the RSS as a percentage of total memory. SZ shows the size of the process virtual address space; for X servers this size includes a memory-mapped frame buffer, and in this case, for a Creator3D the frame buffer address space adds over 100 megabytes to the total. For normal processes, a large SZ indicates a large swap space usage. The RSS column shows the amount of RAM mapped to that process, including RAM shared with other processes. In this case, PID 2431 has an SZ of 38576 KB and RSS of 28568 KB, 22.6 percent of the available memory on this 128-MB Ultra. The X server has an SZ of 130736 KB and an RSS of 18848 KB.
7. What are the CPU and disk utilization levels?
How busy is the CPU overall, what's the proportion of user and system
CPU time, how busy are the disks, which ones have the highest load? All
this information can be seen with
iostat -xc (
-xPnce in Solaris 2.6 -- think of "expense" to remember the new
options). Don't collect more than 100 samples, strip out all the idle
disks, and set your recording interval to match the time span you need
to instrument. For a 24-hour day, 15-minute intervals are fine. For a
10-minute period when the system is busy, 10-second intervals are fine.
The shorter the time interval, the more "noisy" the data will be
because the peaks are not smoothed out over time. Gathering both a
long-term and a short-term peak view helps highlight the problem areas.
One way to collect this data is to use the SE toolkit -- a script I
virtual_adrian.se (see Resources below), writes out to a text-based log
whenever it sees part of the system (a disk or whatever) that seems to be
slow or overloaded.
8. What is making the disks busy?
If the whole disk subsystem is idle, then you can skip this question.
The per-process data does not tell you which disks the processes are
accessing. Use the
df command to list mounted file
systems, and use
showmount to show which ones are exported
from an NFS server; then, figure out how the applications are installed
to work out which disks are being hit and where raw database tables are
swap -l command lists swap file locations;
watch these carefully in the
iostat data because they all
become very busy with paging activity when there is a memory shortage.
9. What is the network name service configuration?
If the machine is responding slowly but does not seem to be at all
busy, it may be waiting for some other system to respond to a request.
A surprising number of problems can be caused by badly configured name
/etc/resolv.conf to see if DNS, NIS, or NIS+ is in use.
Make sure the name servers are all running and responding quickly. Also
check that the system is properly routing over the network.
10. How much network activity is there?
You need to look at the packet rate on each interface, the NFS client and server operation rates, and the TCP connection rate, throughput, and retransmission rate. One way is to run this twice, separated by a defined time interval.
% netstat -i; nfsstat; netstat -s
Another way is to use the SE toolkit's
nx.se script that
monitors the interfaces and TCP data along the lines of
% se nx.se 10 Current tcp RtoMin is 200, interval 10, start Thu Oct 16 16:52:33 1997 Name Ipkt/s Opkt/s Err/s Coll% NoCP/s Defr/s tcpIn tcpOut Conn/s %Retran hme0 212.0 426.9 0.00 0.00 0.00 0.00 65 593435 0.00 0.00 hme0 176.1 352.6 0.00 0.00 0.00 0.00 53 490379 0.00 0.00
11. Is there enough memory?
When an application starts up or grows or reads files, it takes memory
from the free list. When the free list gets down to a few megabytes,
the kernel decides which files and processes to steal memory from, to
replenish the free list. It decides by scanning pages, looking for ones
that haven't been used recently and paging out their contents so that
the memory can be put on the free list. If there is no scanning,
then you definitely have enough memory. If there is a lot of
scanning and the swap disks are busy at the same time, you need more
memory. If the swap disks are more than 50 percent busy, you should
make swap files or partitions on other disks to spread the load and
improve performance while waiting for more RAM to be delivered. You can
sar -g to look at the paging
virtual_adrian.se will watch it for you.
12. What changed recently and what is on the way?
It is always useful to know what was changed. You might have added a lot more users, or some event might have caused higher user activity than usual. You might have upgraded an application to add features or installed a newer version. Other systems may have been added to the network. Configuration changes or hardware "upgrades" can sometimes impact performance if they are not configured properly. You might have added a hardware RAID controller but forgotten to enable its nonvolatile RAM for fast write capability. It is also useful to know what might happen in the future. How much extra capacity might be needed for the next bunch of additional users or new applications?
I hope that helps you get started. You may even solve the problem yourself while collecting data to send to someone else! Feel free to send problems to me, but please don't send urgent ones that need a prompt answer, as I may not be able to help you in time. That's what SunService gets paid for!
We are getting close to a new version of SE. It's a major rewrite with lots of new features and includes Solaris 2.5, 2.5.1, and 2.6 support for SPARC and x86 (no older releases any more). It has taken a long time to build and test, and it will be the subject of next month's column. It's not available before December 1st, so please don't keep asking for it! Any queries about SE should go to the email@example.com alias only, which reaches both myself and Rich Pettit and is logged for posterity.
About the author
Adrian Cockcroft joined Sun in 1988, and currently works as a performance specialist for the Server Division of SMCC. He wrote Sun Performance and Tuning: SPARC and Solaris, published by SunSoft Press PTR Prentice Hall. Reach Adrian at firstname.lastname@example.org.
If you have technical problems with this magazine, contact email@example.com