Watching your Web server
How do I monitor my Web server's performance
Worried about Web server performance? Wonder how your server compares to others out there? The growing clamor for answers to these questions has prompted development of some Web sever specific benchmarks and Java applets for monitoring performance. This month we take a look at these benchmarks, the sources of sever performance data, and ways to fix common problems. (4,000 words)
Q:I'm getting complaints about my Web server. Users say it's too slow. Where do I start looking? What should I fix first? -- Webless in Wauwatosa
These are becoming common questions in our Internet-crazed world. Luckily, a variety of resources exist to help you answer them. Let's start at the top and look at some new benchmarks that help you compare your Web server's performance to that of others out there. Then we'll examine some tools that monitor your server's usage, and discuss ways to improve its speed.
There are two noteworthy Web server performance benchmarks. Webstones is available now, but is uncontrolled, so benchmarkers can change the code, the options, and the workload. The results are interesting, but need careful analysis of what lies behind the headlines. The proposed SPEC Webperf benchmark is available to SPEC members only, but when it is released it will be subject to strict run rules and full configuration disclosure.
The situation is a bit like the early, Wild West days of database benchmarking, when vendors would quote the TP1 benchmark and get as many transactions per second as they could by any means. The Transaction Performance Council cleared up this mess, and now the strictly controlled TPC-A, B, C and D benchmarks are used to benchmark database performance. Sun has decided to wait until the SPEC Webperf benchmark is ready before publishing any official numbers.
There are a few Webstone benchmark results online that use the SPARCstation 20 Model 71 as the test system. They report results that vary from 20 to 300 operations per second depending upon the server software in use. They also vary from about 36 to 100 operations per second using the same software, but increasing the RAM from 32 to 64 megabytes.
Chuck Musciano went over some of the basics in his Webmaster column last month. He referred to many Web servers, and for simplicity I'd like to group them into four generations of products that have similar performance characteristics. As they say "Your mileage (access sizes and types) will vary." The two benchmarks use a different workload mix. Webstone has an average transfer size of about 7K, Webperf has a different distribution and a larger average transfer size.
The Web server performance bottleneck moves from being the server code itself in the first generation to the TCP/IP protocol overhead in the second generation. It then ends up being the network bandwidth and disk subsystem in the third and fourth generations.
The numbers I use above are based on Webstone results measured by Spyglass. Spyglass modified Webstone to implement keep-alives, but used the "usual" Webstone workload mix. The company lists results for a "Brand X" Web server that looks like Netscape 1.1 to me. Some Webstone measurements made by Silicon Graphics Inc. (Webstone was written by SGI) show that when tested with 32 megabytes, the SPARCstation 20 Model 71 offers 36 operations per second. Essentially the same test run by Spyglass with 64 megabytes yielded about 100 operations per second.
Why the big difference?
I think you get the picture! There are too many caveats and detailed issues to make comparison between Webstone results easy. We will have to wait a few more months for SPEC Webperf to help clarify Web benchmarking.
What to monitor and what to fix
Let's start with the assumption that you are using one of the good second-generation Web servers like Netscape's Communications Server that comes with Sun's Netra Internet servers. If you are still using a first-generation server there's not much point worrying about tuning anything else.
It is common to combine several services onto one server, although separate machines can be configured for large sites. A search engine may create a larger load than the http service itself. Complex "cgi-bin" scripts, Usenet news, e-mail, or FTP traffic can easily use up all the CPU or the network bandwidth to your site.
System, network and Web service performance collection
I'll start by briefly revisiting the generic system performance rules I describe in my book. They are implemented by the SE toolkit. There are two additional sources of information that are relevant to Web servers. One is the TCP status data. The other is the access logfile.
I'll explain how I decided on what data to collect and how I implemented a lightweight script in SE to automate the process. You can download the script or look at some live data being collected on http://www.sun.com. There is a simple Java graphing applet that can be used to view the data more interactively. Finally, some common problems are explained and solved.
Storing long-term performance measures
Consider this sage advice from an experienced performance analyst:
"To predict the future, you must know the past." - Pat Artis
Don't wait for problems to be reported. Start collecting and analyzing data as soon as your service is up and running. If you record baseline data when the system is working well, you can compare it to data when it begins acting up. Often the problem will be obvious from the comparison, but difficult to see without the baseline. You will also notice potential problems before your users complain.
I write a new logfile each day -- a regular text file in table format. A header line identifies each space delimited column, and a timestamp begins each line of measurements. This format is easy to read into spreadsheets, statistics, or plotting packages. It is readable by human eyes and can be processed by a Java applet if the logfiles are stored in a place addressable by a URL. Here is an example, showing a Unix time (seconds since Jan 1st 1970 GMT) and local time (hours, minutes, seconds, US/Pacific), updated every five minutes.
% more /var/adm/weblogs/percol-1996-02-05 timestamp locltime ... 823562592 15:23:12 ... 823562892 15:28:12 ... 823563193 15:33:13 ...
The timestamp is useful as the X axis for graphs over time, and local time is useful for humans to read (labelling the X axis perhaps). It also correlates well with changes in the load level through the day.
One oddity in the behavior is that the output file is not opened until the first line of data is ready to write. This means that there is no output data, not even an empty file with a header, for the first five minutes. The code that generates a filename based on the date must work like this in order to get midnight change-over of the output file working correctly.
System performance rules
This month I'm introducing a new version of the SE toolkit (release 2.5) in my other performance column "New Release of the SE Performance Toolkit". The main new feature is the performance collator for Web servers that I'm describing here. If you already have the SE Performance Toolkit Version 2.5 or have read Appendix A at the end of my book, you should be familiar with the ten basic rules that indicate the state of parts of the system and the action required to improve bad states. The toolkit makes it easy to include the rules, and they provide a high-level indication of the system's health. The rules, along with a character code in brackets are:
The overall states that I report for each of these are:
If everything is white and green with the occasional amber all is well. If there are persistent red states, you may need to use more detailed diagnostics like the virtual_adrian script if the problem is not obvious. The most common black state results when swap space runs out. With no swap space, programs give errors or crash when they try to start. If you are lucky, you might be able to kill the culprit or quickly add some more swap space. If you are unlucky, you might have to reboot.
I write a single, ten-character word to encode these states in the output logfile. The header indicates the order of the rules using their character codes, and the word contains a character for each rule indicating its state. I put the "bad" states in upper case. The system has no problems in the example here:
% more /var/adm/weblogs/percol-1996-02-05 timestamp locltime DNnsrkcmdi ... 823562592 15:23:12 wgwwwgwggw ... 823562892 15:28:12 wgwwwgwggw ... 823563193 15:33:13 wgwwwgwggw ...
System utilization levels
We also need to keep track of spare capacity in the system. The basic measures I find useful are the user and system CPU times, the disk utilization, and network load.
The CPU measures are easy to get. I don't bother with waiting for I/O. It's too misleading on multiprocessor systems, as I have said in a September column; and, idle time is redundant since we know that usr + sys + idle = 100. Rule status will indicate amber or red if there is too much load on the CPU and response times are affected.
Condensing a variety of disk configurations into a few measures is tricky. I've settled on reporting the utilization of the busiest disk in the system, as it is likely to be some kind of bottleneck. I also report the utilization averaged across all the disks. You have a well-balanced disk configuration if the two numbers are similar. You are unbalanced if the peak is much higher than the average. You can rearrange the load by reallocating data or striping. I label these two columns peak disk busy (pdb) and mean disk busy (mdb). The disk status rule looks at each disk individually and reports amber or red if it sees slow service times. This is all the detail we need.
There is no direct way to measure a busy network. However, I tally the total number of input and output packets per second along with an average collision rate across all network interfaces. The network status rule checks each interface individually and reports amber or red if it sees a high packet count and collision rate.
Memory utilization is not easily measured, so I rely on the status to tell me more RAM is needed.
% more /var/adm/weblogs/percol-1996-02-05 timestamp locltime DNnsrkcmdi usr sys pdb mdb Ipkt/s Opkt/s Coll% ... 823562592 15:23:12 wgwwwgwggw 8 6 3 2 82.0 100.0 1.0 ... 823562892 15:28:12 wgwwwgwggw 3 3 2 2 89.0 100.0 2.0 ... 823563193 15:33:13 wgwwwgwggw 11 4 3 3 102.0 113.0 2.7 ...
TCP network measures
These statistics are reported as part of the netstat -s command output. They are available directly in SE as the mib$tcp class. Be warned, however. We recently found that some of the values are misleading or incorrect and are working to fix them. In particular there is some over counting in tcpRetransSegs. The netstat-s command works by extracting data from the TCP streams subsystem. It locks the state of TCP to obtain a consistent snapshot of all the measurements. This doesn't prevent data transmission, but it does prevent TCP from opening or closing connections. On most systems this is so quick there is no problem.
If you have a highly loaded Web server (100 http operations per
second or more), you will notice netstat -s may take several
seconds to run, and it stalls the server for that time period. This is
undesirable! It is possible, when running as root, to read the
measures of interest directly from the kernel, via
kvm$tcp_status in SE. A few measures are not fully updated in
this data structure, but it is good enough for my purposes and creates
little overhead. Either way this is the basic data returned. The
only side effect you may notice is a warning about
/dev/kvm access that is printed out if you do not run the
command as root. You can ignore this warning.
This example output is not taken on a Web server, the data wraps round quickly and the display gets ragged with large values. In my script I'm careful to handle wrap rounds of the 32-bit counters.
TCP tcpRtoAlgorithm = 4 tcpRtoMin = 200 tcpRtoMax = 60000 tcpMaxConn = -1 tcpActiveOpens = 9 tcpPassiveOpens = 9 tcpAttemptFails = 0 tcpEstabResets = 0 tcpCurrEstab = 2 tcpOutSegs = 1124 tcpOutDataSegs = 1052 tcpOutDataBytes =109928 tcpRetransSegs = 9 tcpRetransBytes = 0 tcpOutAck = 72 tcpOutAckDelayed = 36 tcpOutUrg = 0 tcpOutWinUpdate = 0 tcpOutWinProbe = 0 tcpOutControl = 34 tcpOutRsts = 0 tcpOutFastRetrans = 0 tcpInSegs = 1176 tcpInAckSegs = 1066 tcpInAckilobytes =109944 tcpInDupAck = 22 tcpInAckUnsent = 0 tcpInInorderSegs = 1052 tcpInInorderBytes =109928 tcpInUnorderSegs = 0 tcpInUnorderBytes = 0 tcpInDupSegs = 0 tcpInDupBytes = 0 tcpInPartDupSegs = 0 tcpInPartDupBytes = 0 tcpInPastWinSegs = 0 tcpInPastWinBytes = 0 tcpInWinProbe = 0 tcpInWinUpdate = 0 tcpInClosed = 0 tcpRttNoUpdate = 0 tcpRttUpdate = 132 tcpTimRetrans = 0 tcpTimRetransDrop = 0 tcpTimKeepalive = 6 tcpTimKeepaliveProbe= 0 tcpTimKeepaliveDrop = 0
The data I record is the number of bytes input and output per second and the retransmit percentage. These values are totals, so the difference over the measurement interval is used.
These values are 32-bit quantities. One reason for picking a 300-second measurement interval is that even on a 100-megabit interface running at full capacity these measures will not wrap round all the way in 300 seconds.
The table is getting wide so I've omitted the utilization section from the example below.
% more /var/adm/weblogs/percol-1996-02-04 timestamp locltime DNnsrkcmdi ... tcpIn/s tcpOut/s Retran% ... 823562592 15:23:12 wgwwwgwggw ... 1658 42308 16.28 ... 823562892 15:28:12 wgwwwgwggw ... 1714 47826 14.47 ... 823563193 15:33:13 wgwwwgwggw ... 2010 54589 15.00 ...
If you are running TCP over a low latency network, you are unlikely to see any problems. You may notice high retransmit rates over a wide area network or the Internet. In older releases, a TCP bug caused some problems. This was fixed in Solaris 2.3 patch 101318-74, Solaris 2.4 patch 101945-34 and Solaris 2.5FCS.
Another problem results from the very short nature of many http connections. TCP maintains an estimate of the round trip time that is used to decide when to retransmit a segment. For the first few packets of each connection, we tend to underestimate the time and are more likely to restransmit them. With many short http transfers, this raises the retransmit rate. It doesn't really affect large http or ftp transfers.
SunSoft is working on improvements to TCP that will make the initial
estimate better and make it converge more quickly (bugid 1233827).
Until the fix is available you should manually increase the minimum
retransmit time. If set too high, TCP will wait too long to retransmit
packets that really have been lost. This should be limited to one or
two seconds. If you see more than 15 percent retransmissions reported,
run the following command and add it to
/usr/sbin/ndd -set /dev/tcp tcp_rexmit_interval_min 1500
Tracking the access log
The Web server software normally writes a log that tracks who has accessed which URL, when it was accessed, and how much data was transferred. This is often used for accounting and market research purposes. Each line in the file is a separate http operation. The access log file gets large quickly so it is normally archived daily, moving the old file and starting a new file.
There is a fairly standard format for this file. Both NCSA and Netscape Communications Server 1.1 use the same format, and I parse the data for them. Some other servers (e.g., Proxy servers, OpenMarket) store extra data, and my script may need to be modified to pick out the right fields. Netscape provides a command called analyze, which processes the entire log file. This puts a large load on the system for a few minutes and is useful for summarizing what happened yesterday. I'm interested in what happened in the last few minutes, so I have developed a way to track the log file as it grows.
The way I track the access log is fairly sophisticated and robust.
The data extracted from the access log is accumulated into the following measurements averaged over the five minute measurement interval:
% more /var/adm/weblogs/percol-1996-02-04 httpop/s http/p5s get/s post/s cgi/s httpb/s to4KB/s to16KB/s to64KB/s to256KB/s ov256KB/s 192.9. 5.21 10.00 5.15 0.05 0.34 43945 3.40 1.27 0.30 0.24 0.00 0.14 5.99 11.20 5.96 0.02 0.33 57823 4.15 1.27 0.26 0.29 0.01 0.07 7.02 12.20 6.98 0.03 0.41 60258 4.69 1.72 0.34 0.25 0.01 0.18
That completes the data in my logfile. Note that the data collector is a script written in an interpreted dialect of C percollator.se. It's simple to modify it yourself and collect more or different data.
Running the performance collator script
Some of you may be worried about the overhead of running this script. I have good news and bad. The good news is the script uses a tiny amount of CPU power -- less than a second of CPU time per hour is typical. The bad news is that it uses about 2 megabytes of RAM -- a mixture of interpreter overhead and multiple copies of large kernel statistics data structures. Since the code is so close to C, some of this overhead could be avoided by rewriting the collector in compiled C. Personally, I like to edit and run, and I like the fact that the same script can be used on any version of Solaris regardless of release or CPU.
For example, below is an accounting summary for www.sun.com. This is a summary of commands that completed that day, and I restarted percollator.se after it had been running for about ten days so it shows about 0.5 seconds per hour of run time.
TOTAL COMMAND SUMMARY COMMAND NUMBER TOTAL TOTAL TOTAL MEAN MEAN HOG CHARS BLOCKS NAME CMDS KCOREMIN CPU-MIN REAL-MIN SIZE-K CPU-MIN FACTOR TRNSFD READ TOTALS 91217 258640.46 144.81 134970.40 1786.12 0.00 0.00 10227979008 156839 architex 1645 131656.29 42.95 82.85 3065.15 0.03 0.52 2201467840 33746 ns-httpd 1363 44174.87 39.17 115275.62 1127.74 0.03 0.00 3620051712 22417 search.c 1758 17548.83 8.65 99.29 2028.26 0.00 0.09 158836544 69 percolla 1 11507.03 4.89 14281.39 2351.73 4.89 0.00 421527552 610I have a startup shell script called percol that I put in
/etc/rc3.d/S99percolto start the script automatically on reboot. The startup shell script must be edited to set up environment variables that are used to set rule thresholds and the location of the access logfile. The output directory defaults to
/var/adm/weblogs, but it can be set to a directory that is accessible over the Web. You can look at the current numbers for www.sun.com. (The Sun webmaster published a static specifications page for www.sun.com's UltraServer 1 Model 140.)
As you can see, this computer (which is capable of more than 200 http operations per second) is lightly loaded most of the time. As I write this in late February, it tends to run at about 5 to 10 percent of capacity. The occasional peaks in CPU and disk activity are caused most often by the Architext search engine.
Brewing some Java
There is a reason that I called my script the "performance collator." Coffee puns are mandatory with Java, so I have built a percollator to help me brew some Java! The definition of collate also seems approriate.
Collate assemble and arrange systematically. Oxford English Dictionary.
Browsing the data with Java applets
For those of you who don't have a Java-enabled browser, I've taken a fixed snapshot of a graph.
If you use Netscape 2.0 you can view the data interactively and graph it with the applet I wrote or a much more sophisticated one produced by Rinaldo DiGiorgio and described in his March JavaWorld column, which we expect to make available March 15.
The data is there. The concepts are simple. The combination of SE and Java makes this easy and extensible. If you play with it and create something new and interesting, send me your URL and I'll publish a link to it. The Java source code is linked to the bottom of my graph page. It is my first program in Java, so be kind! This is definitely in the fix-it-yourself category if you have any problems with it.
Did I answer the question?
My correspondent this month complained of a sluggish Web server. By examining a server's state codes, you can see if the problem is system overload. For the vast majority of Internet servers, the problem may be lack of bandwith on the Internet connection because of competition with Usenet News and e-mail traffic. You should record the number of bytes of throughput you are getting from your Web server. Remember to add about 15 percent to this to allow for the non-data protocol overhead and compare it with your expected network bandwidth. (That's how my correspondent and I narrowed the problem to his server's overloaded T1 Internet connection.)
I also recommend you turn on system accounting and keep daily and monthly accounting summaries. For www.sun.com, this made it clear our Architext search engine is the primary load. I can easily see the average CPU time, RAM, and blocks of disk I/O per search from the accounting summaries. Then I can predict how many searches per second the system can handle.
Some of the busiest sites on the Internet run on Suns. One site recently ran its full load on a single UltraServer 1 and clocked about 5 million accesses in one day. Another site, Infoseek, uses several Suns, and according to its home page, processes more than 5 million search retrievals a day. Third-generation Web server software is imminent: There are protocol improvements, ever faster CPUs, and some scope for further tuning in Solaris. I'm sure that there is plenty of headroom in the servers. Two limitations may be the 45-megabit T3 lines used by Sun and other large sites, and the 10-megabit infrastructure used in many intranets. The next step up could be 155-megabit and 622-megabit ATM lines. I'm not sure when ATM will become common, but the Ultra machines were designed with ATM speeds in mind.
About the author
Adrian Cockcroft joined Sun in 1988, and currently works as a performance specialist for the Server Division of SMCC. He is the author of Sun Performance and Tuning: SPARC and Solaris, published by SunSoft Press PTR Prentice Hall. Reach Adrian at firstname.lastname@example.org.
If you have technical problems with this magazine, contact email@example.com