Benchmarking the Web
Are accurate and unbiased Web
Being armed with benchmarking numbers doesn't count for much if you don't know what they mean. We look at what's involved in benchmarking Web servers, what the results can tell us, and what to be wary of in relying on the four most popular benchmarks. We also include two sidebars with extensive links to additional benchmarking information. One sidebar is an enlightening interview with Dan Connolly of the World Wide Web Consortium. (5,000 words including two sidebars)
Tweaking performance usually requires locating the bottlenecks, and then reducing their effects as much as possible. The interesting thing about Web performance, however, is that users supply so many of these bottlenecks.
Because so many of them use modems to attach to the Internet, for instance, the nature of their attachment (28.8 Kbps at best, for most of them) completely dominates their perception of performance.
Frankly, though, neither Webmasters nor users really care about what factors govern Web performance. They just want to be sure that they're obtaining the fastest possible performance from their own systems and software. In that spirit, we'll investigate the significance of Web benchmarks and try to describe what such data can -- and can't -- tell its readers.
What do Web benchmarks really mean?
There are three primary factors that determine Web performance, each with its own underlying contributing factors:
Because Webmasters can control neither end-user circumstances and configurations, nor the latency of the networks to which their servers attach, all Web benchmarks available today are really Web server benchmarks.
In other words, all the data you see about Web performance is really only a measure of how good a job a Web server can do in responding to some mix of user requests.
The most typical Web server characteristics that benchmarks measure include the following:
Most Web server benchmarks are measured over local area networks, where high-speed connections are the norm, and the typical delays associated with slower modem-based connections are completely irrelevant -- as are the delays that can sometimes be imposed by the networking infrastructure.
Thus, what a Web benchmark basically tells its readers is "how fast is the Web server?" and "how many users can it handle at the same time?" While these statistics can be measured (and presented) in various ways, they basically boil down to some measure of raw throughput, and some rating of handling capacity. For heavily loaded servers, these are vital statistics worth knowing about.
But what a Web benchmark can't tell its readers is as interesting as the information it can convey. Benchmarks don't deal with issues of perceived user performance (or what users with slow connections observe at their end of the browser-server link), with errors or irregular behaviors (or what kinds of problems or hiccups users might encounter), or with a rating of how gracefully a Web server degrades under increasing loads. Nor do current benchmarks address well-known Web server performance sinks like the impact of using CGI programs or other Web extensions to provide enhanced interactivity on a Web site. All of these things are important to users, and should therefore be important to Webmasters as well. At the moment, however, no formal tools exist to measure these potential sources of performance hickeys.
Ultimately, you must understand that no Web benchmark can tell you everything you need to know about a particular Web server or the platform on which it runs. Benchmarks do provide the most objective way to compare different Web server hardware and software, provided you make sure you're not changing more than one variable at a time. (For example, to compare the performance of two Web server software packages, be sure that both benchmarks were run on the same hardware configurations; likewise, to compare hardware platforms, be sure the benchmarks used the same software).
What's involved in benchmarking a Web server?
Fundamentally, benchmarking depends on two things:
When it comes to benchmarking a Web server, the foundation of the software that will exercise its capabilities has to come from some kind of "typical workload." It's entirely reasonable to question the validity of anybody's definition of typical, especially since what's typical for one Web server may be completely atypical for another, depending on the levels of usage, the number of Web sites housed on a server, and the appeal of the content to its audience. That's why discussions of benchmarks typically detail how the imposed workload was characterized, the kinds and numbers of sites used to characterize a workload, and the types of analysis performed on the raw data used to create the workload.
Lest you be tempted to dismiss all benchmarks as irrelevant, you can take comfort in the nature of statistics. The Law of Large Numbers states that a sufficiently large sample will be adequately able to model typical behaviors for the population from which it is drawn. In English, this means that the more the workloads used to create a benchmark resemble your own server's workload, the better that model applies to you. Nevertheless, a certain amount of caution when evaluating benchmarks -- especially those built by vendors to compare their servers to those from other vendors -- is always warranted.
What's in a Web server benchmark?
Since the HTTP protocol upon which the Web rests is inherently a transaction-oriented protocol, it's a worthwhile subject for benchmarking. In HTTP, clients make short, simple requests that can take only a limited number of forms. Servers provide a broad range of replies that can vary from short error or information messages to arbitrarily long collections of text, graphics, and other information.
Thus, Web server benchmarks tend to concentrate on three or four measurements, all produced by subjecting the server to its modeled workload and varying one or more of the variables being measured. These variables usually include at least two or three of the following measurements:
For the first three measurements, more is clearly better, since it indicates a server that can handle more users, process more data, or field more HTTP requests over time. For the final measurement, less is better, because it indicates a server that responds more quickly to a request for service.
What benchmarks are available today?
There are numerous Web server benchmarks available right now on the Web (where else would they be?). We found the following handful to be of the greatest interest -- and potential utility -- to practicing Webmasters and other Web aficionados (See the sidebar, But wait, there's more, for further benchmarking information and online resources).
What's problematic with these benchmarks?
None of these are perfect, and none offer the transaction tracking and measurement provided by more comprehensive database benchmarks. Both the HP and SGI benchmarks (NetPerf and WebStone) are suspect because they're vendor-designed and maintained. Only the SPECweb96 benchmark appears to be based on a mathematically sophisticated, complete attempt to model document requests and sizes in recreating server document delivery behavior from a "real" Web environment.
Although three of the four benchmarks make their source code available (free for NetPerf and WebStone, for a fee from SPEC for SPECweb96), GStone is like a black box in that the Web66 team does not make its code available. But SPEC is the only organization that imposes rigid rules for how (and why) code can be altered in the porting process. It also insists on a rigorous review of any ported versions before permitting licensees to publish results based on any of its benchmarks.
Finally, only SPECweb96 represents a concerted effort from a sizable group of interested vendors to construct an objective and completely replicable benchmark (See the sidebar, But wait, there's more, for a list of the major players involved in this project). Although HP and SGI are forthcoming with source code and results, they exercise no control over their benchmarks' use, nor do they require the kind of reporting on alterations and code replacements that any porting effort will typically require.
Which benchmark rules the roost?
Today, the NetPerf and GStone benchmarks enjoy the broadest use and support the largest pools of published results. But we don't think this situation will continue for too much longer.
Because of SPEC's attention to detail, its rigorous modeling approach, and its cross-industry recruitment of designers, programmers, and testers; and because SPEC's sole focus and expertise lies in benchmarking, we think SPECweb96 stands out as the most reasonable benchmark in this group. This is true, not only because of the analysis of actual traffic and vendor experience that went into modeling its transaction mix, but also because the underlying client technology used to create the server load is based on research and implementation efforts that have been underway at SPEC since the early 1990s.
This technology (known as SPEC SFS or LADDIS) has been tested and proven in other client/server applications, and it offers a strong foundation for the SPECweb96 benchmark.
Likewise, SPECweb96 is likely to be the most accurate benchmark of all, because it uses the most realistic workloads and most rigorously constructed transaction mixes of any of the benchmarks we've discussed. This is partly due to the broad involvement from the contributing vendors, but also due to SPEC's detailed statistical analyses of thousands of actual Web server logs collected under a broad range of conditions and usage scenarios.
SPECWeb was only introduced in July, 1996, so results that derive from its use are still scarce. As of late August, only DEC and HP had published SPECweb results (not surprisingly, the blazingly fast 300-MHz Alpha dual-processor system looks very good). We learned that most of the other vendors involved in in-house Web server benchmarking are deep in the throes of testing right now, and virtually all intend to publish results some time in September or October. Apparently, there are still some kinks being worked out in the test itself. As soon as these last-minute elements are resolved, vendors will begin to publish results, and the code for the test will be made public, according to Walter Bays, the SPEC representative from Sun Microsystems.
"As time goes on, other vendors will join the race reporting on existing products. You'll also probably see SPECweb96 quoted in new product announcements -- the more SPECweb96 is quoted, the more customers will ask for it, and the usage will snowball," Bays said.
By September, it will also be possible to order a CD containing the SPECweb96 benchmark, with versions available both in C and Perl (cost for not-for-profit organizations: $400; all others: $800. The benchmark is available by request on a case-by-case basis at no charge for creating test results for publication). Visit the SPEC Web site at http://www.specbench.org for details on membership, dues, and participation in SPEC benchmark design and testing.
According to Kaivalya Dixit, president of SPEC, what makes SPECweb96 special is easy to explain:
"Unlike other Web server benchmarks, SPECweb96 is analyzed, debated and agreed by leading vendors, system integrators, and universities. It is a standardized test -- which means that published run/reporting rules, repeatability of results, and full disclosure are necessary for System Level Benchmarks. Most importantly, [SPECweb96's] strengths and weaknesses have been openly discussed and reported. The salient difference of this benchmark has been in the development, where SPEC leveraged skills from many disciplines (server vendors, software vendors and research groups) to make it as fair as humanly possible."
Let's examine a set of SPECweb96 results, and explain what they indicate, and how they might be interpreted:
(reproduced from http://www.specbench.org/osg/web96/results/)
Sponsor System Name SPECweb96 Full Disclosures ------- ------------------------ --------- -------------------- Digital Alphaserver 1000A 4/266 252 Text Html PostScript Digital Alphaserver 2100 5/300 565 Text Html PostScript Digital Alphaserver 2100 5/300 809 Text Html PostScript Hewlett HP 9000 Model D210 216 Text Html PostScript Hewlett HP 9000 Model K400 500 Text Html PostScriptFirst, it's important to understand the column headings:
The bulk of this information is pretty straightforward, so let's jump straight into the SPECweb96 metric. To begin with, bigger is better (thus, in the list above, the DEC Alphaserver 2100 5/300 outperforms all of the other systems by a pretty wide margin). But what does this number really measure? By measuring the response time for each server GET request and the overall throughput over time, the benchmark calculates a score that increases as latency decreases (shorter response times score higher) and as throughput increases (more server output means a higher score). The most important thing about these numbers is that they're designed to be compared: That is, because the results reported for the DEC Alphaserver 1000A are 16 percent higher than for the HP 9000 Model D210, this means the DEC platform is 16 percent faster than the HP platform.
But while you're pondering the numbers, recall that the existence of "Full Disclosures" for these benchmarks is a very good thing. In fact, most of the intellectual and programmatic rigor that surrounds SPECweb96 (and all SPEC benchmarks) is because they're designed to be easily ported to multiple platforms. "Full disclosure" means what it says -- namely, that any and all changes to the SPECweb96 source code must be described completely, and any potential impacts on results must be reported in profuse detail.
This means two very important things:
Although it's been well-observed that vendors do indeed tweak systems and software to do better on SPEC benchmarks, these benchmarks are designed so that tweaking will usually translate into genuinely better server performance as well.
If you have technical problems with this magazine, contact email@example.com
There are many excellent sources for further benchmarking information. We've included URLs for general overviews as well as reference sites for specific benchmarks mentioned in this story.
A plethora of WebStone-related information can easily be produced by jumping into SGI's home page at http://www.sgi.com and using its search engine with the string "WebStone."
The original white paper describing version 1.0, which includes a great deal of technical design detail, is available at: http://www.sgi.com/Products/WebFORCE/WebStone/
An open benchmark, Webstone's source code can be downloaded and examined, and used by anyone who wishes to employ this performance measurement tool. Although it's currently built for Unix environments, the SGI site exhorts readers to "feel free to port this code to other platforms."
WebStone measures the raw throughput of a standard mix of HTTP messages, built to model a normal server workload. Two main measures of performance, latency and number of connections per second, are employed. It's also possible to use WebStone to collect other data, including throughput measured in bits per second. WebStone is driven by a set of client configuration files. These determine the server a client will request data from, how long the connection is maintained, and the URLs to be requested.
A SPEC SFS working group has adapted SPEC's SFS to Web server benchmarking. Participants in this effort included the following companies: CommerceNet, OpenMarket, Digital Equipment Corp., Netscape, Fujitsu (HAL Computer Systems), Siemens Nixdorf, Hewlett-Packard, Silicon Graphics, Inc., IBM, Spyglass, Intel, Sun Microsystems, Inc.
For its first release, made available July 22, 1996, this group focused the workload on HTTP GETs. They plan to test POSTs and other aspects of Web serving in an as-yet-unannounced second release.
The workload for SPECweb is built on a model of a "typical" Web provider. This provider offers individuals space where each can place Web pages to be accessed. Thus, each individual has a separate Web space on the server, with multiple pages within each space. Each benchmark client simulates a number of Web browsers by requesting random pages throughout the Web server's entire page space. We expect this benchmark to become an industry standard for Web server testing, especially when future versions are released.
The GStone effort attempts to define a Web server benchmark methodology and test suite, but while this group is more than willing to share a detailed discussion of the test suite and its approach and its results (available at http://web66.coled.umn.edu/GStone/, it does not offer the GStone test suite for download.
GStone measures the time it takes to start a connection
(how long it takes for the first element of data to arrive at the client)
and the rate at which data is transmitted (how quickly the data moves from
the server to the client).
GStone results are expressed as Gv[L,r]
In essence, a GStone is unit of time. Thus, lower GStone values signify faster performance, while higher values signify slower performance.
For example, Gp[1704, 99] means that the average connection took 1.704 seconds to open, and required 99 milliseconds to transfer each kilobyte of data. At the time we visited their site, GStone measurements were available for the following Web servers:
We expect the table of results to expand with time, when the group has more than preliminary results to share. This URL might be worth visiting regularly for a while, to see how its efforts are progressing.
Dan Connolly, a Web luminary, conversed with author Ed Tittel via e-mail. Connolly was one of the chief architects of HTML 2.0 and is currently employed as a Research Technical Staff member at the World Wide Web Consortium in the Laboratory for Computer Science at MIT in Cambridge, MA. Here are the questions Tittel asked along with Connolly's replies.
Q: Are you convinced that Web benchmarking has
Yes. :-) See: http://www.w3.org/pub/WWW/Test/
The Web is a system with a large human component. According to HCI (Human-Computer Interaction) studies, the difference between .1 seconds and .2 seconds to follow a link can have radical impact on the productivity of a Web user -- it has to do with attention span and other things about the way humans are built. Another major threshold is one second. Beyond that, you're really wasting the user's time.
I really should know the seminal papers on this sort of thing, but I don't. A little surfing yields this random paper that I haven't read:
Green, A. J. K. (1994) "An Exploration of the Relationships between Task Representations, Knowledge Structures and Performance." Interacting with Computers, 6, 61-85 http://www.mrc-apu.cam.ac.uk/amodeus/abstracts/jp/jp05.html
The guy who knows about HCI is Keith Instone:
Q: What is your opinion of "counting hits?"
It's evil, in that it encourages folks who don't know any better to do things that are at odds with the health of the Net. The state-of-the-art hacks for gathering demographics are really bad, and they interfere with scalability techniques like caching and replication.
In the abstract, "counting hits" and other techniques for gathering demographics are reasonable and necessary things as the Web matures as a medium. But as I said, the state of the art techniques are globally destructive.
See: http://www.w3.org/pub/WWW/Demographics/Activity and http://www.w3.org/pub/WWW/Demographics/
Q: How can WebMasters and ISPs plan to deliver reasonable service from their sites?
I don't know the state-of-the-art here. Keep in mind that WebMasters are not generally W3C's customers, but rather customers of W3C customers.
If I were a Webmaster, I'd read the FAQ: "How fast does my net connection need to be?" and hang out on comp.infosystems.www.servers.* to find out the conventional wisdom. Check out stuff like http://www.webcompare.com, including http://www.webcompare.com/bench.html
Q: What's your perspective on causes for Web delays and
other bottlenecks? What roles do backbone traffic, server load, and
I hear often, and from reliable sources, that backbone traffic is not a problem. The telcos have all the bandwidth folks are willing to pay for...
Server load is the bottleneck in most cases these days, I'd say either the Net connection to the server is saturated or the server machine is overloaded. Traditional Web server implementations had many serious performance bugs. Get a modern server. And of course, be aware that CGI and server-side includes can be costly.
There are also problems in the protocols. This is where W3C comes in. Some fixes are long overdue. In the near term, HTTP 1.1 with persistent connections and clearer rules about caching should help somewhat. Some improvements will roll out more gradually, but we expect they will have a more dramatic effect.
Q: Where do you see Web behavior and performance headed?
There is a LOT of room for improvement. Engelbart's stuff back in the '60s demonstrated sub-second link traversal time. If you haven't seen the videos of this stuff, do it.
Bootstrap Institute - Publications, Articles, Videos http://www.bootstrap.org/library.html
"The Augmented Knowledge Workshop: Engelbart on the history of NLS." Includes design rationale, plus historic photos and movie footage. (82 min.) - 1986
So the first thing to do is to catch up to the state-of-the-art back in the '60s!
In the long term, the telcos and cable companies are poised to compete to address the bandwidth and infrastructure issues. I gather that there is an avalanche of business models, ventures, and approaches that the communications reform bill will set in motion.
But it will be several years -- in Web-years, several generations -- before the sunset of POTS as the infrastructure for the consumer portion of the Internet.
In that time, you'll see a lot of innovation in Web software to optimize for particular patterns of use. For example, there are products that periodically download whole web sites for off-line viewing that simulate much of the newspaper-style pattern of use.
Q: What are your thoughts on the "gridlock" or "train wreck"
scenarios so popular among some pundits?
My favorite analogy is "tragedy of the commons." [Ed Tittel's note: This refers to the overgrazing of public lands available for general use in 18th century England, where this had a harsh effect on small independent farmers who relied on the commons for an adjunct to expensive, private pasturage. The tragedy was that the more they relied on them, the more certain their ruination became. Here, Connolly seems to imply that the clogging of the public Internet could be a function of companies and users who are grabbing as much bandwidth as they can, to preempt others' abilities to use that bandwidth themselves. The real tragedy, of course, is that everybody loses.]
It seems to me that the most damage is done by folks who don't know any better. There are a few malicious cretins out there, but they disappear in the noise, really. It's important that the designers of this thing realize that people will do the easiest thing that works, by and large. So we have to make it so that it's hard to waste a lot of bandwidth. Trouble is, the designers also tend to do the easiest thing that works, which often as not ends up encouraging end-users to do antisocial things.
If we can keep a little discipline among the development community, I'll stay optimistic.
About the author
Ed Tittel is a principal at LANWrights, Inc. an Austin, TX-based consultancy. He is the author of numerous magazine articles and over 25 computer-related books, most notably "HTML for Dummies," 2nd Ed., "The 60-Minute Guide to Java," 2nd Ed., and "Web Programming Secrets" (all titles IDG Books Worldwide, 1996). Check out the LANWrights site at http://www.lanw.com