Click on our Sponsors to help Support SunWorld

Benchmarking the Web

Are accurate and unbiased Web
performance benchmarks possible?

By Ed Tittel

September  1996
[Next story]
[Table of Contents]
Subscribe to SunWorld, it's free!

Being armed with benchmarking numbers doesn't count for much if you don't know what they mean. We look at what's involved in benchmarking Web servers, what the results can tell us, and what to be wary of in relying on the four most popular benchmarks. We also include two sidebars with extensive links to additional benchmarking information. One sidebar is an enlightening interview with Dan Connolly of the World Wide Web Consortium. (5,000 words including two sidebars)

Mail this
article to
a friend
As Dan Connolly, one of the chief architects of HTML and HTTP at the World Wide Web Consortium has said, "Performance is a black art. The whole Web is a black art at this point -- there's very little science. At a time when tools are so primitive, benchmarking is one of the few available techniques to find performance bugs." (See the sidebar, Connolly speaks out, for more commentary).

Tweaking performance usually requires locating the bottlenecks, and then reducing their effects as much as possible. The interesting thing about Web performance, however, is that users supply so many of these bottlenecks.

Because so many of them use modems to attach to the Internet, for instance, the nature of their attachment (28.8 Kbps at best, for most of them) completely dominates their perception of performance.

Frankly, though, neither Webmasters nor users really care about what factors govern Web performance. They just want to be sure that they're obtaining the fastest possible performance from their own systems and software. In that spirit, we'll investigate the significance of Web benchmarks and try to describe what such data can -- and can't -- tell its readers.

What do Web benchmarks really mean?
There are three primary factors that determine Web performance, each with its own underlying contributing factors:

  1. The end-user whose Web browser issues a request for Web data
  2. The network infrastructure that ferries requests and replies between clients and servers
  3. The Web server where the requested information and programs reside

Because Webmasters can control neither end-user circumstances and configurations, nor the latency of the networks to which their servers attach, all Web benchmarks available today are really Web server benchmarks.

In other words, all the data you see about Web performance is really only a measure of how good a job a Web server can do in responding to some mix of user requests.

The most typical Web server characteristics that benchmarks measure include the following:

Most Web server benchmarks are measured over local area networks, where high-speed connections are the norm, and the typical delays associated with slower modem-based connections are completely irrelevant -- as are the delays that can sometimes be imposed by the networking infrastructure.

Thus, what a Web benchmark basically tells its readers is "how fast is the Web server?" and "how many users can it handle at the same time?" While these statistics can be measured (and presented) in various ways, they basically boil down to some measure of raw throughput, and some rating of handling capacity. For heavily loaded servers, these are vital statistics worth knowing about.

But what a Web benchmark can't tell its readers is as interesting as the information it can convey. Benchmarks don't deal with issues of perceived user performance (or what users with slow connections observe at their end of the browser-server link), with errors or irregular behaviors (or what kinds of problems or hiccups users might encounter), or with a rating of how gracefully a Web server degrades under increasing loads. Nor do current benchmarks address well-known Web server performance sinks like the impact of using CGI programs or other Web extensions to provide enhanced interactivity on a Web site. All of these things are important to users, and should therefore be important to Webmasters as well. At the moment, however, no formal tools exist to measure these potential sources of performance hickeys.

Ultimately, you must understand that no Web benchmark can tell you everything you need to know about a particular Web server or the platform on which it runs. Benchmarks do provide the most objective way to compare different Web server hardware and software, provided you make sure you're not changing more than one variable at a time. (For example, to compare the performance of two Web server software packages, be sure that both benchmarks were run on the same hardware configurations; likewise, to compare hardware platforms, be sure the benchmarks used the same software).


What's involved in benchmarking a Web server?
Fundamentally, benchmarking depends on two things:

  1. Modeling a typical set of computer behaviors, usually in the form of one kind of transaction or another.
  2. Applying increasing levels of load to a system to find out where it runs out of gas, or ceases to perform as expected.

When it comes to benchmarking a Web server, the foundation of the software that will exercise its capabilities has to come from some kind of "typical workload." It's entirely reasonable to question the validity of anybody's definition of typical, especially since what's typical for one Web server may be completely atypical for another, depending on the levels of usage, the number of Web sites housed on a server, and the appeal of the content to its audience. That's why discussions of benchmarks typically detail how the imposed workload was characterized, the kinds and numbers of sites used to characterize a workload, and the types of analysis performed on the raw data used to create the workload.

Lest you be tempted to dismiss all benchmarks as irrelevant, you can take comfort in the nature of statistics. The Law of Large Numbers states that a sufficiently large sample will be adequately able to model typical behaviors for the population from which it is drawn. In English, this means that the more the workloads used to create a benchmark resemble your own server's workload, the better that model applies to you. Nevertheless, a certain amount of caution when evaluating benchmarks -- especially those built by vendors to compare their servers to those from other vendors -- is always warranted.

What's in a Web server benchmark?
Since the HTTP protocol upon which the Web rests is inherently a transaction-oriented protocol, it's a worthwhile subject for benchmarking. In HTTP, clients make short, simple requests that can take only a limited number of forms. Servers provide a broad range of replies that can vary from short error or information messages to arbitrarily long collections of text, graphics, and other information.

Thus, Web server benchmarks tend to concentrate on three or four measurements, all produced by subjecting the server to its modeled workload and varying one or more of the variables being measured. These variables usually include at least two or three of the following measurements:

For the first three measurements, more is clearly better, since it indicates a server that can handle more users, process more data, or field more HTTP requests over time. For the final measurement, less is better, because it indicates a server that responds more quickly to a request for service.

What benchmarks are available today?
There are numerous Web server benchmarks available right now on the Web (where else would they be?). We found the following handful to be of the greatest interest -- and potential utility -- to practicing Webmasters and other Web aficionados (See the sidebar, But wait, there's more, for further benchmarking information and online resources).

Implemented and maintained by Hewlett-Packard, NetPerf provides a general indication of network performance for Unix and Windows NT. It measures response latency for generic transactions on an IP network. Because it's not specifically aimed at HTTP, it's not as accurate as the rest of the benchmarks we cover, nor does it address maximum carrying capacity or connections handled over time. But NetPerf is widely available and broadly used to benchmark IP servers of all kinds.

Implemented and maintained by Silicon Graphics, Inc., WebStone is designed from the ground up to measure Web server performance. SGI provides source code for its benchmark for easy porting to other platforms. WebStone measures raw throughput for a standard mix of HTTP requests and replies to model a "normal" server workload. It provides stats on response latency and throughput, including the number of connections and the number of requests and replies handled over time. While SGI touts this benchmark, other vendors have been less enthusiastic about its results.

Implemented and maintained by Web66, a multi-university project in Minneapolis, MN aimed at facilitating the use of Internet technology in K-12 schools, GStone represents that group's efforts to define a Web server benchmark methodology and test suite. It measures typical performance for typical service under a typical load, as determined by statistical analyses across numerous Web sites, public and private. GStone measures response latency and data throughput for a Web server. Unfortunately, the group does not make its source code available for re-use by others, but does perform ongoing analyses and regularly publishes updated results.

Implemented and maintained by the Standard Performance Evaluation Corporation (SPEC), SPECweb96 represents this industry group's first foray into Web benchmarking. Because SPEC's charter is to define and supply objective benchmarks, and because this group has been careful to recruit representatives from most of the major players in the Internet and intranet markets, SPECweb96 represents the best hope for an objective, universal Web benchmark to which all parties can agree. The SPEC group performed extensive analysis of HTTP requests and replies across numerous sites, then solicited feedback from companies like Netscape and Spyglass in constructing its model workload. HTML document sizes are varied according to a mix across a range of file sizes, then randomly accessed to simulate typical access behavior. This benchmark concentrates completely on HTTP GET requests and measures average response time for requests, plus a metric based on overall throughput, measured in maximum benchmark operations per second.

What's problematic with these benchmarks?
None of these are perfect, and none offer the transaction tracking and measurement provided by more comprehensive database benchmarks. Both the HP and SGI benchmarks (NetPerf and WebStone) are suspect because they're vendor-designed and maintained. Only the SPECweb96 benchmark appears to be based on a mathematically sophisticated, complete attempt to model document requests and sizes in recreating server document delivery behavior from a "real" Web environment.

Although three of the four benchmarks make their source code available (free for NetPerf and WebStone, for a fee from SPEC for SPECweb96), GStone is like a black box in that the Web66 team does not make its code available. But SPEC is the only organization that imposes rigid rules for how (and why) code can be altered in the porting process. It also insists on a rigorous review of any ported versions before permitting licensees to publish results based on any of its benchmarks.

Finally, only SPECweb96 represents a concerted effort from a sizable group of interested vendors to construct an objective and completely replicable benchmark (See the sidebar, But wait, there's more, for a list of the major players involved in this project). Although HP and SGI are forthcoming with source code and results, they exercise no control over their benchmarks' use, nor do they require the kind of reporting on alterations and code replacements that any porting effort will typically require.

Which benchmark rules the roost?
Today, the NetPerf and GStone benchmarks enjoy the broadest use and support the largest pools of published results. But we don't think this situation will continue for too much longer.

Because of SPEC's attention to detail, its rigorous modeling approach, and its cross-industry recruitment of designers, programmers, and testers; and because SPEC's sole focus and expertise lies in benchmarking, we think SPECweb96 stands out as the most reasonable benchmark in this group. This is true, not only because of the analysis of actual traffic and vendor experience that went into modeling its transaction mix, but also because the underlying client technology used to create the server load is based on research and implementation efforts that have been underway at SPEC since the early 1990s.

This technology (known as SPEC SFS or LADDIS) has been tested and proven in other client/server applications, and it offers a strong foundation for the SPECweb96 benchmark.

Likewise, SPECweb96 is likely to be the most accurate benchmark of all, because it uses the most realistic workloads and most rigorously constructed transaction mixes of any of the benchmarks we've discussed. This is partly due to the broad involvement from the contributing vendors, but also due to SPEC's detailed statistical analyses of thousands of actual Web server logs collected under a broad range of conditions and usage scenarios.

SPECWeb was only introduced in July, 1996, so results that derive from its use are still scarce. As of late August, only DEC and HP had published SPECweb results (not surprisingly, the blazingly fast 300-MHz Alpha dual-processor system looks very good). We learned that most of the other vendors involved in in-house Web server benchmarking are deep in the throes of testing right now, and virtually all intend to publish results some time in September or October. Apparently, there are still some kinks being worked out in the test itself. As soon as these last-minute elements are resolved, vendors will begin to publish results, and the code for the test will be made public, according to Walter Bays, the SPEC representative from Sun Microsystems.

"As time goes on, other vendors will join the race reporting on existing products. You'll also probably see SPECweb96 quoted in new product announcements -- the more SPECweb96 is quoted, the more customers will ask for it, and the usage will snowball," Bays said.

By September, it will also be possible to order a CD containing the SPECweb96 benchmark, with versions available both in C and Perl (cost for not-for-profit organizations: $400; all others: $800. The benchmark is available by request on a case-by-case basis at no charge for creating test results for publication). Visit the SPEC Web site at for details on membership, dues, and participation in SPEC benchmark design and testing.

Understanding SPECweb96
According to Kaivalya Dixit, president of SPEC, what makes SPECweb96 special is easy to explain:

"Unlike other Web server benchmarks, SPECweb96 is analyzed, debated and agreed by leading vendors, system integrators, and universities. It is a standardized test -- which means that published run/reporting rules, repeatability of results, and full disclosure are necessary for System Level Benchmarks. Most importantly, [SPECweb96's] strengths and weaknesses have been openly discussed and reported. The salient difference of this benchmark has been in the development, where SPEC leveraged skills from many disciplines (server vendors, software vendors and research groups) to make it as fair as humanly possible."

Let's examine a set of SPECweb96 results, and explain what they indicate, and how they might be interpreted:

SPECweb96 results
(reproduced from

Sponsor  System Name               SPECweb96  Full Disclosures
-------  ------------------------  ---------  --------------------
Digital  Alphaserver 1000A 4/266      252     Text Html PostScript
Digital  Alphaserver 2100 5/300       565     Text Html PostScript
Digital  Alphaserver 2100 5/300       809     Text Html PostScript
Hewlett  HP 9000 Model D210           216     Text Html PostScript
Hewlett  HP 9000 Model K400           500     Text Html PostScript
First, it's important to understand the column headings:
The name of the company or organization that performed the benchmark.

System Name
The make, model, and system speed of the test system.

The numerical results of the SPECweb96 benchmark.

Full Disclosures
The formats available for complete reports on results (full disclosures are required by SPEC benchmarking rules, and include information about any changes or modifications made to the original source code). Since we copied this table straight from the Web, each term was originally linked to a text, HTML, or PostScript version of the "Full Disclosures" report for that system.

The bulk of this information is pretty straightforward, so let's jump straight into the SPECweb96 metric. To begin with, bigger is better (thus, in the list above, the DEC Alphaserver 2100 5/300 outperforms all of the other systems by a pretty wide margin). But what does this number really measure? By measuring the response time for each server GET request and the overall throughput over time, the benchmark calculates a score that increases as latency decreases (shorter response times score higher) and as throughput increases (more server output means a higher score). The most important thing about these numbers is that they're designed to be compared: That is, because the results reported for the DEC Alphaserver 1000A are 16 percent higher than for the HP 9000 Model D210, this means the DEC platform is 16 percent faster than the HP platform.

But while you're pondering the numbers, recall that the existence of "Full Disclosures" for these benchmarks is a very good thing. In fact, most of the intellectual and programmatic rigor that surrounds SPECweb96 (and all SPEC benchmarks) is because they're designed to be easily ported to multiple platforms. "Full disclosure" means what it says -- namely, that any and all changes to the SPECweb96 source code must be described completely, and any potential impacts on results must be reported in profuse detail.

This means two very important things:

  1. For better test results, vendors must tweak their own code or their hardware, NOT the benchmark itself.
  2. It helps maintain the comparability of benchmarks made on one set of hardware and software to those made on another (which is what this benchmark was designed to accomplish in the first place).

Although it's been well-observed that vendors do indeed tweak systems and software to do better on SPEC benchmarks, these benchmarks are designed so that tweaking will usually translate into genuinely better server performance as well.

Click on our Sponsors to help Support SunWorld

What did you think of this article?
-Very worth reading
-Worth reading
-Not worth reading
-Too long
-Just right
-Too short
-Too technical
-Just right
-Not technical enough

[Table of Contents]
Subscribe to SunWorld, it's free!
[Next story]
Sun's Site

[(c) Copyright  Web Publishing Inc., and IDG Communication company]

If you have technical problems with this magazine, contact

Last modified:

SidebarBack to story

But wait, there's more

There are many excellent sources for further benchmarking information. We've included URLs for general overviews as well as reference sites for specific benchmarks mentioned in this story.

Overview information:

Benchmark information:

Current version: 2.1
Implemented & maintained by: Hewlett-Packard
Home page:
Training/documentation info available from:
Anonymous FTP:

Current version 2.0
Implemented & maintained by: Silicon Graphics, Inc. (SGI)
Home page:
Mailing list:
A highly informative FAQ on the topic can be found at:
Download from:
The download section occurs early on in the home page. Look for the phrase "Download WebStone Benchmark" (gzipped tar file).

A plethora of WebStone-related information can easily be produced by jumping into SGI's home page at and using its search engine with the string "WebStone."

The original white paper describing version 1.0, which includes a great deal of technical design detail, is available at:

An open benchmark, Webstone's source code can be downloaded and examined, and used by anyone who wishes to employ this performance measurement tool. Although it's currently built for Unix environments, the SGI site exhorts readers to "feel free to port this code to other platforms."

WebStone measures the raw throughput of a standard mix of HTTP messages, built to model a normal server workload. Two main measures of performance, latency and number of connections per second, are employed. It's also possible to use WebStone to collect other data, including throughput measured in bits per second. WebStone is driven by a set of client configuration files. These determine the server a client will request data from, how long the connection is maintained, and the URLs to be requested.

Current version 1.0
Implemented & maintained by: Standard Performance Evaluation Corporation (SPEC)
Home page:
SPECWeb Information Page:
SPECWeb Download Page:

A SPEC SFS working group has adapted SPEC's SFS to Web server benchmarking. Participants in this effort included the following companies: CommerceNet, OpenMarket, Digital Equipment Corp., Netscape, Fujitsu (HAL Computer Systems), Siemens Nixdorf, Hewlett-Packard, Silicon Graphics, Inc., IBM, Spyglass, Intel, Sun Microsystems, Inc.

For its first release, made available July 22, 1996, this group focused the workload on HTTP GETs. They plan to test POSTs and other aspects of Web serving in an as-yet-unannounced second release.

The workload for SPECweb is built on a model of a "typical" Web provider. This provider offers individuals space where each can place Web pages to be accessed. Thus, each individual has a separate Web space on the server, with multiple pages within each space. Each benchmark client simulates a number of Web browsers by requesting random pages throughout the Web server's entire page space. We expect this benchmark to become an industry standard for Web server testing, especially when future versions are released.

The University of Minnesota's Web66/GStone
Current version 1.0
Implemented & maintained by: The Web66 Project
Home page:
GStoneInformation Page:
GStone Download Page: not available

The GStone effort attempts to define a Web server benchmark methodology and test suite, but while this group is more than willing to share a detailed discussion of the test suite and its approach and its results (available at, it does not offer the GStone test suite for download.

GStone measures the time it takes to start a connection (how long it takes for the first element of data to arrive at the client) and the rate at which data is transmitted (how quickly the data moves from the server to the client).

GStone results are expressed as Gv[L,r]

v: Version
The GStone version number

L: Latency
The time, in milliseconds, that it takes to open a connection. This measures the interval between when the browser initiates a connection until the server begins transmitting data, including all negotiations to open the connection and request the URL.

r: Rate
The rate at which the server transmits data, measured in milliseconds per kilobyte of data.

In essence, a GStone is unit of time. Thus, lower GStone values signify faster performance, while higher values signify slower performance.

For example, Gp[1704, 99] means that the average connection took 1.704 seconds to open, and required 99 milliseconds to transfer each kilobyte of data. At the time we visited their site, GStone measurements were available for the following Web servers:

We expect the table of results to expand with time, when the group has more than preliminary results to share. This URL might be worth visiting regularly for a while, to see how its efforts are progressing.

SidebarBack to story

SidebarBack to story

Connolly speaks out

Dan Connolly, a Web luminary, conversed with author Ed Tittel via e-mail. Connolly was one of the chief architects of HTML 2.0 and is currently employed as a Research Technical Staff member at the World Wide Web Consortium in the Laboratory for Computer Science at MIT in Cambridge, MA. Here are the questions Tittel asked along with Connolly's replies.

Q: Are you convinced that Web benchmarking has any significance?

Yes. :-) See:

The Web is a system with a large human component. According to HCI (Human-Computer Interaction) studies, the difference between .1 seconds and .2 seconds to follow a link can have radical impact on the productivity of a Web user -- it has to do with attention span and other things about the way humans are built. Another major threshold is one second. Beyond that, you're really wasting the user's time.

I really should know the seminal papers on this sort of thing, but I don't. A little surfing yields this random paper that I haven't read:

Green, A. J. K. (1994) "An Exploration of the Relationships between Task Representations, Knowledge Structures and Performance." Interacting with Computers, 6, 61-85

The guy who knows about HCI is Keith Instone:

Q: What is your opinion of "counting hits?"

It's evil, in that it encourages folks who don't know any better to do things that are at odds with the health of the Net. The state-of-the-art hacks for gathering demographics are really bad, and they interfere with scalability techniques like caching and replication.

In the abstract, "counting hits" and other techniques for gathering demographics are reasonable and necessary things as the Web matures as a medium. But as I said, the state of the art techniques are globally destructive.

See: and

Q: How can WebMasters and ISPs plan to deliver reasonable service from their sites?

I don't know the state-of-the-art here. Keep in mind that WebMasters are not generally W3C's customers, but rather customers of W3C customers.

If I were a Webmaster, I'd read the FAQ: "How fast does my net connection need to be?" and hang out on comp.infosystems.www.servers.* to find out the conventional wisdom. Check out stuff like, including

Q: What's your perspective on causes for Web delays and other bottlenecks? What roles do backbone traffic, server load, and protocols play?

I hear often, and from reliable sources, that backbone traffic is not a problem. The telcos have all the bandwidth folks are willing to pay for...

Server load is the bottleneck in most cases these days, I'd say either the Net connection to the server is saturated or the server machine is overloaded. Traditional Web server implementations had many serious performance bugs. Get a modern server. And of course, be aware that CGI and server-side includes can be costly.

There are also problems in the protocols. This is where W3C comes in. Some fixes are long overdue. In the near term, HTTP 1.1 with persistent connections and clearer rules about caching should help somewhat. Some improvements will roll out more gradually, but we expect they will have a more dramatic effect.


Q: Where do you see Web behavior and performance headed?

There is a LOT of room for improvement. Engelbart's stuff back in the '60s demonstrated sub-second link traversal time. If you haven't seen the videos of this stuff, do it.

Bootstrap Institute - Publications, Articles, Videos

"The Augmented Knowledge Workshop: Engelbart on the history of NLS." Includes design rationale, plus historic photos and movie footage. (82 min.) - 1986

So the first thing to do is to catch up to the state-of-the-art back in the '60s!

In the long term, the telcos and cable companies are poised to compete to address the bandwidth and infrastructure issues. I gather that there is an avalanche of business models, ventures, and approaches that the communications reform bill will set in motion.

But it will be several years -- in Web-years, several generations -- before the sunset of POTS as the infrastructure for the consumer portion of the Internet.

In that time, you'll see a lot of innovation in Web software to optimize for particular patterns of use. For example, there are products that periodically download whole web sites for off-line viewing that simulate much of the newspaper-style pattern of use.

Q: What are your thoughts on the "gridlock" or "train wreck" scenarios so popular among some pundits?

My favorite analogy is "tragedy of the commons." [Ed Tittel's note: This refers to the overgrazing of public lands available for general use in 18th century England, where this had a harsh effect on small independent farmers who relied on the commons for an adjunct to expensive, private pasturage. The tragedy was that the more they relied on them, the more certain their ruination became. Here, Connolly seems to imply that the clogging of the public Internet could be a function of companies and users who are grabbing as much bandwidth as they can, to preempt others' abilities to use that bandwidth themselves. The real tragedy, of course, is that everybody loses.]

It seems to me that the most damage is done by folks who don't know any better. There are a few malicious cretins out there, but they disappear in the noise, really. It's important that the designers of this thing realize that people will do the easiest thing that works, by and large. So we have to make it so that it's hard to waste a lot of bandwidth. Trouble is, the designers also tend to do the easiest thing that works, which often as not ends up encouraging end-users to do antisocial things.

If we can keep a little discipline among the development community, I'll stay optimistic.

SidebarBack to story

About the author
Ed Tittel is a principal at LANWrights, Inc. an Austin, TX-based consultancy. He is the author of numerous magazine articles and over 25 computer-related books, most notably "HTML for Dummies," 2nd Ed., "The 60-Minute Guide to Java," 2nd Ed., and "Web Programming Secrets" (all titles IDG Books Worldwide, 1996). Check out the LANWrights site at