Collecting and using server statistics
How to interpret server statistics and when they can (and can't!) be used.
If you run a server, you need to know who is accessing your pages. Take the time to learn how to read an access log and find a tool that can process the data and generate useful reports for you and your content managers. (2,300 words)
For many webmasters, the success of their sites can be directly correlated to the size of their access log: More visitors signals popularity, which may determine the success of their sites. The real measure of a site's success, however, is not only the quantity but the quality of the visitors.
This month, I take a look at access statistics. We start with the basics, including how statistics are generated and what the log files look like. From there we learn how to examine a log to get a real indication of your site's popularity. Next is an overview of how you can and can't use statistics, and we conclude by checking out a great tool I use to compile access reports from a raw statistics log.
The nuts and bolts of statistics
All NCSA-derived servers, including NCSA's httpd and the Apache server, write a one-line entry to a log file each time someone attempts to access the Web server. Usually, these entries are kept in a file named
access_log within a directory named
the installation directory for your server. This default location can
vary, of course, based on the value of the
directive in your
httpd.conf file in the
When a client machine connects to your Web server, the server writes a line to the file that looks something like this:
ip061.sky.net - - [01/Dec/1995:00:00:58 -0500] "GET /transparent_images.html HTTP/1.0" 200 10356
The various fields are:
"GET /transparent_images.html HTTP/1.0"
GETto retrieve a complete document. This command might also be
HEADto retrieve just the header portion of a document, or
POSTto invoke a POST-style application from a form.
Regardless of the command, the second field is the path component of the
URL being accessed. This is the path of the requested document relative
to the document root directory on your server. On my server, the
document root is
/usr/local/http/docs, so the actual file
being referenced here is
The final field is the name and version of the protocol used to send and receive the data with the client. In this case, the client was using version 1.0 of the http protocol.
By far, the most common return codes you'll see are 404 (document not found) and 403 (access denied, assuming you have access controls enabled).
While these log entries may appear obscure, they are actually easy to browse through and read. If you haven't already, take a moment to browse your access log and familiarize yourself with your clients.
How to lie with statistics
As a simple gauge of your server's popularity, you can just count the number of lines in your log file:
wc -l access_logAs this count increases, your popularity is obviously going up, too. Right? Wrong!
If your counts are not increasing as dramatically as you might like, just add a few images to your home page. You've just quadrupled the growth rate of your access counts, even though the same number of people are visiting your site. The reason is simple: every access to your site is logged, including each embedded image. If you add four images to a page, each access to that page will add five entries to your log: one for the page and four for the images. If you quote raw access counts for your server, you'll be off by a factor of five.
There are only a few sites that truly enjoy millions of accesses each day. The rest of us are happy to muddle along with a few thousand visitors a day. Instead of boosting your ego with inflated (and erroneous) access counts, use your statistics file to extract valid data that can help you tune and improve your server's content.
Remove all image references from your log files right off the bat. The
easiest way to do this is to organize your site so that all images are
kept in a common directory (I use
/images). That way, I
can easily remove images with a single command and count the remainder:
egrep -v '/images/' access_log | wc -lOn my server for the month of December 1995, I had 418,000 raw accesses to my machine. After stripping image references, I had 93,000 document retrievals. I consider those 93,000 accesses a true indicator of my site's activity for the month; the 418,000 figure is more an indicator of aggregate bandwidth than anything else.
If you are further inclined to prune your logs, consider removing duplicate references from the same site within a small window of time, say five minutes or so. These references are most likely caused by a single person loading and reloading the same set of pages as they wander through your site. In reality, all those accesses count as one visit from one person. This kind of pruning requires some custom programming but may be worthwhile if you want an accurate visitor count.
What you can and can't do
If you are running a server for a marketing group, you will find that they are completely unhappy with the kind of demographic data you can supply based on your access logs. Marketing types need to know information about individuals: age, sex, income, occupation, etc. When they discover that you might be able to determine the machine your visitors are running and little else, they will be appalled.
I've had marketing folks ask for a list of e-mail addresses of every person who has accessed a particular site so they can send a message to each of them! I've been asked for the names of visitors, their employers, and their country of origin. I've even been asked to determine the speed of their connection to the Internet!
The reality is that very little personal information is available to a Web server. If you really want that data, you're better off creating a form that collects the information from your visitors, perhaps as a prelude to accessing certain pages of interest.
You can, however, get an indication of the documents on your server which are attracting the most interest. By counting the aggregate demand for each document on your server, you'll begin to see why people are visiting your site and which pages they find interesting.
Since your access logs are written chronologically, you can trace the access path through your site by a single visitor. This is really useful information: you can see which pages are serving as entry points to your server (your top-level page is rarely the main entry page for your site) and how people are navigating through your pages.
Consider this visitation sequence to my site:
126.96.36.199 - - [01/Dec/1995:16:01:38 -0500] "GET /transparent_images.html HTTP/1.0" 200 10356 188.8.131.52 - - [01/Dec/1995:16:02:26 -0500] "GET /files/giftrans.exe HTTP/1.0" 200 35997 184.108.40.206 - - [01/Dec/1995:16:08:33 -0500] "GET / HTTP/1.0" 200 3515 220.127.116.11 - - [01/Dec/1995:16:09:00 -0500] "GET /local_stuff.html HTTP/1.0" 200 3620 18.104.22.168 - - [01/Dec/1995:16:09:14 -0500] "GET /images.html HTTP/1.0" 200 4441 22.214.171.124 - - [01/Dec/1995:16:11:03 -0500] "GET /files/local_open.gif HTTP/1.0" 200 582 126.96.36.199 - - [01/Dec/1995:16:11:14 -0500] "GET /files/local_closed.gif HTTP/1.0" 200 548 188.8.131.52 - - [01/Dec/1995:16:11:21 -0500] "GET /files/bullet_red.gif HTTP/1.0" 200 268 184.108.40.206 - - [01/Dec/1995:16:11:34 -0500] "GET /files/bullet_blue.gif HTTP/1.0" 200 268 220.127.116.11 - - [01/Dec/1995:16:11:39 -0500] "GET /files/bullet_yellow.gif HTTP/1.0" 200 268 18.104.22.168 - - [01/Dec/1995:16:11:45 -0500] "GET /files/bullet_green.gif HTTP/1.0" 200 268 22.214.171.124 - - [01/Dec/1995:16:18:23 -0500] "GET /files/fun_closed.gif HTTP/1.0" 200 530 126.96.36.199 - - [01/Dec/1995:16:18:36 -0500] "GET /files/fun_closed.gif HTTP/1.0" 200 530 188.8.131.52 - - [01/Dec/1995:16:18:49 -0500] "GET /files/melmac_closed.gif HTTP/1.0" 200 514This visitor enters my site through my transparent images document, downloading a copy of the giftrans tool after about a minute. Six minutes later he jumps up to my top-level page and then down into my "local stuff" collection. He finds my collection of sharable images and quickly downloads a number of images.
From this sequence, I can conclude that it's fairly easy to reach my home page from a subsidiary page and that my general navigation tools seem to be working. This sequence of 14 accesses, along with the 50-some embedded images that I've stripped from the log, constitute a single visit from one person lasting approximately 17 minutes.
While this kind of analysis can be tedious to conduct regularly, you should occasionally perform this type of "thread analysis" on your logs to be sure people are working their way through your site properly.
A log analysis tool
Few people want to browse a raw log file, but many people like to see a summary report of their sites' access statistics. While a number of tools exist to help you parse and summarize a log file (as a start, check out the collection of log tool references on Yahoo) my personal favorite is analog, written by Stephen Turner of the University of Cambridge Statistical Laboratory.
While the reports generated by analog are not markedly different from those created by other log analysis tools, analog is wonderfully easy to use, offers a wealth of configuration options, and above all else, is fast. I was astounded when analog parsed my complete server log (dating back to February 1994) in less than seven minutes. My previous log tool took around 75 minutes to perform the same task!
Analog lets you compile your favorite report options and further configure the tool using a vast number of directives you place in a separate configuration file. All of these settings can be overridden on the command line if needed, making analog one of the most versatile log tools I've used.
I like to use analog to generate access reports each morning for my site. You will find it useful for this task and a lot more. If you haven't already settled on a log tool (and even if you have), take a moment to learn more about analog.
Access log analysis is important, but it isn't the only tool in your demographic arsenal. Next month, we'll look at referer and agent logs for more clues about your users.
About the author
Chuck Musciano has been running Melmac for close to two years, serving up HTML tips and tricks to hundreds of thousands of visitors each month. He's been a beta-tester and contributor to the NCSA httpd project and speaks regularly on the Internet, World Wide Web, and related topics. His book, HTML: The Definitive Guide, is currently available from O'Reilly and Associates.
If you have technical problems with this magazine, contact email@example.com