Click on our Sponsors to help Support SunWorld

Collecting and using server statistics

How to interpret server statistics and when they can (and can't!) be used.

By Chuck Musciano

March 1996

Abstract

If you run a server, you need to know who is accessing your pages. Take the time to learn how to read an access log and find a tool that can process the data and generate useful reports for you and your content managers. (2,300 words)

Mail this
article to
a friend

One of the first things you'll do after you get your server up and running is check your access log -- the file your server creates to track every access, good and bad, to your site. After all, what's the point of running a server if no one visits?

For many webmasters, the success of their sites can be directly correlated to the size of their access log: More visitors signals popularity, which may determine the success of their sites. The real measure of a site's success, however, is not only the quantity but the quality of the visitors.

This month, I take a look at access statistics. We start with the basics, including how statistics are generated and what the log files look like. From there we learn how to examine a log to get a real indication of your site's popularity. Next is an overview of how you can and can't use statistics, and we conclude by checking out a great tool I use to compile access reports from a raw statistics log.

Advertisements

The nuts and bolts of statistics
All NCSA-derived servers, including NCSA's httpd and the Apache server, write a one-line entry to a log file each time someone attempts to access the Web server. Usually, these entries are kept in a file named access_log within a directory named logs in the installation directory for your server. This default location can vary, of course, based on the value of the TransferLog directive in your httpd.conf file in the conf directory.

When a client machine connects to your Web server, the server writes a line to the file that looks something like this:

ip061.sky.net - - [01/Dec/1995:00:00:58 -0500] "GET /transparent_images.html HTTP/1.0" 200 10356

The various fields are:

ip061.sky.net

The domain name of the machine making this request. This name is determined by reversing the IP address of the machine to determine the domain name. If the machine's reverse address is not correctly managed by a domain name server, you'll find the raw IP address in this field instead. The number of addresses that fail to reverse correctly is alarmingly high. On my server, they constitute 16% of all requests! If you run a DNS, make sure your server reverses your addresses correctly. This makes life easier for everyone on the Internet.

- -

These dashes are actually placeholders for two fields that are intended to contain the client machine name and user name, as determined by your server using the RFC 931 authentication protocol. In reality, very few sites are actually running an authentication server and even fewer servers are configured to use RFC 931 authentication. So these fields are almost always empty.

[01/Dec/1995:00:00:58 -0500]

This is the date and time of the access, along with your offset from Greenwich Mean Time. Since I'm on the East Coast of the United States, I'm five hours ahead of GMT.

"GET /transparent_images.html HTTP/1.0"

This is the request made by the client browser. The first word is the http command, usually a GET to retrieve a complete document. This command might also be HEAD to retrieve just the header portion of a document, or POST to invoke a POST-style application from a form.

Regardless of the command, the second field is the path component of the URL being accessed. This is the path of the requested document relative to the document root directory on your server. On my server, the document root is /usr/local/http/docs, so the actual file being referenced here is /usr/local/http/docs/transparent_images.html.

The final field is the name and version of the protocol used to send and receive the data with the client. In this case, the client was using version 1.0 of the http protocol.

200

This is the server response code. A successful request generates a response code of 200. Other response codes include

302 URL has been redirected to another document
400 Bad request was made by the client
401 Authorization is required for this document
403 Access to this document is forbidden
404 Document not found
500 Server internal error
501 Application method (either GET or POST) is not implemented
503 Server is out of resources

By far, the most common return codes you'll see are 404 (document not found) and 403 (access denied, assuming you have access controls enabled).

10356

This is the number of bytes transferred to the client. Since every request has some sort of response, even erroneous requests will have a non-zero value for this field.

While these log entries may appear obscure, they are actually easy to browse through and read. If you haven't already, take a moment to browse your access log and familiarize yourself with your clients.

How to lie with statistics
As a simple gauge of your server's popularity, you can just count the number of lines in your log file:

   wc -l access_log

As this count increases, your popularity is obviously going up, too. Right? Wrong!

If your counts are not increasing as dramatically as you might like, just add a few images to your home page. You've just quadrupled the growth rate of your access counts, even though the same number of people are visiting your site. The reason is simple: every access to your site is logged, including each embedded image. If you add four images to a page, each access to that page will add five entries to your log: one for the page and four for the images. If you quote raw access counts for your server, you'll be off by a factor of five.

There are only a few sites that truly enjoy millions of accesses each day. The rest of us are happy to muddle along with a few thousand visitors a day. Instead of boosting your ego with inflated (and erroneous) access counts, use your statistics file to extract valid data that can help you tune and improve your server's content.

Remove all image references from your log files right off the bat. The easiest way to do this is to organize your site so that all images are kept in a common directory (I use /images). That way, I can easily remove images with a single command and count the remainder:

   egrep -v '/images/' access_log | wc -l

On my server for the month of December 1995, I had 418,000 raw accesses to my machine. After stripping image references, I had 93,000 document retrievals. I consider those 93,000 accesses a true indicator of my site's activity for the month; the 418,000 figure is more an indicator of aggregate bandwidth than anything else.

If you are further inclined to prune your logs, consider removing duplicate references from the same site within a small window of time, say five minutes or so. These references are most likely caused by a single person loading and reloading the same set of pages as they wander through your site. In reality, all those accesses count as one visit from one person. This kind of pruning requires some custom programming but may be worthwhile if you want an accurate visitor count.

What you can and can't do
If you are running a server for a marketing group, you will find that they are completely unhappy with the kind of demographic data you can supply based on your access logs. Marketing types need to know information about individuals: age, sex, income, occupation, etc. When they discover that you might be able to determine the machine your visitors are running and little else, they will be appalled.

I've had marketing folks ask for a list of e-mail addresses of every person who has accessed a particular site so they can send a message to each of them! I've been asked for the names of visitors, their employers, and their country of origin. I've even been asked to determine the speed of their connection to the Internet!

The reality is that very little personal information is available to a Web server. If you really want that data, you're better off creating a form that collects the information from your visitors, perhaps as a prelude to accessing certain pages of interest.

You can, however, get an indication of the documents on your server which are attracting the most interest. By counting the aggregate demand for each document on your server, you'll begin to see why people are visiting your site and which pages they find interesting.

Since your access logs are written chronologically, you can trace the access path through your site by a single visitor. This is really useful information: you can see which pages are serving as entry points to your server (your top-level page is rarely the main entry page for your site) and how people are navigating through your pages.

Consider this visitation sequence to my site:

128.103.120.11 - - [01/Dec/1995:16:01:38 -0500] "GET /transparent_images.html HTTP/1.0" 200 10356
128.103.120.11 - - [01/Dec/1995:16:02:26 -0500] "GET /files/giftrans.exe HTTP/1.0" 200 35997
128.103.120.11 - - [01/Dec/1995:16:08:33 -0500] "GET / HTTP/1.0" 200 3515
128.103.120.11 - - [01/Dec/1995:16:09:00 -0500] "GET /local_stuff.html HTTP/1.0" 200 3620
128.103.120.11 - - [01/Dec/1995:16:09:14 -0500] "GET /images.html HTTP/1.0" 200 4441
128.103.120.11 - - [01/Dec/1995:16:11:03 -0500] "GET /files/local_open.gif HTTP/1.0" 200 582
128.103.120.11 - - [01/Dec/1995:16:11:14 -0500] "GET /files/local_closed.gif HTTP/1.0" 200 548
128.103.120.11 - - [01/Dec/1995:16:11:21 -0500] "GET /files/bullet_red.gif HTTP/1.0" 200 268
128.103.120.11 - - [01/Dec/1995:16:11:34 -0500] "GET /files/bullet_blue.gif HTTP/1.0" 200 268
128.103.120.11 - - [01/Dec/1995:16:11:39 -0500] "GET /files/bullet_yellow.gif HTTP/1.0" 200 268
128.103.120.11 - - [01/Dec/1995:16:11:45 -0500] "GET /files/bullet_green.gif HTTP/1.0" 200 268
128.103.120.11 - - [01/Dec/1995:16:18:23 -0500] "GET /files/fun_closed.gif HTTP/1.0" 200 530
128.103.120.11 - - [01/Dec/1995:16:18:36 -0500] "GET /files/fun_closed.gif HTTP/1.0" 200 530
128.103.120.11 - - [01/Dec/1995:16:18:49 -0500] "GET /files/melmac_closed.gif HTTP/1.0" 200 514

This visitor enters my site through my transparent images document, downloading a copy of the giftrans tool after about a minute. Six minutes later he jumps up to my top-level page and then down into my "local stuff" collection. He finds my collection of sharable images and quickly downloads a number of images.

From this sequence, I can conclude that it's fairly easy to reach my home page from a subsidiary page and that my general navigation tools seem to be working. This sequence of 14 accesses, along with the 50-some embedded images that I've stripped from the log, constitute a single visit from one person lasting approximately 17 minutes.

While this kind of analysis can be tedious to conduct regularly, you should occasionally perform this type of "thread analysis" on your logs to be sure people are working their way through your site properly.

A log analysis tool
Few people want to browse a raw log file, but many people like to see a summary report of their sites' access statistics. While a number of tools exist to help you parse and summarize a log file (as a start, check out the collection of log tool references on Yahoo) my personal favorite is analog, written by Stephen Turner of the University of Cambridge Statistical Laboratory.

While the reports generated by analog are not markedly different from those created by other log analysis tools, analog is wonderfully easy to use, offers a wealth of configuration options, and above all else, is fast. I was astounded when analog parsed my complete server log (dating back to February 1994) in less than seven minutes. My previous log tool took around 75 minutes to perform the same task!

Analog lets you compile your favorite report options and further configure the tool using a vast number of directives you place in a separate configuration file. All of these settings can be overridden on the command line if needed, making analog one of the most versatile log tools I've used.

I like to use analog to generate access reports each morning for my site. You will find it useful for this task and a lot more. If you haven't already settled on a log tool (and even if you have), take a moment to learn more about analog.

Next month...
Access log analysis is important, but it isn't the only tool in your demographic arsenal. Next month, we'll look at referer and agent logs for more clues about your users.

Click on our Sponsors to help Support SunWorld

Resources

Home of httpd
http://hoohoo.ncsa.uiuc.edu/
Apache server's home
http://www.apache.org/
RFC 931
http://andrew2.andrew.cmu.edu/rfc/rfc912.html
Yahoo's log tool references
http://www.yahoo.com/Computers_and_Internet/Internet/World_Wide_Web/HTTP/Servers/Log_Analysis_Tools/ analog http://www.statslab.cam.ac.uk/~sret1/analog/
NCSA httpd documentation
http://www.tesre.bo.cnr.it/docs/Overview.html
"Watching your Web server"
/sunworldonline/swol-03-1996/swol-03-perf.html
Melmac
http://melmac.corp.harris.com/
Other Webmaster
/sunworldonline/common/swol-backissues-columns.html#webmaster

About the author
Chuck Musciano has been running Melmac for close to two years, serving up HTML tips and tricks to hundreds of thousands of visitors each month. He's been a beta-tester and contributor to the NCSA httpd project and speaks regularly on the Internet, World Wide Web, and related topics. His book, HTML: The Definitive Guide, is currently available from O'Reilly and Associates.

If you have technical problems with this magazine, contact webmaster@sunworld.com

URL: http://www.sunworld.com/swol-03-1996/swol-03-webmaster.html
Last modified:

Comments:
Name:
Email:
Company Name: