Analyzing your referrer log
Who's linking to your pages? Your server's referrer log has all the clues
Server popularity is based largely on other sites linking to your own. Find out who's linking to you with your referrer log. (2,100 words)
Editor's note: We're taking a stand on proper spelling! We acknowledge that the term "referer" log is widely spelled as such, but it is in fact misspelled. Therefore, we will only use "referer" in this article when it is part of a technical notation.
The importance of being linked
This may sound like a bit of a no-brainer, but the more sites that link to your site, the more people are likely to visit your site. To create a successful site, one of your primary goals is to get others to link to you. This increases the chances that some random net surfer will stumble across a link to your site, visit, and be so enthralled by your content that he'll come back again and again.
Before the Web exploded, you could actually make a decent estimate of how many links referred to your site by surfing around and counting them. Some of the more industrious folks created spiders who rummaged around finding links back to their site. Once the Web began its exponential growth, this kind of exploratory accounting became impossible.
Around that same time, the most recent release of the NCSA httpd server began creating a referrer log, using data passed to it from the browser that was connecting to your site. This data included the URL of the page currently displayed by the browser when it connected to your site. This URL, known as the referring page gets written to the referrer log, along with the document requested from your site.
This is really useful data that receives practically no attention. With a little bit of analysis, you can determine exactly which sites are most often being viewed when a person suddenly links to your site. Clearly, if this happens a lot, the referring site probably has a link to your site.
Sifting the log
Not surprisingly, the referrer log can be found in the
directory beneath the top-level installation directory of your Web server,
in the file named
referer_log. Each line in the file represents
one reference to your server and looks like this:
http://sunsite.unc.edu/boutell/faq/tinter.htm -> /transparent_images.html http://webcrawler.com/cgi-bin/WebQuery -> /images.html file:///Hard%20Drive/System%20Folder/Preferences/Netscape/Bookmarks.html file://localhost/usr/users/la/lav6/public_html/.lynx_bookmarks.html -> /transparent_images.html file:///I|/HTML/Referenc.htm -> /about_html.html
The URL to the left is the referring page; the path to the right is the document requested while that page was being viewed.
The first of these five samples is the most common entry in the log: Some random document on the Web was being viewed when someone jumped to a document on your site. This kind of entry can represent one of three kinds of accesses to your site:
The second line is a bit trickier: you'll note that it doesn't represent
a document. Instead, it represents a CGI script (the clue? the
element in the URL). Since the site name is webcrawler.com, it's a good
guess that my site was reached by someone querying
WebCrawler and finding my site
as a result. This is good news: my site has made it into a link index and is
being found by interested parties. Hopefully, you can generate many of
these kinds of crosslinks.
The last three lines represent different flavors of the same kind of link. For each of these, the visitor was viewing a document stored locally on his machine and then jumped to your page. While there is the possibility that the user suddenly typed in your URL, more than likely a link to your site was contained in that local document. Since the vast majority of local documents are actually personal hotlists, these links mean you have made it into someone's hotlist or personal document collection, the highest compliment your site can be paid.
Picking apart the referring URL can teach you a bit about the visitor. The
file URL, above, is from a Macintosh user running
Netscape. How to tell? Very
few other systems have blanks (
%20) imbedded in path names, and
even fewer name their disk drives "Hard Drive."
file URL is from a real computer running Unix.
Unix local file references are actually regular URLs that use the generic
server name of "localhost" to reach the file. (Macs and PCs omit the
server name for local references. Notice that their URLs begin with
three slashes.) Moreover, the familiar mount
/usr is usually only found on Unix systems. In this
particular case, we can also see that the user was running Lynx as his
In the last
file example, the user is sitting on a PC
somewhere. The giveaway is the single capital letter designating the
drive from which the document was retrieved, followed by a "|" which is
used in PC
file URLs to replace the colon (":") normally
used after the drive letter. If you've ever used a PC, you also know
that very few of them actually have nine disk drives attached, so that
I drive is most likely a network drive being mounted on the PC over the
local area network. Based upon this URL, it looks like a collection of HTML
pages is being stored on a LAN file server, shared by several people,
and a link to your site exists somewhere in that collection.
Linking to yourself
Many of the entries in your referrer log will list a document on your site as the referring page. This is useful, too, since it lets you determine how people move among documents on your site. If you have few references to yourself, it means that people are hitting a single page on your site and not browsing other pages before moving to other places.
By counting the jumps between your pages, you can find the most popular path through your site. You may be surprised to discover that the way you want people to visit your site is not at all how they actually move between your pages. Too many people believe that visitors always arrive through your top-level page and work their way down your page hierarchy. In fact, a lot of people bookmark low-level pages at your site, jump directly to them, and never see your top-level page.
Your referrer log can help you find dead ends in your pages -- places where people visit but from which they never go elsewhere. You can also find the most popular paths through your pages. If certain paths are very popular, you might want to examine those pages to see why. Did you crosslink in a different way? Are the links more visible and accessible? Did you use graphics instead of text (or vice versa)? Inducing a visitor to browse more of your site is a subtle process, and slight variations in page design can drastically affect a person's desire to move on to other pages on your site.
By changing your pages and revisiting your referrer log periodically, you can see if your changes are improving your site's accessibility.
Manually processing your referrer log
You can glean a lot from your referrer log with judicious use of the
extracting specific lines from the log, and counting them. For
example, I know that links to my site exist from within Sunworld
Online, so I counted them with
egrep sunworld referer_log | wc -l
I found 425 visits to my site that originated from pages within
SunWorld Online. Since the URLs of my columns are
date-encoded, I can use
cut to count
the visits by month:
egrep sunworld referer_log | sed -e 's#^.*/sunworldonline/##' | cut -c1-12 | sort | uniq -c
This results in a quick breakdown:
70 swol-01-1996 110 swol-02-1996 57 swol-03-1996 2 swol-09-1995 20 swol-10-1995 166 swol-12-1995
Clearly, visits increase the longer a column is available, and for some reason, my January column didn't attract as many readers likely to jump to my site as December or February.
Automated log processing
This kind of quick and dirty processing is great when you can easily extract the desired entries, organize, and count them. The more general problem of summing all the accesses to your site from all the visitors is tedious. Fortunately, a handy tool exists to make life easier.
RefStats 1.1.1 is a perl script written by Jerry Franz that creates a nice listing (in HTML, no less) of your referrer log entries. The entries are sorted by the documents on your site, so that all references to a single document are summarized. For example, here is a brief bit of the results for my site:
My "About HTML" page has been visited 8,440 times from folks who had just been looking at http://www.ncsa.uiuc.edu/demoweb/html-primer.html. This is a great reference, since it means my site is being referenced right from the Web documentation at the NCSA!
Since RefStats generates an HTML document, you can easily schedule it to run periodically and create a new referrer log analysis that is linked into your pages. You might want to see how things change weekly, especially if you are changing your pages and modifying your internal linking structure. The RefStats output is also handy to test the usefulness of crosslink agreements you may have with other sites. If their link isn't generating traffic to your site, you might want to renegotiate the agreement.
What now, Sherlock?
Clues like this abound in the referrer log. If you're feeling a bit like Sherlock Holmes, dive into your log and see what you can discern about your visitors. Taking the time to understand the dynamics of the links to your pages, both internal and external to your server, can pay big dividends if you use the information to fine-tune your document structure and internal links.
It's also useful to note that there is a one-for-one correspondence between entries in the access log, agent log, and the referrer log. Theoretically, you could create a report that lists the referring URL, the document visited, the browser used, and the site from which the reference came. Correlating this by hand is nearly impossible, and I've not yet found a tool on the Web that does the trick. Do you know of such a tool? Drop me a line, and I'll share it here.
I'm still accepting browser usage statistics, as detailed in last month's column. If you take the time to generate those statistics, send them to me, and I'll publish the aggregate results.
About the author
Chuck Musciano has been running Melmac and the HTML Guru Home Page for two years, serving up HTML tips and tricks to hundreds of thousands of visitors each month. He's been a beta-tester and contributor to the NCSA httpd project and speaks regularly on the Internet, World Wide Web, and related topics. His book, HTML: The Definitive Guide, is currently available from O'Reilly and Associates.
If you have technical problems with this magazine, contact email@example.com