Click on our Sponsors to help Support SunWorld

Analyzing your referrer log

Who's linking to your pages? Your server's referrer log has all the clues

May 1996

Abstract

Server popularity is based largely on other sites linking to your own. Find out who's linking to you with your referrer log. (2,100 words)
Editor's note: We're taking a stand on proper spelling! We acknowledge that the term "referer" log is widely spelled as such, but it is in fact misspelled. Therefore, we will only use "referer" in this article when it is part of a technical notation.

Mail this
article to
a friend

In my previous two columns, we've picked apart your server's access log to figure out who has been visiting your site, and your agent log to figure out the browser being used when they visited. This month, we'll complete this "trilogy of traces" by examining your server's referrer log.

The importance of being linked
This may sound like a bit of a no-brainer, but the more sites that link to your site, the more people are likely to visit your site. To create a successful site, one of your primary goals is to get others to link to you. This increases the chances that some random net surfer will stumble across a link to your site, visit, and be so enthralled by your content that he'll come back again and again.

Before the Web exploded, you could actually make a decent estimate of how many links referred to your site by surfing around and counting them. Some of the more industrious folks created spiders who rummaged around finding links back to their site. Once the Web began its exponential growth, this kind of exploratory accounting became impossible.

Around that same time, the most recent release of the NCSA httpd server began creating a referrer log, using data passed to it from the browser that was connecting to your site. This data included the URL of the page currently displayed by the browser when it connected to your site. This URL, known as the referring page gets written to the referrer log, along with the document requested from your site.

This is really useful data that receives practically no attention. With a little bit of analysis, you can determine exactly which sites are most often being viewed when a person suddenly links to your site. Clearly, if this happens a lot, the referring site probably has a link to your site.

Advertisements

Sifting the log
Not surprisingly, the referrer log can be found in the logs directory beneath the top-level installation directory of your Web server, in the file named referer_log. Each line in the file represents one reference to your server and looks like this:

http://sunsite.unc.edu/boutell/faq/tinter.htm -> /transparent_images.html
http://webcrawler.com/cgi-bin/WebQuery -> /images.html
file:///Hard%20Drive/System%20Folder/Preferences/Netscape/Bookmarks.html
file://localhost/usr/users/la/lav6/public_html/.lynx_bookmarks.html ->
/transparent_images.html
file:///I|/HTML/Referenc.htm -> /about_html.html

The URL to the left is the referring page; the path to the right is the document requested while that page was being viewed.

The first of these five samples is the most common entry in the log: Some random document on the Web was being viewed when someone jumped to a document on your site. This kind of entry can represent one of three kinds of accesses to your site:

If the same URL occurs many times in your log, it probably means that a link to your site exists on that page. "Many times" is a difficult number to settle on. I've discovered that as few as five entries in my referrer log reliably indicates that a link does, indeed, exist on that page.
If the URL occurs just a few times, there is probably not a link to your site from that page. How then, did the visitor get to your site while viewing that page? Two possibilities:
- The user has a link to your site on his hotlist and was suddenly compelled to visit your site while viewing this page, selecting your site from his hotlist, or
- The user actually typed in the URL of your site while viewing this page. If your site is publicized via e-mail, Usenet news announcements, or other non-web media, including legacy media like magazine, newspaper, and television ads, this is a likely scenario. Advertising literature like business cards, brochures, and pamphlets also lead to this kind of access.

The second line is a bit trickier: you'll note that it doesn't represent a document. Instead, it represents a CGI script (the clue? the cgi-bin element in the URL). Since the site name is webcrawler.com, it's a good guess that my site was reached by someone querying WebCrawler and finding my site as a result. This is good news: my site has made it into a link index and is being found by interested parties. Hopefully, you can generate many of these kinds of crosslinks.

The last three lines represent different flavors of the same kind of link. For each of these, the visitor was viewing a document stored locally on his machine and then jumped to your page. While there is the possibility that the user suddenly typed in your URL, more than likely a link to your site was contained in that local document. Since the vast majority of local documents are actually personal hotlists, these links mean you have made it into someone's hotlist or personal document collection, the highest compliment your site can be paid.

Picking apart the referring URL can teach you a bit about the visitor. The first file URL, above, is from a Macintosh user running Netscape. How to tell? Very few other systems have blanks (%20) imbedded in path names, and even fewer name their disk drives "Hard Drive."

The second file URL is from a real computer running Unix. Unix local file references are actually regular URLs that use the generic server name of "localhost" to reach the file. (Macs and PCs omit the server name for local references. Notice that their URLs begin with three slashes.) Moreover, the familiar mount point of /usr is usually only found on Unix systems. In this particular case, we can also see that the user was running Lynx as his browser.

In the last file example, the user is sitting on a PC somewhere. The giveaway is the single capital letter designating the drive from which the document was retrieved, followed by a "|" which is used in PC file URLs to replace the colon (":") normally used after the drive letter. If you've ever used a PC, you also know that very few of them actually have nine disk drives attached, so that I drive is most likely a network drive being mounted on the PC over the local area network. Based upon this URL, it looks like a collection of HTML pages is being stored on a LAN file server, shared by several people, and a link to your site exists somewhere in that collection.

Linking to yourself
Many of the entries in your referrer log will list a document on your site as the referring page. This is useful, too, since it lets you determine how people move among documents on your site. If you have few references to yourself, it means that people are hitting a single page on your site and not browsing other pages before moving to other places.

By counting the jumps between your pages, you can find the most popular path through your site. You may be surprised to discover that the way you want people to visit your site is not at all how they actually move between your pages. Too many people believe that visitors always arrive through your top-level page and work their way down your page hierarchy. In fact, a lot of people bookmark low-level pages at your site, jump directly to them, and never see your top-level page.

Your referrer log can help you find dead ends in your pages -- places where people visit but from which they never go elsewhere. You can also find the most popular paths through your pages. If certain paths are very popular, you might want to examine those pages to see why. Did you crosslink in a different way? Are the links more visible and accessible? Did you use graphics instead of text (or vice versa)? Inducing a visitor to browse more of your site is a subtle process, and slight variations in page design can drastically affect a person's desire to move on to other pages on your site.

By changing your pages and revisiting your referrer log periodically, you can see if your changes are improving your site's accessibility.

Manually processing your referrer log
You can glean a lot from your referrer log with judicious use of the sed, grep, and cut commands, extracting specific lines from the log, and counting them. For example, I know that links to my site exist from within Sunworld Online, so I counted them with

     egrep sunworld referer_log | wc -l

I found 425 visits to my site that originated from pages within SunWorld Online. Since the URLs of my columns are date-encoded, I can use sed and cut to count the visits by month:

     egrep sunworld referer_log | sed -e 's#^.*/sunworldonline/##' | cut
-c1-12 | sort | uniq -c

This results in a quick breakdown:

       70 swol-01-1996
      110 swol-02-1996
       57 swol-03-1996
        2 swol-09-1995
       20 swol-10-1995
      166 swol-12-1995

Clearly, visits increase the longer a column is available, and for some reason, my January column didn't attract as many readers likely to jump to my site as December or February.

Automated log processing
This kind of quick and dirty processing is great when you can easily extract the desired entries, organize, and count them. The more general problem of summing all the accesses to your site from all the visitors is tedious. Fortunately, a handy tool exists to make life easier.

RefStats 1.1.1 is a perl script written by Jerry Franz that creates a nice listing (in HTML, no less) of your referrer log entries. The entries are sorted by the documents on your site, so that all references to a single document are summarized. For example, here is a brief bit of the results for my site:

/about_html.html: http://www.ncsa.uiuc.edu/demoweb/html-primer.html (8,440 references); http://www.ncsa.uiuc.edu/General/Internet/WWW/HTMLPrimer.html (7,331 references); http://www-slis.lib.indiana.edu/Internet/programmer-page.html (912 references); http://www.isisnet.com/mlindsay/kitrex3.html (384 references); http://webcrawler.com/cgi-bin/WebQuery (353 references); http://www.vol.it /IT/IT/SCIENZA/INFORMATICA/guide.htm (341 references); http://info.med.yale.edu /caim/M_Resources.HTML (329 references); http://melmac.corp.harris.com/local_stuff.html (294 references); http://www.oswego.edu/personal.hp/create.pers.html (259 references)

My "About HTML" page has been visited 8,440 times from folks who had just been looking at http://www.ncsa.uiuc.edu/demoweb/html-primer.html. This is a great reference, since it means my site is being referenced right from the Web documentation at the NCSA!

Since RefStats generates an HTML document, you can easily schedule it to run periodically and create a new referrer log analysis that is linked into your pages. You might want to see how things change weekly, especially if you are changing your pages and modifying your internal linking structure. The RefStats output is also handy to test the usefulness of crosslink agreements you may have with other sites. If their link isn't generating traffic to your site, you might want to renegotiate the agreement.

What now, Sherlock?
Clues like this abound in the referrer log. If you're feeling a bit like Sherlock Holmes, dive into your log and see what you can discern about your visitors. Taking the time to understand the dynamics of the links to your pages, both internal and external to your server, can pay big dividends if you use the information to fine-tune your document structure and internal links.

It's also useful to note that there is a one-for-one correspondence between entries in the access log, agent log, and the referrer log. Theoretically, you could create a report that lists the referring URL, the document visited, the browser used, and the site from which the reference came. Correlating this by hand is nearly impossible, and I've not yet found a tool on the Web that does the trick. Do you know of such a tool? Drop me a line, and I'll share it here.

I'm still accepting browser usage statistics, as detailed in last month's column. If you take the time to generate those statistics, send them to me, and I'll publish the aggregate results.

Click on our Sponsors to help Support SunWorld

Resources

"Collecting and using server statistics"
/sunworldonline/swol-03-1996/swol-03-webmaster.html
"What browser are you designing for?"
/sunworldonline/swol-04-1996/swol-04-webmaster.html
Apache
http://www.apache.org/
Netscape
http://www.netscape.com/
Chuck Musciano's sed script
http://members.aol.com/htmlguru/agent_log.html melmac, columnist Chuck Musciano's server http://melmac.corp.harris.com/ HTML Guru Home Page, columnist Chuck Musciano's other server http://members.aol.com/htmlguru/
Yahoo's log tool references
http://www.yahoo.com/Computers_and_Internet/Internet/World_Wide_Web/HTTP/Servers/Log_Analysis_Tools/ analog http://www.statslab.cam.ac.uk/~sret1/analog/
NCSA httpd documentation
http://www.tesre.bo.cnr.it/docs/Overview.html
"Watching your Web server"
/sunworldonline/swol-03-1996/swol-03-perf.html HTML: The Definitive Guide http://www.ora.com/www/item/html.html
Other Webmaster
/sunworldonline/common/swol-backissues-columns.html#webmaster

About the author
Chuck Musciano has been running Melmac and the HTML Guru Home Page for two years, serving up HTML tips and tricks to hundreds of thousands of visitors each month. He's been a beta-tester and contributor to the NCSA httpd project and speaks regularly on the Internet, World Wide Web, and related topics. His book, HTML: The Definitive Guide, is currently available from O'Reilly and Associates.

If you have technical problems with this magazine, contact webmaster@sunworld.com

URL: http://www.sunworld.com/swol-05-1996/swol-05-webmaster.html
Last modified:

Comments:
Name:
Email:
Company Name: