Click on our Sponsors to help Support SunWorld

Controlling Web Robots

Ever had a spider bombard your Web site with page requests? Want to stop it? Want to keep parts of your site from being indexed by search engine robots? Read on...

June 1997

Abstract

Web robots -- automated programs that search the Web for data -- are not always a Webmaster's best friend. In fact, they can sometimes be a pain. They can hog bandwidth and poke about in places they shouldn't. Luckily, with a little knowledge of how robots operate and a basic understanding of the Robot Exclusion Protocol you too can keep robots from running amok. (1,900 words)

Mail this
article to
a friend

"Klaatu barada nikto."

With those three words, Helen Benson (played by Patricia Neal) unleashed the power of Gort, the robot accompanying the extraterrestrial Klaatu (played my Michael Rennie) in the 1951 movie classic The Day The Earth Stood Still. Things were easier back then; Gort was capable of melting tanks with a beam emanating from his helmet visor but required just a few verbal commands for complete control by his owner.

If only it were so easy today. We are surrounded by robots on the Web, some useful, some bothersome, but all much less readily controlled than Gort. On a positive note, the worst any Web robot can do is to flood your server with extra requests; tank-melting hasn't even been considered for inclusion in the robot control protocol currently proposed by the Internet Engineering Task Force.

Still, every Webmaster should know how robots work, what they do, and how to control their access to your site. This month, we'll cover all that so that you'll be able to handle most any robot (with the possible exception of Gort) that may come your way.

Advertisements

Robot basics
As soon as the Web got too big to keep track of manually, automated programs, or robots, were developed to walk the Web and catalog everything they found. For obvious reasons, such robots are often known as "spiders." Spiders move from site to site, collecting data of various forms and cataloguing everything they find in a database. Once collected, the data might be used to drive a search engine, develop statistics, or simply be used to populate a private repository of document links.

Most people think of Web robots as builders of search engine databases. While this is an important function, many other robots are roaming the Web, often on a much smaller scale. Some robots search only for specific documents, either to create focused collections on behalf of some user, or to detect page theft and copyright violations. Other robots simply test every URL they discover, ensuring that a site is not suffering from Web-rot. Smart Webmasters use these robots to keep their sites in perfect condition. Finally, some robots conduct performance and availability tests to see if sites are online and measure how long it takes to download their pages.

Early robot developers learned several lessons in short order. The most important one concerned bandwidth and the lack thereof. Left unchecked, robots can saturate a site with requests, however innocent. Real visitors can't get in while a robot is consuming all the available bandwidth to a site. These unintentional denial of service attacks caused outrage among the affected Webmasters. In response, robot authors forced their robots to wait, often several minutes or more, between requests to a site.

The next problem concerned access statistics. Many Webmasters were elated to see their site's access count growing, only to discover that the vast majority of visitors were robots. To keep things in perspective, robot authors began including identifying strings as part of the HTTP header they sent to the server. The Webmaster could then ignore robot accesses when tabulating access counts for their site.

Finally, Webmasters began seeing parts of their sites in search engine databases that they didn't want indexed: CGI scripts, images, prototype pages, and other private data. Having your site indexed is a good thing, but having all of your site made public is a bit too much. In response, the Robot Exclusion Protocol was developed, allowing authors to specify exactly who is allowed to index what on their site.

The Robot Exclusion Protocol
The Robot Exclusion Protocol lets a Webmaster specify which robots, based upon their identifying strings, can access their site. Of those robots allowed access, the Webmaster can then specify which parts of the site are available for access.

As a Webmaster, you need only create a single file on your site to control every robot that happens by. This file is named robots.txt and must be placed in the top-level document directory on your site. Well-behaved robots will read this file before visiting your site further, and will only access your site if the file grants them access. Within this file, you can place three kinds of lines: comments, user agent specifications, and resource access controls.

Comment lines are easy: any line beginning with a "#" character is a comment and is ignored by the robot. It makes sense to comment your entries in the file, if only for your later perusal; keep in mind that this is one file on your site that is rarely, if ever, read by humans.

The meat of the file consists of the sets of user agent and resource control lines that you add. These lines appear in a specific order: first one or more user agent lines, followed by one or more resource control lines. Blank lines separate these sets of lines; blank lines may not appear within the groups.

A quick example should clarify all of this. Suppose you've discovered that a spider named "Charlotte" has been visiting your CGI script directory. You could restrict further access with this robots.txt file:

     # Sample robots.txt file

     User-agent: charlotte Disallow: /cgi-bin

A few nits right off the bat: the name in the User-agent line is case-insensitive and need not completely match the robot's name. Thus, a name of "CHARLOTTE" would work just as well, as would "charl."

The pathname specified in the Disallow line represents a portion of a virtual path on your server. Any path that begins with the specified string will be ignored by the robot. In this case, we're assuming that /cgi-bin is a directory; everything within it will be ignored by the robot. Since exclusion is based on simple string matches, a file named /cgi-bin.html would be ignored as well. If you want to get tricky, consider this disallowed resource:

     Disallow: ~

Since any user-level pages on a site are usually referenced relative to that user's home directory using a tilde ("~"), this Disallow line prevents any user-level pages from being visited by the robot.

You can use an asterisk ("*") as a wild card in the User-agent line. This is not a Unix-style regular expression, but a special value that means "match all agents." All robots must obey any resource controls associated with this special user agent name. Thus, to prevent any robot from seeing that CGI directory, you might use:

     User-agent: *
     Disallow: /cgi-bin

You can create a similar wild card for the Disallow line with the "/" character. Since the match is based on simple string matching, and since every URL on your site must begin with a "/", this path matches every path on your site. If you want to keep Charlotte away completely, use:

     User-agent: charlotte
     Disallow: /

If you are truly opposed to any robot visitation, you can combine both wild cards:

     User-agent: *
     Disallow: /

We've focused on the Disallow directive up to this point, but keep in mind that the formal Robot Exclusion Protocol specification also provides for an Allow directive. It works the same as the Disallow directive except that it is used to grant access to a specific resource. While the spec provides for this directive, it is not in general use, and I wouldn't count on too many robots honoring it.

When a robot reads the robots.txt file, it finds the first User-agent line that matches its name. It then applies each Allow and Disallow to the desired resource path until it finds the first match. If the match is disallowed, the robot will not request it from the server. If the match is allowed, or does not match any directive, the robot is free to access that resource on the server.

The Robot <meta> tag
The Robot Exclusion Protocol can only be managed through the robots.txt file located at the top-level of your site. If you cannot create or modify this file on your site due to access restrictions, how can you control robots attempting to see your pages?

There is no solution to this problem that controls access to non-HTML resources on your site, but within your HTML pages there is a special <meta> tag you can use to control how a robot indexes each page.

For this <meta> tag, you supply a name attribute of robots. The content attribute can contain any of the values noindex and nofollow.

For example, to prevent a page from being included in an index, you would place

     <meta name="robots" content="noindex">

in your document's <head>. While this keeps the page from being added to the index, it does not prevent the robot from parsing the page, extracting any URLs, and visiting those pages. To prevent the robot from traveling beyond this page, use the nofollow value for the content attribute. You can prevent both indexing and follow-on by combining the values:

     <meta name="robots" content="noindex, nofollow">

The only disadvantage to using this tag is that few robots currently honor it.

Other control methods
The premise of the Robot Exclusion Protocol is that robots are well-behaved, look for robots.txt files and robot <meta> tags, and honor what they find. Robots that are poorly written or that intentionally ignore the protocol will visit your site no matter what you put in the robots.txt file or robot <meta> tags.

There is little you can do to stop these robots, short of determining their originating server address and preventing that server from visiting your site. All servers have ways to restrict access to specific machines; consult your server documentation for more details.

For more information
There are several resources on the Web dealing with robots and the Robot Exclusion Protocol. The protocol draft specification (http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html) is a bit dry but worth a read. You'll probably find the Web Robots Page (http://info.webcrawler.com/mak/projects/robots/robots.html) easier to read and much more helpful. It includes links to several useful robot-related documents, including the Web Robots FAQ (http://info.webcrawler.com/mak/projects/robots/faq.html) and the Robots Mailing List (http://info.webcrawler.com/mailing-lists/robots/info.html).

Next month, we'll look at another voluntary access control standard, the Platform for Internet Content Selection (PICS).

Click on our Sponsors to help Support SunWorld

Resources

HTML Guru Home Page, Chuck's server
http://members.aol.com/htmlguru/
HTML: The Definitive Guide, second edition
http://www.ora.com/catalog/html2/
Chuck Musciano's sed script
http://members.aol.com/htmlguru/agent_log.html
Yahoo's log tool references
http://www.yahoo.com/Computers_and_Internet/Software/Internet/World_Wide_Web/Servers/Log_Analysis_Tools/
Full listing of previous Webmaster columns in SunWorld.
http://www.sunworld.com/common/swol-backissues-columns.html#webmaster
Web server performance/management stories in SunWorld's Site Index
http://www.sunworld.com/common/swol-siteindex.html#webperf
Web server security stories in SunWorld's Site Index
http://www.sunworld.com/common/swol-siteindex.html#websec

About the author
Chuck Musciano has been running various Web sites, including the HTML Guru Home Page, since early 1994, serving up HTML tips and tricks to hundreds of thousands of visitors each month. He's been a beta-tester and contributor to the NCSA httpd project and speaks regularly on the Internet, World Wide Web, and related topics.

The new edition of his book, HTML: The Definitive Guide is now available from O'Reilly & Associates. The second edition is revised and updated, covering Internet Explorer 3.0 and Netscape 4.0 extensions to HTML, along with all the standard elements of HTML 3.2. It features 140 pages of new material covering layers, document layout, Cascading Style Sheets, and JavaScript Style Sheets. Reach Chuck at Chuck.Musciano@sunworld.com.

If you have technical problems with this magazine, contact webmaster@sunworld.com

URL: http://www.sunworld.com/swol-06-1997/swol-06-webmaster.html
Last modified:

Comments:
Name:
Email:
Company Name: