Controlling Web Robots
Ever had a spider bombard your Web site with page requests? Want to stop it? Want to keep parts of your site from being indexed by search engine robots? Read on...
Web robots -- automated programs that search the Web for data -- are not always a Webmaster's best friend. In fact, they can sometimes be a pain. They can hog bandwidth and poke about in places they shouldn't. Luckily, with a little knowledge of how robots operate and a basic understanding of the Robot Exclusion Protocol you too can keep robots from running amok. (1,900 words)
"Klaatu barada nikto."
With those three words, Helen Benson (played by Patricia Neal) unleashed the power of Gort, the robot accompanying the extraterrestrial Klaatu (played my Michael Rennie) in the 1951 movie classic The Day The Earth Stood Still. Things were easier back then; Gort was capable of melting tanks with a beam emanating from his helmet visor but required just a few verbal commands for complete control by his owner.
If only it were so easy today. We are surrounded by robots on the Web, some useful, some bothersome, but all much less readily controlled than Gort. On a positive note, the worst any Web robot can do is to flood your server with extra requests; tank-melting hasn't even been considered for inclusion in the robot control protocol currently proposed by the Internet Engineering Task Force.
Still, every Webmaster should know how robots work, what they do, and how to control their access to your site. This month, we'll cover all that so that you'll be able to handle most any robot (with the possible exception of Gort) that may come your way.
As soon as the Web got too big to keep track of manually, automated programs, or robots, were developed to walk the Web and catalog everything they found. For obvious reasons, such robots are often known as "spiders." Spiders move from site to site, collecting data of various forms and cataloguing everything they find in a database. Once collected, the data might be used to drive a search engine, develop statistics, or simply be used to populate a private repository of document links.
Most people think of Web robots as builders of search engine databases. While this is an important function, many other robots are roaming the Web, often on a much smaller scale. Some robots search only for specific documents, either to create focused collections on behalf of some user, or to detect page theft and copyright violations. Other robots simply test every URL they discover, ensuring that a site is not suffering from Web-rot. Smart Webmasters use these robots to keep their sites in perfect condition. Finally, some robots conduct performance and availability tests to see if sites are online and measure how long it takes to download their pages.
Early robot developers learned several lessons in short order. The most important one concerned bandwidth and the lack thereof. Left unchecked, robots can saturate a site with requests, however innocent. Real visitors can't get in while a robot is consuming all the available bandwidth to a site. These unintentional denial of service attacks caused outrage among the affected Webmasters. In response, robot authors forced their robots to wait, often several minutes or more, between requests to a site.
The next problem concerned access statistics. Many Webmasters were elated to see their site's access count growing, only to discover that the vast majority of visitors were robots. To keep things in perspective, robot authors began including identifying strings as part of the HTTP header they sent to the server. The Webmaster could then ignore robot accesses when tabulating access counts for their site.
Finally, Webmasters began seeing parts of their sites in search engine databases that they didn't want indexed: CGI scripts, images, prototype pages, and other private data. Having your site indexed is a good thing, but having all of your site made public is a bit too much. In response, the Robot Exclusion Protocol was developed, allowing authors to specify exactly who is allowed to index what on their site.
The Robot Exclusion Protocol
The Robot Exclusion Protocol lets a Webmaster specify which robots, based upon their identifying strings, can access their site. Of those robots allowed access, the Webmaster can then specify which parts of the site are available for access.
As a Webmaster, you need only create a single file on your site to
control every robot that happens by. This file is named
robots.txt and must be placed in the top-level
document directory on your site. Well-behaved robots will read this
file before visiting your site further, and will only access your
site if the file grants them access. Within this file, you can
place three kinds of lines: comments, user agent specifications, and
resource access controls.
Comment lines are easy: any line beginning with a "#" character is a comment and is ignored by the robot. It makes sense to comment your entries in the file, if only for your later perusal; keep in mind that this is one file on your site that is rarely, if ever, read by humans.
The meat of the file consists of the sets of user agent and resource control lines that you add. These lines appear in a specific order: first one or more user agent lines, followed by one or more resource control lines. Blank lines separate these sets of lines; blank lines may not appear within the groups.
A quick example should clarify all of this. Suppose you've
discovered that a spider named "Charlotte" has been visiting your
CGI script directory. You could restrict further access with this
# Sample robots.txt file User-agent: charlotte Disallow: /cgi-bin
A few nits right off the bat: the name in the
User-agent line is case-insensitive and need not
completely match the robot's name. Thus, a name of "CHARLOTTE"
would work just as well, as would "charl."
The pathname specified in the
Disallow line represents
a portion of a virtual path on your server. Any path that begins
with the specified string will be ignored by the robot. In this
case, we're assuming that
/cgi-bin is a directory;
everything within it will be ignored by the robot. Since exclusion
is based on simple string matches, a file named
/cgi-bin.html would be ignored as well. If you want to
get tricky, consider this disallowed resource:
Since any user-level pages on a site are usually referenced relative to
that user's home directory using a tilde ("~"), this
Disallow line prevents any user-level pages from being
visited by the robot.
You can use an asterisk ("*") as a wild card in the
User-agent line. This is not a Unix-style
regular expression, but a special value that means "match all
agents." All robots must obey any resource controls associated with
this special user agent name. Thus, to prevent any robot from
seeing that CGI directory, you might use:
User-agent: * Disallow: /cgi-bin
You can create a similar wild card for the
with the "/" character. Since the match is based on simple string
matching, and since every URL on your site must begin with a "/",
this path matches every path on your site. If you want to keep
Charlotte away completely, use:
User-agent: charlotte Disallow: /
If you are truly opposed to any robot visitation, you can combine both wild cards:
User-agent: * Disallow: /
We've focused on the
Disallow directive up to this
point, but keep in mind that the formal Robot Exclusion Protocol
specification also provides for an
Allow directive. It
works the same as the
Disallow directive except that it
is used to grant access to a specific resource. While the spec
provides for this directive, it is not in general use, and I wouldn't
count on too many robots honoring it.
When a robot reads the
robots.txt file, it finds the
User-agent line that matches its name. It then
Disallow to the
desired resource path until it finds the first match. If the match
is disallowed, the robot will not request it from the server. If
the match is allowed, or does not match any directive, the robot is
free to access that resource on the server.
The Robot <meta> tag
The Robot Exclusion Protocol can only be managed through the
robots.txt file located at the top-level of your site.
If you cannot create or modify this file on your site due to access
restrictions, how can you control robots attempting to see your
There is no solution to this problem that controls access to
non-HTML resources on your site, but within your HTML pages there is
<meta> tag you can use to control how a
robot indexes each page.
<meta> tag, you supply a
name attribute of
content attribute can contain any of the values
For example, to prevent a page from being included in an index, you would place
<meta name="robots" content="noindex">
in your document's
<head>. While this keeps the
page from being added to the index, it does not prevent the robot
from parsing the page, extracting any URLs, and visiting those
pages. To prevent the robot from traveling beyond this page, use
nofollow value for the
attribute. You can prevent both indexing and follow-on by combining
<meta name="robots" content="noindex, nofollow">
The only disadvantage to using this tag is that few robots currently honor it.
Other control methods
The premise of the Robot Exclusion Protocol is that robots are well-behaved, look for
robots.txt files and robot
<meta> tags, and honor what they find. Robots
that are poorly written or that intentionally ignore the protocol
will visit your site no matter what you put in the
robots.txt file or robot
There is little you can do to stop these robots, short of determining their originating server address and preventing that server from visiting your site. All servers have ways to restrict access to specific machines; consult your server documentation for more details.
For more information
There are several resources on the Web dealing with robots and the Robot Exclusion Protocol. The protocol draft specification (http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html) is a bit dry but worth a read. You'll probably find the Web Robots Page (http://info.webcrawler.com/mak/projects/robots/robots.html) easier to read and much more helpful. It includes links to several useful robot-related documents, including the Web Robots FAQ (http://info.webcrawler.com/mak/projects/robots/faq.html) and the Robots Mailing List (http://info.webcrawler.com/mailing-lists/robots/info.html).
Next month, we'll look at another voluntary access control standard, the Platform for Internet Content Selection (PICS).
About the author
Chuck Musciano has been running various Web sites, including the HTML Guru Home Page, since early 1994, serving up HTML tips and tricks to hundreds of thousands of visitors each month. He's been a beta-tester and contributor to the NCSA httpd project and speaks regularly on the Internet, World Wide Web, and related topics.
If you have technical problems with this magazine, contact firstname.lastname@example.org