Internet archives: Who's doing it? And can you protect your privacy?
While Internet libraries are certainly valuable, the potential for abuse is troubling. Find out what 3 archive companies are indexing and how they're accomplishing it
Internet archiving technologies promise to provide a historical record of the Internet for posterity and a way to find documents and newsgroup mail that have since been removed. Archives can be very useful for researching technical details that were once hot topics but are now almost forgotten. At the same time, they can threaten our privacy by allowing anyone to find things we may have thought had been hidden or deleted. Still, there are techniques to guarantee that what you say on the Internet can't be used against you. (3,000 words, including a sidebar)
One day, John tries to remember a great recipe he saw posted to a newsgroup. Unfortunately, he doesn't remember what newsgroup, much less when it was posted. All he can remember is "Hello Dolly Brownie Bars." John does a search on various search engines, but to no avail. In his exhaustive searches, John comes across a link to DejaNews, a newsgroup archiving service. He pops in his query, and a few minutes later he gets his recipe. John is so impressed, he decides to look up his old college buddy. He puts in his name, and out comes half a dozen posts from his friend with his new e-mail address.
But John is not the only one using DejaNews to track down information.
On the other side of town, Lucifer, a down-and-out private investigator has
found the ultimate tool for electronic blackmail. He finds an explicit
posting from something like the
alt.sex.kinky newsgroup, saves
the message, and prepares the letter for his victim: "Dear Victim: Despite
your high moral pretensions, I know the truth about your checkered past and
the kinky sex games you play. Please send $5,000 in cybercash to the
following numbered account in Switzerland, or I will send the following
evidence, with proof that you wrote it to your wife, your employer, and
Lucifer isn't alone. The local insurance underwriter has discovered one
more tool for pre-screening clients. Before the initial interview, he asks
the potential client for his e-mail address. Then the underwriter does a
Usenet search looking for postings to such groups as
alt.support.cancer. If a post is found asking for advice on
some chronic, costly ailment, the underwriter calls the potential client
back and apologetically declines to offer insurance.
Like any tool, search engines can be used for good and ill. When Usenet newsgroups first started, people thought they could be used to enhance communications between people. It just so happened that some of those communities might also share information on building bombs while others are interested in exchanging self-help information. If anything, Internet archives are on the rise. They're growing to capture every word posted to Usenet and every Web page that can be reached by a pointer from a public site.
So you thought your postings were somewhat "private," huh? Howard Rheingold, a noted philosopher and founder of The Well says, "People who think posting to Usenet is like having a conversation need to understand that although it is informal, like a conversation, it is also publishing. Words you write today can be retrieved tomorrow, or years from now, and used against you."
At the moment, there are three main organizations archiving the Internet. DejaNews Inc. in Austin, TX, is probably the most widely known as it has been offering access to text archived from newsgroups for a while. InReference Inc. in Sunnyvale, CA, is a start-up company that is offering Usenet archives, as well as mailing lists, which can be searched simultaneously from a single form. The Internet Archive in San Francisco is the most ambitious, as it plans to archive the entire Web -- graphics, sound, and video included.
Usenet deja vu
Usenet newsgroups were first started in 1979 as a way for people to exchange ideas or problems and solutions. DejaNews was formed when programmer Steve Madere was looking for a killer application for a new text indexing system he had developed. He settled on indexing the newsgroups and began archiving in March 1995. He went public on the Web with it two months later.
Initially it focused on key newsgroups such as the
hierarchy used to exchange technical information. In January 1996, it began
archiving the significantly bulkier
talk hierarchies. The archive currently indexes 15,000
newsgroups and has close to 80 million messages online that require 120
gigabytes of disk space of storage. According to NetPartners' Usenet
storage space calculator, 774 megabytes per day of Usenet postings are
added, but that includes binary files that are not archived by DejaNews due
to storage and policy reasons.
DejaNews currently runs on two sets of machines, one performing database operations, the other providing the WWW interface. The database and Web machines are dual 133-MHz Pentium systems running Linux SMP with up to 256 megabytes of RAM, as well as multiple one- to four-gigabyte hard disk drives.
DejaNews gives the searcher a variety of different ways of looking for information. The basic search is based on keywords and word fragments. Searches can be narrowed to certain authors, newsgroups, time periods, or subjects using query filters. Topic thread searches can retrieve the entire thread of articles on a particular topic, which helps put an article in context. Author profiles provide statistics on articles from a single e-mail address. DejaNews is supported by advertising, and prices depend on how it is targeted. SUBHEAD Privacy and copyright protection As a matter of policy, DejaNews will remove articles from an author if he or she requests it, but it can be a time consuming process for the author. Humphrey Marr, director of business development at DejaNews explains, "We will delete things for people as a courtesy as long as we can manage it. We try and educate people because they may not realize what they are doing when they post to Usenet. We can understand someone not wanting their name up, or getting confused and thinking it was a private e-mail setup."
There are more efficient ways to keep confidential postings private (See sidebar, "Protecting your privacy from archives").
Since DejaNews does not archive any binary files, it can ignore some of the legal issues associated with sexually explicit pictures on the Internet (besides saving a lot of space). In addition, this helps protect it from having to deal with pirated software being distributed over Usenet.
DejaNews is also willing to respect all copyrights. As it turns out, the group that asks DejaNews to remove copyrighted materials most often is not some publisher trying to protect itself from lost profits. It is the Church of Scientology attempting to protect itself from outsiders, and occasionally a church member, who post portions of its copyrighted materials.
Marr says, "With the Church of Scientology, it is a never-ending cycle because people are always posting their material."
InReference's mail sorting
While DejaNews is focusing exclusively on Usenet, InReference is providing a single reference point for all types of information by archiving Usenet and mailing lists. The company started archiving mailing lists in January 1996 and began archiving Usenet in March.
InReference currently archives approximately 16,000 newsgroups. It has an index of 100,000 mailing lists, but only has permission from list owners to archive about 1,000. The relatively small number of archived lists in relation to the index is not due to InReference's own storage limitations. Rather, it reflects owners that have responded to e-mail requests from InReference. The company is constantly looking to archive new lists as a free service and list owners are invited to authorize archiving of their lists so that they can be included in the archive. As Jurgen Botz, one of the founders of InReference says, "We intend to archive everything we get permission to archive."
Like DejaNews, InReference honors the no-archive header, as well as requests to remove postings of copyrighted material. It also filters out all binary files. Botz explains, "The volume of binaries on Usenet is gigantic and the return smaller because the audience is relatively smaller. Plus you cannot search the binaries."
InReference's advertising works similar to DejaNews. Since the service is still undergoing beta tests the company has not established firm rates, but it has set up ads for sponsors and non-profit organizations to keep those ad banners filled.
The basic database engine runs on an UltraSPARC E4000 Enterprise Server with eight processors, two gigabytes of RAM, and 350 gigabytes of storage in a RAID configuration. A regular news serve runs on a SPARCserver 20 and feeds Usenet articles and mailing list messages to the database. A separate SPARCserver 20 is used as the Web front end and feeds the queries into the database engine.
Botz says that the architecture is distributed so they can add more capacity if required, but for now the UltraSPARC does a decent job. "It is not likely we will need more servers, but we may need to add more RAM. That's okay since the UltraSPARC can hold up to 17 gigabytes of RAM."
Doing it all: Internet Archive takes on the Web
Perhaps the most ambitious attempt to archive the Internet is being done by the Internet Archive which has the goal of indexing as much as it can on the Internet. It is not an index and does not allow individuals to search through its archive, yet. Instead, it allows companies bulk access to its archives in the form of copies of the tapes. It is then the companies' responsibility to dissect the information and index it as needed.
The Internet crawling is done by a 166-MHz Pentium system running BSDI. It collects the pages onto a Quantum DLT4500 tape library, which can store 200 gigabytes of compressed data. It uses the Forefront Group's Web Whacker software to collect the pages. Storage of the archive will be done on a Sun UltraSPARC 2 (it is temporarily on a SPARC 20). It is connected to two tape robots that can store an aggregate of 5.8 terabytes of data.
The Internet Archive is a not-for-profit organization that would receive copies of all bits downloaded for long-term maintenance. However, it has a commercial side that is developing technology for gathering, storing, and managing terabytes of data.
In addition to archiving the Web, it is also making an archive of Usenet. It is not limiting itself to text only either. It seeks to capture the entire world of multimedia on the Internet with graphics, sound, and video. It is attempting to provide as much coverage as possible by collecting postings from others sources such as CD-ROM, and has already collected Usenet news from as far back as 1992.
Brewster Kahle, founder and president of the Internet Archive says, "We are trying to be comprehensive and aggressive about filling in the present and opportunistic with the past. We are trying to build toward a digital library, as opposed to a card catalog. Next year we will offer bulk access to the data, but it won't be a search service. It will be like an archive as opposed to a library. In an archive, they hand you a box and say, go at it. That is more our style."
Presumably, the archive could provide a legal basis for proving when information was posted to the Internet. Kahle notes that one lawyer has already called them to determine whether certain information had been publicly posted on a certain date, but the archive was not comprehensive enough yet to be useful. In the future, it could provide a public record of when information was available that could have some legal validity. The dates now used to show when something was modified or added can be changed by any Webmaster.
The robot crawler only gathers data for the archive from Web pages that someone could get to by clicking on a link. It cannot get behind CGI scripts, forms, databases, restricted accesses, or any site for which a name or password is required.
To protect the privacy of individuals, the Internet Archive is working on a way of pulling individuals' Web sites out of the archive on request, as DejaNews and InReference now do with postings. But Kahle points out, "Technically it is not hard to do. It is a matter of what is right and wrong. Who has the right to pull what we see? The Internet is not just published materials, it is often personal. To that extent, we don't want to change the nature of the Internet because it is being archived. We want to improve the infrastructure of the 'Net by supporting persistent pointers to old data that might not be online anymore."
Internet archives might change the way the Internet is viewed. No longer can people live and work in cyberspace under the assumption that anything they do will disappear after a while. The archives described in the article are only the most prominent ones at the moment. What happens to all of that data that the National Security Agency is rumored to be collecting from all of the international Internet links to the U.S.? Is that information being archived as well, in the interests of national security?
The Well founder Rheingold says, "Computers are good at compiling little bits of things into larger collections. We all leave electronic trails of our purchases, our subscriptions, our travels, our communications. Most people generate hundreds and thousands of tiny pieces of information about themselves. There's nothing really new about that. What's new is the emergence of search engines and databases that can find and compile these fragments into dossiers. And it isn't just Big Brother we have to worry about now -- law enforcement agencies already have access to information-gathering tools. It's everyone! We need a national debate and educational campaign about the privacy implications of new technologies."
If you have technical problems with this magazine, contact firstname.lastname@example.org
If you do not want something that you write on a newsgroup to get archived, include the following line in the X-header in any posting you make:
Although this standard is currently respected by the three Internet archives, there is no guarantee that someone else archiving the Usenet will acknowledge it (like the government or large insurance companies). Consequently, DejaNews recommends three alternative ways of posting. First, you could use Internet Relay Chat or other mediums which have more privacy built in. Second, you can use an anonymous account or remailer so that personal information is not linkable to your true identity. Third, do not include identifying information in your postings like your name, phone number, or anything else that can be used to identify you.
If you do not want your Web page archived, you can put password
protection on it. You could also make a point of not registering with any
search engines and not allowing direct links from any public indexes or Web
sites that are indexed. If you own your own domain name, you can create a
file in the root directory called
robot.txt which says:
# Prevent all robots from visiting this site: User-agent: * Disallow: /
If you do not own the domain or want the entire thing excluded from robots, then you can include specific metatags in your HTML page.
<meta name="robots" content="noindex"> will stop all
robots (that honor the standard) from indexing your page entirely.
<meta name="robots" content="nofollow"> will allow
robots to index your page, but not follow any links.
<meta name="robots" content="noarchive"> will allow
search engine robots to index your page and follow links on the page, but
not archive them.
About the author
George Lawton (email@example.com) is a computer and telecommunications consultant based in Brisbane, CA. You can visit his home page at http://www.best.com/~glawton/glawton/. Reach George at firstname.lastname@example.org.