|
Mining nuggets from UsenetScripts and methods for sifting through tons of newsgroups to find a speck of wisdom
|
Usenet is a huge, often informal series of bulletin boards chronicling a staggering array of social and technical topics. Unfortunately, mining useful information from Usenet is confounded by:This article discusses several useful utilities that help locate items across the vast Usenet space. Command-line utilities for NN are included in the attached sidebar. (3,800 words)
- The proliferation of news groups
- Different groups related to the same topic
- Identical articles appearing in more than one group
- Unclear group titles making it unclear which groups are relevant
Mail this article to a friend |
In 1979, a group of North Carolinans wanted to share information between remote computers using inexpensive modems and dial-up telephone lines. The first Usenet (USEr NETwork) consisted of three Unix minicomputers (yes, three). Today, Usenet consists of approximately 15,000 on-line discussion forums and special interest groups spanning millions of computers around the world.
The World Wide Web is, of course, today's brightest star in the Internet constellation. The Web's dazzling multimedia abilities do not make for the perfect medium. The Web is not a dynamic, two-way communications channel. Usenet, on the other hand, is very interactive. Usenet is like Citizens Band radio, where the Web is more like multi-media magazine publishing.
These two technologies serve different needs, as we see here:
Message 6 of 6973 Subject: An Interesting Program in Washington, DC From: John Q Public <SIWP15.JQ.Public@us.ab.edu> Date: September 11, 1996 Message-Id: <36181@sci.med.aids> Newsgroups: sci.med.aids The Smithsonian has a special program this weekend. The Names Project Quilt is in Washington, DC. Please forward this information. We hope you can attend!
Web interaction tends to be either static in both directions (display a document, read a document), static in one direction, and semi-dynamic in the other (display a document, send a static mail message) or semi-dynamic in both directions (display a combination box of options and receive a set of discrete responses). There is no real personal interaction (today, anyway) and everyone sees the same information.
Usenet is specifically designed for personal interactions. Usenet's special interest groups are akin to New England's town meetings.
The Web won't replace Usenet, nor is the Web a fad. (Well, maybe.) Each fills a niche. The goal is to recognize some issues of finding information in Usenet so that you can use this tool more efficiently. There are dozens of newsreaders, and the operation of many is cumbersome. This awkwardness can be streamlined by special utilities -- the topic of this article.
Usenet is big. Very big. Some servers may retain 10,000 newsgroups, each of which may have hundreds of associated articles, each of which may be dozens of lines long. It is not uncommon for servers to store a gigabyte of Usenet news at any given time. According to the Usenet Storage Space Calculator, approximately 500 megabytes of articles are posted to Usenet every day. This is equivalent to publishing daily 500 copies of a 400-page novel. Usenet volume has doubled every year.
Usenet is one of few Internet resources that follows a hierarchical structure. We can exploit this to expedite information searches. Let's see how we can mine for data using Netscape Navigator's graphical, browser-based, mouse-driven reader, and NN, which is a text-based, command-driven reader found in Unix. The latter has die-hard users, but the former is rapidly becoming the tool of choice.
Typical applications
|
|
|
|
The core problems
Three problems stand between you and the information you want
on Usenet:
The first two issues seem simple enough. In reality, however, many newsgroups may be relevant to a given topic. Cross-posting can spread the same article across many appropriate newsgroups. Anti-social people broadcast articles across many inappropriate groups, an act known as "spamming." It may not be clear which newsgroup is relevant to a given topic. For example, where might one find a discussion of overhead paging? Or, overhead garage doors? One topic may span several boundaries.
First things first
Usenet has a hierarchical structure where newsgroups names have the
form xxx.yyy.zzz. Topic xxx is the primary category,
yyy is the secondary topic, and zzz is the tertiary
topic (there may be two, three or more). As an example,
rec.music.folk deals with recreational topics in
general, music in particular, and folk music
specifically. sci.med deals with the medical aspects
of science.
Assuming your organization's network maintains a Usenet news server (called having a news feed), the list of active newsgroups is typically stored in a system file (such as /netnews/nn/news/active.0). Ask your friendly local system administrator or your Internet Service Provider's help desk for the name and location of this file. Macintosh and PC users may want to download it to their own platforms for quick reference. This file can be copied to a new file (referred to here as newsdir) in which one can append a description to each group. The active newsgroup file has a simple structure and typically looks something like this (albeit thousands of lines long!):
alt.bbs alt.bbs.internet alt.bbs.lists sci.med sci.med.aids sci.med.paramedic
Since you are reading this article in SunWorld Online we
assume you employ a Unix server. Place the newsdir file in a
globally readable directory with only owner-writable permissions. (For
Unix, set access permissions using the command chmod u+rwx, g+r,
o+r newsdir
)
One last word of advice. Usenet has developed its own culture and set of courtesies called "netiquette." Be sure to read the standard netiquette guides, including a humorous parody of the nationally syndicated Emily Post manners column known as Emily Postnews.
Let's see how we resolved the scenarios described earlier. In the attached sidebar, we included Unix scripts to illustrate ways to solve a problem. The scripts are intended to be simple, not slick. They can be adapted as necessary. Navigator incorporates special features by click buttons or other mouse-driven interfaces. Not all features are available on all Web-based newsreaders.
Finding the right newsgroups
Some Web browsers load the names of all active newsgroups and display
the list in a scrollable window.
Purists point out the inherent disadvantage of keyword searching is that the concept of recall is promoted over precision. This means that all relevant items may be retrieved but not all retrieved items may be relevant. For example, indiscriminate searching on the key phrase MEDI will find MEDIcine but also multi-MEDIa. This is the classic all-or-only problem where one wants to retrieve all relevant items, only relevant items, and no irrelevant items. This is not always possible. Use common sense when using keywords. On-line searching is more an art than a science.
Dealing with matches
The above process might find zero, one, or many matches. In the case of
no matches the user can be presented with a message suggesting the use
of synonyms or word stubs to increase the likelihood of a match. In the
case of too many hits (say, exceeding a given threshold defined by the
user or the system administrator) the user receives a warning message.
Homing in on relevant articles
Finding the right newsgroups is not always enough. There can be several
potentially relevant newsgroups and many articles within those groups.
It is neither efficient nor effective to scan the groups one at a time.
One solution is to form a meta-group consisting of all articles on a
given topic assembled in one place for your examination regardless of
the newsgroup in which it originally appears.
Many Web search engines examine only documents or URL's and can miss timely article discussions and associated threads. AltaVista and Excite can search Usenet news groups. Deja News is a specialist in Usenet searches. Deja News has archived Usenet since March 1995, and stores most newsgroups.
This type of massive information retrieval is supported by a series of capabilities that includes keyword, wild card, and Boolean searching (the use of AND, OR, and NOT). The latter permits formulation of complex queries. Simple and power searches are available by mouse click.
Watch out for the gotchas! A simple Boolean expression like Venetian AND Blind can return information on Venetian Blinds and Blind Venetians. Web news browsers permit users to pre-select the number of news articles being retrieved at one time. Be sure the number of articles being fetched is enough to find potential matches.
A disadvantage of the above techniques is the intrinsic absence of output ranking. Users generally want the most relevant items first in their hit list. The better search engines and Deja News try to rank hits using scoring algorithms. In addition, queries can be filtered on the basis of newsgroup, news author, date, subject, and threading (successive logically connected related discussions) by appropriate command button selection. Searching is a post-modern art form!
One variation on this searching information mode is to use a two-step approach of newsgroup locator followed by article locator. Step One consists of identifying a set of potentially relevant newsgroups by a suitable topic keyword. The first component of the names returned provides the group category. Step Two consists of using that specific group string in the article locator. This filtering process in conjunction with a particular search term can dramatically accelerate finding Usenet information.
Posting an article
Now that you have found what you want how do you respond to it? How can
you broadcast your query to the world? The normal posting process is
straightforward for some newsreaders. In Navigator, pull down a news
menu, click on a command button, enter newsgroup name, type message,
cut/paste as desired, click send button, done -- although this presumes
that you already know the name of the relevant newsgroup!
In some environments, however, the system prompts or graphical forms can be downright cryptic or obscure. An uncluttered user menu or window interface streamlines the process and enhances its effectiveness as a communication tool.
Usenet person to person communication is dynamic and almost entirely text based. Therefore it is important that the posting process -- be it Netscape Navigator or the NN newsreader -- allow for titles, attachments, editing, addresses, signatures, replies, broadcasting and other common message features. It is also important that the user have a clear opportunity to abort the transmission before it is sent (see: Just A Two Bit Genie, Wilson Library Bulletin, April 1995 by this author for an example of unexpected side-effects when a desperate System Manager encounters a computer literate Genie. WLB is available at most public libraries.)
The Usenet newsgroup structure exploits its hierarchy. One site receives a "feed" from another, which may pass it on. Metaphors such as traveling upstream and downstream are useful in understanding the propagation of Usenet articles. A posting moves in this manner along backbone sites gradually being propagated across the Usenet universe. The speed of transmission and its penetration are very much a function of the networks through which it passes. It may take time for your wisdom to circle the globe.
Usenet can, of course, be abused. Mass broadcasts not only waste
bandwidth but are just plain discourteous to your fellow readers.
Highly charged emotional responses (known as FLAMING!!!!) are
objectionable regardless of whether one uses a sophisticated graphical
or command-driven Usenet reader. Keep in mind that forms-based Web
posting services have permitted users to hijack the e-mail address of
others and thus plant forged messages to newsgroups in the name of
someone else. Caveat emptor: If you see a Usenet message from
bill.clinton@whitehouse.gov
announcing he's decided to
switch parties or forsake vintage Ford Mustangs in favor of Camaros,
it's probably a forged message.
Customizing the presentation sequence
A newsreader presents newsgroups in alphabetical order, or in no order
at all. This can be frustrating for users who want to see just a few
groups, or wish to view their favorites in a particular order.
Fortunately, users can define the viewing order.
A customized presentation sequence is usually controlled by an initialization file that resides in a special directory.
Excerpts of Unix scripts supplied were intentionally written to be simple and straightforward, not to be slick. The idea is that the logical flow be clear, not cryptic. The code was originally developed in Korn shell script for a HP-UX Unix and then ported to a Sun workstation running Solaris 2. Some operators called by the code such as the system editor, pager, newsreader are host-specific and should be modified for different platforms. This caveat also applies to the location of the active newsgroup file and the newsgroup initialization file. All of the above are referred to in the code using variable name substitution and so need only be changed when they are first identified. Some features are already included in contemporary Web-Usenet interfaces. Others could be included as pull-down options.
|
Resources
If you have technical problems with this magazine, contact webmaster@sunworld.com
URL: http://www.sunworld.com/swol-10-1996/swol-10-usenet.html
Last modified:
Usenet access should present the user with a series of options. In a browser, this usually consists of multiple windows (newsgroup names, associated article titles, corresponding contents, and so on), command buttons, clickable icons, and scroll bars.
In the case of a text-based news reader, it might be menu corresponding to the contents of the first newsgroup invoked, or a set of suite options. The latter should be in a loop so that a user may execute any option in any order repeatedly.
while [ .true. ] do cat <<EOF A USENET SUITE 1. Invoke a newsreader 2. Find relevant articles 3. Find relevant newsgroups 4. Post a message to a newsgroup 5. Customize newsgroup presentation 6. Net Etiquette (and Emily Postnews) EOF read -r action? "Action number: [<ENTER>=quit]" <take action> done
Finding the right newsgroups
In the case of command-driven text newsreaders a keyword search can be performed on the descriptor file and the hits are displayed.
Begin Loop: read -r topic? "Topical keyword? [<ENTER>=quit>]" case "$topic" in '' ) exit ;; * ) grep -i "$topic" "$newsdir" > temp n=`wc -l <temp` if [ $n -eq 0 ] then echo .'Error: no matches on this term.' else pg -20 -nsp "Page %d <q=quit, h=help>" temp fi ;; esac End Loop
Possible matches are shown on the screen using a pager so that the user can view them at their own pace, quitting at any time. A flexible pager is used so that
Note that these capabilities might not be available if the Unix search
function (grep
) were used alone because the output would
be little more than an uncontrolled and possibly unwieldy list.
The Web interface usually shows the entire active newsgroup list or the subscribed newsgroups list. It would be possible to construct a keyword search window which could pass the presumably shorter hit list to the news server when it was launched. This can expedite the look-up process.
The example above shows implementation of a single keyword. Multiple
keywords could be accommodated by piping a grep
with one
keyword into a grep with another. This would emulate a Boolean
AND operator. Independent greps
with
separate keywords redirected to a single output file would emulate a
Boolean OR operator. Many variations on this theme are
possible by judicious use of redirection.
The [i
] flag allows case-insensitive searching (SAS is the
same as SaS). If the flag is removed then the matching becomes case
sensitive (AIDS is not the same as aids). Case-insensitive searching
is more general than case-sensitive searching, although the latter is
more focused (viral AIDS can be distinguished from visual aids). A
distinction should be made between exact and partial matching (e.g,
searching for the software package SAS by a simple keyword could lead
to diSASter).
The user can try different keywords or various character strings in the absence of knowing the exact word. The keyword MEDI will pick up newsgroup names or descriptors containing medical, medicinal, medic, and so on. The significance here is that the task of identifying a relevant newsgroup from an enormous roster can be reduced to that of scanning a short list.
You can identify all newsgroups in a general category by using a
keyword corresponding to that category. For example, find all New
Jersey groups by using nj, which is the first term in the
newsgroup name. This can be useful as a first cut. Also, since every
newsgroup has at least one dot character in its name it is possible to
display all available newsgroups by using [.
] as the
keyword search character.
Dealing with matches
The above process might find zero, one, or many matches. In the case of no matches the user can be presented with a message suggesting the use of synonyms or word stubs to increase the likelihood of a match. In the case of too many hits (say, exceeding a given threshold defined by the user or the system administrator) the user receives a warning message.
With a small modification of the script it is possible to display index numbers alongside each line of the hit list. If this is done then the user can select the index number from the list, in which case the corresponding newsgroup name can be determined by parsing the line. The newsreader can be invoked and automatically aimed to that specific newsgroup.
n_matches=`wc -l < temp` # no matches if [ "$n_matches" -eq 0 ] then echo " Error: No matches. Try a synonym or short word stub' # one match elif [ "$n_matches" -eq 1 ] then cat temp # too many matches elif [ "$n_matches" -gt "$threshold" ] then echo 'Error: Threshold reached. Try a more restrictive term' pg -20 -nsp "Page %d <q=quit, h=help>" temp # some matches else pg -20 -nsp "Page %d <q=quit, h=help>" temp fi read -r continue? "Press <ENTER> to Continue"
Homing in on relevant articles
Most text-based newsreaders that support article search capability do so on the basis of keywords in the article title as opposed to full text. One extraction technique is an automatic general search over all newsgroups selecting any or all relevant articles as they are encountered. This is comprehensive but slower. Alternatively, one may specify a general newsgroup category name (such as nj, sci, soc, rec, and alt) in which case the search is considerably faster, albeit more focused.
(Note: This process can be remarkably quick even for command based systems. As point of comparison, a HP 9000/G60 computer running HP-UX Version 9.x can scan 15,828 article titles in the comp [=computer] newsgroup category in 21 seconds. A SPARCstation 5 running Solaris 2.4 can scan 91,205 article titles in 52 seconds. Your mileage may vary.)
read -r term? "Select term keyword:" read -r type? "Type of search: g)eneral or s)pecific? [s]" case "$type" in 'g' | 'G' ) nn -xms "$term" <display results> ;; 's' | 'S' | '') read -r group? "Group name?" nn -xms "$term" "$group". <display results> ;; * ) echo .' Error: Invalid entry!.' ;; esac
This example uses the [x
, m
, s
]
flags of the newsreader where s
enables keyword Subject searching
on titles, x
enables eXtracting the matches, and
m
enables Metagroup formation of the matches.
Some newsreaders allow searching not only by article title but also
by article author. This is helpful if one wants to extract all
contributions made by an individual to a particular discussion or you
know WHO wrote an article but not WHERE it was located or WHAT was its
title. In this case change the [s
] flag in the command
line above to be [n
] (mnemonic: s for subject,
n for name). In the Web context, select the appropriate radio
button to designate author-based searching.
Customizing the presentation sequence
In the case of the NN newsreader the file is called init and resides in the .nn sub-directory of the home directory. The order of newsgroups in this file determines the presentation order. One line containing the word sequence must precedes the list of newsgroups.
Useful features include the ability to append additional newsgroups to the end of a preexisting ordering, edit an old ordering, display the current ordering, create a totally new ordering or purge the old ordering. Much of this can be accomplished in a browser-based newsreader by judicious cutting and pasting.
As an added option the user could be passed into an abbreviated version of the relevant newsgroup locator which presents newsgroups of interest determined by a subject keywords from the user. Potentially relevant newsgroups are presented one by one for possible inclusion into the ordering. The process is looped over newsgroups matching a given keyword and over different keywords.
read -r action? "a)ppend c)change v)iew r)eplace [<ENTER>=quit]" case "$action" in # quit '' ) exit ;; # or append 'a') read -r group?"Add newsgroup:" echo "$group" >>initfile ;; # or change 'c' ) "$editor" initfile ;; # or view 'v' ) "$pager" initfile ;; # or replace 'r' ) rm $initfile echo 'sequence' >>$initfile ead -r topic? "Find newsgroups on topic: " # extract names from descriptor file grep -i $topic newsdir | cut -f1 >temp n_hits=`wc -l <temp` echo 'There are ' $n_hits 'matches' # loop over choices head -$index temp | tail -1 index=`expr $index + 1` read -r include?"INCLUDE? [y, n]" # put newsgroup in file case "$include" in 'y' ) group=`head -$index temp | tail -1` echo "$group" >>$initfile ;; * ) echo 'No action taken' ;; esac ;; * ) echo 'Invalid choice!' ;; esac
About the author
Lee Ratzan is employed by the Information Systems and Technology Division of the University of Medicine and Dentistry of New Jersey (UMDNJ) and is a Ph.D candidate at the School of Communication, Information and Library Studies of Rutgers University. He has developed and published numerous Internet applications on a variety of computer platforms and operating systems.
Reach Lee at lee.ratzan@sunworld.com.