Click on our Sponsors to help Support SunWorld

Mining nuggets from Usenet

Scripts and methods for sifting through tons of newsgroups to find a speck of wisdom

By Lee Ratzan

SunWorld
October  1996
[Next story]
[Table of Contents]
[Search]
Subscribe to SunWorld, it's free!

Abstract
Usenet is a huge, often informal series of bulletin boards chronicling a staggering array of social and technical topics. Unfortunately, mining useful information from Usenet is confounded by: This article discusses several useful utilities that help locate items across the vast Usenet space. Command-line utilities for NN are included in the attached sidebar. (3,800 words)


Mail this
article to
a friend

In 1979, a group of North Carolinans wanted to share information between remote computers using inexpensive modems and dial-up telephone lines. The first Usenet (USEr NETwork) consisted of three Unix minicomputers (yes, three). Today, Usenet consists of approximately 15,000 on-line discussion forums and special interest groups spanning millions of computers around the world.

The World Wide Web is, of course, today's brightest star in the Internet constellation. The Web's dazzling multimedia abilities do not make for the perfect medium. The Web is not a dynamic, two-way communications channel. Usenet, on the other hand, is very interactive. Usenet is like Citizens Band radio, where the Web is more like multi-media magazine publishing.

These two technologies serve different needs, as we see here:

Message 6 of 6973
Subject:	An Interesting Program in Washington, DC
From:		John Q Public <SIWP15.JQ.Public@us.ab.edu>
Date:		September 11, 1996
Message-Id:	<36181@sci.med.aids>
Newsgroups:	sci.med.aids
The Smithsonian has a special program this weekend. The Names Project
Quilt is in Washington, DC. Please forward this information. We hope
you can attend!

Web interaction tends to be either static in both directions (display a document, read a document), static in one direction, and semi-dynamic in the other (display a document, send a static mail message) or semi-dynamic in both directions (display a combination box of options and receive a set of discrete responses). There is no real personal interaction (today, anyway) and everyone sees the same information.

Usenet is specifically designed for personal interactions. Usenet's special interest groups are akin to New England's town meetings.

The Web won't replace Usenet, nor is the Web a fad. (Well, maybe.) Each fills a niche. The goal is to recognize some issues of finding information in Usenet so that you can use this tool more efficiently. There are dozens of newsreaders, and the operation of many is cumbersome. This awkwardness can be streamlined by special utilities -- the topic of this article.

Usenet is big. Very big. Some servers may retain 10,000 newsgroups, each of which may have hundreds of associated articles, each of which may be dozens of lines long. It is not uncommon for servers to store a gigabyte of Usenet news at any given time. According to the Usenet Storage Space Calculator, approximately 500 megabytes of articles are posted to Usenet every day. This is equivalent to publishing daily 500 copies of a 400-page novel. Usenet volume has doubled every year.

Usenet is one of few Internet resources that follows a hierarchical structure. We can exploit this to expedite information searches. Let's see how we can mine for data using Netscape Navigator's graphical, browser-based, mouse-driven reader, and NN, which is a text-based, command-driven reader found in Unix. The latter has die-hard users, but the former is rapidly becoming the tool of choice.

Typical applications

  1. A government agency issues a ruling vital to your work. The ruling is read by thousands of your peers, prompting lively on-line discussions. Your boss asks for a summary of the views. There are many potentially relevant newsgroups under the same primary category. You initiate an article search over the newsgroups and then identify and review several substantive discussions in minutes.

  2. Months ago, you stumbled across a Frequently Asked Questions (FAQ) document in a newsgroup. Today, you can't quite remember the document's details, but are certain it will be useful. You launch an automated search on several major categories. The search reveals all recently posted FAQs, and quickly identifies the prodigal doc.

  3. You have a specific question about a particular topic. No one nearby can help. Documentation is unavailable. You post a query to a relevant special interest newsgroup. An authoritative answer arrives by e-mail in an hour.


Advertisements

The core problems
Three problems stand between you and the information you want on Usenet:

  1. Finding the right newsgroups
  2. Homing in on relevant articles
  3. Posting an article to an appropriate news group

The first two issues seem simple enough. In reality, however, many newsgroups may be relevant to a given topic. Cross-posting can spread the same article across many appropriate newsgroups. Anti-social people broadcast articles across many inappropriate groups, an act known as "spamming." It may not be clear which newsgroup is relevant to a given topic. For example, where might one find a discussion of overhead paging? Or, overhead garage doors? One topic may span several boundaries.

First things first
Usenet has a hierarchical structure where newsgroups names have the form xxx.yyy.zzz. Topic xxx is the primary category, yyy is the secondary topic, and zzz is the tertiary topic (there may be two, three or more). As an example, rec.music.folk deals with recreational topics in general, music in particular, and folk music specifically. sci.med deals with the medical aspects of science.

Assuming your organization's network maintains a Usenet news server (called having a news feed), the list of active newsgroups is typically stored in a system file (such as /netnews/nn/news/active.0). Ask your friendly local system administrator or your Internet Service Provider's help desk for the name and location of this file. Macintosh and PC users may want to download it to their own platforms for quick reference. This file can be copied to a new file (referred to here as newsdir) in which one can append a description to each group. The active newsgroup file has a simple structure and typically looks something like this (albeit thousands of lines long!):

alt.bbs
alt.bbs.internet
alt.bbs.lists
sci.med
sci.med.aids
sci.med.paramedic

Since you are reading this article in SunWorld Online we assume you employ a Unix server. Place the newsdir file in a globally readable directory with only owner-writable permissions. (For Unix, set access permissions using the command chmod u+rwx, g+r, o+r newsdir)

One last word of advice. Usenet has developed its own culture and set of courtesies called "netiquette." Be sure to read the standard netiquette guides, including a humorous parody of the nationally syndicated Emily Post manners column known as Emily Postnews.

Let's see how we resolved the scenarios described earlier. In the attached sidebar, we included Unix scripts to illustrate ways to solve a problem. The scripts are intended to be simple, not slick. They can be adapted as necessary. Navigator incorporates special features by click buttons or other mouse-driven interfaces. Not all features are available on all Web-based newsreaders.

Finding the right newsgroups
Some Web browsers load the names of all active newsgroups and display the list in a scrollable window.

[Screenshot of Netscape Navigator Usenet reader tool]

Purists point out the inherent disadvantage of keyword searching is that the concept of recall is promoted over precision. This means that all relevant items may be retrieved but not all retrieved items may be relevant. For example, indiscriminate searching on the key phrase MEDI will find MEDIcine but also multi-MEDIa. This is the classic all-or-only problem where one wants to retrieve all relevant items, only relevant items, and no irrelevant items. This is not always possible. Use common sense when using keywords. On-line searching is more an art than a science.

Dealing with matches
The above process might find zero, one, or many matches. In the case of no matches the user can be presented with a message suggesting the use of synonyms or word stubs to increase the likelihood of a match. In the case of too many hits (say, exceeding a given threshold defined by the user or the system administrator) the user receives a warning message.

Homing in on relevant articles
Finding the right newsgroups is not always enough. There can be several potentially relevant newsgroups and many articles within those groups. It is neither efficient nor effective to scan the groups one at a time. One solution is to form a meta-group consisting of all articles on a given topic assembled in one place for your examination regardless of the newsgroup in which it originally appears.

Many Web search engines examine only documents or URL's and can miss timely article discussions and associated threads. AltaVista and Excite can search Usenet news groups. Deja News is a specialist in Usenet searches. Deja News has archived Usenet since March 1995, and stores most newsgroups.

This type of massive information retrieval is supported by a series of capabilities that includes keyword, wild card, and Boolean searching (the use of AND, OR, and NOT). The latter permits formulation of complex queries. Simple and power searches are available by mouse click.

Watch out for the gotchas! A simple Boolean expression like Venetian AND Blind can return information on Venetian Blinds and Blind Venetians. Web news browsers permit users to pre-select the number of news articles being retrieved at one time. Be sure the number of articles being fetched is enough to find potential matches.

A disadvantage of the above techniques is the intrinsic absence of output ranking. Users generally want the most relevant items first in their hit list. The better search engines and Deja News try to rank hits using scoring algorithms. In addition, queries can be filtered on the basis of newsgroup, news author, date, subject, and threading (successive logically connected related discussions) by appropriate command button selection. Searching is a post-modern art form!

One variation on this searching information mode is to use a two-step approach of newsgroup locator followed by article locator. Step One consists of identifying a set of potentially relevant newsgroups by a suitable topic keyword. The first component of the names returned provides the group category. Step Two consists of using that specific group string in the article locator. This filtering process in conjunction with a particular search term can dramatically accelerate finding Usenet information.

Posting an article
Now that you have found what you want how do you respond to it? How can you broadcast your query to the world? The normal posting process is straightforward for some newsreaders. In Navigator, pull down a news menu, click on a command button, enter newsgroup name, type message, cut/paste as desired, click send button, done -- although this presumes that you already know the name of the relevant newsgroup!

In some environments, however, the system prompts or graphical forms can be downright cryptic or obscure. An uncluttered user menu or window interface streamlines the process and enhances its effectiveness as a communication tool.

Usenet person to person communication is dynamic and almost entirely text based. Therefore it is important that the posting process -- be it Netscape Navigator or the NN newsreader -- allow for titles, attachments, editing, addresses, signatures, replies, broadcasting and other common message features. It is also important that the user have a clear opportunity to abort the transmission before it is sent (see: Just A Two Bit Genie, Wilson Library Bulletin, April 1995 by this author for an example of unexpected side-effects when a desperate System Manager encounters a computer literate Genie. WLB is available at most public libraries.)

The Usenet newsgroup structure exploits its hierarchy. One site receives a "feed" from another, which may pass it on. Metaphors such as traveling upstream and downstream are useful in understanding the propagation of Usenet articles. A posting moves in this manner along backbone sites gradually being propagated across the Usenet universe. The speed of transmission and its penetration are very much a function of the networks through which it passes. It may take time for your wisdom to circle the globe.

Usenet can, of course, be abused. Mass broadcasts not only waste bandwidth but are just plain discourteous to your fellow readers. Highly charged emotional responses (known as FLAMING!!!!) are objectionable regardless of whether one uses a sophisticated graphical or command-driven Usenet reader. Keep in mind that forms-based Web posting services have permitted users to hijack the e-mail address of others and thus plant forged messages to newsgroups in the name of someone else. Caveat emptor: If you see a Usenet message from bill.clinton@whitehouse.gov announcing he's decided to switch parties or forsake vintage Ford Mustangs in favor of Camaros, it's probably a forged message.

Customizing the presentation sequence
A newsreader presents newsgroups in alphabetical order, or in no order at all. This can be frustrating for users who want to see just a few groups, or wish to view their favorites in a particular order. Fortunately, users can define the viewing order.

A customized presentation sequence is usually controlled by an initialization file that resides in a special directory.

[Screenshot of Netscape Navigator Usenet reader tool]

Excerpts of Unix scripts supplied were intentionally written to be simple and straightforward, not to be slick. The idea is that the logical flow be clear, not cryptic. The code was originally developed in Korn shell script for a HP-UX Unix and then ported to a Sun workstation running Solaris 2. Some operators called by the code such as the system editor, pager, newsreader are host-specific and should be modified for different platforms. This caveat also applies to the location of the active newsgroup file and the newsgroup initialization file. All of the above are referred to in the code using variable name substitution and so need only be changed when they are first identified. Some features are already included in contemporary Web-Usenet interfaces. Others could be included as pull-down options.


Click on our Sponsors to help Support SunWorld


Resources


What did you think of this article?
-Very worth reading
-Worth reading
-Not worth reading
-Too long
-Just right
-Too short
-Too technical
-Just right
-Not technical enough
 
 
 
    

SunWorld
[Table of Contents]
Subscribe to SunWorld, it's free!
[Search]
Feedback
[Next story]
Sun's Site

[(c) Copyright  Web Publishing Inc., and IDG Communication company]

If you have technical problems with this magazine, contact webmaster@sunworld.com

URL: http://www.sunworld.com/swol-10-1996/swol-10-usenet.html
Last modified:

SidebarBack to story

Command-line utilities for handling Usenet

Usenet access should present the user with a series of options. In a browser, this usually consists of multiple windows (newsgroup names, associated article titles, corresponding contents, and so on), command buttons, clickable icons, and scroll bars.

In the case of a text-based news reader, it might be menu corresponding to the contents of the first newsgroup invoked, or a set of suite options. The latter should be in a loop so that a user may execute any option in any order repeatedly.

while [ .true. ]
do
cat <<EOF
 A USENET SUITE
	1.		Invoke a newsreader
	2.		Find relevant articles
	3.		Find relevant newsgroups
	4.		Post a message to a newsgroup
	5.		Customize newsgroup presentation
	6.		Net Etiquette (and Emily Postnews)
EOF
read -r action? "Action number: [<ENTER>=quit]"
<take action>
done

Finding the right newsgroups

In the case of command-driven text newsreaders a keyword search can be performed on the descriptor file and the hits are displayed.

Begin Loop:
	read -r topic? "Topical keyword? [<ENTER>=quit>]"
	case "$topic" in
		'' )	exit
			;;
		* )	grep -i "$topic" "$newsdir" > temp
n=`wc -l <temp`
if [ $n -eq 0 ]
then
echo .'Error: no matches on this term.'
else
			pg -20 -nsp "Page %d <q=quit, h=help>" temp
fi
			;;
	esac
End Loop

Possible matches are shown on the screen using a pager so that the user can view them at their own pace, quitting at any time. A flexible pager is used so that

  1. The list can be viewed screen by screen
  2. The screen can be scrolled backwards or forwards as desired
  3. The pager displays a page number to give the user a point of reference
  4. The user can search for important character strings
  5. The user can jump to the beginning or end of the file or to a particular page number

Note that these capabilities might not be available if the Unix search function (grep) were used alone because the output would be little more than an uncontrolled and possibly unwieldy list.

The Web interface usually shows the entire active newsgroup list or the subscribed newsgroups list. It would be possible to construct a keyword search window which could pass the presumably shorter hit list to the news server when it was launched. This can expedite the look-up process.

The example above shows implementation of a single keyword. Multiple keywords could be accommodated by piping a grep with one keyword into a grep with another. This would emulate a Boolean AND operator. Independent greps with separate keywords redirected to a single output file would emulate a Boolean OR operator. Many variations on this theme are possible by judicious use of redirection.

The [i] flag allows case-insensitive searching (SAS is the same as SaS). If the flag is removed then the matching becomes case sensitive (AIDS is not the same as aids). Case-insensitive searching is more general than case-sensitive searching, although the latter is more focused (viral AIDS can be distinguished from visual aids). A distinction should be made between exact and partial matching (e.g, searching for the software package SAS by a simple keyword could lead to diSASter).

The user can try different keywords or various character strings in the absence of knowing the exact word. The keyword MEDI will pick up newsgroup names or descriptors containing medical, medicinal, medic, and so on. The significance here is that the task of identifying a relevant newsgroup from an enormous roster can be reduced to that of scanning a short list.

You can identify all newsgroups in a general category by using a keyword corresponding to that category. For example, find all New Jersey groups by using nj, which is the first term in the newsgroup name. This can be useful as a first cut. Also, since every newsgroup has at least one dot character in its name it is possible to display all available newsgroups by using [.] as the keyword search character.

Dealing with matches

The above process might find zero, one, or many matches. In the case of no matches the user can be presented with a message suggesting the use of synonyms or word stubs to increase the likelihood of a match. In the case of too many hits (say, exceeding a given threshold defined by the user or the system administrator) the user receives a warning message.

With a small modification of the script it is possible to display index numbers alongside each line of the hit list. If this is done then the user can select the index number from the list, in which case the corresponding newsgroup name can be determined by parsing the line. The newsreader can be invoked and automatically aimed to that specific newsgroup.

	n_matches=`wc -l < temp` 
# no matches 
	if [ "$n_matches" -eq 0 ]
	then
		echo " Error: No matches. Try a synonym or short word stub'
# one match
		elif [ "$n_matches" -eq 1 ]
		then
		cat temp
# too many matches
	elif [ "$n_matches" -gt "$threshold" ]
	then
	        echo 'Error: Threshold reached. Try a more restrictive term'
			pg -20 -nsp "Page %d <q=quit, h=help>" temp
# some matches
	else
			pg -20 -nsp "Page %d <q=quit, h=help>" temp
	fi
read -r continue? "Press <ENTER> to Continue"

Homing in on relevant articles

Most text-based newsreaders that support article search capability do so on the basis of keywords in the article title as opposed to full text. One extraction technique is an automatic general search over all newsgroups selecting any or all relevant articles as they are encountered. This is comprehensive but slower. Alternatively, one may specify a general newsgroup category name (such as nj, sci, soc, rec, and alt) in which case the search is considerably faster, albeit more focused.

(Note: This process can be remarkably quick even for command based systems. As point of comparison, a HP 9000/G60 computer running HP-UX Version 9.x can scan 15,828 article titles in the comp [=computer] newsgroup category in 21 seconds. A SPARCstation 5 running Solaris 2.4 can scan 91,205 article titles in 52 seconds. Your mileage may vary.)

		read -r term? "Select term keyword:"
		read -r type? "Type of search: g)eneral or s)pecific? [s]"
		case "$type" in
			'g' | 'G' )	nn -xms "$term"
<display results>
  ;;
			's' | 'S' | '')	read -r group? "Group name?"
					nn -xms "$term" "$group".
<display results>
					;;
* )
echo .' Error: Invalid entry!.'
;;
		esac

This example uses the [x, m, s] flags of the newsreader where s enables keyword Subject searching on titles, x enables eXtracting the matches, and m enables Metagroup formation of the matches.

Some newsreaders allow searching not only by article title but also by article author. This is helpful if one wants to extract all contributions made by an individual to a particular discussion or you know WHO wrote an article but not WHERE it was located or WHAT was its title. In this case change the [s] flag in the command line above to be [n] (mnemonic: s for subject, n for name). In the Web context, select the appropriate radio button to designate author-based searching.

Customizing the presentation sequence

In the case of the NN newsreader the file is called init and resides in the .nn sub-directory of the home directory. The order of newsgroups in this file determines the presentation order. One line containing the word sequence must precedes the list of newsgroups.

Useful features include the ability to append additional newsgroups to the end of a preexisting ordering, edit an old ordering, display the current ordering, create a totally new ordering or purge the old ordering. Much of this can be accomplished in a browser-based newsreader by judicious cutting and pasting.

As an added option the user could be passed into an abbreviated version of the relevant newsgroup locator which presents newsgroups of interest determined by a subject keywords from the user. Potentially relevant newsgroups are presented one by one for possible inclusion into the ordering. The process is looped over newsgroups matching a given keyword and over different keywords.

read -r action? "a)ppend  c)change v)iew  r)eplace [<ENTER>=quit]"

	case "$action" in

# quit		'' )
			exit
			;;
# or append
		'a')
       			read -r group?"Add newsgroup:"
			echo "$group" >>initfile
			;;
# or change
		'c' )
			"$editor" initfile
			;;
# or view
		'v' )
			"$pager" initfile
			;;
# or replace
		'r' )
			rm $initfile
			echo 'sequence' >>$initfile
			ead -r topic? "Find newsgroups on topic: "
# extract names from descriptor file
			grep -i $topic newsdir | cut -f1 >temp
			n_hits=`wc -l <temp`
			echo 'There are ' $n_hits 'matches'
# loop over choices
			head -$index temp | tail -1
			index=`expr $index + 1`
			read -r include?"INCLUDE? [y, n]"
# put newsgroup in file
			case "$include" in
				'y'  ) group=`head -$index temp | tail -1`
				       echo "$group" >>$initfile
					;;
				* )	echo 'No action taken'
					;;
			esac
			;;
		* ) echo 'Invalid choice!'
		    ;;		
	esac

SidebarBack to story

About the author
Lee Ratzan is employed by the Information Systems and Technology Division of the University of Medicine and Dentistry of New Jersey (UMDNJ) and is a Ph.D candidate at the School of Communication, Information and Library Studies of Rutgers University. He has developed and published numerous Internet applications on a variety of computer platforms and operating systems. Reach Lee at lee.ratzan@sunworld.com.