Click on our Sponsors to help Support SunWorld

Why text search matters

Why client/server text search will become a hot topic

By Bill Rosenblatt

SunWorld
September  1995
[Next story]
[Table of Contents]
[Search]
Subscribe to SunWorld, it's free!

Abstract
Conventional client/server means access to databases -- rows and columns of numbers and strings. So why the sudden interest in text-search engines? Because as content becomes more complex in form, users demand access not just to data lists, but to documents. That means searching text for words -- and for meaning and context. No simple problem, but its solution will be important to you. Read on to find out why. (2,700 words)


Mail this
article to
a friend

The heart of client/server until now has been the relational database, which specializes in simple structured data: storing, managing, indexing, manipulating, and retrieving it.

We do that so well that success is breeding a whole new set of problems. Client/server technologies are sucking up an unfathomable amount of information and making it more widely available. There is a mad rush to put content online -- in today's ultra-competitive business environment, to make more and better information more readily available to your customers and your workers.

But increasingly, valuable content isn't easily manipulated by traditional databases. In last month's column we looked at the requirements for managing documents as complex data, and at the various technologies, at different stages of maturity, necessary to do so.

The most important of these new technologies is text search -- which we'll examine this month. Text search may seem trivial -- doesn't your word processor offer text search? Isn't grep a standard part of Unix? But as we'll see, text-search technology is becoming a major arrow in client/server architects' quivers. If you're involved in client/server, you'll need to pay attention to recent developments.

Content vs. information
Conventional client/server technology has focused on simple structured data -- what I call information, not content. Yet the technology -- in particular, SQL and relational-database indexing schemes -- has certain important characteristics that point the way towards desirable qualities for content searching:


Advertisements

As I said in last month's column, searching by content is the most important way to find complex information, such as large pieces of text, images, audio, and video, in large databases. Content-based search must be as intuitive as possible, so that non-technical users can find things without having to learn specialized languages -- just as they should not have to learn SQL to search for structured data.

It must also be possible to set up a database for content-based search with little or no manual intervention. This point is less obvious than the above, but it's just as important. In library information retrieval systems, articles can be found by searching on keywords. Keywords for each article (and other descriptive information) are usually supplied separately -- and manually -- by librarians who have special skills in selecting them. But today's companies that put content into databases do not want to spend the time or invest the skills necessary to enter lots of descriptive information, yet without it, users won't be able to find anything.

So the solution must be to automate the generation of keywords and other descriptive info -- indexing. Traditional database technology, though fine for storing and protecting the content, is clearly not up to this task. Major additional technologies must be added; let's look at some of them.

Worth a thousand words
The most difficult type of content to auto-index are images or video. The required technology is image understanding. With image understanding, a machine can scan an image and conclude something like, "This is an office building on fire with firefighters rescuing people out of the windows." This requires some knowledge of the real world (or at least of a particular area of it), the ability to relate complex shapes, colors, and textures to things in that knowledge base (building, fire, window, firefighter), and -- ideally -- to deduce relationships among those things (firefighters rescuing people because the building is on fire). The more you think about this, the more you realize that this is a very hard problem. Viable solutions are many years away, meaning that images and video will need to be indexed manually for the foreseeable future.

Audio data, analogously, requires speech recognition, a technology that is considerably further along. Speech recognition is easier because the system only has to translate the raw audio data into text; this is a linear, one-to-one mapping. It does not have to "understand" the text the way an image-understanding system has to understand an image: a speech recognizer passes its results along to a text search engine, as we'll see, for automatic indexing.

The difficulty of speech recognition is a function of the size of the vocabulary that a system must recognize, and of whether it must recognize words, phrases, or larger grammatical structures. You may already be familiar with the simplest possible speech recognition problem: you call directory assistance and the phone company asks you if you want it to dial the number and you answer "yes" or "no"; this is a controlled vocabulary of two single words that are easy to distinguish.

The best currently available speech recognition systems can work with vocabularies of perhaps a thousand words and phrases. For example, Bell Atlantic's directory assistance system can distinguish all of the cities in a given area code, but currently it has to pass you to a human operator for the name of the person or business you want. Similarly, speech recognition systems that come with PC sound cards can -- with proper training -- discern names of commands that you want to run. But recognition of sentences or larger grammatical structures, and of more general vocabularies, is not quite there yet: Bell Atlantic's automated directory assistance, for example, would require the ability to search for "John Smith on South St." in the Philadelphia residential directory (which contains hundred of thousands of entries) in a few seconds. It will be a couple of years until that's possible.


Conventional client/server database technology
is not up to the task of searching content.

Then there's text. Text recognition technology has been around for years. Many text-search engines are now widely available, robust, efficient, and relatively cheap. Major vendors include Verity, Fulcrum, Personal Library Software (PLS), Dataware, Conquest, and Architext. Some of these have search engines that they license to software makers and online services. For example, Verity licenses its Topic engine to Lotus for use in Lotus Notes and to Adobe for use in Acrobat; PLS's search engine is used on The WELL and America Online; and Fulcrum's engine is a component of the Microsoft Network.

Pretty Good Text Search
The most widely available technology, which I call Pretty Good Text Search (PGTS, in honor of Pretty Good Privacy in the computer security field), is now stuck at the point of diminishing returns -- a point at which going beyond "pretty good" will require quantum breakthroughs, of which we're just starting to get a glimpse. PGTS basically works like this:

  1. A parser reads text from a set of documents, breaks it into words, removes punctuation and other extraneous matter, and removes stopwords like a, the, and, etc.

  2. The indexer inserts the resulting stream of words into a data structure, such as a hash table or B-tree, that allows quick access. The entry for each word in the index structure stores information about each occurrence of the word, such as which documents it appears in and at what positions.

  3. When the user types in a keyword query, the search engine looks at the entry for the keywords and selects the documents that have the most occurrences of them. It presents a hit list of matching documents (by name and other descriptive information) to the user, ranked in order of relevance (number of occurrences of the user's keywords).
If the user's query has more than one keyword, then the search engine has to combine occurrence numbers in order to do relevance ranking. It typically does this by computing an N-dimensional "distance" for each document, where N is the number of keywords in the query, and the distance for each keyword is the number of occurrences in a given document. The greater the distance, the higher the relevance.

Commercial PGTS systems have various bells and whistles beyond the above scheme. Most common are query interfaces that allow Boolean combinations of keywords, whether by explicit Boolean language or by parsing natural language queries. Another nicety is word stemming, or removal of common prefixes and suffixes before indexing (so that "searching," "searches," and "searched" are all treated as the same word).

PGTS systems can be quite useful, though like many programs they take some practice and a bit of black art to be able to use them effectively. The limitation with PGTS is that it relies on exact (or semi-exact, with word stemming and other such features) matches with literal words in document content. Thus, if your query contains "fire," it won't match documents that contain "conflagration" or "arson," and not "fire," yet it will match documents about people being terminated from their jobs and about impassioned artistic performances. In other words, PGTS uses your keywords literally, for their syntax, not their meaning.

Therefore it seems obvious that improving text-search performance beyond PGTS means adding the capability to search for meaning (or semantics) instead of for literal words. Unfortunately, this is hard. In fact, it's similar to the component of image understanding, as explained above, that requires an internal knowledge base.

Short of that, the text search field has come up with an incremental improvement over PGTS -- query refinement. Once you have received the initial hit-list from the search engine, query refinement lets you select the hitlist items that most closely match your interest. This gives the search engine a much richer source of information about what you want. The search engine could then conceivably use the documents you selected as if they were lists of query keywords and match them against the rest of the database. (However, since this can be quite inefficient, in practice, abstracts are often used instead of the full text of documents.) Query refinement can be iterative, allowing you to keep selecting hitlist items and refining the query until you're satisfied.

Sequel to SQL
From a client/server perspective, query refinement takes text search beyond that which can be closely modeled with standard database query methods. If you think about it, you'll realize that client/server PGTS can be emulated by adding a few straightforward features to SQL. Two vendors have such extended query languages as part of their text-related systems: Documentum's DQL, part of its Documentum Enterprise Document Management System, and TRW's TEQL, a component of its InfoWeb distributed information-retrieval architecture. But a new kind of protocol is necessary for modeling text search with query refinement. One such protocol, gaining in popularity, is the ANSI Z39.50 standard, whose implementations include TRW Business Intelligence Systems' Search Access and the WAISserver from WAIS, Inc. (now part of America Online). We'll look at this important protocol next month.

Query refinement -- and hence the need for specialized client/server text-search protocols -- will be around for a while, because significant improvement in text search will not come easily. The next level, as mentioned above, involves tracking semantics of natural language. Semantic (aka concept-based) search requires a model of word meanings that includes various kinds of relationships between words, such as synonyms (fire and conflagration), related terms (fire and arson), and so on.

One way to set up such a model is to build in a thesaurus for the language in question and represent the relationships among words in a semantic network or other such knowledge representation data structure. The Conquest, PLS, and Architext search engines do this; Oracle's ConText text-processing tool uses a similar scheme to create summaries of text documents. The various types of relationships among words are represented as links in the network; they have different "affinity" values showing the strength of the relationship, where synonyms have the highest values. Each document in a database maps to a set of nodes in the network, showing which words are present, augmented by occurrence counts. The sidebar shows an example of how queries can be processed under this scheme.

Reasonable semantic search solutions are beginning to appear. Conquest, for example, works very well but suffers in efficiency when compared to straightforward indexing schemes like PGTS. Yet even semantic search isn't perfect; therefore semantic search is likely to be augmented by query refinement for the foreseeable future. Hence the need for specialized protocols, like Z39.50, that go beyond simple database queries.

Text search, and therefore client/server text-search protocols, have importance that cannot be underestimated. For one thing, notice that it's possible to reduce all other content search tasks to text search. For sound, if you can convert sound to text, then you can index the sound object by indexing its text. Similarly, if an image understanding system can at least generate names of things that it finds in an image, then a text engine can index those names and get a fair representation of an image's meaning. In other words, text search allows all content types to be indexed for Pretty Good access in the large multimedia databases of the present and future.

Therefore, I claim text search will soon become a requirement for client-server architectures. It's extremely likely server operating system vendors will want to incorporate it into their systems, just as they have been incorporating network transport-layer capability nowadays. As storage becomes less and less expensive, and as companies rush to put content onto servers, there will be more and more of a crying need to find things on those servers. Conventional operating system and database technology is inadequate for this task. Text search will let us do it.


Click on our Sponsors to help Support SunWorld


Resources


What did you think of this article?
-Very worth reading
-Worth reading
-Not worth reading
-Too long
-Just right
-Too short
-Too technical
-Just right
-Not technical enough
 
 
 
    

SunWorld
[Table of Contents]
Subscribe to SunWorld, it's free!
[Search]
Feedback
[Next story]
Sun's Site

[(c) Copyright  Web Publishing Inc., and IDG Communication company]

If you have technical problems with this magazine, contact webmaster@sunworld.com

URL: http://www.sunworld.com/swol-09-1995/swol-09-cs.html
Last modified:

SidebarBack to story

Semantic search

A semantic search engine uses a model of words and their interrelationships to process queries based on meaning, not literal words. Here is a simple way of modeling such relationships.

A text query is processed against a semantic network:

[Diagram of the query process]></CENTER>

<P>
<P>These two portions of a semantic network show two meanings of the
word

When the user types in "fire" as a query, first the system would prompt for which of the two (or more) meanings of "fire" is intended. Then, the search engine would activate the appropriate node in the semantic network, as well as those connected to it with relationships of sufficient strength -- where "sufficient" is a parameter the user can set. The diagram shows that the system will activate nodes with relationships of strength 2 or more.

Once the nodes have been activated, the system traces links from them back to documents in the database and calculates their relevance score as the sum of the strengths of all links to each document. The left-hand article has a total score of 6 because it contains two synonyms of "fire" in the correct sense. Notice that even though the article includes the word "firefighter," the user isn't interested in this term because of its weak association with "fire." Also notice that the system isn't sophisticated enough to detect "suspicious blaze" as a phrase related to "arson." The right-hand article gets a score of zero because its occurrence of "fire" is not in the same sense as the query keyword.

SidebarBack to story