|
Why text search mattersWhy client/server text search will become a hot topic
|
Conventional client/server means access to databases -- rows and columns of numbers and strings. So why the sudden interest in text-search engines? Because as content becomes more complex in form, users demand access not just to data lists, but to documents. That means searching text for words -- and for meaning and context. No simple problem, but its solution will be important to you. Read on to find out why. (2,700 words)
Mail this article to a friend |
The heart of client/server until now has been the relational database, which specializes in simple structured data: storing, managing, indexing, manipulating, and retrieving it.
We do that so well that success is breeding a whole new set of problems. Client/server technologies are sucking up an unfathomable amount of information and making it more widely available. There is a mad rush to put content online -- in today's ultra-competitive business environment, to make more and better information more readily available to your customers and your workers.
But increasingly, valuable content isn't easily manipulated by traditional databases. In last month's column we looked at the requirements for managing documents as complex data, and at the various technologies, at different stages of maturity, necessary to do so.
The most important of these new technologies is text search -- which we'll examine this month. Text search may seem trivial -- doesn't your word processor offer text search? Isn't grep a standard part of Unix? But as we'll see, text-search technology is becoming a major arrow in client/server architects' quivers. If you're involved in client/server, you'll need to pay attention to recent developments.
Content vs. information
Conventional client/server technology has focused on simple structured
data -- what I call information, not content. Yet the technology
-- in particular, SQL and relational-database indexing schemes
-- has certain important characteristics that point the way towards
desirable qualities for content searching:
|
|
|
|
As I said in last month's column, searching by content is the most important way to find complex information, such as large pieces of text, images, audio, and video, in large databases. Content-based search must be as intuitive as possible, so that non-technical users can find things without having to learn specialized languages -- just as they should not have to learn SQL to search for structured data.
It must also be possible to set up a database for content-based search with little or no manual intervention. This point is less obvious than the above, but it's just as important. In library information retrieval systems, articles can be found by searching on keywords. Keywords for each article (and other descriptive information) are usually supplied separately -- and manually -- by librarians who have special skills in selecting them. But today's companies that put content into databases do not want to spend the time or invest the skills necessary to enter lots of descriptive information, yet without it, users won't be able to find anything.
So the solution must be to automate the generation of keywords and other descriptive info -- indexing. Traditional database technology, though fine for storing and protecting the content, is clearly not up to this task. Major additional technologies must be added; let's look at some of them.
Worth a thousand words
The most difficult type of content to auto-index are images or video.
The required technology is image understanding. With image
understanding, a machine can scan an image and conclude something like,
"This is an office building on fire with firefighters rescuing people
out of the windows." This requires some knowledge of the real world (or
at least of a particular area of it), the ability to relate complex
shapes, colors, and textures to things in that knowledge base
(building, fire, window, firefighter), and -- ideally -- to deduce
relationships among those things (firefighters rescuing people because
the building is on fire). The more you think about this, the more you
realize that this is a very hard problem. Viable solutions are
many years away, meaning that images and video will need to be indexed
manually for the foreseeable future.
Audio data, analogously, requires speech recognition, a technology that is considerably further along. Speech recognition is easier because the system only has to translate the raw audio data into text; this is a linear, one-to-one mapping. It does not have to "understand" the text the way an image-understanding system has to understand an image: a speech recognizer passes its results along to a text search engine, as we'll see, for automatic indexing.
The difficulty of speech recognition is a function of the size of the vocabulary that a system must recognize, and of whether it must recognize words, phrases, or larger grammatical structures. You may already be familiar with the simplest possible speech recognition problem: you call directory assistance and the phone company asks you if you want it to dial the number and you answer "yes" or "no"; this is a controlled vocabulary of two single words that are easy to distinguish.
The best currently available speech recognition systems can work with vocabularies of perhaps a thousand words and phrases. For example, Bell Atlantic's directory assistance system can distinguish all of the cities in a given area code, but currently it has to pass you to a human operator for the name of the person or business you want. Similarly, speech recognition systems that come with PC sound cards can -- with proper training -- discern names of commands that you want to run. But recognition of sentences or larger grammatical structures, and of more general vocabularies, is not quite there yet: Bell Atlantic's automated directory assistance, for example, would require the ability to search for "John Smith on South St." in the Philadelphia residential directory (which contains hundred of thousands of entries) in a few seconds. It will be a couple of years until that's possible.
Conventional client/server database technology
is not up to the task of searching content.
Then there's text. Text recognition technology has been around for
years. Many text-search engines are now widely available, robust,
efficient, and relatively cheap. Major vendors include
Verity,
Fulcrum,
Personal Library Software (PLS),
Dataware, Conquest, and
Architext. Some of these have
search engines that they license to software makers and online
services. For example, Verity licenses its
Topic engine to Lotus for
use in Lotus Notes and to
Adobe for use in
Acrobat;
PLS's search engine is used on
The WELL and
America Online; and Fulcrum's engine is a component of
Pretty Good Text Search
Commercial PGTS systems have various bells and whistles beyond the
above scheme. Most common are query interfaces that allow Boolean
combinations of keywords, whether by explicit Boolean language or by
parsing natural language queries. Another nicety is word
stemming, or removal of common prefixes and suffixes before
indexing (so that "searching," "searches," and "searched" are all
treated as the same word).
PGTS systems can be quite useful, though like many programs they
take some practice and a bit of black art to be able to use them
effectively. The limitation with PGTS is that it relies on exact (or
semi-exact, with word stemming and other such features) matches with
literal words in document content. Thus, if your query contains
"fire," it won't match documents that contain "conflagration" or
"arson," and not "fire," yet it will match documents about people being
terminated from their jobs and about impassioned artistic performances.
In other words, PGTS uses your keywords literally, for their syntax,
not their meaning.
Therefore it seems obvious that improving text-search performance
beyond PGTS means adding the capability to search for meaning (or
semantics) instead of for literal words. Unfortunately, this
is hard. In fact, it's similar to the component of image understanding,
as explained above, that requires an internal knowledge base.
Short of that, the text search field has come up with an
incremental improvement over PGTS -- query refinement. Once you
have received the initial hit-list from the search engine, query
refinement lets you select the hitlist items that most closely match
your interest. This gives the search engine a much richer source of
information about what you want. The search engine could then
conceivably use the documents you selected as if they were lists of
query keywords and match them against the rest of the database.
(However, since this can be quite inefficient, in practice, abstracts
are often used instead of the full text of documents.) Query refinement
can be iterative, allowing you to keep selecting hitlist items and
refining the query until you're satisfied.
Sequel to SQL
Query refinement -- and hence the need for specialized
client/server text-search protocols -- will be around for a while,
because significant improvement in text search will not come easily.
The next level, as mentioned above, involves tracking semantics of
natural language. Semantic (aka concept-based) search
requires a model of word meanings that includes various kinds of
relationships between words, such as synonyms (fire and conflagration),
related terms (fire and arson), and so on.
One way to set up such a model is to build in a thesaurus for the
language in question and represent the relationships among words in a
semantic network or other such knowledge representation data
structure. The Conquest,
PLS, and Architext search engines do this; Oracle's ConText text-processing tool
uses a similar scheme to create summaries of text documents. The
various types of relationships among words are represented as links in
the network; they have different "affinity" values showing the strength
of the relationship, where synonyms have the highest values. Each
document in a database maps to a set of nodes in the network, showing
which words are present, augmented by occurrence counts. The sidebar shows an example of how
queries can be processed under this scheme.
Reasonable semantic search solutions are beginning to appear.
Conquest, for example, works very well but suffers in efficiency when
compared to straightforward indexing schemes like PGTS. Yet even
semantic search isn't perfect; therefore semantic search is likely to
be augmented by query refinement for the foreseeable future. Hence the
need for specialized protocols, like Z39.50, that go beyond simple
database queries.
Text search, and therefore client/server text-search protocols,
have importance that cannot be underestimated. For one thing, notice
that it's possible to reduce all other content search tasks to text
search. For sound, if you can convert sound to text, then you can index
the sound object by indexing its text. Similarly, if an image
understanding system can at least generate names of things that it
finds in an image, then a text engine can index those names and get a
fair representation of an image's meaning. In other words, text search
allows all content types to be indexed for Pretty Good access in the
large multimedia databases of the present and future.
Therefore, I claim text search will soon become a requirement
for client-server architectures. It's extremely likely server
operating system vendors will want to incorporate it into their
systems, just as they have been incorporating network transport-layer
capability nowadays. As storage becomes less and less expensive, and as
companies rush to put content onto servers, there will be more and more
of a crying need to find things on those servers. Conventional
operating system and database technology is inadequate for this task.
Text search will let us do it.
Resources
If you have technical problems with this magazine, contact
webmaster@sunworld.com
URL: http://www.sunworld.com/swol-09-1995/swol-09-cs.html
A semantic search engine uses a model of words and their
interrelationships to process queries based on meaning, not literal
words. Here is a simple way of modeling such relationships.
A text query is processed against a semantic network:
When the user types in "fire" as a query, first the system would
prompt for which of the two (or more) meanings of "fire" is intended.
Then, the search engine would activate the appropriate node in the
semantic network, as well as those connected to it with relationships
of sufficient strength -- where "sufficient" is a parameter the user
can set. The diagram shows that the system will activate nodes with
relationships of strength 2 or more.
Once the nodes have been activated, the system traces links from
them back to documents in the database and calculates their relevance
score as the sum of the strengths of all links to each document. The
left-hand article has a total score of 6 because it contains two
synonyms of "fire" in the correct sense. Notice that even though the
article includes the word "firefighter," the user isn't interested in
this term because of its weak association with "fire." Also notice that
the system isn't sophisticated enough to detect "suspicious blaze" as a
phrase related to "arson." The right-hand article gets a score of zero
because its occurrence of "fire" is not in the same sense as the query
keyword.
The most widely available technology, which I call Pretty Good Text
Search (PGTS, in honor of Pretty Good Privacy in the computer security
field), is now stuck at the point of diminishing returns -- a point at
which going beyond "pretty good" will require quantum breakthroughs, of
which we're just starting to get a glimpse. PGTS basically works like
this:
If the user's query has more than one keyword, then the search
engine has to combine occurrence numbers in order to do relevance
ranking. It typically does this by computing an N-dimensional
"distance" for each document, where N is the number of keywords in the
query, and the distance for each keyword is the number of occurrences
in a given document. The greater the distance, the higher the
relevance.
From a client/server perspective, query refinement takes text
search beyond that which can be closely modeled with standard database
query methods. If you think about it, you'll realize that client/server
PGTS can be emulated by adding a few straightforward features to SQL.
Two vendors have such extended query languages as part of their
text-related systems: Documentum's DQL, part of its Documentum
Enterprise Document Management System, and TRW's TEQL, a component of
its InfoWeb distributed information-retrieval architecture. But a new
kind of protocol is necessary for modeling text search with query
refinement. One such protocol, gaining in popularity, is the ANSI
Z39.50 standard, whose implementations include TRW Business Intelligence Systems'
Search
Access and the WAISserver
from WAIS, Inc. (now part of America
Online). We'll look at this important protocol next month.
http://www.bell-atl.com/
http://www.verity.com/
http://www.fultech.com/
http://www.dataware.com/
http://www.pls.com/
http://www.atext.com/
http://www.adobe.com/
http://www.adobe.com/Acrobat/Acrobat0.html
http://www.verity.com/family.html
http://www.well.com/
http://www.msn.com/
http://vinca.cnidr.org/protocols/z3950/z3950.html
http://www.bis.trw.com/
http://www.wais.com/newhomepages/product.html
http://www.wais.com/
http://www.oracle.com/
/sunworldonline/common/swol-backissues-columns.html#cs
Last modified:
Semantic search