Click on our Sponsors to help Support SunWorld

Enterprise Document Management

From Data to Documents

By Bill Rosenblatt

August 1995

Abstract

Companies and technology vendors have been trying to deliver on the promise of digital document management for years. But the technology hasn't been cost-effective enough, leaving early-adopter companies disappointed after large investments. That's changing now. Technologies and their costs are converging with business processes and benefits to create a vital document-management industry that should soon be as important to enterprise data architects as basic structured data is now.

Mail this
article to
a friend

Documents. You probably deal with them every day. Many organizations have roomfuls of documents that constitute the heart of their most important business processes. You've come to expect them to be mountains of paper that are managed manually and therefore inefficiently.

Yet why shouldn't the management of business-critical documents benefit from the same technological advantages as your company's simple structured data? Put another way: shouldn't document management be as important as traditional data management? Once you accept the fact that document management is a client-server data-management problem, the answer is a very obvious yes.

Companies and technology vendors have been trying to deliver on the promise of digital document management for several years. The benefits of storing documents online are obvious, starting with the savings in space and the ability to access them from virtually anywhere. But the technology hasn't been cost-effective enough, leaving early adopters disappointed after large investments.

However, that's changing. Technologies and their costs are converging with business processes and benefits to create a vital document-management industry that should soon be as important to enterprise data architects as basic structured data is now.

In this month's column, we'll examine the need for document management systems and their requirements. In subsequent columns, we'll go into depth on some of the issues and explore some products.

Advertisements

In the beginning, there was the image...
Document management, as a product category at least, has its origin in the imaging-systems market. Imaging technology centered on expensive turnkey systems with dedicated display devices, expensive hardware and media, proprietary storage formats, and proprietary viewing software. The promise of such systems was essentially that you could store umpteen-thousand pages on one optical disk -- i.e., space savings, nothing more.

More recent versions of these high-end imaging systems have taken advantage of standard LANs and desktop operating systems -- client/server technology, that is -- allowing anyone to view images from their desktops. Standard messaging protocols like VIM and MAPI have enabled imaging systems to add workflow functionality -- the ability to route documents among users in prescribed ways. Imaging vendors like FileNet and ViewStar added such features to their systems and addressed them to classic "paper-flow" processes like insurance-claim processing.

Yet such systems aren't quite document managers -- mainly because the "documents" they manage are rather trivial (if large) images of paper documents, augmented with some simple data attributes. They have little to do with the rest of the information that an organization handles, whether on desktop computers, database servers, or mainframes. Real document management involves all such information.

Docu-centric view
The document is turning out to be the most overwhelmingly powerful metaphor for organizing all types of information so that people can work with it -- create, view, change, or move it. Anyone who has used a modern word processor should understand this. It seemed like the most natural thing in the world for users of Microsoft Word, WordPerfect, or Lotus Ami Pro to embed spreadsheets and graphics in their files.

Then it was a short step to adding "click-to-play" multimedia data like sound, video, and animation. Then hypertext links, data fields, queries to external databases, and other kinds of active hot links. Now it seems that a word-processor file is like a two-dimensional canvas onto which any type of information can go -- it has become a compound document, potentially composed of all kinds of objects. Text elements like words, fonts, and margins practically constitute a "default" object type.

Microsoft's Object Linking and Embedding (OLE) was the first commercially successful technology to support compound documents. OLE developed from the need to add features to a word processor, but now it's being pressed into service as an architecture for specifying complex documents. Like so much else from Microsoft, it's not ideal for that purpose, but it serves because of the company's reach in the marketplace.

Yet other technologies have appeared that are more explicitly designed as compound document architectures. The one with the most potential is OpenDoc, being developed by an "anti-Microsoft" consortium called Component Integration Labs, Inc. that was founded by WordPerfect, IBM, and Apple.

Although components of OpenDoc have been released, it hasn't taken hold yet -- mainly because, like the proverbial blind men and the elephant, no one is exactly sure what it is or does. Compound-document architectures that are currently successful are more modest in scope: Adobe's Acrobat, with its Portable Document Format (PDF), is a cross-platform document-viewing technology that is a blockbuster waiting to happen. And the other smaller-scale compound-document architecture is already a smash success: the World Wide Web's Hypertext Markup Language (HTML).

Too hard with a hard drive
Technologies like these will become completely pervasive, and therefore, everything that an information-technology user sees will look like a document. And all of those documents will need to be managed. Right now, documents sit on users' hard drives, or at best, on file servers. This is no longer sufficient, for several reasons:

Very soon -- if not already -- your organization will have enough important documents online that it will be too difficult to find what you're looking for by poking around desktop hard drives and file servers. These give you access methods that only work for small numbers of documents: filenames and perhaps remote machine names or logical drive letters. This is not a good set of tools for finding "that project plan written some time last year where that guy who used to be the project manager allocated three months for system design." The kinds of tools you will need imply complex data management, as we'll see below.

The number of links among your documents -- whether hypertext, "hot" data queries, embedded objects, or whatever else -- will explode over the next few years. Given the amount of change that typical enterprise computing infrastructures undergo, these links will become more and more impossible to track given, once again, tools as barbaric as file and machine names for specifying them.
Furthermore, as object-oriented operating systems and databases become more prevalent, compound documents will naturally map to hierarchies of linked objects, some of which could appear in multiple documents or exist on multiple machines. In other words, the one-to-one mapping between documents and files -- already not strictly true anymore -- will completely disappear.
See the sidebar for an example of distributed compound documents.

More and more documents will be accessed and changed by multiple people. As workgroup technology (as epitomized by Lotus Notes) becomes more widespread, the idea of a document kept to oneself will become more and more obsolete: when you finish a draft, you will need to put or send your document where the right people can get at it. Eventually, even authoring software will allow multiple simultaneous authors gracefully.
Another instance of multiple access to a document is the aircraft maintenance scenario (see sidebar), where document components can have more than one link to them. Similarly, documents with hot queries to structured data imply the need for traditional database management.

All of these reasons for going beyond file and machine names strongly imply the need for database-management-like capabilities: indexing, querying, keys, linking, referential integrity, concurrency control.

Conclusion: document management is really database management. Yet, obviously, the features that document databases must support go well beyond those of databases that handle simple structured data like numbers, small character strings, and even Binary Large OBjects (BLOBS -- any data field that contains undifferentiated digitized information).

So what data-management features are necessary for document management? We'll answer that question by initially addressing the first of the three reasons above -- finding things. You will want to access your documents in four basic ways:

By querying on descriptive attributes (metadata) associated with documents or document elements. Some of these could be the name of the author, date of last modification, document number, or format. This implies the need for simple data storage and querying, which any SQL implementation can do well.

Through logical organizational structures, such as directory hierarchies and compound document architectures like OLE and OpenDoc. Any modern operating system includes these features.

By content, as in the maintenance-manual example. This may actually be the most important access method of them all. The only reasonable way to search by content nowadays is through full-text querying; in a few years, speech and image recognition should be possible too. To enable full-text search, you need a text-search engine that can filter documents in various formats and index them properly. Such text engines are available from vendors like Verity, Fulcrum, and Conquest.

Through ad-hoc relationships among documents, such as sections or chapters of a bigger document, different formats of a document, and so on. Object-oriented operating systems or databases support this functionality, though it can be added fairly easily onto a traditional file-based operating system.

Here are the requirements for document management systems: a traditional file-based or object-oriented operating system, a database that supports SQL, a text-search engine, and some capacity for defining ad-hoc relationships among files. In other words, a file server makes a lousy document-management system. Adding a relational or object-oriented database helps, but it doesn't let you search on content; you really need a text-search engine. Finally, you need a user interface that ties it all together -- including, most significantly, integration of the SQL data-query language with full-text querying.

As you can see, the requirements for enterprise document management constitute a superset of traditional data-management requirements. This makes perfect sense, and given the increasing importance of online business-critical documents, it suggests that database management will become part of the field of document management. Database professionals who feel that their world begins with table definitions and ends with joins should think again: you will soon need to manage documents.

What products are available for enterprise document management, and what standardization efforts are under way? Stay tuned, these questions and others will be answered in future issues of SunWorld Online.

Click on our Sponsors to help Support SunWorld

Resources

Object Linking and Embedding
http://www.microsoft.com/TechNet/ole.htm
OpenDoc
http://www.austin.ibm.com/developer/objects/od1.html
Component Integration Labs, Inc.
http://www.cilabs.org
Acrobat
http://www.adobe.com/Acrobat/Acrobat0.html
Lotus Notes
http://www.lotus.com/home/notes.htm
Verity
http://www.verity.com
Fulcrum
http://www.fultech.com
A list of other Client/Server
/sunworldonline/common/swol-backissues-columns.html#cs

If you have technical problems with this magazine, contact webmaster@sunworld.com

URL: http://www.sunworld.com/swol-08-1995/swol-08-cs.html
Last modified:

Distributed maintenance manual

As an example of the power of distributed compound documents, consider maintenance manuals for commercial aircraft. Few commercial airliners are exactly alike; they differ in engine parts, seating configuration, options, and so on. It would be natural to represent the online manual for a given airplane as a hierarchy of objects, each of which is the maintenance document for a single component.

[Maintenance Manual Graphic]

Assume that Trans-Hemisphere Airlines (THA) has a fleet of Douglas-McDonald DM-11s. There may be five different types of DM-11 fuselages, each of which has a maintenance manual in electronic form at the Douglas-McDonald factory. The online manuals for each THA DM-11 would contain links to the correct fuselage manuals at Douglas-McDonald. Now assume that some of the DM-11s have engines made by Ratt & Pitney, while others have engines by General Mechanics (GM). The manual for the GM engine could reside on a computer at GM, and likewise for the Ratt & Pitney. Every THA DM-11 that has a GM engine would have a manual containing a link to that document at GM, while the manuals for planes with R&P engines have links to the analogous document at R&P.

The result of this scheme is that each THA plane can have a maintenance manual that looks seamless and correct, while only one copy of each component is actually stored, and it's stored at the factory where the part was actually created. A sensible arrangement.

Comments:
Name:
Email:
Company Name: