Enterprise Document Management
From Data to Documents
Companies and technology vendors have been trying to deliver on the promise of digital document management for years. But the technology hasn't been cost-effective enough, leaving early-adopter companies disappointed after large investments. That's changing now. Technologies and their costs are converging with business processes and benefits to create a vital document-management industry that should soon be as important to enterprise data architects as basic structured data is now.
Documents. You probably deal with them every day. Many organizations have roomfuls of documents that constitute the heart of their most important business processes. You've come to expect them to be mountains of paper that are managed manually and therefore inefficiently.
Yet why shouldn't the management of business-critical documents benefit from the same technological advantages as your company's simple structured data? Put another way: shouldn't document management be as important as traditional data management? Once you accept the fact that document management is a client-server data-management problem, the answer is a very obvious yes.
Companies and technology vendors have been trying to deliver on the promise of digital document management for several years. The benefits of storing documents online are obvious, starting with the savings in space and the ability to access them from virtually anywhere. But the technology hasn't been cost-effective enough, leaving early adopters disappointed after large investments.
However, that's changing. Technologies and their costs are converging with business processes and benefits to create a vital document-management industry that should soon be as important to enterprise data architects as basic structured data is now.
In this month's column, we'll examine the need for document management systems and their requirements. In subsequent columns, we'll go into depth on some of the issues and explore some products.
In the beginning, there was the image...
Document management, as a product category at least, has its origin in the imaging-systems market. Imaging technology centered on expensive turnkey systems with dedicated display devices, expensive hardware and media, proprietary storage formats, and proprietary viewing software. The promise of such systems was essentially that you could store umpteen-thousand pages on one optical disk -- i.e., space savings, nothing more.
More recent versions of these high-end imaging systems have taken advantage of standard LANs and desktop operating systems -- client/server technology, that is -- allowing anyone to view images from their desktops. Standard messaging protocols like VIM and MAPI have enabled imaging systems to add workflow functionality -- the ability to route documents among users in prescribed ways. Imaging vendors like FileNet and ViewStar added such features to their systems and addressed them to classic "paper-flow" processes like insurance-claim processing.
Yet such systems aren't quite document managers -- mainly because the "documents" they manage are rather trivial (if large) images of paper documents, augmented with some simple data attributes. They have little to do with the rest of the information that an organization handles, whether on desktop computers, database servers, or mainframes. Real document management involves all such information.
The document is turning out to be the most overwhelmingly powerful metaphor for organizing all types of information so that people can work with it -- create, view, change, or move it. Anyone who has used a modern word processor should understand this. It seemed like the most natural thing in the world for users of Microsoft Word, WordPerfect, or Lotus Ami Pro to embed spreadsheets and graphics in their files.
Then it was a short step to adding "click-to-play" multimedia data like sound, video, and animation. Then hypertext links, data fields, queries to external databases, and other kinds of active hot links. Now it seems that a word-processor file is like a two-dimensional canvas onto which any type of information can go -- it has become a compound document, potentially composed of all kinds of objects. Text elements like words, fonts, and margins practically constitute a "default" object type.
Microsoft's Object Linking and Embedding (OLE) was the first commercially successful technology to support compound documents. OLE developed from the need to add features to a word processor, but now it's being pressed into service as an architecture for specifying complex documents. Like so much else from Microsoft, it's not ideal for that purpose, but it serves because of the company's reach in the marketplace.
Yet other technologies have appeared that are more explicitly designed as compound document architectures. The one with the most potential is OpenDoc, being developed by an "anti-Microsoft" consortium called Component Integration Labs, Inc. that was founded by WordPerfect, IBM, and Apple.
Although components of OpenDoc have been released, it hasn't taken hold yet -- mainly because, like the proverbial blind men and the elephant, no one is exactly sure what it is or does. Compound-document architectures that are currently successful are more modest in scope: Adobe's Acrobat, with its Portable Document Format (PDF), is a cross-platform document-viewing technology that is a blockbuster waiting to happen. And the other smaller-scale compound-document architecture is already a smash success: the World Wide Web's Hypertext Markup Language (HTML).
Too hard with a hard drive
Technologies like these will become completely pervasive, and therefore, everything that an information-technology user sees will look like a document. And all of those documents will need to be managed. Right now, documents sit on users' hard drives, or at best, on file servers. This is no longer sufficient, for several reasons:
Furthermore, as object-oriented operating systems and databases become more prevalent, compound documents will naturally map to hierarchies of linked objects, some of which could appear in multiple documents or exist on multiple machines. In other words, the one-to-one mapping between documents and files -- already not strictly true anymore -- will completely disappear.
See the sidebar for an example of distributed compound documents.
Another instance of multiple access to a document is the aircraft maintenance scenario (see sidebar), where document components can have more than one link to them. Similarly, documents with hot queries to structured data imply the need for traditional database management.
All of these reasons for going beyond file and machine names strongly imply the need for database-management-like capabilities: indexing, querying, keys, linking, referential integrity, concurrency control.
Conclusion: document management is really database management. Yet, obviously, the features that document databases must support go well beyond those of databases that handle simple structured data like numbers, small character strings, and even Binary Large OBjects (BLOBS -- any data field that contains undifferentiated digitized information).
So what data-management features are necessary for document management? We'll answer that question by initially addressing the first of the three reasons above -- finding things. You will want to access your documents in four basic ways:
Here are the requirements for document management systems: a traditional file-based or object-oriented operating system, a database that supports SQL, a text-search engine, and some capacity for defining ad-hoc relationships among files. In other words, a file server makes a lousy document-management system. Adding a relational or object-oriented database helps, but it doesn't let you search on content; you really need a text-search engine. Finally, you need a user interface that ties it all together -- including, most significantly, integration of the SQL data-query language with full-text querying.
As you can see, the requirements for enterprise document management constitute a superset of traditional data-management requirements. This makes perfect sense, and given the increasing importance of online business-critical documents, it suggests that database management will become part of the field of document management. Database professionals who feel that their world begins with table definitions and ends with joins should think again: you will soon need to manage documents.
What products are available for enterprise document management, and what standardization efforts are under way? Stay tuned, these questions and others will be answered in future issues of SunWorld Online.
If you have technical problems with this magazine, contact firstname.lastname@example.org
As an example of the power of distributed compound documents, consider maintenance manuals for commercial aircraft. Few commercial airliners are exactly alike; they differ in engine parts, seating configuration, options, and so on. It would be natural to represent the online manual for a given airplane as a hierarchy of objects, each of which is the maintenance document for a single component.
Assume that Trans-Hemisphere Airlines (THA) has a fleet of Douglas-McDonald DM-11s. There may be five different types of DM-11 fuselages, each of which has a maintenance manual in electronic form at the Douglas-McDonald factory. The online manuals for each THA DM-11 would contain links to the correct fuselage manuals at Douglas-McDonald. Now assume that some of the DM-11s have engines made by Ratt & Pitney, while others have engines by General Mechanics (GM). The manual for the GM engine could reside on a computer at GM, and likewise for the Ratt & Pitney. Every THA DM-11 that has a GM engine would have a manual containing a link to that document at GM, while the manuals for planes with R&P engines have links to the analogous document at R&P.
The result of this scheme is that each THA plane can have a maintenance manual that looks seamless and correct, while only one copy of each component is actually stored, and it's stored at the factory where the part was actually created. A sensible arrangement.