Click on our Sponsors to help Support SunWorld

Smart data

Add intelligence to your data files

August 1999

Abstract

By adding contextual information and/or executable code to your data files, you can make them more flexible, powerful, and portable. (1,000 words)

Mail this
article to
a friend

n "A lazy afternoon" (Silicon Carny, March 1999), I described the use of Perl as a tool for encoding dbm (Unix database management) data, in order to facilitate transfer between machines:

	$divisions{AP} = 'Applications';
	...

This format is easy to generate. Parsing it, of course, is trivial: we just let Perl do the work! And, because the data is represented as ASCII strings, byte-ordering is not an issue.

As long as we take care to escape (disable special interpretation for) any magic characters (i.e., quote marks), Perl is quite happy to ingest the file, loading specified locations with specified values. Because Perl is a full programming language, the assignment statements can be as tricky as desired:

	$a = 'to be';
    $b = "$a or not $a";
	foreach (1 .. 9) { $c[$_] = $_ * $_ }

This general technique (encoding data in a programming language) sees more regular use than you might think. Every PostScript file, for example, is a program. The PostScript program tells the printer how to render and print the desired image, as in the following example:

	%!
	100 100 moveto
	300 300 lineto
	stroke showpage

Several attempts have been made to use this technique in windowing environments. Sun's ill-fated NeWS environment used raw PostScript both for imaging and user interface programming. Although this was an elegant and powerful system, industry backlash against Sun dominance and licensing fees allowed X11 to win the Unix desktop battle.

The general idea has not, however, gone away. Display PostScript, used in conjunction with X11, is still used by some programs to provide imaging support. Apple's Mac OS X system will use Adobe System's PDF (portable document format) as a basis for imaging. PDF is more formally defined than raw PostScript, allowing documents to be merged a bit more easily.

Getting away from PostScript, there are a number of other systems that generate programs and ship them around as data. Sun's Java and Microsoft's macros are revealing examples, showing both the power and dangers inherent in this technique.

Importing programs into your computer environment is always an iffy proposition. If you don't trust both the code's originator and the transmission channel, you may be subjecting your computer to unknown dangers.

Fortunately, many of the benefits of smart data can be realized without incurring these sorts of dangers. For instance, it is quite possible to use declarative programming languages, like BoulderIO and XML, that support variable definitions but not arbitrary programming constructs.

Advertisements

BoulderIO
Lincoln Stein's BoulderIO is a very simple system, supporting only two data types: strings and records. Strings are terminated by a newline character, and records are enclosed by a pair of braces:

	a=aaaaa...
	b={
	  ba=aaaaa...
	  bb=bbbbb...
	  bc=ccccc...
	}

To make these structures easy to use, Stein has created a library of Perl modules that parse and/or emit BoulderIO files. Of course, the syntax is so simple that BoulderIO files are absolutely trivial for scripts to generate.

BoulderIO has its roots in genetic sequence manipulation, so it has some specialized features (e.g., a Blast interface) that would be of interest only to geneticists. But BoulderIO's data format (and supporting code) are convenient, general, and surprisingly powerful, making it useful for the rest of us as well.

Although BoulderIO fits well into the Unix filter and pipe metaphor, it does so in a somewhat peculiar manner. BoulderIO programs can filter an incoming data stream, using and/or modifying only those elements that are relevant. Any items that are not specifically processed are passed to the following programs in the pipeline. This allows the entire body of data to increase in size and complexity, adding annotations, intermediate results, etc. The resulting file can thus be a complete record of the pipeline's activities, allowing easy and productive analysis.

I have found myself using BoulderIO for a variety of tasks, some of which are probably rather different from anything Stein had in mind when he designed the system. For instance, I find it to be a very useful format for writing log files (e.g., for CGI scripts).

Log entries can be of arbitrary size, but they are always tied together as records. New elements are trivial to add, since there are no concerns about "breaking" the data format. Finally, the files are easily readable by both humans and computers.

XML
I haven't studied XML (Extended Markup Language) in depth, but it appears to have many of BoulderIO's useful attributes, along with several of its own. Like BoulderIO, XML can encode both strings and records. It is supported in this, however, by strong tools for enforcing standardization.

Using XML DTDs (document type definitions), it is quite possible to publish standards for given sorts of information (e.g., bibliographic entries). Because each corresponding XML document references the DTD, the receiving program can ensure that all of the elements in the document are well-defined and complete.

If, at some later time, it becomes necessary to add elements, a revised form of the DTD can be published. As long as the old DTD is still available, documents that use it can still be interpreted unambiguously. This gives us a chance to write robust systems of programs that will be able to exchange information for years to come.

XML and its related standards are language-independent; support software is thus available for Java, Perl, Python, Tcl, etc. The programming models for the support software vary wildly, including procedural, event-oriented, and object-oriented interfaces. Nonetheless, it is clear that a great deal of support software is on the way, and much of it will be very good indeed.

The principal impact of HTML and the Internet has been to facilitate "many to many" publishing. Unlike telephones or traditional mass media, the Internet allows Joe Six-pack to reach a mass audience (happily refuting A.J. Liebling's comment that "freedom of the press belongs to those who own one").

Unfortunately for our purposes, HTML pages are designed (in theory) to look nice and convey information to human viewers. In addition, their format changes whenever the Webmaster gets a new idea. Consequently, most Web pages are not well-suited for computer programs to parse.

The impact of XML will thus be the facilitation of "many to many" data exchange among computers. As organizations define DTDs and publish XML documents, more and more information will be cleanly available for use by virtually any computer program. I can't predict how this will all turn out, but it should be pretty interesting!

Click on our Sponsors to help Support SunWorld

Resources

"A lazy afternoon," Rich Morin (SunWorld, March 1999):
http://www.sunworld.com/swol-03-1999/swol-03-silicon.html
Adobe Systems:
http://www.adobe.com
BoulderIO:
http://stein.cshl.org/software/boulder/
Java:
http://java.sun.com
Mac OS X:
http://www.apple.com/macosx
Perl:
http://www.perl.com
Sun:
http://www.sun.com
XML:
http://www.xml.com
"XML and the IT Architect," Jonathan Rich (IT Architect, SunWorld, June 1999):
http://www.sunworld.com/swol-06-1999/swol-06-itarchitect.html
"XML: The future of EDI," Uche Ogbuji (SunWorld, February 1999):
http://www.sunworld.com/sunworldonline/swol-02-1999/swol-02-xmledi.html

Other SunWorld resources

Previous Silicon Carny columns in SunWorld:
http://www.sunworld.com/common/swol-backissues-columns.html#silicon
The SunWorld Topical Index -- a comprehensive listing of all SunWorld articles by subject:
http://www.sunworld.com/common/swol-siteindex.html
Visit sunWHERE -- launchpad to hundreds of online resources for Sun users:
http://www.sunworld.com/sunworldonline/sunwhere.html
Explore back issues of SunWorld:
http://www.sunworld.com/common/swol-backissues.html
IDG.net, your one-stop IT resource:
http://www.idg.net

About the author
Rich Morin operates Prime Time Freeware (www.ptf.com), a publisher of books about open source software. He lives in San Bruno, CA, on the San Francisco peninsula.

If you have technical problems with this magazine, contact webmaster@sunworld.com

URL: http://www.sunworld.com/swol-08-1999/swol-08-silicon.html
Last modified:

Comments:
Name:
Email:
Company Name: