Open Text Processing
I recently talked about my frustration with Microsoft Office file formats when used as a standard document type. Today I'd like to talk about alternative document format types that do not discriminate between platforms. By using these ``open'' types of document formats, people in your organization do not have to use MS Office, but some may if they wish. In other words, your choice of a platform does not depend on your choice of office software.
SGML, XML, TeX, and LaTeX. What a collection of consonants and vowels. Let me try and (over)simplify this stuff for you, because the details will take some study, and the tools are changing rapidly.
SGML (Standard Generalized Markup Language) is an international standard (the standard, actually) for describing the structure of documents of all kinds. XML (Extensible Markup Language) is a subset of the full SGML standard designed for fast prototyping of SGML implimentations. It provides an easier transition for experienced HTML developers than developing a full SGML implementation. XML is SGML without many of the features that allow developers to build highly complex implementations of SGML. You could call XML a sort of SGML Lite.
TeX (pronounced ``tek'') is a document formatting system. LaTeX (pronounced Lah' Tek or Lay'Teks) is a set of macros for TeX that make document preparation easier for TeX composers. Instead of setting up numerous TeX primitives, you simply call LaTeX macros that do the dirty work for you.
Formatting Versus Structuring. This is a very important distinction. Format deals with the physical appearance of document output. Structure deals with the electronic archival, storage, and processing of documents capable of many different output formats. Please be clear on this distinction.
Recent implementations of HTML have bastardized the notion of structure being apart from format. Originally, HTML was supposed to preserve the structure of a document for transmission over the Internet. But various browser developers created proprietary and non-standard elements that worked only for their browser. Netscape and Microsoft introduced various format elements and attributes that were designed to make web pages ``look better'' at the expense of document compatibility. HTML was never orignally intended to be a formatting language. It was intended to be a very simplified markup language for preserving document structure. That's all.
If you tie a document to a single type of formatted output, say to printed paper, then you don't have a transportable document anymore. But, if you create two methods of preserving a document, one for document structure and another for document format, both of which are open and transportable, then your document is fully useable by anyone on any machine. Several methods of achieving this ``best of both worlds'' breakthrough exist.
Open Document Formatting Mechanisms. I've mentioned TeX and LaTeX formatting systems. There are others. Perhaps the most important mechanism is the DSSSL (Document Style Semantics and Specification Language). This is a higher level language designed as an open and transportable, easy-to-implement formatting language. Using it, you can define formatting style semantics for a document that can be output to, for example, RTF, HTML, PostScript, TeX, ASCII text, Adobe PDF, and any other format you'd care to build an implementation for.
XSL (Extensible Style Language) is a style language that was designed especially for use with XML. It's a flexible language, and I think there are converters to translate DSSSLized documents into XSL.
Jade is an implementation from James Clark (www.jclark.com) of the DSSSL. Using Jade, you can output a DSSSLized document into various output formats. Jade is GNU software that produces TeX, HTML, RTF, PostScript, and Text formats.
A FOSI (File Output Specification Instance) is another (older) way of specifying output format. FOSIs are still used in various government projects, but these will mostly be giving way to the newer DSSSL and XSL/XML implementations.
There are many other formatting mechanisms, but these are some of the most widely implemented. Tools for each of these mechanisms are widely available from various sources. Usually, these tools are either commercial and quite expensive, or they are publicly available from your friendly neighborhood ftp site and are released under something like the GNU public license.
An Example. Let's start with a common document type: software manuals. Your main job is to represent your document structure accurately with a DTD (document type definition). You might chose the DocBook DTD (http://www.ora.com/davenport/), which was designed with software manuals in mind, or you might wish to choose another DTD, or you may even wish to create your own DTD. I would recommend choosing a DTD that already has plenty of support available for it. You'll be thankful that you did when it comes time to be productive.
Next, you'll need to decide what output formats you want to make available. Presumably, you'll want to store and archive your electronic software manuals as DocBook instances. So if you want to print or burn CD-ROMs, or MS Word documents, or output to webpages, you'll need stylesheets and conversion mechanisms for each of these.
You'll need a DTD (we'll use DocBook), a DSSSL stylesheet for each output format (use RTF for now), and a DSSSL processor, such as James Clark's Jade (http://www.jclark.com).
Once your software manuals are marked up as DocBook document instances, and you have created your db-rtf.dsl stylesheet (or you can use an existing one--see http://nwalsh.com for ideas), then you basically have all you need to start generating RTF documents. You can write another stylesheet for HTML or LaTeX or PostScript (or use existing stylesheets) so you can output to these formats. Meanwhile, you haven't had to modify your DTD or your numerous software manuals' markup. (You can certainly, and probably should, update the data content in those manuals to reflect the appropriate software version, however.)
Now, you can generate your documents on the fly from the SGML source, if you wish. Let's say Jose in Costa Rica needs a chapter from one of your software manuals. He can choose his output format of choice, say PostScript or RTF, by selecting a box on a webpage, then clicking ``GO'' on a form. Behind the scenes on your webserver, Jade is busy generating an RTF document that he can transfer to his machine in Costa Rica. In flash it's done, and he can begin transfer.
When he reads it, he sees an error. He makes a change to his version of the document, saves it as a DocBook SGML document instance, and sends the chapter back to the publications department with a note to correct the error. (SGML documents are easy to email, since they're straight text.) That afternoon, the software manual's maintainer, Anne, looks at the correction in the SGML document instance, converts it into RTF with Jade, views it in Word (or whatever), and sees that Jose's correction is OK, but decides to clairfy (wordsmith) Jose's correction. Anne saves the changes as a DocBook SGML document instance, and updates the website and emails a copy of the updated SGML source for the chapter to Jose.
Jose just happens to have his own copy of Jade, the DocBook DTD, and the db-rtf.dsl stylesheet. He creates his own RTF output version of the chapter with the correction.
Epilogue. Over time, these procedures become easy and commonplace. You don't think twice about them, and usually, you don't have to. That's because Microsoft doesn't control your basis for document formats (just RTF, maybe--and even that might still be compatible with new versions of RTF). So while the world around you changes constantly, you at least have a stable document format to archive your documents in. If you want to change or modify your DTD later, you can at least transform the document instances dynamically, as they are called from your website, and update the instances as required.
These are the benefits of Open Text Processing. You'll never find this sort of flexibility with a Microsoft only solution. But you can still use Microsoft products if you want; but, you won't be at the mercy of their ever-changing marketing choices.
Some Resources. Here are some resources for you to find more information about DSSSL, SGML, XML, and DocBook: