by lrosenth

Created

October 22, 2007

Archiving is a rather loaded word since doing it can be a widely varying activity. In many situations, archiving PDF files is a very good solution. In fact it was so attractive to some US Government agencies that they encouraged their personnel to work on an ISO committee/working group to define a special subset of PDF called PDF/A that meets their needs better than plain old PDF might. 

Generally PDF/A restricts the use of features of PDF that might make the visual presentation unreliable.  Things like content that might change according to the day on which it is viewed. (Yes, you can do things like that in unconstrained PDF!)  PDF/A also requires the use of certain features in PDF, such as including structural information that allows the content to be read, in a superior fashion, to people that are blind and to be extracted for reuse. The point of this blog is not to teach you the technology of PDF/A; you can learn about PDF/A here and here.

I wanted to talk more about a couple of alternatives that come up, namely ODF and OOXML. I sometimes wonder if I am losing it, because I have such a hard time understanding some other people’s (wrong) viewpoints. Both ODF and OOXML are editable document formats, in fact, they are defined as companions to editing programs, that used to be called word processing software. If you want to change the document you can read one of these files into one of these editors and change away. This is a great virtue, if that is what you want to do.  But would Silver Bow County, Montana want me to be able to edit my birth certificate?  I think not. And yes, that is my real birth place, Butte, Montana (in Silver Bow County). Do we want our court records to be editable?  I think not. So why would I want to chose one of the editable document formats for such records. And this is not a straw man argument because people are repeatedly saying that ODF and OOXML are ideal archiving formats for such material.

Now to be fair, they base their belief that these will be great archiving formats primarily because they are “XML.” I have already posted a blog about XML Documents where I point out that although these document formats do contain XML subfiles, they are in truth ZIP Archives. Nonetheless, to be fair, much of the information you might want to reuse or edit in these archives is in an XML markup language of one kind or another. I also have a previous blog about why it isn’t very helpful to claim something is XML. It is much more informative to say it is XML for [something]. And of course, if some XML is for [something] it might not be suitable for [something else].

Of course, I believe that PDF/A is the current best choice for archiving “documents of record.”  PDF was originally designed to capture print output from any application in such a way that it could be viewed and navigated efficiently as well as easily printed to paper. Adobe has added features to support ways of “editing” PDF files and those features are highly desired by many, but PDF/A does a good job of preserving that original PDF design for capturing documents that might otherwise be printed. Here is a diagram that I made to emphasize the differences between PDF and the other two document formats.

The upper left represents those files using OOXML and the upper right those using ODF.  These two formats, as well as this discussion, cover the three authoring paradigms for: documents, spreadsheets and projector presentations.  PDF can uniformly capture any of these types of documents. ODF and OOXML are what Adobe has traditionally called “native file formats” for their respective editing applications. If I want to go to lunch in the middle of an authoring activity, I save my work in a native file format.  The above diagram also shows the difference between editing and reuse. The dotted line going from the PDF file back to the editing applications indicates that the content of PDF files can be reused. When considering our archiving needs we also need to distinguish between reuse and editing; reuse might be important and editing not. PDF files do support content reuse.

Back to archiving. As I said at the start, archiving documents can mean many things. I often create documents that I might want to recreate and edit many years after I did the original work and it is great to be able to find the old files to update. Sometimes I cannot find the files and sometimes I no longer have the correct editor or the correct version of the editor to read them. But the claim is that these new “XML-friendly” formats will make that better. I am not sure exactly why this belief is held strongly by some, but I guess if you make the formats ISO standards, the software developers may be more careful to read old files with new programs.

Adobe has gone to great lengths to make sure that PDF is upward compatible and with few exceptions any proper PDF file that exists can be viewed with Adobe Reader. We plan to continue to do this.

So there might be archiving reasons to call for using PDF and there might be archiving objectives that call for using ODF or OOXML. Somehow I cannot fail to believe that the number of archiving needs for non-editable formats far exceeds that for editable forms.  After all, archiving means preserving things, and usually you want to preserver the total look of a document. PDF/A does that.

Before I conclude, I want to say just a bit more about XML. There is what I think is a rather technically shallow belief that XML files are easier to work with and will survive the passing of time, even great periods of time, better than other formats. The text held within XML files can usually be viewed with any generic text editor and I guess that gives people a warm feeling that it will therefore also be easier to retrieve with a program. Fair enough. But what is glossed over way too much is that that text is enveloped within XML for [something]. (See my earlier blog entry.) The envelopes (schemas) offered by ODF and OOXML are different. Different enough that a simple program cannot extract just the raw text from either. And is that all I really want from a document in the future, the raw text. Because when you get to the layout and the images and the color space definitions and the fonts, these things do not lend themselves well to XML and are often stored within the ZIP archives as binary data. So tell me again where the advantage to XML is for this purpose?

Now I will give two more hits against the myth that ODF and OOXML will be more easy to process at some distant date because much of the content is XML. First is the fact that the XML is Flate compressed in both formats making it impossible to read in a generic editor until after it has been de-Flated, and second, do we really believe that 50 or 100 years from now our software and computer technology will still be in a state where anyone will give a hoot whether something is XML or not. Will there not be something far superior invented by then?

By the time you worry about these issues you will find that processing PDF files will be as easy or maybe even easier that those XML friendly versions.

Contact me at jking@adobe.com