by lrosenth

Created

October 22, 2007

Archiving is a rather loaded word since doing it can be a widely varying activity. In many situations, archiving PDF files is a very good solution. In fact it was so attractive to some US Government agencies that they encouraged their personnel to work on an ISO committee/working group to define a special subset of PDF called PDF/A that meets their needs better than plain old PDF might. 

Generally PDF/A restricts the use of features of PDF that might make the visual presentation unreliable.  Things like content that might change according to the day on which it is viewed. (Yes, you can do things like that in unconstrained PDF!)  PDF/A also requires the use of certain features in PDF, such as including structural information that allows the content to be read, in a superior fashion, to people that are blind and to be extracted for reuse. The point of this blog is not to teach you the technology of PDF/A; you can learn about PDF/A here and here.

I wanted to talk more about a couple of alternatives that come up, namely ODF and OOXML. I sometimes wonder if I am losing it, because I have such a hard time understanding some other people’s (wrong) viewpoints. Both ODF and OOXML are editable document formats, in fact, they are defined as companions to editing programs, that used to be called word processing software. If you want to change the document you can read one of these files into one of these editors and change away. This is a great virtue, if that is what you want to do.  But would Silver Bow County, Montana want me to be able to edit my birth certificate?  I think not. And yes, that is my real birth place, Butte, Montana (in Silver Bow County). Do we want our court records to be editable?  I think not. So why would I want to chose one of the editable document formats for such records. And this is not a straw man argument because people are repeatedly saying that ODF and OOXML are ideal archiving formats for such material.

Now to be fair, they base their belief that these will be great archiving formats primarily because they are “XML.” I have already posted a blog about XML Documents where I point out that although these document formats do contain XML subfiles, they are in truth ZIP Archives. Nonetheless, to be fair, much of the information you might want to reuse or edit in these archives is in an XML markup language of one kind or another. I also have a previous blog about why it isn’t very helpful to claim something is XML. It is much more informative to say it is XML for [something]. And of course, if some XML is for [something] it might not be suitable for [something else].

Of course, I believe that PDF/A is the current best choice for archiving “documents of record.”  PDF was originally designed to capture print output from any application in such a way that it could be viewed and navigated efficiently as well as easily printed to paper. Adobe has added features to support ways of “editing” PDF files and those features are highly desired by many, but PDF/A does a good job of preserving that original PDF design for capturing documents that might otherwise be printed. Here is a diagram that I made to emphasize the differences between PDF and the other two document formats.

The upper left represents those files using OOXML and the upper right those using ODF.  These two formats, as well as this discussion, cover the three authoring paradigms for: documents, spreadsheets and projector presentations.  PDF can uniformly capture any of these types of documents. ODF and OOXML are what Adobe has traditionally called “native file formats” for their respective editing applications. If I want to go to lunch in the middle of an authoring activity, I save my work in a native file format.  The above diagram also shows the difference between editing and reuse. The dotted line going from the PDF file back to the editing applications indicates that the content of PDF files can be reused. When considering our archiving needs we also need to distinguish between reuse and editing; reuse might be important and editing not. PDF files do support content reuse.

Back to archiving. As I said at the start, archiving documents can mean many things. I often create documents that I might want to recreate and edit many years after I did the original work and it is great to be able to find the old files to update. Sometimes I cannot find the files and sometimes I no longer have the correct editor or the correct version of the editor to read them. But the claim is that these new “XML-friendly” formats will make that better. I am not sure exactly why this belief is held strongly by some, but I guess if you make the formats ISO standards, the software developers may be more careful to read old files with new programs.

Adobe has gone to great lengths to make sure that PDF is upward compatible and with few exceptions any proper PDF file that exists can be viewed with Adobe Reader. We plan to continue to do this.

So there might be archiving reasons to call for using PDF and there might be archiving objectives that call for using ODF or OOXML. Somehow I cannot fail to believe that the number of archiving needs for non-editable formats far exceeds that for editable forms.  After all, archiving means preserving things, and usually you want to preserver the total look of a document. PDF/A does that.

Before I conclude, I want to say just a bit more about XML. There is what I think is a rather technically shallow belief that XML files are easier to work with and will survive the passing of time, even great periods of time, better than other formats. The text held within XML files can usually be viewed with any generic text editor and I guess that gives people a warm feeling that it will therefore also be easier to retrieve with a program. Fair enough. But what is glossed over way too much is that that text is enveloped within XML for [something]. (See my earlier blog entry.) The envelopes (schemas) offered by ODF and OOXML are different. Different enough that a simple program cannot extract just the raw text from either. And is that all I really want from a document in the future, the raw text. Because when you get to the layout and the images and the color space definitions and the fonts, these things do not lend themselves well to XML and are often stored within the ZIP archives as binary data. So tell me again where the advantage to XML is for this purpose?

Now I will give two more hits against the myth that ODF and OOXML will be more easy to process at some distant date because much of the content is XML. First is the fact that the XML is Flate compressed in both formats making it impossible to read in a generic editor until after it has been de-Flated, and second, do we really believe that 50 or 100 years from now our software and computer technology will still be in a state where anyone will give a hoot whether something is XML or not. Will there not be something far superior invented by then?

By the time you worry about these issues you will find that processing PDF files will be as easy or maybe even easier that those XML friendly versions.

Contact me at jking@adobe.com


COMMENTS

  • By Mike - 3:31 PM on October 22, 2007  

    Hi – keep on trucking, this series makes interesting reading!Thanks![Thank you. J.King]

  • By Gary McGath - 7:06 AM on November 26, 2007  

    I came upon this post after my earlier comment mentioning PDF/A. Obviously we aren’t in complete disagreement about the value of profiles after all.Guessing what formats will be most useful in 50 years or more is a tricky job. One issue with the XML formats is that ZIP compression has never itself been defined as a formal standard. On the other hand, if you were stuck on a desert island with a decompressed ODF file and a computer, that might be easier to figure out from scratch that a PDF/A file.

  • By Chris Rusbridge - 7:02 AM on November 27, 2007  

    Hi, I am confused by the missing words in the last sentences of para 4: “It is much more informative to say it is XML for . And of course, if some XML is for it might not be suitable for .”[Chris: The blank areas are supposed to be ellipses (…). Sorry if my explanation was not thorough. If you look at the slide presentations at http://home.comcast.net/~jk05/presentations that is about XML and PDF you will get the whole story.But for the short answer, we should always talk about what particular schema or use the XML satisfies. For example, XML for business cards, or XML for vector graphics, or XML for news stories. XML by itself isn’t saying much. — Jim King]

  • By Raj - 1:12 AM on April 2, 2008  

    Archiving – SummaryArchive means a collection of historical records especially for Publication, Banking, Insurance, Legal, Education and Government Organization. It also refers to the location where these records are kept in papers, letters, files, diaries or any other documentary materials created by the Individual.Archives were developed by ancient Chinese, Romans and Greek and the French possess the largest archival collection in the World with records of A.D. 625.With the tremendous exponential growth in the records led to the creation of paper print, micro film and computer files and for the data preservation process Digital storage in CD Rom was started.Digital storage has longer life spans and the access technologies that are changing at an ever increasing pace raised the awareness of the Digital Archiving and Preservation among Publishers, librarians and Corporation.Digital Electronic Archiving (DEA) – its purpose and scopeDigital Archiving is defined as the long term storage for preservation and access to information created in electronic form which includes technical information, data, text, images, audio, video and multimedia.Several years ago publishers began to realize that if the archival version of their efforts were stored in such a way that it could be reused and get additional benefits & revenue from the repository. So many of the Large Publishers started creating their own archives based on the provision of their product in electronic form.The speed with which electronic Publishing has grown worldwide Publishers and Librarians have become increasingly concerned about the archiving and preservation of digital information.It gives access rights for the subscribers to down load any history information because of the availability of physical copies in the archived.It provides a back – up and may be used to spread across multiple geographic locations. The Archive is not even used for online searching service but it is archived to ensure that its content is not altered.Costs/Resources –It was the most difficult aspect as in most of the respondents indicated that they just didn’t know how much the archive was costing or would cost in the future.For Publishers and producers the cost of archiving is still tied up there is no indication of hardware/software costs now or in the future.Source –Pressmart Media Limited– offers digitization of historic archives stored on print, microfilm or microfiche media enables reliable preservation and cost-effective storage and can also be electronically repurposed for internal research or online distribution.[Another source of great information are the proceedings of the IS&T Archiving Conferences found at http://www.imaging.org/store/physpub.cfm?seriesid=28&pubtypeid=2I will be giving a paper there in June 2008. — Jim King]

  • By Bruce - 8:30 PM on April 25, 2011  

    I have boxes of files that take up huge amounts of space in my tiny condo. I had the bright idea to convert them into digital files for archiving to free up space in my condo. My friend, who owns a printing service said she could do it for me and that it would be saved as pdf files. I thought I’d do some research before having her do the work and came across your post.

    Thank you for writing this. It confirms that I’m making the right decision and that pdf is the right format for me.