PDF/A and the “openness” of PDF

John Carroll of Microsoft has written an interesting blog item on PDF vs. Office XML that deserves careful reading. In it he criticizes the government of Massachussets for approving the use of PDF 1.5 but not Office XML as a document format. I posted a brief reply that addresses one issue I’d like to discuss at greater length.

A lot of discussions of “openness” get lost in issues that border on metaphysical. Here I’m going to try to bring the discussion back to earth by going back to the question, what problem are “openness” requirements trying to solve?

PDF was originally invented to solve one problem, the problem of sharing documents electronically. If my application for a mortgage needs to be reviewed by twelve people before approval, how can we make sure that all twelve people see the same document, and thus make their decisions based on the same information? The old way to do it was by using paper with ink on it.

That this solution works seems obvious at first glance, but how do you know it works? Ink fades over time, paper gets discolored. Things can look different under different lighting conditions. How do you know I didn’t put an extra zero of some financial figure in a rapidly fading ink so that one reviewer saw it and the next did not?

The questions get more complicated when photocopiers are involved. In practice most people say that if you review a paper document and I reviewed a photocopy of it, we reviewed the same document. But how do you know that? What if the document has some colors that the copier can’t pick up? What if the copier is a digital one with software that randomly scrambles words?

By now you probably think I’m being silly, but this is an important point: standardization is all about risk management. Any document workflow has risks of error and uncertainty being introduced – even paper workflows. For most workflows the use of ordinary paper, pens, printers, and photocopiers such as you find in the average office are good enough that no one thinks about the kinds of risks I’ve discussed. But professional archivists do worry about the risks of preserving paper documents. Last year I had the privilege of touring the British National Archives and seeing documents that date back to the 11th century. The archivists there, and at other governmental organizations like NARA, worry a lot about the stability of different kinds of paper and ink under different conditions of humidity, temperature, and so on.

The point here is that the kind of risks you need to manage vary with your workflow requirements, but no workflow has zero risk.

Which brings me (finally) to digital documents. Since the 1980s, office workers have been sharing digital documents. What are the risks that exist with digital documents, how serious are they to various workflows, and what steps can be taken to reduce them?

The first types of digital documents people tried to share were native word processor files. They quickly discovered a couple of major risks. One was that a colleague wouldn’t be able to open a document at all because of incompatible software. Another, perhaps more dangerous, was that the colleague would be able to open the doc, but subtle changes would occur that would break the workflow – for example, text would be reflowed, with the result that a comment about “the first line on page 23″ would have an unpredictable reference. And a final one was that one reader of a document would edit it – inadvertenly or maliciously – before passing it on to the next person.

Adobe invented PDF and Acrobat primarily to reduce these risks to acceptable levels for office and government work. And it worked! In the 12+ years since Acrobat was introduced, PDF has become a de facto standard for documents because experience has shown that the risks of PDF documents becoming unreadable, or reading unpredictably, are negligible in practice. And I repeat “negligible in practice” – as a technical expert in PDF I am well aware that the risks are not zero. But the big point is, we wouldn’t even be talking about PDF as a government standard if we didn’t have many years of experience showing that it works. And this, I think, is why Office XML didn’t get out of the starting gate with certain governments: at this time it lacks the starting point of a wide enough installed base and widespread user experience to merit consideration.

So why PDF/A? PDF/A exists because for certain applications, practical experience that the format works is not sufficient (though it is necessary). Prime examples are legal applications and archival applications. If a document is going to be used as evidence in a court proceeding, or if there is a statutory requirement that it be preserved for posterity, then a stricter level of scrutiny is required. These applications require that independent technical experts examine the risks associated with the technology and make sure that appropriate steps have been taken to minimize those risks.

One of the risks that particularly concerns archivists is intellectual property risk. This is the risk that the technology required to reproduce the document will not be available because relevant information cannot be found or cannot be used for legal reasons. It is this risk that is addressed by concerns for “openness”, and claims about openness need to be addressed in this context.

How do you assess the risk of a technology becoming unavailable or unreliable because of information becoming unavailable or of intellectual property lawsuits? You can’t eliminate that risk completely, but you can make informed prudent judgements. A book that is in widespread publication and present in many libraries is much less likely to vanish from the world than an HTML page on the website of a corporation. A technology that has been implemented by numerous independent vendors over a period of many years is much more likely to be free of intellectual property encumbrances than one just released by a single vendor.

The approval of PDF/A by ISO, and its adoption as a standard by government agencies, reflects a collective judgement by a large community of experts that the use of this technology is prudent and responsible risk management. In particular it reflects a judgement that PDF/A is at least as reliable as paper.

We live in the real world. Disk drives can fail, buildings full of paper can burn down, companies can go out of business, and widely used technologies can suddenly become the target of patent lawsuits.

The due diligence of the expert evaluating technology does not consist in a rigorous proof that the technology strictly conforms to some ideal, but in the intelligent weighing of a lot of relevant considerations. PDF is widely implemented, has proved itself trustworthy for over a decade, and has been found to be licensed on “reasonable and non-discriminatory terms” by standards bodies that specialize in making those evaluations. That is why agencies with archival requirements are adopting it.