PDF/A metadata – namespace URIs and prefixes

PDF/A developers and users should be aware of a couple of important corrections that are going to be made to the metadata specifcations for PDF/A.

The PDF/A specification requires that certain metadata be written into conforming PDF/A files. This metadata is in XML, and more specifically uses the Adobe XMP specification. Unfortunately the published version of PDF/A had some errors and ambiguities in the specification of that metadata. This may lead to some tools producing PDF/A that is not strictly compliant with PDF/A (either as originally published or as corrected) or to some tools declaring files non-compliant when they are compliant.


In XML, a namespace provides “a simple method for qualifying element and attribute names used in Extensible Markup Language documents by associating them with namespaces identified by URI references” (W3C). The PDF/A spec refers to several namespaces that are defined by other standards (such as the xmp namespace) and defines a few of its own.A namespace is defined by a URI, such as http://ns.adobe.com/xap/1.0/. The PDF/A spec identifies these URIs, but in several cases omits the last character, usually a / or # character. But the XMP specification, building on the RDF specification, requires that all namespaces defined within XMP metadata must use one of these termination characters. So what PDF/A specified to be used in XMP is in fact invalid XMP. This will be fixed when the PDF/A spec is corrected, but for now developers who follow the published spec will be writing metadata that is not strictly compliant with PDF/A as it was intended. And PDF/A files produced by software written to the corrected spec could get warnings or errors from validators written to the published spec.Another thorny area of the spec is namespace prefixes. An XML document that uses a namespace generally uses an abbreviation for the full namespace URI. The abbreviation is called a Namespace Prefix. Here’s an example from the W3C:

</tableHere the namespace URI is http://www.w3.org/TR/REC-html40 and the namespace prefix is html.In several places where the PDF/A spec discusses the use of a namespace, it specifies the prefix that should be used. For example, the spec defines a PDF/A extension schema and says

The required schema namespace prefix is pdfaSchema.

So it is clear that a PDF/A document MUST use “pdfaSchema” as its prefix when it uses this namespace. But in many other places the language is less clear, for example

The XML namespace URI for the dc prefix is <http://purl.org/dc/elements/1.1/>.

In this example is it required that the document use “dc” as its prefix when it uses this namespace?I had the pleasure of swapping a lot of email with some very knowledgeable developers on this subject. Some thought the answer was “yes” and some thought it was “no”. Finally our editor, Stephen Abrams, was able to settle the issue with the recollection that the developers of the spec had intended to introduce required prefixes only for new namespaces introduces for PDF/A and not for existing namespaces incorporated from other standards. This will be made in clear in the corrigendum. We still may have disagreements about what the spec should have said, but at least we will all agree what it does say.I hope users of PDF/A will not be overly alarmed by these developments. They do not have any bearing on the reliability of PDF/A documents. But they do add to the expense and complexity of software for creating and validating PDF/A. If there is one thing I hope people take away from this adventure, it’s that when we software developers say that this or that requirement will make a standard more difficult to implement, it’s serious business.

By Macduff Hughes

 Comments (1)

Created

June 30, 2006
<?xml version="1.0"?>
<!-- all elements here are explicitly in the HTML namespace -->
<html:html xmlns:html='http://www.w3.org/TR/REC-html40'>
  <html:head><html:title>Frobnostication</html:title></html:head>
  <html:body><html:p>Moved to 
    <html:a href='http://frob.com'>here.</html:a></html:p></html:body>
</html:html>