Archive for October, 2009

October 30, 2009

PDFs and Their Content – Part 2

This is Part 2 of the previous blog.

I think PDF forms represents a very powerful and significant tool. Increasingly we want both humans and computers to read documents. However, the requirements for easy reading for each is considerably different.

For a long time the primary use of computers has been "data processing". Business data processing has worked with structured data consisting primarily of numbers and text strings whose meaning and properties are well defined and known a priori. Much of the data semantics is build into the data processing software.  In the last decade or two, distribution and sharing of information among humans has moved in to share the primary spot.

Humans, when given numbers and strings, also need a context in which to understand their meaning and significance.

When creating a PDF form, a designer makes a very explicit decision of which information is needed by humans and which is needed by our data processing software. People do document processing and for the most part computers do data processing.

Those places where fill-able blanks occur in a form define the data that is being collected or displayed.  The background or "artwork" of the form turns the raw data into a document that provides the context in which the human can understand the data.

Here is a diagram showing how an artwork presentation layer and a data layer come together to make a document from which both the human and the computer can obtain exactly what they need.

PDF forms maintains this separation of layers and the data layer can be imported or exported into and out of the form artwork layer. The humans see the composed document and the computer can process the data only, with the traditional data processing software. Either the computer or the human can supply the contents for the data layer for presenting or gathering the data.

So, I think forms offer a very clever way for computers and humans to both see that part of a document most suitable and necessary for them to process.

Jim King (mailto:jking@adobe.com)

 

1:21 PM Comments (0) Permalink

PDFs and Their Content – Part 1

Sound bites (or is it bytes)

I think most technical people share a problem that I have: we have extreme difficulty in expressing ourselves in one simple sentence.  I have this problem when responding to questions/issues about PDF. For example, I have a hard time responding to this inaccurate statement in a short sound bite:

"PDF is great because it is not editable and freezes the content."

Technically that statement is totally inaccurate but there are related statements that are true.  For example:

"PDF is great because it not only captures my content but allows me to chose and lock down the look and feel for my content."   or

"PDF is great because I can apply a document signature to the file after I create it and then people can detect if it has been tampered with between me and them."

And here is one that I encountered from my financial advisor: "I had always sent my customers paper spreadsheets in the mail because I didn’t want them to have my spreadsheet electronic files that have my intellectual content as far as the calculations and macros. Once I could make PDF files from my spreadsheets I can send them electronically and not worry."  Am I to conclude that his primary value to me was in his spreadsheets?

Editablility and Resuse

But to get at the issue of re-use and editability versus frozen content, I have to use quite a few sentences, in fact, the remainder of this blog and the following blog.

The first issue we have to get straight is whether something is a function of the PDF file format or of the software that processes it. If people are concerned about the PDF file format then they need to join the ISO committee that is now managing PDF as the ISO 32000 standard. Many of my previous blogs record the process of moving the ownership of PDF from Adobe to ISO which was completed in January 2008.

But most of the reuse issues are a property of the software not the PDF file format. So if someone doesn’t like the behavior of their current software they might consider looking for other software and/or convincing someone to provide software with the needed function. 

But just for example, there are degrees of resuse that have been incorporated into Adobe’s Acrobat viewer including the following:

•  Copy/Paste.  If the author permits it, I can copy content from a PDF and paste it into other files. Adobe has spent a great deal of time and effort to make this work as well as it does, especially given the complexity of dealing with text.  Please see my previous blog entry about text in PDF.

• Export. Acrobat supports exporting PDF content into various formats including .rtf, .doc, .html, .eps, .png, ,jpeg, .xml, .jpeg2000, .tiff, .xls, .ps, .txt.  I was almost alarmed when I opened Acrobat to obtain an accurate list and found so many format supported.  And there are choices and setting for many of these. I assure you that this represents a great investment by Adobe to provide this support for reuse of PDF content. Many of these export functions are imperfect but do provide a strong basic ability to reuse content.

•  Hybrid Files. One can make a "hybrid" PDF document that includes the author’s original source file as an attachment. This is supported as an automated feature by Open Office tools as well as the Acrobat tools that create PDF files form Microsoft Office products. This provides a final form PDF document with the editability of the original source that the author used to create the PDF in the first place.

•  Forms. A more sophisticated kind of hybrid file is supported by PDF fill-in forms. This is so cool that I am going to make the discussion about it a separate Part 2 to this entry. (I wonder if the reason I think this is so cool is because I defined the properties for the Acrobat forms prototype in 1993. Na!  It is just cool!)

If an author wants to inhibit the reuse of the content in their documents they can set properties within the PDF file to prohibit it. For some authors the content of their documents represent their intellectual property and they want to protect it.

So, if things don’t work to our liking it may be an authors decision or the software designers decision, but seldom should we hold this as a PDF deficiency. PDF is a cool tool.

Jim King (mailto:jking@adobe.com)

 

 

1:19 PM Comments (0) Permalink