Archive for October, 2007

October 31, 2007

PDF by Design

I named this blog “Inside PDF” because I anticipated telling you a lot about PDF technology – what is inside of a PDF file and why.  I have spent most of the time so far talking about PDF and standards. So, I thought it was about time to do an entry about PDF itself. I believe that PDF has been so successful because of the caldron out of which it was brewed. By 1990 Adobe was quite successful with PostScript. By then we had helped over 60 other companies make printers, image setters, and other imaging devices that used PostScript. We had also shipped Display PostScript and the Steve Jobs NeXT machine was a computer whose operating system’s imaging model was Display PostScript. So Adobe had had considerable experience in displaying documents on a screen.

Here are the problems we found with using PostScript as an electronic document representation:

  1. In those days, it was a real challenge to render PostScript pages fast enough for interactive applications.
  2. PostScript is a stream oriented language so it is generally impossible to display any given page without having processed all pages preceding it.
  3. PostScript Level 1.0 had no compression technology so images and fonts made the files bigger than they needed to be (PostScript Level 2.0 which did have compression was released in early 1991).
  4. PostScript files were very difficult if not impossible to reprocess (edit). Since PostScript is a programming language one had to face the issue of understanding what a computer program does, and in its most general terms that activity has been proven to be impossible.
  5. Having the correct fonts available when rendering the file.
  6. A PostScript interpreter/renderer had a relatively large footprint and MACs/PCs of those days were not very suitable targets for a PostScript interpreter.

Adobe actually made several false starts during 1989 – 1991 at defining an electronic document format before the PDF project really took hold. There was one called Interactive PostScript and another very related to today’s ODF. Another fact gets factored into this history as well: Adobe Illustrator was built to support the PostScript imaging model and a relatively light weight imager based upon PostScript code was inside Illustrator. That code formed the starting place for Acrobat. Adobe Illustrator also had its own stylized way to use the PostScript language so that it could re-read and edited the files and those ideas made their way into PDF as well.

Unlike many projects, the Acrobat/PDF project had the benefit of this previous work and the project was founded to design a language that got around the inherent limitations of PostScript for desktop display. So here are the design objectives set out for PDF from the start:

  1. Faster processing with software that would be suitable for PC and MAC viewers.
  2. Portable. This meant using printable ASCII characters throughout the file representation. Interestingly, this is similar to, and based upon, the same motivation as today’s XML using Unicode for it file representation.
  3. Small file size. (use compression technology)
  4. Font independence. Embedding fonts was not considered to be the best solution because of their size.
  5. Single pass generation.
  6. Random access to contents, especially individual pages.
  7. Increment updating the file without a wholesale rewrite.
  8. Extensibility. The need to be able to add features without making either older files or older viewers obsolete.

Most of these were actually spelled out at the front of the first PDF manual.

It was decided to throw out the programming language aspects of PostScript in favor of keeping the graphic primitives but having them driven by constant parameters rather than variables. There is another interesting story around this aspect and the Acrobat Distiller I will relate at another time.

Perhaps the most serious issue with PostScript was its basic stream nature. The files were intended to be read and processed in one pass from starting byte to ending byte. This allowed for documents of unlimited size to be processed by PostScript printers. But it also meant that a font that might have been downloaded on page 1 may be used on page 485 so you couldn’t just take that section of the file pertaining to page 485 and process it without having processed all preceding pages. Consequently, in PDF each page is specified and processed totally independently.

Certain features that took advantage of being an electronic document were foreseen, like hyperlinks off of the table of contents or linking to the proper page within document references (e.g., see figure x on page y).  To support this feature with responsiveness, not only do the pages have to be represented independently, but it must be possible to read only that portion of the file that contains a given page and its needed resources (fonts, imaging, etc.). This is the random access item number 6. 

This requirement (6) lead to introducing the idea of a PDF “object” which is a rather low level building block that can be read from the file randomly. The whole file is made only of objects and of these object types: numbers (995, 3.65), strings ( (a string) ), names (/dog, /Jim), true, false, arrays (heterogeneous) ( [99 /dog 9.4 [1 2 3] true] ), dictionaries (key value pairs) (<>), and streams (dictionaries followed by a bunch of bytes, arbitrary both in content and length). Space is the primary delimiter (not comma as in many languages) and strings are put into parenthesis.  This follows the PostScript representation for the same primitive objects, but PostScript did not separate them out as addressable objects in the file. In PDF each addressable object (called an indirect object) gets a number. At the very end of the PDF file is a cross reference table that maps object numbers into locations within the file. So PDF readers start by reading the cross reference table from the end of the file and then read the objects randomly by number. There is also a starting root object number held in a special structure at the end of the file. I have an annotated tutorial presentation that goes into more detail about this.

Well this is rather rambling on and getting too long. But I have made a couple of the most important points: the PDF design was driven by some well established design principles learned from previous attempts, and the random access feature of the design sets it apart from many if not most other document formats. I will come back to this in the future and fill in some more information on how the problems with PostScript lead to really good design decisions for PDF.

Contact me at jking@adobe.com


11:50 AM Comments (4) Permalink
October 22, 2007

Archiving Documents

Archiving is a rather loaded word since doing it can be a widely varying activity. In many situations, archiving PDF files is a very good solution. In fact it was so attractive to some US Government agencies that they encouraged their personnel to work on an ISO committee/working group to define a special subset of PDF called PDF/A that meets their needs better than plain old PDF might. 

Generally PDF/A restricts the use of features of PDF that might make the visual presentation unreliable.  Things like content that might change according to the day on which it is viewed. (Yes, you can do things like that in unconstrained PDF!)  PDF/A also requires the use of certain features in PDF, such as including structural information that allows the content to be read, in a superior fashion, to people that are blind and to be extracted for reuse. The point of this blog is not to teach you the technology of PDF/A; you can learn about PDF/A here and here.

I wanted to talk more about a couple of alternatives that come up, namely ODF and OOXML. I sometimes wonder if I am losing it, because I have such a hard time understanding some other people’s (wrong) viewpoints. Both ODF and OOXML are editable document formats, in fact, they are defined as companions to editing programs, that used to be called word processing software. If you want to change the document you can read one of these files into one of these editors and change away. This is a great virtue, if that is what you want to do.  But would Silver Bow County, Montana want me to be able to edit my birth certificate?  I think not. And yes, that is my real birth place, Butte, Montana (in Silver Bow County). Do we want our court records to be editable?  I think not. So why would I want to chose one of the editable document formats for such records. And this is not a straw man argument because people are repeatedly saying that ODF and OOXML are ideal archiving formats for such material.

Now to be fair, they base their belief that these will be great archiving formats primarily because they are “XML.” I have already posted a blog about XML Documents where I point out that although these document formats do contain XML subfiles, they are in truth ZIP Archives. Nonetheless, to be fair, much of the information you might want to reuse or edit in these archives is in an XML markup language of one kind or another. I also have a previous blog about why it isn’t very helpful to claim something is XML. It is much more informative to say it is XML for [something]. And of course, if some XML is for [something] it might not be suitable for [something else].

Of course, I believe that PDF/A is the current best choice for archiving “documents of record.”  PDF was originally designed to capture print output from any application in such a way that it could be viewed and navigated efficiently as well as easily printed to paper. Adobe has added features to support ways of “editing” PDF files and those features are highly desired by many, but PDF/A does a good job of preserving that original PDF design for capturing documents that might otherwise be printed. Here is a diagram that I made to emphasize the differences between PDF and the other two document formats.

The upper left represents those files using OOXML and the upper right those using ODF.  These two formats, as well as this discussion, cover the three authoring paradigms for: documents, spreadsheets and projector presentations.  PDF can uniformly capture any of these types of documents. ODF and OOXML are what Adobe has traditionally called “native file formats” for their respective editing applications. If I want to go to lunch in the middle of an authoring activity, I save my work in a native file format.  The above diagram also shows the difference between editing and reuse. The dotted line going from the PDF file back to the editing applications indicates that the content of PDF files can be reused. When considering our archiving needs we also need to distinguish between reuse and editing; reuse might be important and editing not. PDF files do support content reuse.

Back to archiving. As I said at the start, archiving documents can mean many things. I often create documents that I might want to recreate and edit many years after I did the original work and it is great to be able to find the old files to update. Sometimes I cannot find the files and sometimes I no longer have the correct editor or the correct version of the editor to read them. But the claim is that these new “XML-friendly” formats will make that better. I am not sure exactly why this belief is held strongly by some, but I guess if you make the formats ISO standards, the software developers may be more careful to read old files with new programs.

Adobe has gone to great lengths to make sure that PDF is upward compatible and with few exceptions any proper PDF file that exists can be viewed with Adobe Reader. We plan to continue to do this.

So there might be archiving reasons to call for using PDF and there might be archiving objectives that call for using ODF or OOXML. Somehow I cannot fail to believe that the number of archiving needs for non-editable formats far exceeds that for editable forms.  After all, archiving means preserving things, and usually you want to preserver the total look of a document. PDF/A does that.

Before I conclude, I want to say just a bit more about XML. There is what I think is a rather technically shallow belief that XML files are easier to work with and will survive the passing of time, even great periods of time, better than other formats. The text held within XML files can usually be viewed with any generic text editor and I guess that gives people a warm feeling that it will therefore also be easier to retrieve with a program. Fair enough. But what is glossed over way too much is that that text is enveloped within XML for [something]. (See my earlier blog entry.) The envelopes (schemas) offered by ODF and OOXML are different. Different enough that a simple program cannot extract just the raw text from either. And is that all I really want from a document in the future, the raw text. Because when you get to the layout and the images and the color space definitions and the fonts, these things do not lend themselves well to XML and are often stored within the ZIP archives as binary data. So tell me again where the advantage to XML is for this purpose?

Now I will give two more hits against the myth that ODF and OOXML will be more easy to process at some distant date because much of the content is XML. First is the fact that the XML is Flate compressed in both formats making it impossible to read in a generic editor until after it has been de-Flated, and second, do we really believe that 50 or 100 years from now our software and computer technology will still be in a state where anyone will give a hoot whether something is XML or not. Will there not be something far superior invented by then?

By the time you worry about these issues you will find that processing PDF files will be as easy or maybe even easier that those XML friendly versions.

Contact me at jking@adobe.com


11:18 AM Comments (5) Permalink
October 17, 2007

GosCon07

I just  completed a day and one half attending the Government Open Source Conference (GOSCON) in Portland, Oregon. This was the third such conference, but the first I have attended.  I attended for one very PDF related reason. I found these sentences on their web page to publicize the conference|:

"The maintenance and exchange of the most important records such as certificates of birth, marriage and death, taxes, licenses, deeds, laws, regulations, codes and rules — as well as their role in business processes — are the responsibility of IT officers around the globe." (see the description of GOSCON‘s Executive Panel)

To me this sounded an exact match for PDF/A, the archiving subset of PDF. Yet the only document formats mentioned in the rest of the material were OOXML and ODF. The speaker list seemed to invite yet another back and forth between Microsoft and ODF supporters. Adobe wants PDF (or PDF/A, or PDF/E, or PDF/X) to be the file format of choice when most appropriate. There are more PDF files on the Web than ODF, OOXML, and .doc all together. There are billions of PDF files in existence. PDF was announced in June 1993 and with few exceptions everyone of those files is still viewable and printable with today’s software from many vendors including many open source projects.  It doesn’t attest to what will be true 50 or 100 years from now, but PDF has done better than any other document format for maintaining its archiving ability for over 14 years. Yet PDF was completely unmentioned on a panel about Open Document Formats, unmentioned in any of the other conference material.   So one of my colleagues sent off a note to the organizers asking if Adobe could get someone on the panel to bring PDF into the discussion. For whatever reason, we had only heard about this conference at the last moment. But to our surprise the organizers agreed that I could be part of that panel. So off to Portland for Monday and Tuesday morning! 

I had some trepidation since I am certainly not an open source advocate and even though Adobe has some significant open source work (Tamarin) underway, I thought I might be attacked by the vociferous crowd that worship at the open source alter. Might I be provoked to defend myself in an unprofessional manner? It isn’t that I don’t think people should do open source, it is just that I basically don’t get it. I am a true capitalist at heart and I was just a bit old to be one of the back to basics persons of the 60s and in order for me to understand open source I need to know how the programmers get paid for all this "free" work.

Well I did learn a great deal by attending this conference, am definitely more tolerant and understanding of the idea, but not quite yet a true believer.  This conference had just over 150 attendees with about 1/2 from Oregon, mostly city and state government employees, and the rest of the attendees were from all over the world. Malaysian Government had a delegation of 3 or 4 since they have been quite successful in using open source software. The Conference was, how shall I say, definitely Linux supportive, although people did talk about lots of other things. I would rate the average speaker as "B" or "C+" with a few outstanding and a few not so good. The technical level was somewhat low with too much evangelical promotion.

I guess my key take-away was as follows. Big government can afford to hire vendors to write custom software for them to accomplish their goals. Smaller government groups have many, if not all, of the same needs and yet they cannot afford to get custom software created or to create it themselves. And I guess the vendor community is not doing a thorough job of building turn-key software that gets a large enough sales coverage to be cheap enough for this market. Something like that. And this conference had a lot of attendees from smaller government groups. It ranged from people from IT departments under correctional institutions to people representing 5 person’s police departments to city IT personnel. What makes a great deal of sense for these government agencies is to band together to define and get the software that they require created. Open source methods fit this pretty well. Consortia might do just as well. The agencies have money but it would go a lot further if they could share costs 10 ways, or 50 ways or 1000 ways.

There were three keynotes, which were pretty good (head of Linux Foundation, director of Eclipse Ecosystem , VP from Gartner) and then 4-way parallel sessions. One speaker backed out at the last minute and so they offered Doug Johnson from Sun and me the opportunity to organize a session on Open Source versus Open Standards.  Neither of us is shy about speaking our mind so for a last minute session, we did OK; I would give us a "B-".

Overall I’m glad I attended. I forced some attention to PDF, the primary objective, and I learned more about open source and perhaps more importantly how open source advocates reason.  And I did attend one session where a Microsoft guy and an IBM guy squared off on OOXML versus ODF. Surprising there were less that 15 people attending. Maybe we are getting tired of this argument. But I would have been disappointed if there had not been such a session. I think the IBM guy won by a small bit.   Contact me at jking@adobe.com.

5:19 AM Comments (0) Permalink
October 5, 2007

High Drama! — curtain call

Microsoft recently received an award that I am sure they will just love. The FFII (Foundation for a Free Information Infrastructure) has awarded Microsoft the "Best Campaigner Against OOXML" distinction. Apparently this is part of a larger campaign on the part of FFII against the ratification of OOXML as an ISO/IEC standard. I guess it is an attempt to put some humor, at Microsoft’s expense, into a rather serious issue.

This is a serious issue because it is all about competition in a market that produces critical revenue for Microsoft in the 10′s of billions of dollars. How would you like to be the Microsoft executive held responsible for allowing competition to take 10% or 20% of that revenue away. The irony is that the people that are threatening to do that are open source advocates; supposedly most of the money not going to Microsoft would be going to the customers.

But all of this make me nervous. As I posted earlier, I am the technical personn and the technical lead in handing the PDF standard off to ISO. I do not want Adobe to get the next FFII award.  I wrote a previous blog about this and especially about fast tracking standards which I think is where the big difference between what Microsoft has done and what Adobe is doing. Check it out.

Selling the Adobe products that depend upon PDF is also big business for Adobe.  But I am also very confident that the character of Adobe employees and especially of our executives is strong enough to make the right choices between big money pressures and doing the right thing.  Moving PDF down the ISO fast track is the right thing.  – Jim King

2:50 PM Comments (1) Permalink
October 2, 2007

Submission of PDF to ISO

I have written around this subject before but I have never given it a complete run through. I am going to do that now. On January 29, 2007 Adobe Systems announced that it would work with AIIM and ISO’s Technical Committee (TC) 171 to make PDF 1.7 an ISO standard.

Following this announcement, on July 2, 2007, the PDF 1.7 Reference, converted to follow ISO template and styling rules, was sent out as a Draft International Standard (DIS) ballot to country standard’s organizations that are members of ISO/TC 171 using the ISO “Fast Track” procedure. This process is managed by AIIM on behalf of ISO.

The ballot closes on December 2, 2007. Following ISO rules, if the ballot passes and all comments can be resolved, then the DIS will become an ISO Standard: “ISO 32000 — Document management — Portable document format — PDF 1.7”. If the ballot passes but one or more ballots contain unresolved comments then, after addressing the comments, a second Final Draft International Standard (FDIS) ballot that lasts 2 months will be conducted. If that passes (no comments allowed) then it will become an ISO Standard.

Since after a standard has been approved it takes ISO several months to officially publish it, we can expect the ISO 32000 PDF 1.7 standard document to be published by ISO sometime in 2008.

Additional Notes:

  • There are currently 14 countries that are voting P-members of ISO TC 171.
  • Adobe converted the PDF 1.7 Reference into an ISO Draft by 1) using the ISO template and styling directives, 2) removing all Adobe and Acrobat specific information making the document truly vendor neutral, 3) removing obsolete material and 4) adding some material required by ISO including terms and definitions and required introductory material. The draft was also required to use International English spelling and not American English spelling.
  • Once PDF becomes an ISO standard, Adobe will be just one of many implementors of that standard. Adobe expects to participate on the ISO committees determining changes that should be included in a revised standard.
  • Adobe has agreed to abide by ISO intellectual property rules governing patents licensing.
  • Adobe will continue to innovate with respect to Acrobat, LiveCycle and PDF, and will work with the ISO committees on subsequent versions of the ISO 32000 standard.
  • AIIM, with Adobe’s support, submitted PDF 1.7 to ISO TC 171 for fast track processing as an existing de facto standard. A rather different approach could have been to have some other organization, like ECMA, standardize PDF and then submit it to ISO as their existing standard. It would also have been possible to submit to the Joint Technical Committee (JTC1) which is a cooperation between ISO and the International Electrotechnical Commission (IEC). The same path through AIIM to ISO TC 171 has been followed in establishing PDF/A (ISO 19005) and PDF/E (ISO 24517). PDF/X (ISO 15930) took yet a different path.

Further Information:

  • AIIM PDF Standard: http://www.aiim.org/standards.asp?ID=33223
  • ISO Process: http://www.iso.org/iso/en/stdsdevelopment/whowhenhow/proc/proc.html
  • PDF 1.7 Reference: http://www.adobe.com/devnet/pdf/pdf_reference.html
  • Q&A on Adobe website: http://www.adobe.com/pdf/release_pdf_faq.html
  • Adobe Contact: Nora Calvillo <calvillo@adobe.com>


10:00 AM Comments (2) Permalink