I named this blog “Inside PDF” because I anticipated telling you a lot about PDF technology – what is inside of a PDF file and why. I have spent most of the time so far talking about PDF and standards. So, I thought it was about time to do an entry about PDF itself. I believe that PDF has been so successful because of the caldron out of which it was brewed. By 1990 Adobe was quite successful with PostScript. By then we had helped over 60 other companies make printers, image setters, and other imaging devices that used PostScript. We had also shipped Display PostScript and the Steve Jobs NeXT machine was a computer whose operating system’s imaging model was Display PostScript. So Adobe had had considerable experience in displaying documents on a screen.
Here are the problems we found with using PostScript as an electronic document representation:
- In those days, it was a real challenge to render PostScript pages fast enough for interactive applications.
- PostScript is a stream oriented language so it is generally impossible to display any given page without having processed all pages preceding it.
- PostScript Level 1.0 had no compression technology so images and fonts made the files bigger than they needed to be (PostScript Level 2.0 which did have compression was released in early 1991).
- PostScript files were very difficult if not impossible to reprocess (edit). Since PostScript is a programming language one had to face the issue of understanding what a computer program does, and in its most general terms that activity has been proven to be impossible.
- Having the correct fonts available when rendering the file.
- A PostScript interpreter/renderer had a relatively large footprint and MACs/PCs of those days were not very suitable targets for a PostScript interpreter.
Adobe actually made several false starts during 1989 – 1991 at defining an electronic document format before the PDF project really took hold. There was one called Interactive PostScript and another very related to today’s ODF. Another fact gets factored into this history as well: Adobe Illustrator was built to support the PostScript imaging model and a relatively light weight imager based upon PostScript code was inside Illustrator. That code formed the starting place for Acrobat. Adobe Illustrator also had its own stylized way to use the PostScript language so that it could re-read and edited the files and those ideas made their way into PDF as well.
Unlike many projects, the Acrobat/PDF project had the benefit of this previous work and the project was founded to design a language that got around the inherent limitations of PostScript for desktop display. So here are the design objectives set out for PDF from the start:
- Faster processing with software that would be suitable for PC and MAC viewers.
- Portable. This meant using printable ASCII characters throughout the file representation. Interestingly, this is similar to, and based upon, the same motivation as today’s XML using Unicode for it file representation.
- Small file size. (use compression technology)
- Font independence. Embedding fonts was not considered to be the best solution because of their size.
- Single pass generation.
- Random access to contents, especially individual pages.
- Increment updating the file without a wholesale rewrite.
- Extensibility. The need to be able to add features without making either older files or older viewers obsolete.
Most of these were actually spelled out at the front of the first PDF manual.
It was decided to throw out the programming language aspects of PostScript in favor of keeping the graphic primitives but having them driven by constant parameters rather than variables. There is another interesting story around this aspect and the Acrobat Distiller I will relate at another time.
Perhaps the most serious issue with PostScript was its basic stream nature. The files were intended to be read and processed in one pass from starting byte to ending byte. This allowed for documents of unlimited size to be processed by PostScript printers. But it also meant that a font that might have been downloaded on page 1 may be used on page 485 so you couldn’t just take that section of the file pertaining to page 485 and process it without having processed all preceding pages. Consequently, in PDF each page is specified and processed totally independently.
Certain features that took advantage of being an electronic document were foreseen, like hyperlinks off of the table of contents or linking to the proper page within document references (e.g., see figure x on page y). To support this feature with responsiveness, not only do the pages have to be represented independently, but it must be possible to read only that portion of the file that contains a given page and its needed resources (fonts, imaging, etc.). This is the random access item number 6.
This requirement (6) lead to introducing the idea of a PDF “object” which is a rather low level building block that can be read from the file randomly. The whole file is made only of objects and of these object types: numbers (995, 3.65), strings ( (a string) ), names (/dog, /Jim), true, false, arrays (heterogeneous) ( [99 /dog 9.4 [1 2 3] true] ), dictionaries (key value pairs) (<>), and streams (dictionaries followed by a bunch of bytes, arbitrary both in content and length). Space is the primary delimiter (not comma as in many languages) and strings are put into parenthesis. This follows the PostScript representation for the same primitive objects, but PostScript did not separate them out as addressable objects in the file. In PDF each addressable object (called an indirect object) gets a number. At the very end of the PDF file is a cross reference table that maps object numbers into locations within the file. So PDF readers start by reading the cross reference table from the end of the file and then read the objects randomly by number. There is also a starting root object number held in a special structure at the end of the file. I have an annotated tutorial presentation that goes into more detail about this.
Well this is rather rambling on and getting too long. But I have made a couple of the most important points: the PDF design was driven by some well established design principles learned from previous attempts, and the random access feature of the design sets it apart from many if not most other document formats. I will come back to this in the future and fill in some more information on how the problems with PostScript lead to really good design decisions for PDF.
Contact me at firstname.lastname@example.org