Archive for November, 2007

November 29, 2007

CDF

I have heard suggestions that CDF (Compound Document Formats) is a good alternative for ODF or OOXML and I suppose by extension, for PDF. I want to explain a bit about CDF and connect the discussion to my last blog on Communication via Documents. CDF is a set of candidate W3C recommendations (their term for standard; candidate meaning not quite baked). One of the things I want to get to in this article is the "s" on the end of Compound Document Formats. But let start at what CDF is or isn’t.

In my opinion the name CDF, in general, is a little off because these recommendations (aka, standards) are aimed at DOM processing (Document Object Model) and defining how heterogeneous web languages, when processed into a common DOM, can communicate more effectively and have more impact upon each other. To me it doesn’t address "document formats" very directly. The W3C has developed a modest set of XML markup languages for various specialized forms of document content (e.g., XHML, XSL, XForms, SVG, SMIL, MathML and VoiceML  [see here for the recommendations for any of these]). The basic CDF idea is to specify how to make a unified document out of a bunch of these content types. So the first steps of defining a "format" has already been done for the components of a "W3C" compound document.

The proposed recommendation talks about a document root in a "host" language (only example XHTML) that includes subordinate content in some "child" languages either by "reference" or by "inclusion".  By reference is the most commonly used today. Web pages are put together with the host HTML page using URIs to reference images, SWF (Flash), SVG, etc.

CDF uses <object> in XHTML to reference children. Since they mostly talk about XML markup languages, then inclusion is just a matter of sucking up the child into the host XML using name spaces to distinguish which is which. Another possibility, but not discussed, is to do as OOXML, ODF, and Mars have done and use ZIP archives to collect all the document components into a single file (package). Like it or not, the CDF folks will have to face the idea that a fully self contained compound document will need: JPEG images (not in an XML markup language), OpenType fonts (not in an XML markup language), ICC color profiles (not in an XML markup language) and much more. So inclusion by "sucking up" into one XML file is not a very universal solution. Of course, they may not be interested in self contained compound documents.

The real technical content of the CDF recommendations is in details of how to glue these various (XML markup) languages together once they all have been processed into their respective DOMs. It establishes conventions for how a script might reach across DOM boundaries, how events might get propagated across DOM boundaries and stuff like that. I won’t go into this any deeper because you will get more accurate information by just reading the W3C documents. But the main idea of CDF is to bring these variously defined content types into a uniform "framework" so that scripts can operate more at a document level instead of being confined to their own document child.

But one thing that is really interesting to me and that I disagree with is that they set up a "framework" for an arbitrary set of recommendations (remember that translates loosely to "standards") one for each different compound document "profile." You will find the following candidate recommendations on the website: Compound Document by Reference Framework 1.0, WICD Core 1.0, WICD Mobile 1.0 Profile, WICD Full 1.0 Profile (WI is for Web Integration). So we do not get one compound document format but an arbitrary set of formats that all follow similar rules on how things are glued together. On the surface that sounds like a nice modular approach. But what I worry about is the proliferation of the profiles in what I believe is a "Communication by Documents" scenario. Given that this is about Web/Internet documents, they are by nature communication documents. Please read my previous blog where I describe this notion and claim that that was the design point for PDF. It seems strange to me that others haven’t picked up on this notion and been more conservative about defining language profiles.

So what is wrong with language profiles. Well in a communications application, where you are trying to send document information from one person in the world to another person in the world you really want a very limited set of standards that have to be followed on both ends to make the communications work. In my previous blog I talk about having red telephones and blue telephones and blues can only connect to blues and reds to reds. If you let the colors (profiles) proliferate you get a totally useless mess.

CDF seems to be on a path to institutionalize such a mess. Most of their work so far has been on the framework and that is great and is a long time coming for the web XML markup languages. But to leave the door open to multiple profiles for compound documents, and it seems like they are thinking many, is where things go astray. 

It is still salvageable if they pick one set of XML markup languages and edict that all conforming processors must process exactly this set, probably their Full profile. This has been another missing element of the web as we often get documents that do not come out correctly because we are missing some browser plug-in or other.

And of course, PDF has gotten this right!

Contact me at: jking@adobe.com


8:07 AM Comments (1) Permalink
November 16, 2007

Communication via Documents

PDF documents are all about information communication. The communication aspect is often not completely appreciated by other people inventing document formats. And I am thinking here of the CDF, ODF, OOXML folks.

Let’s look at a more familiar example of communication, the telephone. Within some limits of reason, we absolutely expect to be able to pick up any phone anywhere in the world and successfully talk to any other person in the world who also has a phone. That is amazing! But how is that accomplished?  Answer: by using common standards to which all phones and all components of the phone system adhere. Now this blog is less of a plea on my part for standards than it is a plea to understand the very special properties of those standards that must be followed to allow a communication system to be built upon those standards. There are document standards and there are communication document standards and they can be quite different.

Here is the kind of thing you want to avoid at all costs: three kinds of phones, red, blue and black and a phone of a given color can only connect to a phone of the same color. Wouldn’t that be a mess! There actually is a tiny element of this among the cell phone vendors already when it comes to charging for calls. Some vendors offer free calling to any phone from the same vendor (of the same color). But there would be great alarm if those were the only calls that could be made. And how additionally confusing would it be if red phones could call blue but not black, and blue could only call blue, but black could call black and red but not blue. Something that complex would reduce the usage of phones by an order of magnitude.

Well these same ideas apply to communicating information via documents. In this case the sender and receiver are software tools for authoring and displaying documents. In many cases the same software plays both roles of sender and receiver. If I put a document onto a website, I can easily anticipate that 100,000 people or more might read it. (I am waiting for those kinds of numbers for this blog.  Tell your friends!)  I want all of those people to be able to read it no matter what their computer or favorite software. So we have to have standards that are strictly adhered to as to the form those document files can take and also the software has to be written to strictly obey the standards.

Let’s look at a different situation, that of application specific save files, or what we sometimes call “native file formats.”  The primary function for these files is to save any work I am doing so that I can shut down the application and come back later to pick up where I left off. I am thinking of the .psd files of Adobe Photoshop, or the .doc files of Microsoft Word or any other file format closely aligned with a favorite application. Saving is not the only use but that is a key use.

Typically these file formats get revised when a revised version of the product gets introduced. Typically the new application will read older versions of the files and the user can then save then in the new format. But if your files go back too far, your modern software may not be able to read them. Adobe’s products are quite good at release after release being able to read all older files. Other companies are not so good about this. I think it could have something to do with putting pressure on all users of that software to keep updating to the newer versions.

Certainly, most new format files will not successfully be read by the old software. They are new formats because they have new information in them unknown to the old programs.

There is another aspect of these native file formats that is often overlooked. Since they are intended to serve as a snapshot of the work being done with that product, the files generally contain much more than the final appearance of the document. For example, Adobe Illustrator files can contain sets of patterns and color swatches that the user has defined to use with this particular file. They are saved and restored to/from the native .ai files. So these native files may be larger than really needed for communication purposes.

And native file formats are not very good to use as archive files unless they are “turned over” every few years, updating the files to the new revised formats. I wrote about archiving and file formats earlier.

So are OOXML, ODF and CDF native file formats or communication formats? Hard compromises have to be made for a format to be both and, so far, I cannot think of a single example where this has been done truly successfully. PDF is not a native file format and has been designed and managed as a communications format (only).

One more aspect of communications formats is the toughest one to avoid, that of special versions or “profiles” of the format. Usually the formats are designed to be more ambitious than what can be easily handled by limited-function devices. So there is great pressure to define a subset of the original format that is limited in some ways to be supported by limited devices. This then fractures the standard and we get into the situation analogous to red and blue phones where red can call anyone but blue can only call blue. (Subset files can be read by anyone but full format files cannot be read by limited devices.)  Not good!

You might have noticed that I mentioned CDF (Compound Document Formats) for the first time in any of these blogs. That is because of a recent announcement by the Open Document Foundation, Inc., previously a strong supporter of ODF, that it will now be helping to promote CDF. I will follow this blog with one about CDF.

Anyway, the bottom line for the PDF architect: PDF is about the only format that has been specifically designed to be a communications format and it shows.

Contact me at: jking@adobe.com

 


6:09 PM Comments (1) Permalink
November 3, 2007

The Distiller Story

This just might be apocryphal but to the best of my knowledge this explains the heritage of the Adobe Acrobat Distiller. The story also motivates one of the key ways that PDF deviates from PostScript.

When you make a printer, you need to have some pages to print that really show off the great features of the printer. At least this was true in the very early ’80s when small desktop copiers were being experimented with as printers. Adobe had invented PostScript as a device and resolution independent way to get sophisticated text, image and graphics output on a laser printer.

Dr. John Warnock, one of the founders of Adobe, had decided to use the classic US 1040 tax form as a great example of a complex and graphically rich document to print. It involves a lot of text, lines and shaded areas, the type of material that a laser printer can do a great job on but that typical printers of the day could not handle. Now remember that PostScript is a programming language and you can write subroutines to do various imaging jobs. So John made heavy use of subroutines as he hand-crafted the US 1040 tax form. This was a lot of work but by carefully building up a nice hierarchical line drawing, box drawing and shading subroutine library he was able to get a very close 1040 facsimile to print on an experimental PostScript printer. Trouble was it took many minutes to print, many minutes. Not great for the intended purpose of demos.

John was pretty sure that most of the processing time was due to the extensive subroutine use and not due to the actual rendering of the lines, shades and text. Also being a typical programmer who would rather have a computer do the work, he set out to see if he couldn’t automatically convert his 1040 PostScript program into one without subroutines at all. One of the features of PostScript is that you can redefine all of the operators in the language. So John wrote a PostScript program when loaded just before printing the 1040 form, redefined all the graphic rendering primitives to write out a text string that represented the PostScript for what that operator was being asked to do. Did I forget to mention that PostScript can also write out strings that get sent back to the computer to which the printer is attached. Well it can. This may be a little subtle so let me try to make it more specific and understandable. He wrote a PostScript program to run on a PostScript printer whose purpose was to transform PostScript print jobs and send the transformed versions back to the computer to which the printer was attached. Suppose that one of the uses of the graphics primitives was buried deep in a subroutine stack and was written like this: “X1 Y2 moveto X2 Y2 lineto X3 Y3 lineto stroke”, where the Xs and Ys were variables computed in some way, probably within a loop in the subroutines. What would get sent back to the computer in one pass of the loop might look like: “200 200 moveto 250 200 lineto 250 250 lineto stroke”. In the second pass of the loop the same PostScript might send back: “300 300 moveto 350 300 lineto 350 350 lineto stroke”. If the subroutine loop was executed 15 times then 15 strings like those would get sent back to the computer
but each would have different numbers to draw lines in different spots on the 1040.
The interesting thing was that the operators in this new PostScript had no variables and the numbers they contained were the result of executing the original 1040 PostScript program. So if we do this for everything on the page, we end up with a rather unraveled and perhaps voluminous set of basic graphic utterances. If those are subsequently sent down to the printer they will produce the exact same 1040 form but much faster since there is no more programming language.

The result was a 1040 form that could be used very effectively to demo the capabilities of PostScript and laser printers. But a side benefit of the work was this PostScript program which John called a “distiller” that you could load in front of any print job and it would convert it into a “distilled version”  and send it back to the computer. The derived or distilled PostScript would produce the same results but didn’t use any of PostScripts (slower) programming constructs or variables.

Glenn Reid picked up John’s program and fleshed it out to be more complete and to make it more widely available for people to experiment with. This exercise proved two things: that you could automatically “distill” PostScript programs into a more primitive form that didn’t use variables or subroutines, and that the resulting simple PostScript was significantly faster to process and print.  It might be larger, but that was a less serious problem of communicating to the printer more quickly.

If you are following this at all, by now you should realize that this distiller made by John and Glenn went on to become the Adobe Acrobat Distiller and the ideas of simplifying PostScript to remove variables and subroutines led to those features being a mainstay of PDF.  Doug Brotz next wrote the actual first version of what is now Acrobat Distiller by taking the Adobe PostScript interpreter which was written in C and modifying it to do the distiller function. Having that, you didn’t need a PostScript printer to run the PostScript version of the distiller, as John and Glenn had done. Interestingly, Doug did his work on a NeXT machine. That machine’s imaging model was based upon Display PostScript. It all swirls around!

Contact me at: jking@adobe.com

1:31 PM Comments (2) Permalink