" /> Inside PDF: December 2007 Archives

« November 2007 | Main | January 2008 »

December 15, 2007

ZIP Archives and Portable Directories

This is a topic that is dear to my heart and I would love to spur some interest in creating an open source project or something like that. Since about 1999 I have been talking to my colleagues about a concept that I call "portable directories."  It is a simple idea once you "get it." 

File systems, organized around the notion of directories or folders in which to collect files and other directories, have been the staple for how we save computer material on our hard drives, data CDs and DVDs, etc. I suppose it had its invention from an analogy with a file cabinet, but on the computer we can nest folders inside folders to any depth, something hard to do with real physical folders.

We have also used another concept that is almost the opposite: we have used a single file to hold all of the varied content needed, for example, to hold a presentation document like PDF, DOC, PPT or AI. Each software application goes to some trouble to figure out how to stuff a wide variety of material into a single file so that users can have self contained documents. Some of these file format that are "native" to particular applications are sophisticated data bases and support random access within the file so the whole file does not have to be digested at once.

I have long believed that a single file would not be the first choice of a product designer, but we are forced to that because of the extreme ease with which our product's users can then deal with documents. I have long believed that developers would have naturally used a directory structure to save the varied parts of a complex application file were it not for that.

So the simple notion of a "portable" directory is to have our cake and eat it two. What if I were to give you a software library that implemented a complete file system using a single file as its storage area instead of the underlying hard disk.  Just one file. Then the developer could use that library and work in a normal directory structure and yet the user would only see one file.  That's it!

Now this is not a new idea and it isn't so original with me. The NeXT machine accomplished the same idea but rather the other way around. It had a binary bit that you could set in a directory that basically said not to expand the contents when displaying this to the user but treat it as a simple file. This is a great way to get the portable directory idea, except you have to convince each operating system to support it. So I favor the simpler approach of having single files, which all OS's support, and which are easy to transport and manage.  We just invent a directory structure inside that single file using a portable directory software library.

If you consider the operations that you need from a basic file system they are naming files, maintaining a file/directory tree structure, creating file and directories, removing or erasing files or directories and managing the available storage space to reuse it when it is freed and to not have more than one thing using the same space. There are some other things like saving dates, times and size values.  And designing reliable and efficient storage management is non-trivial, but it is something the industry has been doing for about 40 years so there is a lot known about what works well and why.

Microsoft has had several file format definitions that satisfy the portable directory metaphor like OLE Structured Storage and CAB files. But for some reason they have not caught on as I would have imagined they should have. In fact, there is a company that has been selling a portable directory library for years. Of course there are as many ways to implement a portable directory scheme as there are ways to implement a file system so if this idea is ever to be used widely it will have to be a standard, an open standard.

Now what about ZIP archives. Several new file format designs have, in fact, used ZIP archives as a poor man's portable directory. ZIP implements a rudimentary file system, one where storage management is about as primitive as you can imagine. It just lays each sub-file out in one contiguous hunk within the containing single file and it lays each sub-file in succession one after the other. If you want to replace a sub-file you can overwrite the existing one provided the new sub-file is no bigger than the original or you can just add to the end of the ZIP file and rename the old file to some garbage name (or better, just remove it from the ZIP directory).  You can erase a sub-file by removing its name from the directory. Yes, you then need to write a new directory but that, too, can be just added to the end of the file.

So you can see that a ZIP archive makes a not so bad read-only file system but it is dead poor for a read/write system.

So why are OOXML, ODF, Mars, AIR and many others using ZIP archive files when they could be using some superior portable directory implementation.  As far as Adobe is concerned it all comes down to the standards question. ZIP archive are a well established and rather stable de facto standard that is widely supported and has been around for ages. No risk of trying to invent something new and to get it standardized. But it is a dreadfully poor design to use for a compound document architecture where you may want to update some small percentage of the content and do it repeatedly without the need to rewrite the whole ZIP archive each time.

So you might ask why Jim King is just writing about this when he has been talking to his colleagues about it for over 8 years? Why hasn't he made it happen? I have no excuse. I have had many more compelling things to champion, and the portable directory evangelism has never quite risen to the top of my list. But I did hire a summer intern, Deepa Tuteja, in 2003 and she did create a prototype portable directory library. But after developing the basic prototype we made one strategic mistake. Just as an exercise to be able to compare ZIP archives with our work we laid our portable directory structure on top of a ZIP archive, making use of temporary files to hold sub-files until the master file is closed. It is an optimization challenge to figure out the most efficient way to support the portable directory interface with the primitive ZIP underpinnings. But when it came time to sell the ideas and the prototype to some of our product people they chose the less risky (standards wise) ZIP implementation. I have yet to get them off of that.

Now when I get PDF handed off to ISO, I might just begin to ... . Well then I have to work on doing the next ISO 32000 version but when that work quiets down I just might begin to ... .

I hope you all are not like me.

Contact me at: jking@adobe.com

 


December 10, 2007

Comments on Comments

After my last blog announcing the positive vote for PDF to complete the Draft International Standard (DIS) phase of becoming an ISO standard, I have gotten a lot of comments. Quite a few were comments about the ballot comments so I thought I should comment.

The 5 month ballot that just completed allowed for yes, no, or abstain votes to be accompanied by comments or editing suggestions to make the specification better or to fix errors. A positive vote implies that the standard is fine but may also include comments about minor edits that do not substantially change the specification.  Any negative ballot must be accompanied by at least one significant technical objection and those have to be addressed one way or another before the DIS can be published as the final ISO 32000-1 standard.

Since the announcement, a large number of comments to my blog asked one or both of these questions:  why did the French vote negatively, and can you point me to a place where I can find all the 205 comments that were in the ballots.  (See below.)

The answer to the second questions is, sorry but the comments will be restricted to committee members since they are considered part of the committee discussions. If you have a burning desire to read them you will have to join your country's standards organization that is responsible for ISO TC 171 SC2.

As to the first question about the negative vote by France, I have read their comments and can only give you some general information.

I can tell you that the French committee has members that know a great deal about digital signatures and spent considerable time reading the section of the ISO 32000 DIS document that covers PDF digital signatures. And I confess that the section of the ISO 32000 DIS on digital signature is one of the hardest to read. And I think the more you know about digital signatures the more you may miss some of the peculiarities of how PDF does it and object to them.

This fast track process to move PDF 1.7 from Adobe ownership to that of ISO is not the normal hammering out of the technical details for a new standard that many committee members are accustomed to. The current objective is to move PDF 1.7 to ISO management unchanged. Some experts feel that some of the choices that have been made in PDF are wrong and should be changed or improved, especially when it comes to digital signatures. OK. But we cannot change it as we move it or we can run the risk of making it not apply to those billions of PDF files currently in existence. After ISO owns it, we can carefully work out backward compatible changes to fix things people feel need fixing.  That will be ISO 32000-2 the second ISO release. 

I have nothing but warm feelings toward the French, including the people on this committee who voted no. I am delighted that they spent that much time looking over the specification and got interested enough to make constructive criticisms. They are going to make great partners on the ISO committees which will make PDF better and better over time.  The single negative vote is not a big picture item and no one should carry any negative feelings toward this committee for doing the job as they saw it. This is how this process is supposed to work.

Here are a sampling of the comments I got about these two subjects:

  • Is there a link for the comments on the ballot? What key objections or concerns were raised by France that prevented an unanimous ballot? – Andrew Mossberg
  • I would like to read the negative comments, that would be more interesting than the news itself.. =(  – bob
  • Hey, is the list of comments somewhere on the net? Would love to have a look at it. – Evgeni
  • What are the main reasons why france voted no? Would be interesting to know your analysis on that. thanks. – Sid
  • Why did France object?  – Al
  • Why France has voted negative? I'm french, and I don't agree with the french vote.  – Frenchy
  • Does anyone know which issues made the French vote "no" ???  – Paganel
  • Are those comments public? If yes, where can they be found?  – Alex
  • Is there someplace where we can read all the comments?  –  Lemi4 aka. fERDI:)
  • Congrats on the achievement, and PDF's formalization as an ISO standard. I'd be fascinated to read the comments made, so are these archived anywhere?  I'd be even more interested to hear the rationale behind France's "no" vote.  –  Ian Farquhar
  • Is there anywhere where we can see the comments?  I'm curious to know what France objected to!  –  Hydrargyrum
  • Do you know why France voted against it?  –  Nick

Notice that they are all nicely worded, factual and professional.  No crap. Thanks go to my readers.

My announcement blog also made it to Slash Dot where it accumulated over 300 comments. (This is the first time my tech-heavy blog has hit something like SlashDot. I think my colleague Duane Nickull is responsible.)

contact me at:  jking@adobe.com

December 04, 2007

ISO Ballot for PDF 1.7 Passed!

Adobe has received word that the Ballot for approval of PDF 1.7 to become the ISO 32000 Standard (DIS) has passed by a vote of 13::1.

Countries voting positive with no comments:  Australia, Bulgaria, China, Japan, Poland, South Africa, Spain, Sweden, Ukraine. (9)
Countries voting positive with comments: UK (13), USA (125), Germany (11), Switzerland (19).  (4)
Countries voting negative with comments: France (37). (1)   
Countries abstaining: Russia (1)
Italy sent comments but is not a voting (P) member.

Total votes 14.  

13 Positive is 93% (must be > 66.6%)  1 Negative is 7% (must be < 25%).  Clear winner!

Total comments (205).

Five countries added comments to their ballots for a total of 205 that will have to be resolved.

I have been nominated by the US Committee to be the technical editor so for the meeting of the International working group on ISO 32000 on January 21-23, 2008 I will come prepared with responses to all of the 205 comments. If the group can address all the comments to the satisfaction of all countries, especially the ones voting negatively, it is possible to finish at that meeting and publish the revised document. If the resolution is more complicated then we will enter a 2 month FDIS vote. The FDIS votes are not accompanied by comments so if we get no more negative votes at that time the revised document will be the one published as ISO 32000.

It may seem strange that the sponsoring country (US) is the one with the most comments (125) but I think that is a reflection of two things: the US committee contains a lot of knowledgeable people including several from Adobe, and we honestly found some mistakes that we felt must be corrected. To me this reflects the honesty with which this group has approached this whole effort. We could have held back to reduce the number but that is not the way this whole effort has been conducted and we are not about to start with any trickery.

The challenging part will be to get people to participate in the next release of the standard. Lots of people want standards but it takes a measurable resource commitment to participate.

Contact me at: jking@adobe.com

Bits and Pieces

I think I have left a few loose ends in some of my previous blogs so this one is just to rehash and reemphasize some points I might not have covered or not covered well.

Archiving Documents

In my previous blog on this subject I may not have made it clear that I do not believe that PDF/A or PDF files are the total answer to archiving documents. There is a wide spectrum of needs when archiving that depends upon the anticipated uses for the documents in the distant future. What I did say and do believe, is that a vast number of archiving needs are of the ePaper variety, where I just need to save a birth certificate or a wedding license or a record of a business transaction. I do not see any need in those cases for saving an editable form. In fact, I want to save those documents in a form where I can lock down the appearance and be assured that in 50 years a person will be able to see exactly what I see today. PDF/A is great for that.

CDF

I wrote about CDF because in the first few weeks of November it started popping up in the news, primarily because the Open Document Foundation had announced frustration with the direction that ODF was taking and was switching their support to CDF. I found that rather puzzling since CDF and ODF are such different things. At least I think so. So I decided to make clear in my CDF blog what CDF is and what it isn't.  I personally do not think it is a very good horse to bet upon for a ODF replacement unless it takes a rather different direction.

I am a little slow on this gossipy news but in the first few weeks of November a rather strange sequence of events was reported. First Sam Hiser (I guess with Gary Edwards and Marbux, the reported the leaders of the Open Document Foundation), announce their switch of alliance to CDF after having formed the organization to promote ODF. Then the most startling thing was that shortly after that announcement the Open Document Foundation closed down. There are some blogs about this here and here.

I am still puzzle over how shallow some people who write with great authority can be, and amazed at how they can collectively make a widespread and continuing story out of so little. They just fill in between the lines with great imagination. There were forecasts of the demise of ODF just because Hiser claimed their vital support for ODF was moving to CDF. I guess the switch in letters of the alphabet from "O" to "C" had press appeal. In fact, it seems that the ODF Foundation's vital support for ODF has diminished greatly over the years and at this point in rather immaterial.

Document Formats

Another thing that I have blogged about previously and still bugs the @#$% out of me is, again, the technical shallowness of some of the people that write about these things. In fact, I have found a lot of this same technical shallowness in some of the feedback we have gotten about PDF and standards. Come on!  You have to be a little responsible and learn a little bit about the technology, don't ya!

I have been repeatedly asked if ODF will wipe out PDF. I guess it is a reasonable question, but I suppose I know too much for it to sound reasonable to me. Using some old fashion terminology, ODF is a word processor format used to save a document while you go to lunch.  PDF is a communications format used to send information between people. Yes I know that my characterization of ODF is rather shallow itself, but really folks, that is where these formats like .doc, ODF and OOXML have come from. They may have tried to leave the old neighborhood, just like PDF has been trying to leave the old neighborhood of a captured version of what you would print, but those roots show through. Note in the diagram below that PDF and the other document formats live in different places.

I guess this all stems from peoples desire to read/write something interesting, something dicey. The controversy surrounding Microsoft and OOXML versus ODF is just such a thing. I might dismiss it as being a rather stupid argument except that it is at the root of a potentially huge financial loss for Microsoft. So it is big business.

Anyway, from my technical viewpoint (ODF and OOXML), PDF, CDF are all three different things. I do lump OOXML and ODF together although there are important technical differences between them as well. But they both do belong at the top of my diagram and neither is very well suited to be a communications document.  Frankly, I haven't figured out where CDF fits in the diagram.

Document Format Profiles

I have written that document format subsets or profiles are a bad thing. Yet I also think that the PDF subset PDF/A is a good thing. How to reconcile these apparently conflicting positions. Well it took some thought, but I think I got it. Let's try to use my old colored telephone analogy. That was a what-if we had red phones and blue phones and red phones could only connect to red phones and blue to blue. That would be terrible. You can make it worse, even, by introducing black and blue phones and then make up a bunch of nonsymmetrical rules about which color of phone can talk to which other color of phone. Communications document format subsets are like colored phones. And PDF is one of those communications document formats.

But PDF/A is ok because it is not driven by the capabilities of devices and it doesn't restrict the key attribute of PDF, namely the reliable display of the document's content. What is does do is restrict the file format for functional reasons. The files should be such that the colors, the content and the ... cannot age or change from device to device or over time.

Let's see if we can fit it into my phone analogy. What if I said that for certain conversations you were not allowed to use foul language. That would not be the same as blue phones versus red phones.  That is the kind of restriction that PDF/A imposes. It isn't a restriction that inhibits the communication of information, it is a restriction on the kind of information that can be exchanged.  It is OK in my communications metaphor. I can still call and talk to someone with a different color phone, I just might have to watch my language.

I hope you got this because it is really important that we have support for PDF/A and understand its role in the bigger PDF picture.

Contact me at: jking@adobe.com