ZIP Archives and Portable Directories

| 5 Comments

This is a topic that is dear to my heart and I would love to spur some interest in creating an open source project or something like that. Since about 1999 I have been talking to my colleagues about a concept that I call "portable directories."  It is a simple idea once you "get it." 

File systems, organized around the notion of directories or folders in which to collect files and other directories, have been the staple for how we save computer material on our hard drives, data CDs and DVDs, etc. I suppose it had its invention from an analogy with a file cabinet, but on the computer we can nest folders inside folders to any depth, something hard to do with real physical folders.

We have also used another concept that is almost the opposite: we have used a single file to hold all of the varied content needed, for example, to hold a presentation document like PDF, DOC, PPT or AI. Each software application goes to some trouble to figure out how to stuff a wide variety of material into a single file so that users can have self contained documents. Some of these file format that are "native" to particular applications are sophisticated data bases and support random access within the file so the whole file does not have to be digested at once.

I have long believed that a single file would not be the first choice of a product designer, but we are forced to that because of the extreme ease with which our product's users can then deal with documents. I have long believed that developers would have naturally used a directory structure to save the varied parts of a complex application file were it not for that.

So the simple notion of a "portable" directory is to have our cake and eat it two. What if I were to give you a software library that implemented a complete file system using a single file as its storage area instead of the underlying hard disk.  Just one file. Then the developer could use that library and work in a normal directory structure and yet the user would only see one file.  That's it!

Now this is not a new idea and it isn't so original with me. The NeXT machine accomplished the same idea but rather the other way around. It had a binary bit that you could set in a directory that basically said not to expand the contents when displaying this to the user but treat it as a simple file. This is a great way to get the portable directory idea, except you have to convince each operating system to support it. So I favor the simpler approach of having single files, which all OS's support, and which are easy to transport and manage.  We just invent a directory structure inside that single file using a portable directory software library.

If you consider the operations that you need from a basic file system they are naming files, maintaining a file/directory tree structure, creating file and directories, removing or erasing files or directories and managing the available storage space to reuse it when it is freed and to not have more than one thing using the same space. There are some other things like saving dates, times and size values.  And designing reliable and efficient storage management is non-trivial, but it is something the industry has been doing for about 40 years so there is a lot known about what works well and why.

Microsoft has had several file format definitions that satisfy the portable directory metaphor like OLE Structured Storage and CAB files. But for some reason they have not caught on as I would have imagined they should have. In fact, there is a company that has been selling a portable directory library for years. Of course there are as many ways to implement a portable directory scheme as there are ways to implement a file system so if this idea is ever to be used widely it will have to be a standard, an open standard.

Now what about ZIP archives. Several new file format designs have, in fact, used ZIP archives as a poor man's portable directory. ZIP implements a rudimentary file system, one where storage management is about as primitive as you can imagine. It just lays each sub-file out in one contiguous hunk within the containing single file and it lays each sub-file in succession one after the other. If you want to replace a sub-file you can overwrite the existing one provided the new sub-file is no bigger than the original or you can just add to the end of the ZIP file and rename the old file to some garbage name (or better, just remove it from the ZIP directory).  You can erase a sub-file by removing its name from the directory. Yes, you then need to write a new directory but that, too, can be just added to the end of the file.

So you can see that a ZIP archive makes a not so bad read-only file system but it is dead poor for a read/write system.

So why are OOXML, ODF, Mars, AIR and many others using ZIP archive files when they could be using some superior portable directory implementation.  As far as Adobe is concerned it all comes down to the standards question. ZIP archive are a well established and rather stable de facto standard that is widely supported and has been around for ages. No risk of trying to invent something new and to get it standardized. But it is a dreadfully poor design to use for a compound document architecture where you may want to update some small percentage of the content and do it repeatedly without the need to rewrite the whole ZIP archive each time.

So you might ask why Jim King is just writing about this when he has been talking to his colleagues about it for over 8 years? Why hasn't he made it happen? I have no excuse. I have had many more compelling things to champion, and the portable directory evangelism has never quite risen to the top of my list. But I did hire a summer intern, Deepa Tuteja, in 2003 and she did create a prototype portable directory library. But after developing the basic prototype we made one strategic mistake. Just as an exercise to be able to compare ZIP archives with our work we laid our portable directory structure on top of a ZIP archive, making use of temporary files to hold sub-files until the master file is closed. It is an optimization challenge to figure out the most efficient way to support the portable directory interface with the primitive ZIP underpinnings. But when it came time to sell the ideas and the prototype to some of our product people they chose the less risky (standards wise) ZIP implementation. I have yet to get them off of that.

Now when I get PDF handed off to ISO, I might just begin to ... . Well then I have to work on doing the next ISO 32000 version but when that work quiets down I just might begin to ... .

I hope you all are not like me.

Contact me at: jking@adobe.com

 


5 Comments

You still see the NeXT directory masquarading as file in MacOSX. Just look into any *.app directory. Also the iWork file etc. The downside is it is difficult to send iWork file or apps through emails coz one have to zip up first, and most users get upset and do not know they have to do it and just get despair.

Nor really a surprising fact since Jobs was on NeXT when he was booted out of Apple Computer. No surprise there.

This idea made its appearance in 1992-1993 as Apple's "Bento" Specification, by Jed Harris and Ira Ruben. It predates the WWW, so you don't find many traces of it by a simple web search, but a Google Scholar search for "Apple Bento" turns up useful information, including Bento Specification, Revision 1.0d5, July 15, 1993.

From the Bento spec's introduction: 'Right now, applications typically have no way to exchange multiple content elements, unless they have a “private contract” about the format they will use. Furthermore, one application typically has no way to find the content elements in another application’s document, so typically it cannot let a user copy the these content elements and reuse them. Finally, every application developer who wants to store multiple content elements in a document typically has to invent her own object storage mechanism. These are the problems that Bento addresses. Bento provides a mechanism for storing content elements as objects. It defines a standard format for storing multiple different types of objects, and an API for writing out objects and reading them back in.'

I wasn't close to the Bento story, and I don't know what became of it. I do know that we're not all routinely using Bento storage mechanisms now, so it seems like it failed to get traction somehow.

[Jim: Bento was going way beyond what I talk about in this article. They were trying to find a universal format for documents that all different applications could work out of. My comments are more mechanical, just talking about how to store multiple files in a single file in a useful way beyond just ZIP archives. Bento had a product name which I just cannot bring to mind. Microsofts Object Linking and Embedding (OLE) has had much more success and is very similar. -- Jim King]

I remember Microsoft had something like this in 1993. It was called an "OLE 2.0 structured file". MSDN still has an article about it: Herman Rodent, "OLE for Idiots: B Is for Beginning", June 22, 1993

Hi Jim, it's good to see you're still fighting the good fight. When id Software released the source to Quake 3 I thought it would make a good college course. So my students and I studied it as an example of a large body of real live code (we could also have used Firefox, Apache, etc. but when your point changes in class make missiles bounce off walls, well, it's hard to compete). id bundles all of their resources in "pack" files, extension pk3. Levels, sounds, textures, models, etc. We were surprised (and I was pleased) to learn that they're just ZIP files, and if you change the extension you can browse them with any ZIP program. This lead to an excellent discussion about distributing resources, standards, open specs and libraries, and about conserving your engineering cycles for what you're really trying to make.

ZIP is great for (mostly) read-only distribution. I'll be first in line to use your portable directories.
[Joe Holt! Wow! I haven't talked to you in a very long time. Hope everything is well. Thanks for the comment. -- Jim King]

Snap..

I've been struggling to find a way to store data from my software, and I came to exactly the same conclusion as you.

Files from my software can be anything from a few kb to many gigabytes. The problem is, its easily possible to have flat files larger than a pc's memory can handle. Therefore loading the entire file into memory (required in the current implementation for playback of the files contents) can be impossible sometimes

Also, using a flat file, I absolutely need to know how many items of information the file will contain before it is written.

This is a nasty situation, and required the writing of a tool to repair files which didn't have the right amount of information (as happens when the user quits the program too soon).

I came to the conclusion this morning that what I need is a file format that contains a directory structure. In my case one that can hold time series results from an nBody model, and allow for additional information to be stored, according to the needs of a particular experiment.

My current solution is to initially write the file to a directory structure, then read this directory structure into a flat file. That solves one problem, needing to know file size in advance, but not the loading back into memory problem.

It's a tricky issue, one that stretches my meager knowledge, I shall keep an eye and see if you get anywhere..

[PDF files are also random access containers and have quite a lot in common with ZIP and portable directories. So Adobe has solved the problems you note for documents using PDF. A PDF file does not have to be read all at once. In fact, you can read page 329 without reading any of the preceding pages. With care, PDF files can be written in one pass since the index is stored at the end of the file. -- Jim King]

Leave a comment

About this Entry

This page contains a single entry by James C. King published on December 15, 2007 8:05 PM.

Comments on Comments was the previous entry in this blog.

Now Unanimous! is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.