July 29, 2008

Text Content in PDF Files

This is a blog I promised to write months ago and finally have bitten the bullet. To extract text from PDF documents is a rather difficult and a highly technical task and I hope to explain, here, why that is the case.

The Role of PDF

First, I want to position PDF properly for this discussion.  A PDF file represents an ordered sequence of fixed pages. The planned appearance of everything that each page contains is completely specified, down to the smallest detail. All the graphics, images, and text are specified to appear at precise spots on the page, in a particular color, of a given and fixed size, etc.

In this way, the creator of the PDF file gets to determine the exact appearance that the viewer will see. Renderers of PDF (like Adobe Reader or Adobe PDF Print Engine) are required to display the page contents exactly as specified by the creator of the PDF file. The Renderer is a slave to the authoring tool. In fact, I like to think of the PDF specification as a contract between the software that creates the PDF file and the renderer that displays it. They must agree on what the positioning, size, color, etc. noted in the PDF file precisely means.

Text in PDF Page Contents

So now let's turn our attention to text content of PDF pages. Text is rendered for display or printing by using specific graphic designs captured in fonts and each individual graphic is called a glyph. The lower case letter "a" in the Latin alphabet may have hundreds of different glyph appearances such as those shown here:


These all serve the purpose of being a lower case Latin a but have distinctly different glyph appearances and also notice that they also have different widths and spacing required to properly display them.

PDF specifies text content of pages as glyphs not characters. That is, one of the appearances for an "a" is chosen by the creator of the PDF file by choosing a font from which the "a" glyph can be taken. PDF page contents do not specify characters such as just the Latin letter "a".  

The rub comes when we want to work with characters not glyphs. Unicode is widely used because it is a character encoding technology not a glyph encoding one. In fact, for many purposes, such as searching for text strings, we do not want to search by appearance but we want to search by the Latin letters (or commonly by the Unicode encoding of characters).

This issue of how PDF pages represent text is further complicated by three more facts.

1) The representation of PDF page contents is tightly specified so that the creator of the PDF file can determine the exact position on the page of every glyph chosen. But this is usually not done by actually specifying a position for each glyph individually. It is expected and required that the creator of the PDF file have access to the same font as the renderer that processes the PDF file, including the width and spacing information for each glyph supplied in the font file. For a string of text, if we agree to strictly space the glyphs, one from the other, using the spacing information given in the font then all we need to specify in the PDF file is the starting position for the first glyph in the string and the positions of all the rest are determined by that. This saves space and is a common and easy convention to work with.

There is no strict control over font files and so to make sure that the creator of the PDF file and the renderer of the PDF file use the same information, it is wise to embed the font that the creator used, with its important spacing information, in the PDF file itself. When that cannot be done, the width and spacing information for the font glyphs used are placed in the PDF file so that the spacing that the creator desired can be followed even when using a substituted font or one that might be slilghtly different.

2) Since PDF page contents primarily specify the appearance of what is to be rendered for that page, things like line breaks, line justification and hyphenation are all done by the creator and not within a PDF renderer.  As noted earlier, a PDF renderer is a total slave to the representation within the PDF file that the creator put there. This means that if the word "substantial" is broken across two lines so that "sub–" appears at the end of one line and "stantial" at the beginning of the next, then you will never find the string of glyphs for "substantial" in the file but you will find a string for "sub-" and another for "stantial".

3)  Ligatures are glyphs specially designed to enhance the appearance of certain groups of characters when they follow in sequence. For example the two "f"'s and the following "i" in the word "difficult" are considered to look better when they are rendered as one special glyph as in the second rendering shown here:


One single glyph in the right example is used to render "ffi" whereas three are used in the left example. These two words take up a different amount of space horizontally so if material was to follow the word at a specific distance, its position would have to be adjusted. In other words, if you right justify a line of text that contains ligature glyphs, the right justification will be ruined if those ligature glyphs are replaced by the individual character glyphs and visa versa. The bottom line is that the creator of the file and the renderer of the file must agree on which ligatures will be used and which will not. By using glyph references in PDF this problem is avoided because the file creator can determine which ligatures will be used and where, by specifying the proper glyphs. The renderer in its slave role just places the selected glyphs where it should.

Glyph References and Fonts

Different font technologies organize the font glyphs in different ways but basically some identifier is used to select one of the glyphs to use. Several different methods are allowed in PDF in order to support the evolving font technologies but there is one fact that is rather disruptive to being able to extract text from a PDF file. In some cases, all that PDF requires is that the identifier used to select a glyph in the font be used by the file creator. No meaning of it being an "a" is required, just enough information to find the desired glyph in the font is needed. So if we want to extract character text in, say a Unicode representation, we really do not have enough information. We only know how the render the glyph but not what it represents.

There is an optional auxiliary structure called the "ToUnicode" table that was introduced into PDF to help with this text retrieval problem. A ToUnicode table can be associated with a font that does not normally have a way to determine the relationship between glyphs and Unicode characters (some do). The table maps strings of glyph identifiers into strings of Unicode characters, often just one to one, so that the proper character strings can be made from the glyph references in the file.

The process for extracting text also has to recognize and undo ligatures and remove hyphenation. There are various techniques for this and the best ones rely on additional optional information that can be put into the PDF file called "structure" information. I do not want to go into more details at this point but I think by now you should have a good idea of what it takes to extract text from a PDF and why it is so technical and messy.

Why Not Just Use Unicode?

So an easy question is why doesn't PDF use Unicode to represent the text in a PDF file. The simple answer is that the Unicode represents characters independent of their appearance and PDF is all about appearance. But in fact, if I choose a font and devise a method for picking font glyphs given a Unicode character I can almost get away with using Unicode characters and yet rendering glyphs. Why do I say almost?  

Well the issue of ligatures makes it messy because ligatures are glyphs but generally not characters. SVG faces this problem and it partially solves it by having escapes from the Unicode into glyph identifiers when the Unicode cannot handle the cases of interest. Suppose we have the word "difficult" which can be rendered with a ligature for "ffi" or not. But the two choices do require different spacing and placement. So the creator of the PDF file has to be able to specify which ligatures are to be used and which not. Currently whether ligatures are available or not and which ones are available vary by font and even instances of the same font. So more standardization needs to be done before this approach will work.

A New Agreement

The design of PDF established a line in the process of specifying and rendering a document where all the appearance decisions are made by the creator of the PDF file. One could imagine another design where that is not the case. For example, one might leave the line justification, hyphenation, etc. to the rendering component. The difficulty with that is that there can be a relatively large variation in the results so that different renderers making small different choices may produce quite different output. It is also awkward to constrain spacing issues to one part of a document without leaving larger than desired white spaces or overwriting material or laying out the document completely anew.  PDF does hold the position that the creator gets to choose the appearance so shifting partial positioning responsibility to the renderer will be difficult to design.

Enough auxiliary information has been defined for PDF files such that a well written program can extract the text content from PDF pages for reading to the blind, searching, or to extract Unicode strings of characters. Adobe has done this and that code is available through the Adobe Developers Connection using the Acrobat PDF SDK.

The PDF design is very tailored to the creator being able to quite directly and without ambiguity, specify the exact output desired. That is a strong virtue for PDF and the price of more difficult text extraction is a price worth paying for that design.

Jim King  (contact me at:  jking@adobe.com)

July 24, 2008

I'm back

The last full entry I made to this blog was on January 28, 2008, almost 6 months ago. Even though I have been responding to comments and there is interesting information in them, it is really time that I get back to regular entries. I had a lot of distractions both professional and personal. Someone asked me if my blog was dead. My response was that it was in a coma and I would attempt to wake it up. Definitely still alive, just comatose.  So onward!


The ISO 32000-1 (PDF 1.7) specification got officially posted to the ISO website on July 2, 2008, just about 4 months after the final version was sent to them. It was all a done deal on January 24, 2008 when the committee approved the final edits to the specification. But it isn't real until it is real, and that happened on July 2.


You will find that you have to buy the specification from ISO and in the case of ISO 32000-1 the price is 370 CHF (Swiss Francs) the equivalent of about $360 US. Seems like a lot, but ISO has to finance itself somehow. With some luck, and based upon an agreement with ISO, Adobe will be posting a version of this document on our website for free but with references to the ISO copyright removed. The technical content and page numbering will be identical.

I thought it might be useful to cover in more detail some of the questions that came up as comments to my January 2008 post.

1)  Adobe's plan is to follow the ISO 32000-1 specification and not to pursue a direction for PDF competitive to the standard. ISO now owns the specification and Adobe supports that. However (isn't there always a however), Adobe products such as Acrobat and LiveCycle will be making extensions to ISO 32000-1. In fact, we have already done this for Acrobat 9.0.  Extensions, in general, are supported in PDF and ISO 32000-1 and the Acrobat team has documented the Acrobat 9.0 extensions and is submitting them to the ISO committee for consideration to be included in the next version of the standard, ISO 32000-2.  Annex E of ISO 32000-1 describes how extensions from developers are to be noted and documented. Adobe is following those rules.

PDF has been developed as versions (1.0, 1.1, ..., 1.7).  ISO 32000-1 is PDF 1.7 (same as the Adobe PDF 1.7) and the files are marked as such in the header line. Extensions to ISO 32000-1 are extensions to PDF 1.7. Since that PDF version cannot change, a different mechanism was introduced to allow developers to number their extensions and identify them with their developer prefix (Adobe’s prefix is "adbe"). So, Acrobat 9.0 supports PDF BaseVersion 1.7 (ISO 32000-1 version), (adbe) ExtensionLevel 3. The files that use new Acrobat 9 features are so marked with the new Extensions entry in the document catalog and as PDF 1.7 in the header line. Adobe's extensions are soon to be documented on our website. Other companies are free to use these Adobe extensions and they are also free to create and document their own extensions, properly marking the files with their extension numbers. Hopefully, this makes all the extensions public and useable by all.


2)  A lot of people have asked if the existence of ISO 32000-1 obviates the need for PDF/A (the archive subset of PDF).  The answer is a strong NO. PDF/A was created particularly for archiving documents and it both insists that certain PDF features are used (like structure which makes it easier to read PDFs to the blind and easier to extract the content into other formats) as well as insist that certain other features not be used to make sure that the document appearance is reliable (like not using JavaScripts or device dependent color spaces).  The fact that the full PDF is a standard does not impact those reasons for defining PDF/A. 

It could be that some organizations that must use public international standards were using PDF/A since PDF/A has been ISO 19005 since 2005. If they are using PDF/A strictly for that reason and not because of its archiving features, they may indeed want to use PDF 1.7 (ISO 32000-1) instead.

PDF/A is a strict subset of PDF 1.4 meaning that the files do not contain anything not defined for PDF 1.4.  Since PDF 1.4 is also a subset of PDF 1.7 and hence ISO 32000-1, all PDF/A files are good ISO 32000-1 files.

3) A PDF file that conforms to the ISO 32000-1 specification may contain a JPEG image, an ICC color space specification, XFA fill-out form information, an OpenType font and possibly many other kinds of material that are not directly specified in the ISO 32000-1 document (nor in the Adobe PDF 1.7 document).  There are, however, references in the ISO 32000-1 document to specifications for all of those subordinate standards that are used by PDF. Some of those specifications are themselves ISO standards (e.g., JPEG 2000) while others are documented in some other way. In all cases, they are publicly specified so that developers can read and understand every byte of a conforming PDF file. This was the way the Adobe managed PDF and ISO will continue to follow that path.

The way I choose to view this is that the "PDF standard" is documented in a tree of documents starting with the ISO 32000-1 document which references over 80 other specifications. Those, in turn, may make reference to yet other specifications. All of this information is needed to completely and totally understand what can be in a PDF file. So the ISO 32000-1 document and the standard, the way I define them, are different things. This came up in reference to the 3D representations now usable in PDF: U3D and PRC.  U3D is ECMA standard 363 and PRC is an Adobe open specification much like PDF used to be.  U3D is part of the PDF standard, not documented directly in the ISO 32000-1 document but included by reference. PRC is newer than the ISO 32000-1 document and so is documented and referenced, for now, as an Adobe extension.

It is good to be back!

Jim King  (contact me at:  jking@adobe.com)





January 28, 2008

ISO 32000 — Document management — Portable document format — PDF 1.7

The title of this article is the official name of the ISO standard based upon PDF 1.7 which, for all purposes, now exists. The International working group meeting in Orlando, held last week (January 21-23, 2008), was able to successfully resolve all 205 comments to the 32000 Draft International Standard (DIS) document. Even the French, with their expert calling into the meeting on a conference call, have agreed to the edited comment resolutions and, as I reported earlier, changed their vote to positive. That makes it unanimous.

The remaining tasks are primarily mine and the editing staff of ISO in Switzerland.  I have two tasks to perform: 1) make a final version of the document that records the resolution to each of the comments and 2) make the final edits to the DIS that the comment resolutions dictate. I hope to finish this in a week or two and then it is up to the Swiss editors to polish it off and get it published on the ISO website as an official document. (They have a reputation for taking months to do this. Maybe we can get them into a more excited mood for this effort.)

This whole process was started less than one year ago when Adobe announced that, with the cooperation of AIIM and ANSI, we would submit PDF 1.7 to become a public ISO standard. Doing such a complicated standard in one year is extraordinary, but it worked so well for two reasons: 1) PDF was a well accepted and well documented existing de facto standard and 2) we developed some basic principles to guide the work and help with decision making.

As we began to make decisions, answer questions and move forward, it became clear to me that the standards process that Adobe had been following and the standards process that AIIM/ANSI/ISO follow are quite different. For the standards organizations the carefully written standards document is supreme. It defines the standard. While Adobe's PDF 1.7 Reference document is intended to do that same thing it isn't quite so clear. For example, if the billions of files in existence today all contain a construct that has A=1 and the Adobe document says they should have A=2 the document must be changed. That is, the existing files triumph the documentation. It would be of no value to have a specification that does not cover the existing files. So, one focus I put forth at each opportunity, was that the primary objective of the new ISO PDF 1.7 standard was to document the existing files.

In fact, PDF has been a de facto standard based upon three things: the billions of existing files, the thousand of software offerings that create and process those files, and the Adobe PDF 1.7 Reference. And we decided that the order of preference to resolve any differences was in that order, files, software and then documentation. That is definitely not the standards approach where the document is supreme. So as we examined the Adobe PDF 1.7 Reference and turned it into the ISO 32000 draft we adopted what we called this "three legged stool" approach base upon prioritizing the three contributors to the existing standard. If we could capture the most correct interpretation in each area according to those criteria we could produce an ISO document that was (more) supreme and could be acceptable as the definitive word for what a PDF file contains. Of course, the Adode PDF 1.7 Reference proved to be fine for 99.9% of the definition.

Most people, after thinking about this three legged stool metaphor came to agree that documenting the existing files should take priority.

The reason I mention this is I have repeatedly found that most work needs some guiding principles to follow. When making decisions between alternatives, especially when each is very reasonable, established project principles can usually make the decision obvious. If you do not establish such principle early, and make them well known, you find yourself reinventing them and discussing them with each decision. If you do it once at first and get agreement, then they can be applied when needed.

This proved true when the ISO 32000 draft was created and it proved true during last week's review of the ballot comments.

So except for the editing work facing me in the next week or so, I am very relaxed because this long (almost) year is about over. It has been quite an experience for me and I sure hope it has been the right action for both the industry and Adobe to have taken. As I said in an earlier blog, the big thing now is to get enough interested people involved in the ISO 32000 working group to allow the public to be as good a shepard of PDF as Adobe has been in the past. Please get involved by contacting AIIM in the US or your national standards organization affiliated with ISO.

This is my personal blog and my view of things. But I must emphasize that this has been a team effort within Adobe and here are the names of the key players that I worked closely with: Leonard Rosenthol, Ed Taft, Nora Calvillo, Dave McAllister, Isak Tenenboym, Mike Ossesia, Kathy Stone, Sandra Lee-Doersam and Cheryl Shimamoto. Any time you make a list like this you run the risk of leaving someone off the list who should be there––my early apology. Betsy Fanning from AIIM has been a great boss as the secretariat for TC 171 and this effort. There are many more people who made contributions but these are the ones I worked most closely with on the technical aspects of moving PDF 1.7 to ISO.

Contact me at: jking@adobe.com


January 17, 2008

Now Unanimous!

This is a great day! This is about the program that Adobe, AIIM, and ANSI put together just less than one year ago to move PDF 1.7 under public control as an ISO standard. We submitted the suggestion to ISO and they agreed. A lot has happened since then and I have written several previous articles (here and here) about what was happening in this blog.

Well today the French Standards committee which was the only country committee to submit a negative vote on our recent ballot has reviewed my responses to their comments and decided that if those changes in the specification are made they will change their vote to positive. That will make it unanimous!

Let me say all this a little more carefully.  In one of my previous blogs I noted that the results of the ISO Draft International Standard (DIS) ballot for PDF 1.7 due December 2, 2007 was 13 in favor and 1 opposed (the French).  The ballots also had room for editing comments against the DIS and 205 came in including a bunch from the French. Next week (Jan. 21-23, 2008) in Orlando, Florida, the International committee (TC 171/SC 2) will meet to decide which edits, if any, need to be made to the DIS document before it can be published as the official ISO 32000 standard.

After the ballot ended in December, as the acting technical project leader, I drafted responses to all the comments so we would have a starting point for the discussion next week in Orlando. The Secretariat of ISO TC 171/SC 2 sent the comments that the French made back to them, together with my recommendations, asking if they would change their vote to positive if we followed my recommendations.  (Knowing the Secretariat I assume there was some very diplomatic exchanges that took place as well.) And I just now got word that they decided that the suggested resolution to their comments would be good enough.  Whew!

Now we have to get everyone (14 countries) together next week and go over all the comments and make sure all the countries are comfortable with the suggested treatment or decide on a new treatment. I did recommend rejecting quite a few of the comments as being misunderstandings or out of scope, including some that the French made.

I have talked about my responses to the 205 comments. They are mine as the acting project editor but I must confess that I got a lot of help from PDF experts in drafting them especially a great team of experts within Adobe. So they are only mine in the sense of being responsible. I would single out Ed Taft and Leonard Rosenthol as having been invaluable colleagues in this work.

I am confident that the committee will be able to amiably resolve all 205 comments and we can then send the edited version off to Switzerland to become a published ISO standard. 

If the French would have stuck with their negative vote, then we would have to do pretty much the same thing except wait for two months after the edited document is produced and then sent out as a Final Draft International Standard (FDIS). Then the votes would be either thumbs up or thumbs down with no edits possible. Seems like a wasted 2 months in any case, but now we will not have to do that.  Thanks to the French!

Wish me luck next week.

Contact: jking@adobe.com

 


December 15, 2007

ZIP Archives and Portable Directories

This is a topic that is dear to my heart and I would love to spur some interest in creating an open source project or something like that. Since about 1999 I have been talking to my colleagues about a concept that I call "portable directories."  It is a simple idea once you "get it." 

File systems, organized around the notion of directories or folders in which to collect files and other directories, have been the staple for how we save computer material on our hard drives, data CDs and DVDs, etc. I suppose it had its invention from an analogy with a file cabinet, but on the computer we can nest folders inside folders to any depth, something hard to do with real physical folders.

We have also used another concept that is almost the opposite: we have used a single file to hold all of the varied content needed, for example, to hold a presentation document like PDF, DOC, PPT or AI. Each software application goes to some trouble to figure out how to stuff a wide variety of material into a single file so that users can have self contained documents. Some of these file format that are "native" to particular applications are sophisticated data bases and support random access within the file so the whole file does not have to be digested at once.

I have long believed that a single file would not be the first choice of a product designer, but we are forced to that because of the extreme ease with which our product's users can then deal with documents. I have long believed that developers would have naturally used a directory structure to save the varied parts of a complex application file were it not for that.

So the simple notion of a "portable" directory is to have our cake and eat it two. What if I were to give you a software library that implemented a complete file system using a single file as its storage area instead of the underlying hard disk.  Just one file. Then the developer could use that library and work in a normal directory structure and yet the user would only see one file.  That's it!

Now this is not a new idea and it isn't so original with me. The NeXT machine accomplished the same idea but rather the other way around. It had a binary bit that you could set in a directory that basically said not to expand the contents when displaying this to the user but treat it as a simple file. This is a great way to get the portable directory idea, except you have to convince each operating system to support it. So I favor the simpler approach of having single files, which all OS's support, and which are easy to transport and manage.  We just invent a directory structure inside that single file using a portable directory software library.

If you consider the operations that you need from a basic file system they are naming files, maintaining a file/directory tree structure, creating file and directories, removing or erasing files or directories and managing the available storage space to reuse it when it is freed and to not have more than one thing using the same space. There are some other things like saving dates, times and size values.  And designing reliable and efficient storage management is non-trivial, but it is something the industry has been doing for about 40 years so there is a lot known about what works well and why.

Microsoft has had several file format definitions that satisfy the portable directory metaphor like OLE Structured Storage and CAB files. But for some reason they have not caught on as I would have imagined they should have. In fact, there is a company that has been selling a portable directory library for years. Of course there are as many ways to implement a portable directory scheme as there are ways to implement a file system so if this idea is ever to be used widely it will have to be a standard, an open standard.

Now what about ZIP archives. Several new file format designs have, in fact, used ZIP archives as a poor man's portable directory. ZIP implements a rudimentary file system, one where storage management is about as primitive as you can imagine. It just lays each sub-file out in one contiguous hunk within the containing single file and it lays each sub-file in succession one after the other. If you want to replace a sub-file you can overwrite the existing one provided the new sub-file is no bigger than the original or you can just add to the end of the ZIP file and rename the old file to some garbage name (or better, just remove it from the ZIP directory).  You can erase a sub-file by removing its name from the directory. Yes, you then need to write a new directory but that, too, can be just added to the end of the file.

So you can see that a ZIP archive makes a not so bad read-only file system but it is dead poor for a read/write system.

So why are OOXML, ODF, Mars, AIR and many others using ZIP archive files when they could be using some superior portable directory implementation.  As far as Adobe is concerned it all comes down to the standards question. ZIP archive are a well established and rather stable de facto standard that is widely supported and has been around for ages. No risk of trying to invent something new and to get it standardized. But it is a dreadfully poor design to use for a compound document architecture where you may want to update some small percentage of the content and do it repeatedly without the need to rewrite the whole ZIP archive each time.

So you might ask why Jim King is just writing about this when he has been talking to his colleagues about it for over 8 years? Why hasn't he made it happen? I have no excuse. I have had many more compelling things to champion, and the portable directory evangelism has never quite risen to the top of my list. But I did hire a summer intern, Deepa Tuteja, in 2003 and she did create a prototype portable directory library. But after developing the basic prototype we made one strategic mistake. Just as an exercise to be able to compare ZIP archives with our work we laid our portable directory structure on top of a ZIP archive, making use of temporary files to hold sub-files until the master file is closed. It is an optimization challenge to figure out the most efficient way to support the portable directory interface with the primitive ZIP underpinnings. But when it came time to sell the ideas and the prototype to some of our product people they chose the less risky (standards wise) ZIP implementation. I have yet to get them off of that.

Now when I get PDF handed off to ISO, I might just begin to ... . Well then I have to work on doing the next ISO 32000 version but when that work quiets down I just might begin to ... .

I hope you all are not like me.

Contact me at: jking@adobe.com

 


December 10, 2007

Comments on Comments

After my last blog announcing the positive vote for PDF to complete the Draft International Standard (DIS) phase of becoming an ISO standard, I have gotten a lot of comments. Quite a few were comments about the ballot comments so I thought I should comment.

The 5 month ballot that just completed allowed for yes, no, or abstain votes to be accompanied by comments or editing suggestions to make the specification better or to fix errors. A positive vote implies that the standard is fine but may also include comments about minor edits that do not substantially change the specification.  Any negative ballot must be accompanied by at least one significant technical objection and those have to be addressed one way or another before the DIS can be published as the final ISO 32000-1 standard.

Since the announcement, a large number of comments to my blog asked one or both of these questions:  why did the French vote negatively, and can you point me to a place where I can find all the 205 comments that were in the ballots.  (See below.)

The answer to the second questions is, sorry but the comments will be restricted to committee members since they are considered part of the committee discussions. If you have a burning desire to read them you will have to join your country's standards organization that is responsible for ISO TC 171 SC2.

As to the first question about the negative vote by France, I have read their comments and can only give you some general information.

I can tell you that the French committee has members that know a great deal about digital signatures and spent considerable time reading the section of the ISO 32000 DIS document that covers PDF digital signatures. And I confess that the section of the ISO 32000 DIS on digital signature is one of the hardest to read. And I think the more you know about digital signatures the more you may miss some of the peculiarities of how PDF does it and object to them.

This fast track process to move PDF 1.7 from Adobe ownership to that of ISO is not the normal hammering out of the technical details for a new standard that many committee members are accustomed to. The current objective is to move PDF 1.7 to ISO management unchanged. Some experts feel that some of the choices that have been made in PDF are wrong and should be changed or improved, especially when it comes to digital signatures. OK. But we cannot change it as we move it or we can run the risk of making it not apply to those billions of PDF files currently in existence. After ISO owns it, we can carefully work out backward compatible changes to fix things people feel need fixing.  That will be ISO 32000-2 the second ISO release. 

I have nothing but warm feelings toward the French, including the people on this committee who voted no. I am delighted that they spent that much time looking over the specification and got interested enough to make constructive criticisms. They are going to make great partners on the ISO committees which will make PDF better and better over time.  The single negative vote is not a big picture item and no one should carry any negative feelings toward this committee for doing the job as they saw it. This is how this process is supposed to work.

Here are a sampling of the comments I got about these two subjects:

  • Is there a link for the comments on the ballot? What key objections or concerns were raised by France that prevented an unanimous ballot? – Andrew Mossberg
  • I would like to read the negative comments, that would be more interesting than the news itself.. =(  – bob
  • Hey, is the list of comments somewhere on the net? Would love to have a look at it. – Evgeni
  • What are the main reasons why france voted no? Would be interesting to know your analysis on that. thanks. – Sid
  • Why did France object?  – Al
  • Why France has voted negative? I'm french, and I don't agree with the french vote.  – Frenchy
  • Does anyone know which issues made the French vote "no" ???  – Paganel
  • Are those comments public? If yes, where can they be found?  – Alex
  • Is there someplace where we can read all the comments?  –  Lemi4 aka. fERDI:)
  • Congrats on the achievement, and PDF's formalization as an ISO standard. I'd be fascinated to read the comments made, so are these archived anywhere?  I'd be even more interested to hear the rationale behind France's "no" vote.  –  Ian Farquhar
  • Is there anywhere where we can see the comments?  I'm curious to know what France objected to!  –  Hydrargyrum
  • Do you know why France voted against it?  –  Nick

Notice that they are all nicely worded, factual and professional.  No crap. Thanks go to my readers.

My announcement blog also made it to Slash Dot where it accumulated over 300 comments. (This is the first time my tech-heavy blog has hit something like SlashDot. I think my colleague Duane Nickull is responsible.)

contact me at:  jking@adobe.com

December 04, 2007

ISO Ballot for PDF 1.7 Passed!

Adobe has received word that the Ballot for approval of PDF 1.7 to become the ISO 32000 Standard (DIS) has passed by a vote of 13::1.

Countries voting positive with no comments:  Australia, Bulgaria, China, Japan, Poland, South Africa, Spain, Sweden, Ukraine. (9)
Countries voting positive with comments: UK (13), USA (125), Germany (11), Switzerland (19).  (4)
Countries voting negative with comments: France (37). (1)   
Countries abstaining: Russia (1)
Italy sent comments but is not a voting (P) member.

Total votes 14.  

13 Positive is 93% (must be > 66.6%)  1 Negative is 7% (must be < 25%).  Clear winner!

Total comments (205).

Five countries added comments to their ballots for a total of 205 that will have to be resolved.

I have been nominated by the US Committee to be the technical editor so for the meeting of the International working group on ISO 32000 on January 21-23, 2008 I will come prepared with responses to all of the 205 comments. If the group can address all the comments to the satisfaction of all countries, especially the ones voting negatively, it is possible to finish at that meeting and publish the revised document. If the resolution is more complicated then we will enter a 2 month FDIS vote. The FDIS votes are not accompanied by comments so if we get no more negative votes at that time the revised document will be the one published as ISO 32000.

It may seem strange that the sponsoring country (US) is the one with the most comments (125) but I think that is a reflection of two things: the US committee contains a lot of knowledgeable people including several from Adobe, and we honestly found some mistakes that we felt must be corrected. To me this reflects the honesty with which this group has approached this whole effort. We could have held back to reduce the number but that is not the way this whole effort has been conducted and we are not about to start with any trickery.

The challenging part will be to get people to participate in the next release of the standard. Lots of people want standards but it takes a measurable resource commitment to participate.

Contact me at: jking@adobe.com