<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
   <channel>
      <title>Inside PDF</title>
      <link>http://blogs.adobe.com/insidepdf/</link>
      <description>This blog will discuss all aspects of PDF technology.  As PDF Architect and a Senior Principal Scientist, I will cover the recent activities for putting control of PDF into the public hands, present tutorial views on PDF, and similar stuff.  There are a lot of interesting things happening in this area and a lot of misinformation is being generated.  I plan to kill off as much of that misinformation as possible. The postings on this site are my own and don’t necessarily represent Adobe’s positions, views, strategies or opinions. Sometimes I wish they did!   --  Jim King</description>
      <language>en</language>
      <copyright>Copyright 2008</copyright>
      <lastBuildDate>Mon, 28 Jan 2008 09:32:09 -0800</lastBuildDate>
      <generator>http://www.sixapart.com/movabletype/?v=3.2</generator>
      <docs>http://blogs.law.harvard.edu/tech/rss</docs> 

            <item>
         <title>ISO 32000 — Document management — Portable document format — PDF 1.7</title>
         <description><![CDATA[
	<p>The title of this article is the official name of the <a href="http://www.iso.org">ISO</a> standard based upon PDF 1.7 which, for all purposes, now exists.  The International working group meeting in Orlando, held last week (January 21-23, 2008), was able to successfully resolve all 205 comments to the 32000 Draft International Standard (DIS) document. Even the French, with their expert calling into the meeting on a conference call, have agreed to the edited comment resolutions and, as I <a href="http://blogs.adobe.com/insidepdf/2008/01/now_unanimous.html">reported earlier</a>, changed their vote to positive. That makes it unanimous. </p>
	<p>The remaining tasks are primarily mine and the editing staff of ISO in Switzerland.&#160; I have two tasks to perform: 1) make a final version of the document that records the resolution to each of the comments and 2) make the final edits to the DIS that the comment resolutions dictate. I hope to finish this in a week or two and then it is up to the Swiss editors to polish it off and get it published on the ISO website as an official document. (They have a reputation for taking months to do this. Maybe we can get them into a more excited mood for this effort.)</p>
	<p>This whole process was started less than one year ago when <a href="http://www.adobe.com/aboutadobe/pressroom/pressreleases/200701/012907OpenPDFAIIM.html">Adobe announced</a> that, with the cooperation of <a href="http://www.aiim.org/standards.asp?ID=28639">AIIM</a> and <a href="http://www.ansi.org/standards_activities/iso_programs/overview.aspx?menuid=3">ANSI</a>, we would submit PDF 1.7 to become a public ISO standard. Doing such a complicated standard in one year is extraordinary, but it worked so well for two reasons: 1) PDF was a well accepted and <a href="http://www.adobe.com/devnet/pdf/pdf_reference.html">well documented</a> existing de facto standard and  2) we developed some basic principles to guide the work and help with decision making. </p>
	<p>As we began to make decisions, answer questions and move forward, it became clear to me that the standards process that Adobe had been following and the standards process that AIIM/ANSI/ISO follow are quite different. For the standards organizations the carefully written standards document is supreme. It defines the standard. While <a href="http://www.adobe.com/devnet/pdf/pdf_reference.html">Adobe's PDF 1.7 Reference</a> document is intended to do that same thing it isn't quite so clear. For example, if the billions of files in existence today all contain a construct that has A=1 and the Adobe document says they should have A=2 the document must be changed. That is, the existing files triumph the documentation. It would be of no value to have a specification that does not cover the existing files. So, one focus I put forth at each opportunity, was that the primary objective of the new ISO PDF 1.7 standard was to document the existing files.</p>
	<p>In fact, PDF has been a de facto standard based upon three things: the billions of existing files, the thousand of software offerings that create and process those files, and the Adobe PDF 1.7 Reference. And we decided that the order of preference to resolve any differences was in that order, files, software and then documentation. That is definitely not the standards approach where the document is supreme. So as we examined the Adobe PDF 1.7 Reference and turned it into the ISO 32000 draft we adopted what we called this &quot;three legged stool&quot; approach base upon prioritizing the three contributors to the existing standard. If we could capture the most correct interpretation in each area according to those criteria we could produce an ISO document that <em>was</em> (more) supreme and could be acceptable as the definitive word for what a PDF file contains. Of course, the Adode PDF 1.7 Reference proved to be fine for 99.9% of the definition. </p>
	<p>Most people, after thinking about this three legged stool metaphor came to agree that documenting the existing files should take priority. </p>
	<p>The reason I mention this is I have repeatedly found that most work needs some guiding principles to follow. When making decisions between alternatives, especially when each is very reasonable, established project principles can usually make the decision obvious. If you do not establish such principle early, and make them well known, you find yourself reinventing them and discussing them with each decision. If you do it once at first and get agreement, then they can be applied when needed.</p>
	<p>This proved true when the ISO 32000 draft was created and it proved true during last week's review of the ballot comments. </p>
	<p>So except for the editing work facing me in the next week or so, I am very relaxed because this long (almost) year is about over. It has been quite an experience for me and I sure hope it has been the right action for both the industry and Adobe to have taken. As I said in an earlier blog, the big thing now is to get enough interested people involved in the ISO 32000 working group to allow the public to be as good a shepard of PDF as Adobe has been in the past. Please get involved by contacting <a href="http://www.aiim.org/standards.asp?ID=28639">AIIM</a> in the US or your <a href="http://www.iso.org/iso/about/iso_members.htm">national standards organization affiliated with ISO</a>. </p>
	<p>This is my personal blog and my view of things. But I must emphasize that this has been a team effort within Adobe and here are the names of the key players that I worked closely with: Leonard Rosenthol, Ed Taft, Nora Calvillo, Dave McAllister, Isak 
Tenenboym, Mike Ossesia, Kathy Stone, Sandra Lee-Doersam and Cheryl Shimamoto. Any time you make a list like this you run the risk of leaving someone off the list who should be there––my early apology. Betsy Fanning from AIIM has been a great boss as the secretariat for TC 171 and this effort. There are many more people who made contributions but these are the ones I worked most closely with on the technical aspects of moving PDF 1.7 to ISO. </p>
	<p>Contact me at: jking@adobe.com</p>
	<p> <BR/>
    </p>
  ]]></description>
         <link>http://blogs.adobe.com/insidepdf/2008/01/iso_32000_document_management.html</link>
         <guid>http://blogs.adobe.com/insidepdf/2008/01/iso_32000_document_management.html</guid>
         <category></category>
         <pubDate>Mon, 28 Jan 2008 09:32:09 -0800</pubDate>
      </item>
            <item>
         <title>Now Unanimous! </title>
         <description><![CDATA[
	<p>This is a great day! This is about the program that Adobe, <a href="http://www.aiim.org/standards.asp?ID=28639">AIIM</a>, and <a href="http://www.ansi.org/standards_activities/iso_programs/overview.aspx?menuid=3">ANSI</a> put together just less than one year ago to move PDF 1.7 under public control as an ISO standard. We submitted the suggestion to ISO and they agreed. A lot has happened since then and I have written several previous articles (<a href="http://blogs.adobe.com/insidepdf/2007/09/high_drama.html">here</a> and <a href="http://blogs.adobe.com/insidepdf/2007/09/fast_tracking.html">here</a>) about what was happening in this blog. </p>
	<p>Well today the French Standards committee which was the only country committee to submit a negative vote on our recent ballot has reviewed my responses to their comments and decided that if those changes in the specification are made they will change their vote to positive. That will make it unanimous! </p>
	<p>Let me say all this a little more carefully.&#160; In one of my <a href="http://blogs.adobe.com/insidepdf/2007/12/iso_ballot_for_pdf_17_passed.html">previous blogs</a> I noted that the results of the ISO Draft International Standard (DIS) ballot for PDF 1.7 due December 2, 2007 was 13 in favor and 1 opposed (the French).&#160; The ballots also had room for editing comments against the DIS and 205 came in including a bunch from the French. Next week (Jan. 21-23, 2008) in Orlando, Florida, the International committee (<a href="http://www.iso.org/iso/standards_development/technical_committees/list_of_iso_technical_committees/iso_technical_committee.htm?commid=53674">TC 171/SC 2</a>) will meet to decide which edits, if any, need to be made to the DIS document before it can be published as the official ISO 32000 standard.</p>
	<p>After the ballot ended in December, as the acting technical project leader, I drafted responses to all the comments so we would have a starting point for the discussion next week in Orlando. The <a href="http://www.iso.org/iso/standards_development/technical_committees/list_of_iso_technical_committees/technical_committee_contact.htm?commid=53674">Secretariat</a> of <a href="http://www.iso.org/iso/standards_development/technical_committees/list_of_iso_technical_committees/iso_technical_committee.htm?commid=53674">ISO TC 171/SC 2</a>  sent the comments that the French made back to them, together with my recommendations, asking if they would change their vote to positive if we followed my recommendations.&#160; (Knowing the Secretariat I assume there was some very diplomatic exchanges that took place as well.) And I just now got word that they decided that the suggested resolution to their comments would be good enough.&#160; Whew!</p>
	<p>Now we have to get everyone (<a href="http://www.iso.org/iso/standards_development/technical_committees/list_of_iso_technical_committees/iso_technical_committee_participation.htm?commid=53674">14 countries</a>) together next week and go over all the comments and make sure all the countries are comfortable with the suggested treatment or decide on a new treatment. I did recommend rejecting quite a few of the comments as being misunderstandings or out of scope, including some that the French made. </p>
	<p>I have talked about <em>my</em> responses to the 205 comments. They are mine as the acting project editor but I must confess that I got a lot of help from PDF experts in drafting them especially a great team of experts within Adobe. So they are only <em>mine </em>in the sense of being responsible. I would single out <a href="http://www.adobepress.com/content/images/0321115643/excerpts/0321115643_pp23-29.pdf">Ed Taft</a> and <a href="http://www.acrobatusers.com/blogs/leonardr/">Leonard Rosenthol</a> as having been invaluable colleagues in this work. </p>
	<p>I am confident that the committee will be able to amiably resolve all 205 comments and we can then send the edited version off to Switzerland to become a published ISO standard.&#160; </p>
	<p>If the French would have stuck with their negative vote, then we would have to do pretty much the same thing except wait for two months after the edited document is produced and then sent out as a Final Draft International Standard (FDIS). Then the votes would be either thumbs up or thumbs down with no edits possible. Seems like a wasted 2 months in any case, but now we will not have to do that.&#160; Thanks to the French!</p>
	<p>Wish me luck next week.</p>
	<p>Contact: <a href="mailto:jking@adobe.com">jking@adobe.com</a></p>
	<p>&#160;</p>
	<p><BR/>
    </p>
  ]]></description>
         <link>http://blogs.adobe.com/insidepdf/2008/01/now_unanimous.html</link>
         <guid>http://blogs.adobe.com/insidepdf/2008/01/now_unanimous.html</guid>
         <category></category>
         <pubDate>Thu, 17 Jan 2008 18:50:25 -0800</pubDate>
      </item>
            <item>
         <title>ZIP Archives and Portable Directories</title>
         <description><![CDATA[
	<p>This is a topic that is dear to my heart and I would love to spur some interest in creating an open source project or something like that. Since about 1999 I have been talking to my colleagues about a concept that I call &quot;portable directories.&quot;&#160; It is a simple idea once you &quot;get it.&quot;&#160; </p>
	<p>File systems, organized around the notion of directories or folders in which to collect files and other directories, have been the staple for how we save computer material on our hard drives, data CDs and DVDs, etc. I suppose it had its invention from an analogy with a file cabinet, but on the computer we can nest folders inside folders to any depth, something hard to do with real physical folders.</p>
	<p>We have also used another concept that is almost the opposite: we have used a single file to hold all of the varied content needed, for example, to hold a presentation document like PDF, DOC, PPT or AI. Each software application goes to some trouble to figure out how to stuff a wide variety of material into a single file so that users can have self contained documents. Some of these file format that are &quot;native&quot; to particular applications are sophisticated data bases and support random access within the file so the whole file does not have to be digested at once.</p>
	<p>I have long believed that a single file would not be the first choice of a product designer, but we are forced to that because of the extreme ease with which our product's users can then deal with documents. I have long believed that developers would have naturally used a directory structure to save the varied parts of a complex application file were it not for that.</p>
	<p>So the simple notion of a &quot;portable&quot; directory is to have our cake and eat it two. What if I were to give you a software library that implemented a complete file system using a single file as its storage area instead of the underlying hard disk.&#160; Just one file. Then the developer could use that library and work in a normal directory structure and yet the user would only see one file.&#160; That's it!</p>
	<p>Now this is not a new idea and it isn't so original with me. The<a href="http://en.wikipedia.org/wiki/NeXT"> NeXT machine</a> accomplished the same idea but rather the other way around. It had a binary bit that you could set in a directory that basically said not to expand the contents when displaying this to the user but treat it as a simple file. This is a great way to get the portable directory idea, except you have to convince each operating system to support it. So I favor the simpler approach of having single files, which all OS's support, and which are easy to transport and manage.&#160; We just invent a directory structure inside that single file using a portable directory software library.</p>
	<p>If you consider the operations that you need from a basic file system they are naming files, maintaining a file/directory tree structure, creating file and directories, removing or erasing files or directories and managing the available storage space to reuse it when it is freed and to not have more than one thing using the same space. There are some other things like saving dates, times and size values.&#160; And designing reliable and efficient storage management is non-trivial, but it is something the industry has been doing for about 40 years so there is a lot known about what works well and why.</p>
	<p>Microsoft has had several file format definitions that satisfy the portable directory metaphor like <a href="http://msdn2.microsoft.com/en-us/library/aa378734.aspx">OLE Structured Storage</a> and <a href="http://www.microsoft.com/technet/prodtechnol/ie/ieak/techinfo/deploy/60/en/cab.mspx?mfr=true">CAB files</a>. But for some reason they have not caught on as I would have imagined they should have. In fact, there is a <a href="http://www.simple-sw.com/">company</a> that has been selling <a href="http://web.archive.org/web/20070216023234/http%3A//www.simple-sw.com/efsinfo.htm">a portable directory library</a> for years. Of course there are as many ways to implement a portable directory scheme as there are ways to implement a file system so if this idea is ever to be used widely it will have to be a standard, an open standard.</p>
	<p>Now what about <a href="http://www.pkware.com/index.php?option=com_content&amp;task=view&amp;id=64&amp;Itemid=107">ZIP archives</a>. Several new file format designs have, in fact, used ZIP archives as a poor man's portable directory. ZIP implements a rudimentary file system, one where storage management is about as primitive as you can imagine. It just lays each sub-file out in one contiguous hunk within the containing single file and it lays each sub-file in succession one after the other. If you want to replace a sub-file you can overwrite the existing one provided the new sub-file is no bigger than the original or you can just add to the end of the ZIP file and rename the old file to some garbage name (or better, just remove it from the ZIP directory).&#160; You can erase a sub-file by removing its name from the directory. Yes, you then need to write a new directory but that, too, can be just added to the end of the file.</p>
	<p>So you can see that a ZIP archive makes a not so bad read-only file system but it is dead poor for a read/write system.</p>
	<p>So why are <a href="http://en.wikipedia.org/wiki/Office_Open_XML">OOXML</a>, <a href="http://www.oasis-open.org/committees/office/">ODF</a>, <a href="http://labs.adobe.com/technologies/mars/">Mars</a>, <a href="http://labs.adobe.com/technologies/air/">AIR</a> and many others using ZIP archive files when they could be using some superior portable directory implementation.&#160; As far as Adobe is concerned it all comes down to the standards question. ZIP archive are a well established and rather stable de facto standard that is widely supported and has been around for ages. No risk of trying to invent something new and to get it standardized. But it is a dreadfully poor design to use for a compound document architecture where you may want to update some small percentage of the content and do it repeatedly without the need to rewrite the whole ZIP archive each time. </p>
	<p>So you might ask why Jim King is just writing about this when he has been talking to his colleagues about it for over 8 years? Why hasn't he made it happen? I have no excuse. I have had many more compelling things to champion, and the portable directory evangelism has never quite risen to the top of my list. But I did hire a summer intern, Deepa Tuteja,  in 2003 and she did create a prototype portable directory library. But after developing the basic prototype we made one strategic mistake. Just as an exercise to be able to compare ZIP archives with our work we  laid our portable directory structure on top of a ZIP archive, making use of temporary files to hold sub-files until the master file is closed. It is an optimization challenge to figure out the most efficient way to support the portable directory interface with the primitive ZIP underpinnings. But when it came time to sell the ideas and the prototype to some of our product people they chose the less risky (standards wise) ZIP implementation. I have yet to get them off of that. </p>
	<p>Now when I get PDF handed off to ISO, I might just begin to ... . Well then I have to work on doing the next ISO 32000 version but when that work quiets down I just might begin to ... .</p>
	<p>I hope you all are not like me.</p>
	<p>Contact me at: jking@adobe.com </p>
	<p>&#160;</p>
	<p> <BR/>
    </p>
  ]]></description>
         <link>http://blogs.adobe.com/insidepdf/2007/12/zip_archives_and_portable_dire.html</link>
         <guid>http://blogs.adobe.com/insidepdf/2007/12/zip_archives_and_portable_dire.html</guid>
         <category></category>
         <pubDate>Sat, 15 Dec 2007 20:05:08 -0800</pubDate>
      </item>
            <item>
         <title>Comments on Comments</title>
         <description><![CDATA[<p>After my <a href="http://blogs.adobe.com/insidepdf/2007/12/iso_ballot_for_pdf_17_passed.html">last blog</a> announcing the positive vote for PDF to complete the Draft International Standard (DIS) phase of becoming an ISO standard, I have gotten a lot of comments. Quite a few were comments about the ballot comments so I thought I should comment. </p>
	<p>The 5 month ballot that just completed allowed for yes, no, or abstain votes to be accompanied by comments or editing suggestions to make the specification better or to fix errors.&#160;A positive vote implies that the standard is fine but  may also include comments about minor edits that do not substantially change the specification.&#160; Any negative ballot<strong> must </strong>be accompanied by at least one significant technical objection and those have to be addressed one way or another before the DIS can be published as the final ISO 32000-1 standard. </p>
	<p>Since the announcement, a large number of comments to my blog asked one or both of these questions:&#160; why did the French vote negatively, and can you point me to a place where I can find all the 205 comments that were in the ballots.&#160; (See below.) </p>
	<p>The answer to the second questions is, sorry but the comments will be restricted to committee members since they are considered part of the committee discussions. If you have a burning desire to read them you will have to join your <a href="http://www.iso.org/iso/about/iso_members.htm">country's standards organization</a> that is responsible for <a href="http://www.iso.org/iso/iso_technical_committee.html?commid=53674">ISO TC 171 SC2.</a></p>
	<p>As to the first question about the negative vote by France, I have read their comments and can only give you some general information. </p>
	<p>I can tell you that the French committee has members that know a great deal about digital signatures and spent considerable time reading the section of the ISO 32000 DIS document that covers PDF digital signatures. And I confess that the section of the ISO 32000 DIS on digital signature is one of the hardest to read. And I think the more you know about digital signatures the more you may miss some of the peculiarities of how PDF does it and object to them. </p>
	<p>This fast track process to move PDF 1.7 from Adobe ownership to that of ISO is not the normal hammering out of the technical details for a new standard that many committee members are accustomed to. The current objective is to move PDF 1.7 to ISO management <strong>unchanged</strong>. Some experts feel that some of the choices that have been made in PDF are wrong and should be changed or improved, especially when it comes to digital signatures.&#160;OK.  But we cannot change it as we move it or we can run the risk of making it not apply to those billions of PDF files currently in existence. After ISO owns it, we can carefully work out backward compatible changes to fix things people feel need fixing.&#160; That will be ISO 32000-2 the second ISO release.&#160; </p>
	<p>I have nothing but warm feelings toward the French, including the people on this committee who voted no. I am delighted that they spent that much time looking over the specification and got interested enough to make constructive criticisms. They are going to make great partners on the ISO committees which will make PDF better and better over time.&#160; The single negative vote is not a big picture item and no one should carry any negative feelings toward this committee for doing the job as they saw it. This is how this process is supposed to work. </p>
	<p>Here are a sampling of the comments I got about these two subjects:</p>
	<ul>
	  <li>Is there a link for the comments on the ballot?  What key objections or concerns were raised by France that prevented an unanimous ballot? – Andrew Mossberg</li>
	  <li>I would like to read the negative comments, that would be more interesting than the news itself.. =(&#160; – bob</li>
      <li>Hey, is the list of comments somewhere on the net? Would love to have a look at it. – Evgeni</li>
	  <li>What are the main reasons why france voted no? Would be interesting to know your analysis on that. thanks. – Sid</li>
	  <li>Why did France object?&#160; – Al</li>
	  <li>Why France has voted negative? I'm french, and I don't agree with the french vote.&#160; – Frenchy</li>
	  <li>Does anyone know which issues made the French vote "no" ???&#160; – Paganel</li>
	  <li>Are those comments public? If yes, where can they be found?&#160; – Alex</li>
	  <li>Is there someplace where we can read all the comments?&#160; –&#160; Lemi4 aka. fERDI:)</li>
	  <li>Congrats on the achievement, and PDF's formalization as an ISO standard. I'd be fascinated to read the comments made, so are these archived anywhere?&#160; I'd be even more interested to hear the rationale behind France's "no" vote.&#160; –&#160; Ian Farquhar</li>
	  <li>Is there anywhere where we can see the comments?&#160; I'm curious to know what France objected to!&#160; –&#160; Hydrargyrum</li>
	  <li>Do you know why France voted against it?&#160; –&#160; Nick</li>
    </ul>
    <p>Notice that they are all nicely worded, factual and professional.&#160; No crap. Thanks go to my readers. </p>
    <p>My announcement blog also made it to <a href="http://slashdot.org/article.pl?sid=07/12/05/0149227">Slash Dot</a> where it accumulated over 300 comments.   (This is the first time my tech-heavy blog has hit something like SlashDot. I think my colleague <a href="http://technoracle.blogspot.com/">Duane Nickull</a> is responsible.)</p>
    <p>contact me at:&#160;<a href="mailto:jking@adobe.com"> jking@adobe.com </a><BR/>
    </p>
  ]]></description>
         <link>http://blogs.adobe.com/insidepdf/2007/12/comments_on_comments.html</link>
         <guid>http://blogs.adobe.com/insidepdf/2007/12/comments_on_comments.html</guid>
         <category></category>
         <pubDate>Mon, 10 Dec 2007 11:53:57 -0800</pubDate>
      </item>
            <item>
         <title>ISO Ballot for PDF 1.7 Passed!</title>
         <description><![CDATA[    <p>Adobe has received word that the Ballot for approval of PDF 1.7 to become the ISO 32000 Standard (DIS) has passed by a vote of 13::1. <br />
      <br />
      Countries voting positive with no comments: &#160;Australia, Bulgaria, China, Japan, Poland, South Africa, Spain, Sweden, Ukraine. (9)<br />
      Countries voting positive with comments: UK (13), USA (125), Germany (11), Switzerland (19). &#160;(4)<br />
      Countries voting negative with comments: France (37). (1) &#160;&#160;<br />
Countries abstaining: Russia (1)<br />
Italy sent comments but is not a voting (P) member.<br />
<br />
Total votes 14. &#160;<br />
<br />
13 Positive is 93% (must be &gt; 66.6%) &#160;1 Negative is 7% (must be &lt; 25%). &#160;Clear winner!<br />
<br />
Total comments (205).<br />
<br />
Five countries added comments to their ballots for a total of 205 that will have to be resolved.</p>
    <p> I have been nominated by the US Committee to be the technical editor so for the meeting of the International working group on ISO 32000 on January 21-23, 2008 I will come prepared with responses to all of the 205 comments. If the group can address all the comments to the satisfaction of all countries, especially the ones voting negatively, it is possible to finish at that meeting and publish the revised document. If the resolution is more complicated then we will enter a 2 month FDIS vote. The FDIS votes are not accompanied by comments so if we get no more negative votes at that time the revised document will be the one published as ISO 32000.</p>
    <p> It may seem strange that the sponsoring country (US) is the one with the most comments (125) but I think that is a reflection of two things: the US committee contains a lot of knowledgeable people including several from Adobe, and we honestly found some mistakes that we felt must be corrected. To me this reflects the honesty with which this group has approached this whole effort. We could have held back to reduce the number but that is not the way this whole effort has been conducted and we are not about to start with any trickery.<br />
      <br />
The challenging part will be to get people to participate in the next release of the standard. Lots of people want standards but it takes a measurable resource commitment to participate.    </p>
    <p>Contact me at: jking@adobe.com <BR/>
    </p> ]]></description>
         <link>http://blogs.adobe.com/insidepdf/2007/12/iso_ballot_for_pdf_17_passed.html</link>
         <guid>http://blogs.adobe.com/insidepdf/2007/12/iso_ballot_for_pdf_17_passed.html</guid>
         <category></category>
         <pubDate>Tue, 04 Dec 2007 13:06:57 -0800</pubDate>
      </item>
            <item>
         <title>Bits and Pieces</title>
         <description><![CDATA[	<p>I think I have left a few loose ends in some of my previous blogs so this one is just to rehash and reemphasize some points I might not have covered or not covered well.</p>
	<p><strong>Archiving Documents</strong> </p>
	<p>In my <a href="http://blogs.adobe.com/insidepdf/2007/10/archiving_documents.html">previous blog</a> on this subject I may not have made it clear that I do not believe that PDF/A or PDF files are the total answer to archiving documents. There is a wide spectrum of needs when archiving that depends upon the anticipated uses for the documents in the distant future. What I did say and do believe, is that a vast number of archiving needs are of the ePaper variety, where I just need to save a birth certificate or a wedding license or a record of a business transaction. I do not see any need in those cases for saving an editable form. In fact, I want to save those documents in a form where I can lock down the appearance and be assured that in 50 years a person will be able to see exactly what I see today. PDF/A is great for that. </p>
	<p><strong><a href="http://www.w3.org/2004/CDF/">CDF</a></strong></p>
	<p>I wrote about CDF because in the first few weeks of November it started popping up in the news, primarily because the <a href="http://fussnotes.typepad.com/plexnex/2007/11/opendocument-fo.html">Open Document Foundation</a> had announced frustration with the direction that ODF was taking and was switching their support to CDF. I found that rather puzzling since CDF and ODF are such different things. At least I think so. So I decided to make clear in my <a href="http://blogs.adobe.com/insidepdf/2007/11/cdf.html">CDF blog</a> what CDF is and what it isn't.&#160; I personally do not think it is a very good horse to bet upon for a ODF replacement unless it takes a <a href="http://blogs.adobe.com/insidepdf/2007/11/cdf.html">rather different direction</a>. </p>
	<p>I am a little slow on this gossipy news but in the first few weeks of November a rather strange sequence of events was reported. First Sam Hiser (I guess with Gary Edwards and Marbux, the reported the leaders of the Open Document Foundation), announce their switch of alliance to CDF after having formed the organization to promote ODF. Then the most startling thing was that shortly after that announcement the Open Document Foundation closed down. There are some blogs about this <a href="http://www.consortiuminfo.org/standardsblog/article.php?story=20071109070012244#comments">here</a> and  <a href="http://www.robweir.com/blog/2007/10/cracks-in-foundation.html">here.</a></p>
	<p>I am still puzzle over how shallow some people who write with great authority can be, and amazed at how they can collectively make a widespread and continuing story out of so little. They just fill in between the lines with great imagination. There were forecasts of the demise of ODF just because <a href="http://fussnotes.typepad.com/plexnex/2007/11/opendocument-fo.html">Hiser</a> claimed their vital support for ODF was moving to CDF. I guess the switch in letters of the alphabet from &quot;O&quot; to &quot;C&quot; had press appeal. In fact, it seems that the ODF Foundation's <em>vital support</em> for ODF has diminished greatly over the years and at this point in rather immaterial.</p>
	<p><strong>Document Formats </strong></p>
	<p>Another thing that I have <a href="http://blogs.adobe.com/insidepdf/2007/10/archiving_documents.html">blogged about previously</a> and still bugs the @#$% out of me is, again, the technical shallowness of some of the people that write about these things. In fact, I have found a lot of this same technical shallowness in some of the feedback we have gotten about PDF and standards. Come on!&#160; You have to be a little responsible and learn a little bit about the technology, don't ya! </p>
	<p>I have been repeatedly asked if ODF will wipe out PDF. I guess it is a reasonable question, but I suppose I know too much for it to sound reasonable to me. Using some old fashion terminology, ODF is a word processor format used to save a document while you go to lunch.&#160; PDF is a communications format used to send information between people. Yes I know that my characterization of ODF is rather shallow itself, but really folks, that is where these formats like .doc, ODF and OOXML have come from. They may have tried to leave the old neighborhood, just like PDF has been trying to leave the old neighborhood of a captured version of what you would print, but those roots show through. Note in the diagram below that PDF and the other document formats live in different places.</p>
	<p align="center"><img src="http://blogs.adobe.com/insidepdf/OfficeFormats.png" width="310" height="302" /></p>
	<p align="left">I guess this all stems from peoples desire to read/write something interesting, something dicey. The controversy surrounding Microsoft and OOXML versus ODF is just such a thing. I might dismiss it as being a rather stupid argument except that it is at the root of a potentially  huge financial loss for Microsoft. So it is big business. </p>
	<p align="left">Anyway, from my technical viewpoint (ODF and OOXML), PDF, CDF are all three different things. I do lump OOXML and ODF together although there are important technical differences between them as well. But they both do belong at the top of my diagram and neither is very well suited to be a communications document.&#160; Frankly, I haven't figured out where CDF fits in the diagram. </p>
	<p align="left"><strong>Document Format Profiles</strong></p>
	<p align="left">I have written that document format subsets or profiles are a bad thing. Yet I also think that the PDF subset PDF/A is a good thing. How to reconcile these apparently conflicting positions. Well it took some thought, but I think I got it. Let's try to use my old colored telephone analogy. That was a what-if we had red phones and blue phones and red phones could only connect to red phones and blue to blue. That would be terrible. You can make it worse, even, by introducing black and blue phones and then make up a bunch of nonsymmetrical rules about which color of phone can talk to which other color of phone. Communications document format  <strong>subsets</strong> are like colored phones. And PDF is one of those communications document formats. </p>
	<p align="left">But PDF/A is ok because it is not driven by the capabilities of devices and it doesn't restrict the key attribute of PDF, namely the reliable display of the document's content. What is does do is restrict the file format for functional reasons. The files should be such that the colors, the content and the ... cannot age or change from device to device or over time.</p>
	<p align="left">Let's see if we can fit it into my phone analogy. What if I said that for certain conversations you were not allowed to use foul language. That would not be the same as blue phones versus red phones.&#160; That is the kind of restriction that PDF/A imposes. It isn't a restriction that inhibits the communication of information, it is a restriction on the kind of information that can be exchanged.&#160; It is OK in my communications metaphor. I can still call and talk to someone with a different color phone, I just might have to watch my language. </p>
	<p align="left">I hope you got this because it is really important that we have support for PDF/A and understand its role in the bigger PDF picture.</p>
	<p align="left">Contact me at: <a href="mailto:jking@adobe.com">jking@adobe.com</a></p>
	<p align="left">&#160;</p>
	<p> <BR/>
    </p>
	]]></description>
         <link>http://blogs.adobe.com/insidepdf/2007/12/bits_and_pieces.html</link>
         <guid>http://blogs.adobe.com/insidepdf/2007/12/bits_and_pieces.html</guid>
         <category></category>
         <pubDate>Tue, 04 Dec 2007 09:53:38 -0800</pubDate>
      </item>
            <item>
         <title>CDF</title>
         <description><![CDATA[	<p>I have heard suggestions that <a href="http://www.w3.org/2004/CDF/">CDF</a> (Compound Document Formats) is a good alternative for <a href="http://www.oasis-open.org/committees/office/">ODF</a> or <a href="http://en.wikipedia.org/wiki/Office_Open_XML">OOXML</a> and I suppose by extension, for <a href="http://www.adobe.com/devnet/pdf/pdf_reference.html">PDF</a>. I want to explain a bit about CDF and connect the discussion to my last blog on Communication via Documents. CDF is a set of candidate <a href="http://www.w3.org/">W3C</a> recommendations (their term for standard; candidate meaning not quite baked). One of the things I want to get to in this article is the &quot;s&quot; on the end of Compound Document Formats. But let start at what CDF is or isn't. </p>
	<p>In my opinion the name CDF, in general, is a little off because these recommendations (aka, standards) are aimed at <a href="http://www.w3.org/DOM/">DOM</a> processing (Document Object Model) and defining how heterogeneous web languages, when processed into a common DOM, can communicate more effectively and have more impact upon each other. To me it doesn't address &quot;document formats&quot; very directly. The W3C has developed a modest set of XML markup languages for various specialized forms of document content (e.g., XHML, XSL, XForms, SVG, SMIL, MathML and VoiceML&#160; [see <a href="http://www.w3.org/">here</a> for the recommendations for any of these]). The basic CDF idea is to specify how to make a unified document out of a bunch of these content types. So the first steps of defining a &quot;format&quot; has already been done for the components of a &quot;W3C&quot; compound document.</p>
	<p>The proposed recommendation talks about a document root in a &quot;host&quot; language  (only example XHTML) that includes subordinate content in some &quot;child&quot; languages either by &quot;reference&quot; or by &quot;inclusion&quot;.&#160; By reference is the most commonly used today. Web pages are put together with the host HTML page using URIs to reference images, SWF (Flash), SVG, etc. </p>
	<p>CDF uses &lt;object&gt; in XHTML to reference children. Since they mostly talk about XML markup languages, then inclusion is just a matter of sucking up the child into the host XML using name spaces to distinguish which is which. Another possibility, but not discussed, is to do as OOXML, ODF, and <a href="http://labs.adobe.com/technologies/mars/">Mars </a>have done and use <a href="http://www.pkware.com/index.php?option=com_content&amp;task=view&amp;id=64&amp;Itemid=107">ZIP</a> archives to collect all the document components into a single file (package). Like it or not, the CDF folks will have to face the idea that a fully self contained compound document will need: JPEG images (not in an XML markup language), OpenType fonts (not in an XML markup language), ICC color profiles (not in an XML markup language) and much more. So inclusion by &quot;sucking up&quot; into one XML file is not a very universal solution. Of course, they may not be interested in self contained compound documents. </p>
	<p>The real technical content of the CDF recommendations is in details of how to glue these various (XML markup) languages together once they all have been processed into their respective DOMs. It establishes conventions for how a script might reach across DOM boundaries, how events might get propagated across DOM boundaries and stuff like that. I won't go into this any deeper because you will get more accurate information by just reading the W3C <a href="http://www.w3.org/2004/CDF/">documents</a>. But the main idea of CDF is to bring these variously defined content types into a uniform &quot;framework&quot; so that scripts can operate more at a document level instead of being confined to their own document child. </p>
	<p>But one thing that is really interesting to me and that  I disagree with is that they set up a &quot;framework&quot; for an arbitrary set of recommendations (remember that translates loosely to &quot;standards&quot;) one for each different compound document &quot;profile.&quot; You will find the following candidate recommendations on the website: <a href="http://www.w3.org/TR/2007/CR-CDR-20070718/">Compound Document by Reference Framework 1.0</a>, <a href="http://www.w3.org/TR/2007/CR-WICD-20070718/">WICD Core 1.0</a>, <a href="http://www.w3.org/TR/2007/CR-WICDMobile-20070718/">WICD Mobile 1.0 Profile</a>, <a href="http://www.w3.org/TR/2007/CR-WICDFull-20070718/">WICD Full 1.0 Profile</a>&#160;(WI is for Web Integration). So we do not get one compound document format but an arbitrary set of formats that all follow similar rules on how things are glued together. On the surface that sounds like a nice modular approach. But what I worry about is the proliferation of the profiles in what I believe is a &quot;<a href="http://blogs.adobe.com/insidepdf/2007/11/communication_via_documents.html">Communication by Documents</a>&quot; scenario. Given that this is about Web/Internet documents, they are by nature communication documents. Please read my <a href="http://blogs.adobe.com/insidepdf/2007/11/communication_via_documents.html">previous blog</a> where I describe this notion and claim that that was the design point for PDF. It seems strange to me that others haven't picked up on this notion and been more conservative about defining language profiles.</p>
	<p>So what is wrong with language profiles. Well in a communications application, where you are trying to send document information from one person in the world to another person in the world you really want a very limited set of standards that have to be followed on both ends to make the communications work. In my <a href="http://blogs.adobe.com/insidepdf/2007/11/communication_via_documents.html">previous blog</a> I talk about having red telephones and blue telephones and blues can only connect to blues and reds to reds. If you let the colors (profiles) proliferate you get a totally useless mess.</p>
	<p>CDF seems to be on a path to institutionalize such a mess. Most of their work so far has been on the framework and that is great and is a long time coming for the web <a href="http://blogs.adobe.com/insidepdf/2007/09/xml_for.html">XML markup languages</a>. But to leave the door open to multiple profiles for compound documents, and it seems like they are thinking <strong>many,</strong> is where things go astray.&#160; </p>
	<p>It is still salvageable if they pick one set of XML markup languages and edict that all conforming processors must process exactly this set, probably their Full profile. This has been another missing element of the web as we often get documents that do not come out correctly because we are missing some browser plug-in or other. </p>
	<p>And of course, PDF has gotten this right! </p>
	<p>Contact me at: <a href="mailto:jking@adobe.com">jking@adobe.com</a></p>
	<p> <BR/>
    </p>
  ]]></description>
         <link>http://blogs.adobe.com/insidepdf/2007/11/cdf.html</link>
         <guid>http://blogs.adobe.com/insidepdf/2007/11/cdf.html</guid>
         <category></category>
         <pubDate>Thu, 29 Nov 2007 08:07:54 -0800</pubDate>
      </item>
            <item>
         <title>Communication via Documents</title>
         <description><![CDATA[<p>PDF documents are all about<em> information communication</em>. The communication aspect is often not completely appreciated by other people inventing document formats. And I am thinking here of the<a href="http://www.w3.org/2004/CDF/"> CDF</a>, <a href="http://www.oasis-open.org/committees/office/">ODF</a>,<a href="http://en.wikipedia.org/wiki/Office_Open_XML"> OOXML</a> folks. </p>
	<p>Let's look at a more familiar example of communication, the telephone. Within some limits of reason, we absolutely expect to be able to pick up any phone anywhere in the world and successfully talk to any other person in the world who also has a phone. That is amazing! But how is that accomplished?&#160; Answer: by using common standards to which all phones and all components of the phone system adhere. Now this blog is less of a plea on my part for standards than it is a plea to understand the very special properties of those standards that must be followed to allow a communication system to be built upon those standards. There are <em>document </em>standards and there are <em>communication</em> <em>document</em> standards and they can be quite different. </p>
	<p>Here is the kind of thing you want to avoid at all costs: three kinds of phones, red, blue and black and a phone of a given color can only connect to a phone of the same color. Wouldn't that be a mess! There actually is a tiny element of this among the cell phone vendors already when it comes to charging for calls. Some vendors offer free calling to any phone from the same vendor (of the same color). But there would be great alarm if those were the only calls that could be made. And how additionally confusing would it be if red phones could call blue but not black, and blue could only call blue, but black could call black and red but not blue. Something that complex would reduce the usage of phones by an order of magnitude.</p>
	<p>Well these same ideas apply to communicating information via documents. In this case the sender and receiver are software tools for authoring and displaying documents. In many cases the same software plays both roles of sender and receiver. If I put a document onto a website, I can easily anticipate that 100,000 people or more might read it. (I am waiting for those kinds of numbers for this blog.&#160; Tell your friends!)&#160; I want all of those people to be able to read it no matter what their computer or favorite software. So we have to have standards that are strictly adhered to as to the form those document files can take and also the software has to be written to strictly obey the standards. </p>
	<p>Let's look at a different situation, that of application specific save files, or what we sometimes call "native file formats."&#160; The primary function for these files is to save any work I am doing so that I can shut down the application and come back later to pick up where I left off. I am thinking of the .psd files of Adobe Photoshop, or the .doc files of Microsoft Word or any other file format closely aligned with a favorite application. Saving is not the only use but that is a key use. </p>
	<p>Typically these file formats get revised when a revised version of the product gets introduced. Typically the new application will read older versions of the files and the user can then save then in the new format. But if your files go back too far, your modern software may not be able to read them. Adobe's products are quite good at release after release being able to read all older files. Other companies are not so good about this. I think it could have something to do with putting pressure on all users of that software to keep updating to the newer versions. </p>
	<p>Certainly, most new format files will not successfully be read by the old software. They are new formats because they have new information in them unknown to the old programs.</p>
	<p>There is another aspect of these native file formats that is often overlooked. Since they are intended to serve as a snapshot of the work being done with that product, the files generally contain much more than the final appearance of the document. For example, Adobe Illustrator files can contain sets of patterns and color swatches that the user has defined to use with this particular file. They are saved and restored to/from the native .ai files. So these native files may be larger than really needed for communication purposes.</p>
	<p>And native file formats are not very good to use as archive files unless they are "turned over" every few years, updating the files to the new revised formats. I wrote about archiving and file formats earlier.</p>
	<p>So are <a href="http://en.wikipedia.org/wiki/Office_Open_XML">OOXML</a>, <a href="http://www.oasis-open.org/committees/office/">ODF</a> and <a href="http://www.w3.org/2004/CDF/">CDF</a> native file formats or communication formats?  Hard compromises have to be made for a format to be both and, so far, I cannot think of a single example where this has been done truly successfully. PDF is <em>not</em> a native file format and has been designed and managed as a <em>communications</em> format (only).</p>
	<p>One more aspect of communications formats is the toughest one to avoid, that of special versions or "profiles" of the format. Usually the formats are designed to be more ambitious than what can be easily handled by limited-function devices. So there is great pressure to define a subset of the original format that is limited in some ways to be supported by limited devices. This then fractures the standard and we get into the situation analogous to red and blue phones where red can call anyone but blue can only call blue. (Subset files can be read by anyone but full format files cannot be read by limited devices.)&#160; Not good!</p>
	<p>You might have noticed that I mentioned <a href="http://www.w3.org/2004/CDF/">CDF</a> (Compound Document Format<strong>s</strong>) for the first time in any of these blogs. That is because of a recent announcement by the <a href="http://www.computerworld.com/action/article.do?command=viewArticleBasic&articleId=9044723&intsrc=hm_list">Open Document Foundation</a>, Inc., previously a strong supporter of ODF, that it will now be helping to promote CDF. I will follow this blog with one about CDF.</p>
	<p>Anyway, the bottom line for the PDF architect: PDF is about the only format that has been specifically designed to be a communications format and it shows. </p>
	<p>Contact me at: <a href="mailto:jking@adobe.com">jking@adobe.com</a></p>
	<p>&#160;</p>
	<p> <BR/>
    </p>
	]]></description>
         <link>http://blogs.adobe.com/insidepdf/2007/11/communication_via_documents.html</link>
         <guid>http://blogs.adobe.com/insidepdf/2007/11/communication_via_documents.html</guid>
         <category></category>
         <pubDate>Fri, 16 Nov 2007 18:09:47 -0800</pubDate>
      </item>
            <item>
         <title>The Distiller Story</title>
         <description><![CDATA[<p>This just might be apocryphal but to the best of my knowledge this explains the  heritage of the Adobe Acrobat Distiller. The story also motivates one of the key ways that PDF deviates from PostScript. </p>
	<p>When you make a printer, you need to have some pages to print that really show off the great features of the printer. At least this was true in the very early '80s when small desktop copiers were being experimented with as printers. Adobe had invented PostScript as a device and resolution independent way to get sophisticated text, image and graphics output on a laser printer.</p>
	<p><a href="http://www.adobe.com/aboutadobe/pressroom/executivebios/johnwarnock.html">Dr. John Warnock</a>, one of the founders of Adobe, had decided to use the classic <a href="http://www.irs.gov/pub/irs-pdf/f1040.pdf">US 1040 tax</a> form as a great example of a complex and graphically rich document to print. It involves a lot of text, lines and shaded areas, the type of material that a laser printer can do a great job on but that typical printers of the day could not handle. Now remember that PostScript is a programming language and you can write subroutines to do various imaging jobs. So John made heavy use of subroutines as he hand-crafted the US 1040 tax form.  This was a lot of work but by carefully building up a nice hierarchical line drawing, box drawing and shading subroutine library he was able to get a very close 1040 facsimile to print on an experimental PostScript printer. Trouble was it took many minutes to print, many minutes. Not great for the intended purpose of demos.</p>
	<p>John was pretty sure that most of the processing time was due to the extensive subroutine use and not due to the actual rendering of the lines, shades and text. Also being a typical programmer who would rather have a computer do the work, he set out to see if he couldn't automatically convert his 1040 PostScript program into one without subroutines at all. One of the features of PostScript is that you can redefine all of the operators in the language. So John wrote a PostScript program when loaded just before printing the 1040 form, redefined all the graphic rendering primitives to write out a text string that represented the PostScript for what that operator was being asked to do. Did I forget to mention that PostScript can also write out strings that get sent back to the computer to which the printer is attached. Well it can. This may be a little subtle so let me try to make it more specific and understandable. He wrote a PostScript program to run on a PostScript printer whose purpose was to transform PostScript print jobs and send the transformed versions back to the computer to which the printer was attached. Suppose that one of the uses of the graphics primitives was buried deep in a subroutine stack and was written like this: "X1 Y2 moveto X2 Y2 lineto X3 Y3 lineto stroke", where the Xs and Ys were variables computed in some way, probably within a loop in the subroutines. What would get sent back to the computer in one pass of the loop might look like: "200 200 moveto 250 200 lineto 250 250 lineto stroke". In the second pass of the loop the same PostScript might send back: "300 300 moveto 350 300 lineto 350 350 lineto stroke". If the subroutine loop was executed 15 times then 15 strings like those would get sent back to the computer 
but each would have different numbers to draw lines in different spots on the 1040.
 The interesting thing was that the operators in this new PostScript had no variables and the numbers they contained were the result of executing the original 1040 PostScript program. So if we do this for everything on the page, we end up with a rather unraveled and perhaps voluminous set of basic graphic utterances. If those are subsequently sent down to the printer they will produce the exact same 1040 form but much faster since there is no more programming language.</p>
	<p>The result was a 1040 form that could be used very effectively to demo the capabilities of PostScript and laser printers. But a side benefit of the work was this PostScript program which John called a "distiller" that you could load in front of any print job and it would convert it into a "distilled version"&#160; and send it back to the computer. The derived or distilled PostScript would produce the same results but didn't use any of PostScripts (slower) programming constructs or variables. </p>
	<p><a href="http://www.reid.org/glenn/">Glenn Reid</a> picked up John's program and fleshed it out to be more complete and to make it more widely available for people to experiment with. This exercise proved two things: that you could automatically "distill" PostScript programs into a more primitive form that didn't use variables or subroutines, and that the resulting simple PostScript was significantly faster to process and print.&#160; It might be larger, but that was a less serious problem of communicating to the printer more quickly. </p>
	<p>If you are following this at all, by now you should realize that this distiller made by John and Glenn went on to become the Adobe Acrobat Distiller and the ideas of simplifying PostScript to remove variables and subroutines led to those features being a mainstay of PDF.&#160; <a href="http://awards.acm.org/citation.cfm?id=4998157&srt=all&aw=149&ao=SOFTWSYS">Doug Brotz</a> next wrote the actual first version of what is now Acrobat Distiller by taking the Adobe PostScript interpreter which was written in C and modifying it to do the distiller function. Having that, you didn't need a PostScript printer to run the PostScript version of the distiller, as John and Glenn had done. Interestingly, Doug did his work on a NeXT machine. That machine's imaging model was based upon Display PostScript. It all swirls around! </p>
	<p>Contact me at:<a href="mailto:jking@adobe.com"> jking@adobe.com </a><BR/>
    </p>
  ]]></description>
         <link>http://blogs.adobe.com/insidepdf/2007/11/the_distiller_story.html</link>
         <guid>http://blogs.adobe.com/insidepdf/2007/11/the_distiller_story.html</guid>
         <category></category>
         <pubDate>Sat, 03 Nov 2007 13:31:37 -0800</pubDate>
      </item>
            <item>
         <title>PDF by Design</title>
         <description><![CDATA[<p>I named this blog "Inside PDF" because I anticipated telling you a lot about PDF technology – what is inside of a PDF file and why.&#160; I have spent most of the time so far talking about PDF and standards. So, I thought it was about time to do an entry about PDF itself. I believe that PDF has been so successful because of the caldron out of which it was brewed. By 1990 Adobe was quite successful with <a href="http://en.wikipedia.org/wiki/PostScript">PostScript</a>. By then we had helped over 60 other companies make printers, image setters, and other imaging devices that used PostScript. We had also shipped <a href="http://en.wikipedia.org/wiki/Display_PostScript">Display PostScript</a> and the Steve Jobs <a href="http://www.time.com/time/magazine/article/0,9171,968761-1,00.html">NeXT machine</a> was a computer whose operating system's imaging model was Display PostScript. So Adobe had had considerable experience in displaying documents on a screen.</p>
	<p>Here are the problems we found with using PostScript as an electronic document representation:</p>
	<ol>
	  <li>In those days, it was a real challenge to render PostScript pages fast enough for interactive applications.</li>
	  <li>PostScript is a stream oriented language so it is generally impossible to display any given page without having processed all pages preceding it.</li>
      <li>PostScript Level 1.0 had no compression technology so images and fonts made the files bigger than they needed to be (PostScript Level 2.0 which did have compression was released in early 1991).</li>
	  <li>PostScript files were very difficult if not impossible to reprocess (edit). Since PostScript is a programming language one had to face the issue of understanding what a computer program does, and in its most general terms that activity has been proven  to be impossible. </li>
	  <li>Having the correct fonts available when rendering the file.</li>
      <li>A PostScript interpreter/renderer had a relatively large footprint and MACs/PCs of those days were not very suitable targets for a PostScript interpreter. </li>
	</ol>
    <p>Adobe actually made several false starts during 1989 - 1991 at defining an electronic document format before the PDF project really took hold. There was one called Interactive PostScript and another very related to today's <a href="http://en.wikipedia.org/wiki/OpenDocument">ODF</a>. Another fact gets factored into this history as well: <a href="http://www.adobe.com/products/illustrator/">Adobe Illustrator</a> was built to support the PostScript imaging model and a relatively light weight imager based upon PostScript code was inside Illustrator. That code formed the starting place for Acrobat. Adobe Illustrator also had its own stylized way to use the PostScript language so that it could re-read and edited the files and those ideas made their way into PDF as well.</p>
    <p>Unlike many projects, the Acrobat/PDF project had the benefit of this previous work and the project was founded  to design a language that got around the inherent limitations of PostScript for desktop display. So here are the design objectives set out for PDF from the start:</p>
    <ol>
      <li>Faster processing with software that would be suitable for PC and MAC viewers. </li>
      <li>Portable. This meant using printable ASCII characters throughout the file representation. Interestingly, this is similar to, and based upon, the same motivation as today's XML using Unicode for it file representation.</li>
      <li>Small file size. (use compression technology) </li>
      <li>Font independence. Embedding fonts was not considered to be the best solution because of their size. </li>
      <li>Single pass generation.</li>
      <li>Random access to contents, especially individual pages.</li>
      <li>Increment updating the file without a wholesale rewrite.</li>
      <li>Extensibility. The need to be able to add features without making either older files or older viewers obsolete. </li>
    </ol>
    <p>Most of these were actually spelled out at the front of the first <a href="http://www.adobe.com/devnet/pdf/pdfs/PDFReference13.pdf">PDF manual</a>. </p>
    <p>It was decided to throw out the programming language aspects of PostScript in favor of keeping the graphic primitives but having them driven by constant parameters rather than variables. There is another interesting story around this aspect and the Acrobat Distiller I will relate at another time.</p>
    <p>Perhaps the most serious issue with PostScript was its basic stream nature. The files were intended to be read and processed  in one pass from starting byte to ending byte. This allowed for documents of unlimited size to be processed by PostScript printers. But it also meant that a font that might have been downloaded on page 1 may be used on page 485 so you couldn't just take that section of the file pertaining to page 485 and process it without having processed all preceding pages.&#160;Consequently, in PDF each page is specified and processed totally independently. </p>
    <p>Certain features that took advantage of being an electronic document were foreseen, like hyperlinks off of the table of contents or linking to the proper page within document references (e.g., see figure x on page y).&#160; To support this feature with responsiveness, not only do the pages have to be represented independently, but it must be possible to read only that portion of the file that contains a given page and its needed resources (fonts, imaging, etc.). This is the random access item number 6.&#160; </p>
    <p>This requirement (6) lead to introducing the idea of a PDF "object" which is a rather low level building block that can be read from the file randomly. The whole file is made only of objects and of these object types: numbers (995, 3.65), strings ( (a string) ), names (/dog, /Jim), true, false, arrays (heterogeneous) ( [99 /dog 9.4 [1 2 3] true] ), dictionaries (key value pairs) (<</dog (bark) /cat (meow) /size 44>>), and streams (dictionaries followed by a bunch of bytes, arbitrary both in content and length). Space is the primary delimiter (not comma as in many languages) and strings are put into parenthesis.&#160; This follows the PostScript representation for the same primitive objects, but PostScript did not separate them out as addressable objects in the file. In PDF each addressable object (called an indirect object) gets a number. At the very end of the PDF file is a cross reference table that maps object numbers into locations within the file. So PDF readers start by reading the cross reference table from the end of the file and then read the objects randomly by number. There is also a starting root object number held in a special structure at the end of the file. I have an <a href="http://home.comcast.net/~jk05/presentations/PDFTutorials.html">annotated tutorial presentation</a> that goes into more detail about this. </p>
    <p>Well this is rather rambling on and getting too long. But I have made a couple of the most important points: the PDF design was driven by some well established design principles learned from previous attempts, and the random access feature of the design sets it apart from many if not most other document formats. I will come back to this in the future and fill in some more information on how the problems with PostScript lead to really good design decisions for PDF.</p>
    <p>Contact me at<a href="mailto:jking@adobe.com"> jking@adobe.com </a></p>
    <p><BR/>
    </p>
  ]]></description>
         <link>http://blogs.adobe.com/insidepdf/2007/10/pdf_by_design.html</link>
         <guid>http://blogs.adobe.com/insidepdf/2007/10/pdf_by_design.html</guid>
         <category></category>
         <pubDate>Wed, 31 Oct 2007 11:50:19 -0800</pubDate>
      </item>
            <item>
         <title>Archiving Documents</title>
         <description><![CDATA[<p>Archiving is a rather loaded word since doing it can be a widely varying activity. In many situations, archiving PDF files is a very good solution. In fact it was so attractive to some US Government agencies that they encouraged their personnel to work on an ISO committee/working group to define a special subset of PDF called PDF/A that meets their needs better than plain old PDF might.&#160;</p>
	<p>Generally PDF/A<strong> restricts</strong> the use of features of PDF that might make the visual presentation unreliable.&#160; Things like content that might change according to the day on which it is viewed. (Yes, you can do things like that in unconstrained PDF!)&#160; PDF/A also<strong> requires</strong> the use of certain features in PDF, such as including structural information that allows the content to be read, in a superior fashion, to people that are blind and to be extracted for reuse. The point of this blog is not to teach you the technology of PDF/A; you can learn  about PDF/A <a href="http://www.digitalpreservation.gov/formats/fdd/fdd000125.shtml">here</a> and <a href="http://www.pdfa.org/">here</a>. </p>
	<p>I wanted to talk more about a couple of alternatives that come up, namely ODF and OOXML. I sometimes wonder if I am losing it, because I have such a hard time understanding some other people's  (wrong) viewpoints. Both ODF and OOXML are editable document formats, in fact, they are defined as companions to editing programs, that used to be called word processing software. If you want to change the document you can read one of these files into one of these editors and change away. This is a great virtue, if that is what you want to do.&#160; But would Silver Bow County, Montana want me to be able to edit my birth certificate?&#160; I think not.&#160;And yes, that is my real birth place, Butte, Montana (in Silver Bow County). Do we want our court records to be editable?&#160; I think not. So why would I want to chose one of the editable document formats for such records. And this is not a straw man argument because people are repeatedly saying that ODF and OOXML are ideal archiving formats for such material.</p>
	<p>Now to be fair, they base their belief that these will be great archiving formats primarily because they are "XML." I have already posted a blog about <a href="http://blogs.adobe.com/insidepdf/2007/09/xml_documents.html">XML Documents</a> where I point out that although these document formats do contain XML subfiles, they are in truth <a href="http://www.pkware.com/index.php?option=com_content&task=view&id=64&Itemid=107">ZIP Archives</a>. Nonetheless, to be fair, much of the information you might want to reuse or edit in these archives is in an XML markup language of one kind or another. I also have a <a href="http://blogs.adobe.com/insidepdf/2007/09/xml_for.html">previous blog</a> about why it isn't very helpful to claim something is XML. It is much more informative to say it is XML for [something].  And of course, if some XML is for [something] it might not be suitable for [something else]. </p>
	<p>Of course, I believe that PDF/A is the current best choice for archiving "documents of record." &#160;PDF was originally designed to capture print output from any application in such a way that it could be  viewed and navigated efficiently as well as easily printed to paper.  Adobe has added features to support ways of "editing" PDF files and those features are highly desired by many, but PDF/A does a good job of preserving that original PDF design for capturing documents that might otherwise be printed. Here is a diagram that I made to emphasize the differences between PDF and the other two document formats.<img src="http://blogs.adobe.com/insidepdf/OfficeFormats.png" width="310" height="302" hspace="100" /></p>
	<p>The upper left represents those files using OOXML and the upper right those using ODF.&#160; These two formats, as well as this discussion, cover the three authoring paradigms for: documents, spreadsheets and projector presentations.&#160; PDF can uniformly capture any of these types of documents. ODF and OOXML are what Adobe has traditionally called "native file formats" for their respective editing applications. If I want to go to lunch in the middle of an authoring activity, I save my work in a native file format.&#160; The above diagram also shows the difference between <em>editing</em> and<em> reuse</em>. The dotted line going from the PDF file back to the editing applications indicates that the content of PDF files can be reused. When considering our archiving needs we also need to distinguish between reuse and editing; reuse might be important and editing not. PDF files do support content reuse. </p>
	<p>Back to archiving. As I said at the start, archiving documents can mean many things. I often create documents that I might want to recreate and edit many years after I did the original work and it is great to be able to find the old files to update. Sometimes I cannot find the files and sometimes I no longer have the correct editor or the correct version of the editor to read them. But the claim is that these new "XML-friendly" formats will make that better. I am not sure exactly why this belief is held strongly by some, but I guess if you make the formats ISO standards, the software developers may be more careful to read old files with new programs.</p>
	<p>Adobe has gone to great lengths to make sure that PDF is upward compatible and with few exceptions any proper PDF file that exists can be viewed with Adobe Reader. We plan to continue to do this. </p>
	<p>So there might be archiving reasons to call for using PDF and there might be archiving objectives that call for using ODF or OOXML. Somehow I cannot fail to believe that the number of archiving needs for non-editable formats far exceeds that for editable forms.&#160; After all, archiving means preserving things, and usually you want to preserver the total look of a document. PDF/A does that. </p>
	<p>Before I conclude, I want to say just a bit more about XML. There is what I think is a rather technically shallow belief that XML files are easier to work with and will survive the passing of time, even great periods of time, better than other formats. The text held within XML files can usually be viewed with any generic text editor and I guess that gives people a warm feeling that it will therefore also be easier to retrieve with a program. Fair enough. But what is glossed over way too much is that that text is enveloped within XML for [something]. (<a href="http://blogs.adobe.com/insidepdf/2007/09/xml_for.html">See my earlier blog entry.</a>) The envelopes (schemas) offered by ODF and OOXML are different. Different enough that a simple program cannot extract just the raw text from either. And is that all I <em>really</em> want from a document in the future, the raw text. Because when you get to the layout and the images and the color space definitions and the fonts, these things do not lend themselves well to XML and are often stored within the ZIP archives as binary data. So tell me again where the advantage to XML is for this purpose? </p>
	<p>Now I will give two more hits against the myth  that ODF and OOXML will be more easy to process at some distant date because much of the content is XML. First is the fact that the XML is <a href="http://www.ietf.org/rfc/rfc1951.txt'">Flate compressed</a> in both formats making it impossible to read in a generic editor until after it has been de-Flated, and second, do we really believe that 50 or 100 years from now our software and computer technology will still be in a state where anyone will give a hoot whether something is XML or not. Will there not be something far superior invented by then?</p>
	<p>By the time you worry about these issues you will find that processing PDF files will be as easy or maybe even easier that those XML friendly versions.</p>
	<p>Contact me at <a href="mailto:jking@adobe.com">jking@adobe.com</a> </p>
	<p> <BR/>
    </p>
  ]]></description>
         <link>http://blogs.adobe.com/insidepdf/2007/10/archiving_documents.html</link>
         <guid>http://blogs.adobe.com/insidepdf/2007/10/archiving_documents.html</guid>
         <category></category>
         <pubDate>Mon, 22 Oct 2007 11:18:28 -0800</pubDate>
      </item>
            <item>
         <title>GosCon07</title>
         <description><![CDATA[
	<p>I just&#160; completed a day and one half attending the  <a href="http://goscon.org/home">Government Open Source Conference</a> (GOSCON) in Portland, Oregon. This was the third such conference, but the first I have attended.&#160; I attended for one very PDF related reason. I found these sentences on their <a href="http://goscon.org/node/539">web page</a>  to publicize the conference|:</p>
	<blockquote>
	  <p> &quot;The maintenance and exchange of the most important records such as certificates of birth, marriage and death, taxes, licenses, deeds, laws, regulations, codes and rules -- as well as their role in business processes -- are the responsibility of IT officers around the globe.&quot; (see the description of <a href="http://goscon.org/home">GOSCON</a>'s Executive Panel) </p>
    </blockquote>
    <p>To me this sounded an exact match for <a href="http://www.pdfa.org/">PDF/A</a>, the archiving subset of PDF. Yet the only document formats mentioned in the rest of the material were <a href="http://en.wikipedia.org/wiki/Office_Open_XML">OOXML</a> and <a href="http://www.oasis-open.org/committees/office/">ODF</a>. The <a href="http://goscon.org/speakers">speaker list</a> seemed to invite yet another back and forth between <a href="http://goscon.org/node/542">Microsoft and ODF supporters</a>. Adobe wants PDF (or <a href="http://www.pdfa.org/">PDF/A</a>, or <a href="http://www.aiim.org/standards.asp?ID=27860">PDF/E</a>, or <a href="http://en.wikipedia.org/wiki/PDF/X">PDF/X</a>) to be the file format of choice when most appropriate. There are more PDF files on the Web than ODF, OOXML, and .doc all together. There are billions of PDF files in existence. PDF was announced in June 1993 and with few exceptions everyone of those files is still viewable and printable with today's software from <a href="http://www.pdfstore.com/vendors.asp">many vendors</a> including many <a href="http://www.sourceforge.org">open source projects</a>.&#160; It doesn't attest to what will be true 50 or 100 years from now, but PDF has done better than any other document format for maintaining its archiving ability for over 14 years. Yet PDF was completely unmentioned on a panel about <a href="http://goscon.org/sessions#516">Open Document Formats</a>, unmentioned in any of the other conference material. &#160; So one of my colleagues sent off a note to the organizers asking if Adobe could get someone on the panel to bring PDF into the discussion. For whatever reason, we had only heard about this conference at the last moment. But to our surprise the organizers agreed that I could be part of that panel. So off to Portland for Monday and Tuesday morning!&#160;</p>
    <p>I had some trepidation since I am certainly not an open source advocate and even though Adobe has some significant <a href="http://www.mozilla.org/projects/tamarin/">open source work (Tamarin) </a> underway, I thought I might be attacked by the vociferous crowd that worship at the open source alter. Might I be provoked to defend myself in an unprofessional manner? It isn't that I don't think people should do open source, it is just that I basically don't get it. I am a true capitalist at heart and I was just a bit old to be one of the back to basics persons of the 60s and in order for me to understand open source I need to know how the programmers get paid for all this &quot;free&quot; work.</p>
    <p>Well I did learn a great deal by attending this conference, am definitely more tolerant and understanding of the idea, but not quite yet a true believer.&#160; This conference had just over 150 attendees with about 1/2 from Oregon, mostly city and state government employees, and the rest of the attendees were from all over the world. <a href="http://opensource.mampu.gov.my/">Malaysian Government </a> had a delegation of 3 or 4 since they have been quite successful in using open source software. The Conference was, how shall I say, definitely Linux supportive, although people did talk about lots of other things. I would rate the average <a href="http://goscon.org/speakers">speaker</a> as &quot;B&quot; or &quot;C+&quot; with a few outstanding and a few not so good. The technical level was somewhat low with too much evangelical promotion. </p>
    <p>I guess my key take-away was as follows. Big government can afford to hire vendors to write custom software for them to accomplish their goals. Smaller government groups have many, if not all, of the same needs and yet they cannot afford to get custom software created or to create it themselves. And I guess the vendor community is not doing a thorough job of building turn-key software that gets a large enough sales coverage to be cheap enough for this market. Something like that. And this conference had a lot of attendees from smaller government groups. It ranged from  people from IT departments under correctional institutions to people representing 5 person's police departments to city IT personnel.  What makes a great deal of sense for these government agencies is to band together to define and get the software that they require created. Open source methods fit this pretty well. Consortia might do just as well. The agencies have money but it would go a lot further if they could share costs 10 ways, or 50 ways or 1000 ways.</p>
    <p>There were three keynotes, which were pretty good (<a href="http://goscon.org/?q=node/508">head of Linux Foundation</a>, <a href="http://goscon.org/?q=node/513">director of Eclipse Ecosystem </a>, <a href="http://goscon.org/?q=node/498">VP from Gartner</a>) and then 4-way parallel sessions. One speaker backed out at the last minute and so they offered <a href="http://goscon.org/speakers#489">Doug Johnson</a> from Sun and me the opportunity to organize a session on Open Source versus Open Standards.&#160; Neither of us is shy about speaking our mind so for a last minute session, we did OK; I would give us a &quot;B-&quot;. </p>
    <p>Overall I'm glad I attended. I forced some attention to PDF, the primary objective, and I learned more about open source and perhaps more importantly how open source advocates reason.&#160; And I did attend <a href="http://goscon.org/?q=node/542">one session</a> where a Microsoft guy and an IBM guy squared off on OOXML versus ODF.&#160;Surprising there were less that 15 people attending. Maybe we are getting tired of this argument. But I would have been disappointed if there had not been such a session. I think the IBM guy won by a small bit. &#160; Contact me at <a href="mailto:jking@adobe.com">jking@adobe.com</a>. <BR/>
    </p>
  ]]></description>
         <link>http://blogs.adobe.com/insidepdf/2007/10/goscon07.html</link>
         <guid>http://blogs.adobe.com/insidepdf/2007/10/goscon07.html</guid>
         <category></category>
         <pubDate>Wed, 17 Oct 2007 05:19:13 -0800</pubDate>
      </item>
            <item>
         <title>High Drama!  -- curtain call</title>
         <description><![CDATA[
	<p>Microsoft recently received an award that I am sure they will just love. The FFII (<a href="http://www.ffii.org/">Foundation for a Free Information Infrastructure</a>) has awarded Microsoft the &quot;<a href="http://press.ffii.org/Press_releases/FFII_awards_Microsoft_%22Best_Campaigner_against_OOXML_Standardization%22_prize">Best Campaigner Against OOXML</a>&quot; distinction. Apparently this is part of a larger campaign on the part of FFII against the ratification of OOXML as an ISO/IEC standard. I guess it is an attempt to put some humor, at Microsoft&#8217;s expense, into a rather serious issue. </p>
	<p>This is a serious issue because it is all about competition in a market that produces critical revenue for Microsoft in the 10's of billions of dollars. How would you like to be the Microsoft executive held responsible for allowing competition to take 10% or 20% of that revenue away. The irony is that the people that are threatening to do that are open source advocates; supposedly most of the money not going to Microsoft would be going to the customers.<br />
      <br />
      But all of this make me nervous. As I posted earlier, I am the technical personn and the technical lead in handing the PDF standard off to ISO. I do not want Adobe to get the next FFII award. &#160;I wrote a previous blog about this and especially about fast tracking standards which I think is where the big difference between what Microsoft has done and what Adobe is doing. <a href="http://blogs.adobe.com/insidepdf/2007/09/fast_tracking.html">Check it out.</a><br />
      <br />
      Selling the Adobe products that depend upon PDF is also big business for Adobe. &#160;But I am also very confident that the character of Adobe employees and especially of our executives is strong enough to make the right choices between big money pressures and doing the right thing. &#160;Moving PDF down the ISO fast track is the <em>right thing</em>. &#160;--  Jim King <BR/>
	</p>
	]]></description>
         <link>http://blogs.adobe.com/insidepdf/2007/10/high_drama_curtain_call.html</link>
         <guid>http://blogs.adobe.com/insidepdf/2007/10/high_drama_curtain_call.html</guid>
         <category></category>
         <pubDate>Fri, 05 Oct 2007 14:50:27 -0800</pubDate>
      </item>
            <item>
         <title>Submission of PDF to ISO</title>
         <description><![CDATA[<p>I have written around this subject before but I have never given it a complete run through. I am going to do that now. On January 29, 2007 Adobe Systems announced that it would work with AIIM and ISO’s Technical Committee (TC) 171 to make PDF 1.7 an ISO standard.</p>
	<p>Following this announcement, on July 2, 2007, the PDF 1.7 Reference, converted to follow ISO template and styling rules, was sent out as a Draft International Standard (DIS) ballot to country standard’s organizations that are members of ISO/TC 171 using the ISO “Fast Track” procedure. This process is managed by AIIM on behalf of ISO.</p>
	<p>The ballot closes on December 2, 2007. Following ISO rules, if the ballot passes and all comments can be resolved, then the DIS will become an ISO Standard: “ISO 32000 — Document management — Portable document format — PDF 1.7”. If the ballot passes but one or more ballots contain unresolved comments then, after addressing the comments, a second Final Draft International Standard (FDIS) ballot that lasts 2 months will be conducted. If that passes (no comments allowed) then it will become an ISO Standard.</p>
	<p>Since after a standard has been approved it takes ISO several months to officially publish it, we can expect the ISO 32000 PDF 1.7 standard document to be published by ISO sometime in 2008.</p>
	<p>Additional Notes:</p>
	<ul>
	  <li> There are currently 14 countries that are voting P-members of ISO TC 171.</li>
      <li>Adobe converted the PDF 1.7 Reference into an ISO Draft by 1) using the ISO template and styling directives, 2) removing all Adobe and Acrobat specific information making the document truly vendor neutral, 3) removing obsolete material and 4) adding some material required by ISO including terms and definitions and required introductory material. The draft was also required to use International English spelling and not American English spelling.<br />
      </li>
	  <li>Once PDF becomes an ISO standard, Adobe will be just one of many implementors of that standard. Adobe expects to participate on the ISO committees determining changes that should be included in a revised standard.<br />
      </li>
	  <li>Adobe has agreed to abide by ISO intellectual property rules governing patents licensing.<br />
      </li>
	  <li>Adobe will continue to innovate with respect to Acrobat, LiveCycle and PDF, and will work with the ISO committees on subsequent versions of the ISO 32000 standard.<br />
      </li>
	  <li>AIIM, with Adobe’s support, submitted PDF 1.7 to ISO TC 171 for fast track processing as an existing de facto standard. A rather different approach could have been to have some other organization, like ECMA, standardize PDF and then submit it to ISO as their existing standard. It would also have been possible to submit to the Joint Technical Committee (JTC1) which is a cooperation between ISO and the International Electrotechnical Commission (IEC). The same path through AIIM to ISO TC 171 has been followed in establishing PDF/A (ISO 19005) and PDF/E (ISO 24517). PDF/X (ISO 15930) took yet a different path.<br />
      </li>
    </ul>
	<p>Further Information:</p>
	<ul>
	  <li>AIIM PDF Standard: http://www.aiim.org/standards.asp?ID=33223</li>
      <li>ISO Process: http://www.iso.org/iso/en/stdsdevelopment/whowhenhow/proc/proc.html</li>
	  <li> PDF 1.7 Reference: http://www.adobe.com/devnet/pdf/pdf_reference.html</li>
	  <li> Q&amp;A on Adobe website: http://www.adobe.com/pdf/release_pdf_faq.html</li>
	  <li>Adobe Contact: Nora Calvillo &lt;calvillo@adobe.com&gt;</li>
	</ul>
	<BR/>
  ]]></description>
         <link>http://blogs.adobe.com/insidepdf/2007/10/submission_of_pdf_to_iso_1.html</link>
         <guid>http://blogs.adobe.com/insidepdf/2007/10/submission_of_pdf_to_iso_1.html</guid>
         <category></category>
         <pubDate>Tue, 02 Oct 2007 10:00:57 -0800</pubDate>
      </item>
            <item>
         <title>XML Documents</title>
         <description><![CDATA[<p>Today I hope to tie together two previous blogs about <a href="http://blogs.adobe.com/insidepdf/2007/09/high_drama.html">OOXML</a> and  about <a href="http://blogs.adobe.com/insidepdf/2007/09/xml_for.html">XML For ...</a> . &#160;<br />
<br />
I am sure you have often heard the term &#8220;XML Document.&#8221; &#160;I hope you realized that that term is nearly meaningless just like the term &#8220;XML.&#8221; &#160;We should never use either in polite conversation. Let me tell you some of the totally different uses for the term &#8220;XML Document&#8221; which render it a useless term, and maybe you will agree with me to banish it from our vocabularies.<br />
</p>
    <p>A) &#160;First if you know the history of <a href="http://en.wikipedia.org/wiki/IBM_Generalized_Markup_Language">GML</a>, <a href="http://en.wikipedia.org/wiki/SGML">SGML</a> to <a href="http://www.w3.org/xml/">XML</a> (see <a href="http://xml.coverpages.org/sgmlhist0.html">this</a> or <a href="http://www.itwriting.com/xmlintro.php">that</a>, for example) the primary intended use was to mark up text to break it into flowable paragraphs and mark strings of words as headings, lists, footnotes and so on. The notion was to separate the "form" from the "content" by labeling the content semantically not stylistically. The form is introduced later and allows the content to be repurposed multiple times with different styling. In these uses, the predominant material found in an XML file is the words that make up the text. And in most cases the text within a paragraph is uninterrupted with markup except for an occasional emphasized phrase (bold) or font change.<br />
      <br />
This is what I always think of as the typical XML Document.&#160; A good modern example of this is the XML/SGML <a href="http://www.docbook.org/">DocBook</a> markup language. <br />
<br />
B) But I have had conversations with IS people where XML Document to them means the kind of structured data you can withdraw from a database and represent using some XML markup language. For example, think of a business card XML markup language where a person's name, address, phone, etc. are held with appropriately named XML elements. All content, no form. I think of this more as a data packet or dataset than a document but ... .&#160; When I asked the IS folks about the "artwork" that would make the data into a nice looking presentation, their reaction was, "Oh, yes, that. That is handled in another department where they have designers and are concerned about publishing.&#160; But our <em>documents</em> are what you have just referred to as datasets." <br />
      <br />
      C) The one that really blows my mind is the XML markup language <a href="http://www.w3.org/Style/XSL/">XSLT</a>. It actually codes a rather unusual style of programming language into XML elements and attributes. The most interesting thing within these files is the markup and not the text content. Those XSLT files are XML Documents too!  Often <em>XML Document</em> is just a synonym for <em>XML file</em>.<br />
      <br />
      D) The final example, more dear to my heart, is <a href="http://www.w3.org/Graphics/SVG/">SVG</a> (Scalable Vector Graphics), another XML markup language. SVG can be used to express a lot more than graphics as it has extensive text support, and it can also place images into the content. SVG describes the contents of a rectangle, not multiple pages although there are movements to extend it. But I am sure you will hear people talk about SVG Documents. <br />
      <br />
      The most interesting thing is that SVG is not primarily based upon a flow model as say the<a href="http://www.w3.org/MarkUp/"> XHTML</a> markup language, but it is based upon a 2D model of the rectangular canvas and anything can be placed at any starting place within the canvas. Using this model, text must be broken up into strings whose characters all stay along the same baseline. So a paragraph of text can be represented,  not typically in a flowable form but broken into the individual lines. First it must be formatted  in order to be output as this kind of SVG. &#160;What is worse, is there is nothing to stop one from making an SVG file where the order of lines occurring in the SVG file bear no relationship to where they fall on the canvas or their normal reading order. So, for such a file, the lines have to be sorted by position on the canvas in order to convert them back into flowable paragraphs. And what if there were lines with hyphenations. &#160;The hyphenations have to be removed to make flowable text paragraphs.<br />
      -----<br />
      OK, enough of XML Documents. &#160;The point being it is better to talk about the specific XML markup language being used than XML Documents in general. &#160;"XML Document" is meaningless.<br />
      <br />
      But it gets worse. &#160;And this one I blame on the technical people letting the marketing and PR people get away with too much liberty in terminology. &#160;I personally fought this battle within Adobe and so far have won. &#160;<br />
      <br />
      If you look at the new standard XML office document formats (e.g., <a href="http://en.wikipedia.org/wiki/Office_Open_XML">OOXML</a>, <a href="http://www.odfalliance.org/">ODF</a> and <a href="http://www.idpf.org/">OPS</a>) that are widely referred to as XML Documents, none of them are XML files! &#160;They are all <a href="http://www.pkware.com/index.php?option=com_content&task=view&id=64&Itemid=107">ZIP</a>&#160;archives that contain, among  the sub-files in the archive,  <em>some</em> XML files. So at the very best they are a single file representation of a <em>set</em> of XML files. &#160;This is true. &#160;Look at <a href="http://en.wikipedia.org/wiki/Office_Open_XML">OOXML</a>, <a href="http://www.odfalliance.org/">ODF</a>,  <a href="http://www.idpf.org/">OPS</a> and, yes, even Adobe&#8217;s new XML-ized PDFs or <a href="http://www.adobe.com/go/mars">Mars</a>. As files they are all ZIP archives with some additional conventions and rules (e.g., <a href="http://openxmldeveloper.org/archive/category/1002.aspx">OOXML's OPC</a> and <a href="http://www.idpf.org/ocf/ocf1.0/index.htm">IDPF's OCF</a>) controlling how to organize the sub-files. &#160;And the reason they are ZIP archives is that there are many binary formats such as JPEG files that just cannot by any reasonable means be formatted as an XML markup language. These &#8220;office document formats&#8221; all want to represent compound documents that contain binary images, binary fonts, binary color profiles and so on. These binary resources go into a ZIP archive just fine.&#160; They just don't work as XML.<br />
      <br />
      Another great reason to use ZIP archives is that the format supports <a href="http://www.ietf.org/rfc/rfc1951.txt'">Flate</a> compression (used by <a href="http://www.w3.org/TR/PNG/">PNG</a>) of the sub-files and for XML files that can make them much smaller. It is also possible to reach into a ZIP archive and read any particular sub-file without &#8220;unpacking&#8221; the archive. That gives the software N access points into the archive if there are N sub-files. And any sub-file can be read without reading any of the others.<br />
      <br />
      Admittedly the primary sub-files in the document archive files are written in some XML markup language and in most cases make up the majority of the material. &#160;But to call these ZIP files XML Documents, or even XML files, somehow seems dishonest and an attempt to build unfairly upon the tremendous  marketing hype that XML has obtained. <br />
      <br />
      If you look at the Adobe <a href="http://www.adobe.com/go/mars">Mars literature</a> you will find that Mars is called <em>XML-friendly</em>. &#160;That is my doing. &#160;I just thought it was too much of a stretch to call a ZIP archive an XML Document or an XML file. &#160;I think the other groups should also change their terminology to something more technically accurate.&#160; When we misuse technical terminology, or even stretch meanings too much, we just create massive confusion and misunderstandings and we make it even harder to communicate with each other. <br />
      <br />
      In yet another blog I will go into why XML is rather over hyped and why we need to be more accepting of other technologies such as ZIP archives and binary JPEG files. &#160;I also hope to post an entry on why ZIP archives are really not the best technical solution for these applications. But those are for another day.<br />
      <br />
      Contact me at: &#160;<a href="mailto:jking@adobe.com>">mailto:jking@adobe.com></a> <BR/>
	</p>
  ]]></description>
         <link>http://blogs.adobe.com/insidepdf/2007/09/xml_documents.html</link>
         <guid>http://blogs.adobe.com/insidepdf/2007/09/xml_documents.html</guid>
         <category></category>
         <pubDate>Mon, 17 Sep 2007 16:50:33 -0800</pubDate>
      </item>
      
   </channel>
</rss>
