Archive for September, 2007

September 17, 2007

XML Documents

Today I hope to tie together two previous blogs about OOXML and about XML For … .  

I am sure you have often heard the term “XML Document.”  I hope you realized that that term is nearly meaningless just like the term “XML.”  We should never use either in polite conversation. Let me tell you some of the totally different uses for the term “XML Document” which render it a useless term, and maybe you will agree with me to banish it from our vocabularies.

A)  First if you know the history of GML, SGML to XML (see this or that, for example) the primary intended use was to mark up text to break it into flowable paragraphs and mark strings of words as headings, lists, footnotes and so on. The notion was to separate the “form” from the “content” by labeling the content semantically not stylistically. The form is introduced later and allows the content to be repurposed multiple times with different styling. In these uses, the predominant material found in an XML file is the words that make up the text. And in most cases the text within a paragraph is uninterrupted with markup except for an occasional emphasized phrase (bold) or font change.

This is what I always think of as the typical XML Document.  A good modern example of this is the XML/SGML DocBook markup language.

B) But I have had conversations with IS people where XML Document to them means the kind of structured data you can withdraw from a database and represent using some XML markup language. For example, think of a business card XML markup language where a person’s name, address, phone, etc. are held with appropriately named XML elements. All content, no form. I think of this more as a data packet or dataset than a document but … .  When I asked the IS folks about the “artwork” that would make the data into a nice looking presentation, their reaction was, “Oh, yes, that. That is handled in another department where they have designers and are concerned about publishing.  But our documents are what you have just referred to as datasets.”

C) The one that really blows my mind is the XML markup language XSLT. It actually codes a rather unusual style of programming language into XML elements and attributes. The most interesting thing within these files is the markup and not the text content. Those XSLT files are XML Documents too! Often XML Document is just a synonym for XML file.

D) The final example, more dear to my heart, is SVG (Scalable Vector Graphics), another XML markup language. SVG can be used to express a lot more than graphics as it has extensive text support, and it can also place images into the content. SVG describes the contents of a rectangle, not multiple pages although there are movements to extend it. But I am sure you will hear people talk about SVG Documents.

The most interesting thing is that SVG is not primarily based upon a flow model as say the XHTML markup language, but it is based upon a 2D model of the rectangular canvas and anything can be placed at any starting place within the canvas. Using this model, text must be broken up into strings whose characters all stay along the same baseline. So a paragraph of text can be represented, not typically in a flowable form but broken into the individual lines. First it must be formatted in order to be output as this kind of SVG.  What is worse, is there is nothing to stop one from making an SVG file where the order of lines occurring in the SVG file bear no relationship to where they fall on the canvas or their normal reading order. So, for such a file, the lines have to be sorted by position on the canvas in order to convert them back into flowable paragraphs. And what if there were lines with hyphenations.  The hyphenations have to be removed to make flowable text paragraphs.
—–
OK, enough of XML Documents.  The point being it is better to talk about the specific XML markup language being used than XML Documents in general.  ”XML Document” is meaningless.

But it gets worse.  And this one I blame on the technical people letting the marketing and PR people get away with too much liberty in terminology.  I personally fought this battle within Adobe and so far have won.  

If you look at the new standard XML office document formats (e.g., OOXML, ODF and OPS) that are widely referred to as XML Documents, none of them are XML files!  They are all ZIP archives that contain, among the sub-files in the archive, some XML files. So at the very best they are a single file representation of a set of XML files.  This is true.  Look at OOXML, ODF, OPS and, yes, even Adobe’s new XML-ized PDFs or Mars. As files they are all ZIP archives with some additional conventions and rules (e.g., OOXML’s OPC and IDPF’s OCF) controlling how to organize the sub-files.  And the reason they are ZIP archives is that there are many binary formats such as JPEG files that just cannot by any reasonable means be formatted as an XML markup language. These “office document formats” all want to represent compound documents that contain binary images, binary fonts, binary color profiles and so on. These binary resources go into a ZIP archive just fine.  They just don’t work as XML.

Another great reason to use ZIP archives is that the format supports Flate compression (used by PNG) of the sub-files and for XML files that can make them much smaller. It is also possible to reach into a ZIP archive and read any particular sub-file without “unpacking” the archive. That gives the software N access points into the archive if there are N sub-files. And any sub-file can be read without reading any of the others.

Admittedly the primary sub-files in the document archive files are written in some XML markup language and in most cases make up the majority of the material.  But to call these ZIP files XML Documents, or even XML files, somehow seems dishonest and an attempt to build unfairly upon the tremendous marketing hype that XML has obtained.

If you look at the Adobe Mars literature you will find that Mars is called XML-friendly.  That is my doing.  I just thought it was too much of a stretch to call a ZIP archive an XML Document or an XML file.  I think the other groups should also change their terminology to something more technically accurate.  When we misuse technical terminology, or even stretch meanings too much, we just create massive confusion and misunderstandings and we make it even harder to communicate with each other.

In yet another blog I will go into why XML is rather over hyped and why we need to be more accepting of other technologies such as ZIP archives and binary JPEG files.  I also hope to post an entry on why ZIP archives are really not the best technical solution for these applications. But those are for another day.

Contact me at:  “>mailto:jking@adobe.com>

4:50 PM Comments (8) Permalink
September 13, 2007

XML For …

In my opinion, XML is one of the most misunderstood technologies in existence today. And it is amazing because XML, itself, is so elementary. I think that is the source of the misunderstanding – the simplicity. Let me explain.

XML was first established as a “recommendation” (their name for a standard) by the W3C in February 1998, 9 years ago. Around 2001 one of the Adobe executives told all the Adobe product groups that they had to make use of XML in their products. For the most part no one knew what that meant. At the same time, I was also experiencing crazy statements being made at the Seybold conferences about XML that made no sense to me at all.

So I put together a presentation about XML that I presented at the 2002 Adobe internal Technology Summit where all the engineers in the company gather for a 3-day conference. In that presentation (see the entry for XML/PDF) I said two rather outlandish things:
o XML is not a markup language despite what its name says (extensible markup language), and
o You may not just say "XML" anymore at Adobe you must say "XML for <something>," like XML for news feeds, or XML for Graphics, or XML for Job Tickets.

Yes, this will sound rather self-serving, but I think that talk made a huge impact on the Adobe product groups’ ability to act on the directive to support XML in their products. And now Adobe does make extensive use of XML for a lot of things.

XML is not in and of itself a markup language, but it is a set of rules and a notation for inventing markup languages. The term markup came from the earlier practice of editorially marking text galleys for publication with red pencil. It later evolved into ways to markup text for typesetting. Some interesting history.

I have often made a rather simplistic analogy, using the term “XML” is similar to saying “the Roman alphabet,” as in I just wrote a poem using the Roman alphabet. That statement isn’t too illuminating since there are dozens of languages that use the Roman alphabet. Much more information is conveyed if instead I said "I just wrote a poem in German which uses the Roman alphabet." So to say something is XML hardly says anything. It says that you used the rules and notation.

Much more useful is to say what particular XML markup language was used. And there are thousands of XML markup languages that all use the XML notation and rules, just as there are dozens of natural languages that use the Roman alphabet. See also.

So please don’t say anything is just XML but talk about particular XML markup languages. You will find that your communications will take a big step toward being more meaningful. Really!

And if you have time, go look at the annotated slide presentation I mentioned above. That presentation also covers one other hang-up I have about people’s understanding of XML and I will address that in yet another blog titled something like “XML Documents”.

(As an aside and to help you get to know me better, I do have some credentials in this area since I worked at IBM Research for 19 years and used their GML which latter evolved to become an ISO standard called SGML (ISO 8879:1986) which was simplified and had a couple of key attributes added and became XML. So I have lived with this stuff for over 30 years.)

Contact me at: jking@adobe.com


11:15 AM Comments (0) Permalink
September 7, 2007

Fast Tracking

Microsoft has just had a tough defeat in getting OOXML approved as an ISO/IEC Fast Track standard:
See official results.

Is Adobe facing the same problem in our submission of PDF to ISO on a Fast Track?
See press release.

Has the Microsoft effort given Fast Tracking a bad name and will Adobe suffer for this? The Fast Track was meant to skip over the technology development stage because the technology was already a developed standard of one kind or another. At least a couple of methods have been OK’ed for ISO and ISO/IEC Fast Track processing.

Adobe is relying on one method, adopting an existing de facto standard, where Microsoft is relying on a second method, adopting a standard from a duly recognized other standards group, in this case ECMA.

Fast Tracking seems like a very reasonable approach to making PDF an ISO standard. There is no need to go through the standards development stages since PDF already exists and is well defined by Adobe documentation and has been for 14 years. There are billions of PDF files in existence and thousands of applications. See also. Partnering with AIIM and ANSI, Adobe submitted PDF directly to ISO for Fast Track processing to become ISO Standard 32000. We didn’t try to make it some other kind of standard first, but just depended upon the fact that it is a well established de facto standard.

The catch in the Microsoft path seems to be that although OOXML was an existing ECMA -376 standard it had only been so for a short time before it was submitted to ISO. Making it an ECMA standard could appear to have been a ploy to avoid the more rigorous processes of ISO to get it done more quickly. If what I am surmising is anywhere near true, then this does seem like an abuse of the Fast Track idea.

So, my goal is to distance Adobe’s effort from that of Microsoft since they just had a bad flameout and we do not want the same thing to happen to us. The use of Fast Track for PDF seems to be exactly what Fast Tracking was made for. After all, if we want PDF to remain PDF what would the technical development phase of an ISO committee do? It shouldn’t change the current definition. It shouldn’t try to improve it at the same time as transferring ownership. We all know about having too many balls in the air. So there is nothing to do in those first stages, so skip them.

You might be able to argue that OOXML is based upon the old standard (binary and proprietary) Office file formats (.doc, .xls, and .ppt). So why can’t OOXML ride on that de facto-ness? Maybe it should be able to. But the exact representation in OOXML is all new and can and should be subject to extensive technical review, probably more than ECMA was able to give it. Probably not a good candidate for Fast Tracking.

The solution for the future? Let me devise a Fast Tracking rule that would admit PDF but not admit OOXML. What if it had to be a proven de facto standard (same as now) or have been a standard of another standards organization for at least a year or two. That way, if there are standards organizations that do quickie reviews for their standards and are used to move into ISO fast track but avoid a more rigorous ISO development review this might discourage it.

Skipping the standards development steps for PDF makes infinite sense to me. (Of course it would.) And “Fast” and “Track” don’t seem like dirty words. They seem right.

Contact me at: jking@adobe.com

6:34 PM Comments (1) Permalink

High Drama!

Wow! Who would ever believe that votes on standardizing document formats would draw such attention and be so dramatic. Will they make a TV mini-series about this?

Of course, I am referring to the recent rejection of OOXML by ISO’s JTC1, and this in spite of all the behind the scenes lobbying Microsoft has been doing to get enough votes.

Here is the official ISO announcement.


And what is even more amazing is Microsoft is now declaring this a success!

And look at this, also.

I am not one of those who hates Microsoft and I do not believe in taking pot shots at them even when they seem to ask for it, but this one seems to indicate that their PR machinery is much too powerful. I’m a technical guy and us technical types need to stick together to keep our companies from thinking they can change reality with press releases. They lost this one folks, big time and declaring it a success doesn’t make it so.

My name is Jim King and I am the PDF Architect for Adobe. As I hope you know, Adobe is in the process of moving PDF, an existing de facto standard, to be under public (ISO) standards control.

Here is the Adobe press release about it.

I am the Adobe technical lead for this effort so am watching other standards activities with more than great interest. Although I will explain why what Microsoft is doing is very different from what Adobe is doing, when the area around you starts to feel like Gettysburg on July 2, 1863, you have two choices: hide or flee. At Adobe we are not sure which to do.

Maybe there is a third alternative: just jump up and shout “don’t shoot” our hats are blue (or gray). The trouble with that one is you might get shot anyway if your shouts are not heard or if the fire is coming from both sides.

Maybe you can help me guide Adobe to a successfully transfer of the ownership of PDF into ISO’s public hands over the next few months. To me this is a very noble effort: giving away to the public something that Adobe holds near and dear.

In this blog I also plan to talk about many other PDF topics, not just standards. So let’s see if we can keep you coming back. Next time I will give my opinion of what Adobe is doing with PDF and why it is really different from what Microsoft is doing with OOXML.

Contact me at: jking@adobe.com

4:15 PM Comments (1) Permalink