by lrosenth

Created

September 17, 2007

Today I hope to tie together two previous blogs about OOXML and about XML For … .  

I am sure you have often heard the term “XML Document.”  I hope you realized that that term is nearly meaningless just like the term “XML.”  We should never use either in polite conversation. Let me tell you some of the totally different uses for the term “XML Document” which render it a useless term, and maybe you will agree with me to banish it from our vocabularies.

A)  First if you know the history of GML, SGML to XML (see this or that, for example) the primary intended use was to mark up text to break it into flowable paragraphs and mark strings of words as headings, lists, footnotes and so on. The notion was to separate the “form” from the “content” by labeling the content semantically not stylistically. The form is introduced later and allows the content to be repurposed multiple times with different styling. In these uses, the predominant material found in an XML file is the words that make up the text. And in most cases the text within a paragraph is uninterrupted with markup except for an occasional emphasized phrase (bold) or font change.

This is what I always think of as the typical XML Document.  A good modern example of this is the XML/SGML DocBook markup language.

B) But I have had conversations with IS people where XML Document to them means the kind of structured data you can withdraw from a database and represent using some XML markup language. For example, think of a business card XML markup language where a person’s name, address, phone, etc. are held with appropriately named XML elements. All content, no form. I think of this more as a data packet or dataset than a document but … .  When I asked the IS folks about the “artwork” that would make the data into a nice looking presentation, their reaction was, “Oh, yes, that. That is handled in another department where they have designers and are concerned about publishing.  But our documents are what you have just referred to as datasets.”

C) The one that really blows my mind is the XML markup language XSLT. It actually codes a rather unusual style of programming language into XML elements and attributes. The most interesting thing within these files is the markup and not the text content. Those XSLT files are XML Documents too! Often XML Document is just a synonym for XML file.

D) The final example, more dear to my heart, is SVG (Scalable Vector Graphics), another XML markup language. SVG can be used to express a lot more than graphics as it has extensive text support, and it can also place images into the content. SVG describes the contents of a rectangle, not multiple pages although there are movements to extend it. But I am sure you will hear people talk about SVG Documents.

The most interesting thing is that SVG is not primarily based upon a flow model as say the XHTML markup language, but it is based upon a 2D model of the rectangular canvas and anything can be placed at any starting place within the canvas. Using this model, text must be broken up into strings whose characters all stay along the same baseline. So a paragraph of text can be represented, not typically in a flowable form but broken into the individual lines. First it must be formatted in order to be output as this kind of SVG.  What is worse, is there is nothing to stop one from making an SVG file where the order of lines occurring in the SVG file bear no relationship to where they fall on the canvas or their normal reading order. So, for such a file, the lines have to be sorted by position on the canvas in order to convert them back into flowable paragraphs. And what if there were lines with hyphenations.  The hyphenations have to be removed to make flowable text paragraphs.
—–
OK, enough of XML Documents.  The point being it is better to talk about the specific XML markup language being used than XML Documents in general.  ”XML Document” is meaningless.

But it gets worse.  And this one I blame on the technical people letting the marketing and PR people get away with too much liberty in terminology.  I personally fought this battle within Adobe and so far have won.  

If you look at the new standard XML office document formats (e.g., OOXML, ODF and OPS) that are widely referred to as XML Documents, none of them are XML files!  They are all ZIP archives that contain, among the sub-files in the archive, some XML files. So at the very best they are a single file representation of a set of XML files.  This is true.  Look at OOXML, ODF, OPS and, yes, even Adobe’s new XML-ized PDFs or Mars. As files they are all ZIP archives with some additional conventions and rules (e.g., OOXML’s OPC and IDPF’s OCF) controlling how to organize the sub-files.  And the reason they are ZIP archives is that there are many binary formats such as JPEG files that just cannot by any reasonable means be formatted as an XML markup language. These “office document formats” all want to represent compound documents that contain binary images, binary fonts, binary color profiles and so on. These binary resources go into a ZIP archive just fine.  They just don’t work as XML.

Another great reason to use ZIP archives is that the format supports Flate compression (used by PNG) of the sub-files and for XML files that can make them much smaller. It is also possible to reach into a ZIP archive and read any particular sub-file without “unpacking” the archive. That gives the software N access points into the archive if there are N sub-files. And any sub-file can be read without reading any of the others.

Admittedly the primary sub-files in the document archive files are written in some XML markup language and in most cases make up the majority of the material.  But to call these ZIP files XML Documents, or even XML files, somehow seems dishonest and an attempt to build unfairly upon the tremendous marketing hype that XML has obtained.

If you look at the Adobe Mars literature you will find that Mars is called XML-friendly.  That is my doing.  I just thought it was too much of a stretch to call a ZIP archive an XML Document or an XML file.  I think the other groups should also change their terminology to something more technically accurate.  When we misuse technical terminology, or even stretch meanings too much, we just create massive confusion and misunderstandings and we make it even harder to communicate with each other.

In yet another blog I will go into why XML is rather over hyped and why we need to be more accepting of other technologies such as ZIP archives and binary JPEG files.  I also hope to post an entry on why ZIP archives are really not the best technical solution for these applications. But those are for another day.

Contact me at:  “>mailto:jking@adobe.com>