by James C. King
Author’s note: Link to a Romanian translation of this article (by Alexander Ovsov)
Data versus Information
Data versus Information
As part of an Open Government initiative the US Government has created guidelines and goals for making more of the information held by government agencies more accessible to the public. See for example, President Obama’s December 2009 directive.
Of course, I advocate the judicious use of PDF for disseminating that government information. Well, in most situations.
I will use the terms data and information to distinguish at least two ways government agencies are being asked to provide information to the public in electronic form. Data is just the raw numbers, names, places, etc. The stuff you might pull out of a database. But data can be shaped into information by providing interpretations, drawing conclusions, pointing out inconsistencies and then packaging it in an attractive form. And then there is the most extreme Mcluhan view that the medium is the message!
When I go to a government website I usually want information not data. When an analyst goes to a government website she might just want the raw data so that it can be interpreted, shaped and analyzed, and turned into a specific document as information not provided by the government website.
There is the saying that if you have a hammer, then everything looks like a nail. I know for sure my 3 year old grandson thinks so when he has a real hammer in his hand. And I have to confess that, from my view, almost all needs are best addressed using PDF. PDF is my metaphorical hammer. I have some justification for this position as the Project Leader for the ISO 32000 PDF standard.
For real, in most cases, government agencies can, and do, use PDF for effective information distribution. The information as a PDF can be downloaded, read in its electronic form, saved for later reference, shared and printed. Nearly everyone has a free PDF reader. PDF documents can also be infused with, what we at Adobe like to call, "rich document" features. The final representation of the information can be very important. As noted earlier, the medium is the message, or it certainly can make an important contribution.
However, for the person who wants raw data, PDF isn’t the right choice. See, I am willing to refrain from hitting everything in sight with my PDF hammer! But raw data isn’t of much use unless you have metadata (data about data) describing, how to access the date elements, the ranges and meanings for the individual data items, where the data comes from, how old the data is, how authentic, and so many other important properties. So I will argue later that, in fact, PDF can play a very important role in distributing raw data, offering a clever means to accompany that data with meaningful and accurate metadata. (I am using the term metadata here in a very general way as data about data, and maybe a little differently than you are used to.)
The government initiative has surfaced people who have XML hammers. I think some XML enthusiasts, but certainly not all, go overboard. I think it hurts their cause. As I have blogged earlier (XML for •••), XML is one of the most misunderstood, yet useful, technologies we have in today’s toolkit. We need to drop our hammers and consider the facts.
Using XML for raw data is the kernel of a good idea. But here are major problems that arise because we do not go further into a more complete discussion:
- Those who are not familiar with XML, need to realize that XML isn’t a single markup language for a single use, but it is a method for defining and using specialized markup languages. That is why we have to say XML for business cards, XML for invoices, XML for classifying political action committees and so on. There are thousands of such XML markup languages and there will be thousands more to cover all those government datasets where XML is appropriate. Please see my previous blog for more on this topic.
- Large raw datasets can be prohibitively large when expressed in an XML markup language. Unnecessarily large, from an information theoretic view. For example, here is an XML data file that can be found on www.data.gov at this webpage. Note that, when you download this file, it is a ZIP file whose size is 11,903,362 bytes. When you unzip it though, you get a 220,655,917 byte XML file. In this case, the EPA personnel know that XML files can be very large and have packaged it in a ZIP for downloading to reduce the transmission time by a multiple of over 18. In other words, if it takes a minute to download the ZIP’ed version, it will take over 18 minutes to download a raw XML version. After unzipping, the XML file is identical to the original. So any advocacy for XML, should always be accompanied with a filesize discussion and considerations of using something like ZIP. To do otherwise would be irresponsible.
- XML files need additional metadata in order to make use of the data that is found within them. If you are given three numbers (007, 56, 00010) you might not guess that they represent a birthday, which is more commonly supplied as 7/10/1956 in the US. But even if it is provided in that form, it isn’t representative of how it is often expressed in other countries as 10-07-56. Unless given additional information there is an ambiguity between the 7 and the 10 as to which is the month and which the day. And this is a trivial example. We need extra information, e.g., metadata, in order to be able to interpret data accurately and make proper use of it. The basic syntactic rules used for a markup language can and should be provided by offering an XML Schema (.xsd file) but even that does not go far enough. We also need to explain the semantics which usually requires a technical document.
- There are other raw data formats that might be more suited to particular needs such as standard spreadsheet files (.csv — character separated values) or the Microsoft and Open Office spreadsheet formats (.xlsx and .ods, respectively). The latter can also include the formulae and rules needed to compute dependent and summary data. In addition there may be very application specific files, not in XML format, appropriate for specific needs. For example, shapefiles (.shp) for holding spatial feature information as in the www.data.gov webpage reference earlier.
Note: There can be an argument made that compressing/decompressing files is so time consuming that the time lost there is not made up in reduced transmission time. With today’s lighting fast CPU’s the compression and decompression times are relatively minor, but the transmission times can be a problem if you don’t have the latest and greatest Internet connections. So nearly always, the choice to used compressed data is the right one for data expressed in an XML markup language.
PDF instead of ZIP
In a previous blog (PDF File Attachments), I described the file attachment features of standard PDF (ISO 32000). To summarize, any number of file attachments can be embedded into any PDF file. They can be extracted for use by anyone receiving the PDF file. In addition, those file attachments may be compressed using the most common compression method that is used in ZIP files: deflate/flate. For the purposes of distributing government data, this is nearly ideal. The PDF file can carry the .xml or other raw data files as compressed attachments and that base PDF document, itself, can provide all the additional semantic information that would be needed in order to make use of the data — the metadata. If the raw data is in XML form, then a compressed XML Schema file (.xsd) can also be attached to the PDF document. So when using PDF, the points made above are addressed: filesize, necessary metadata to define the XML markup language used, and formats other than XML.
I have created a sample PDF envelope starting from this government dataset. You might want to look it over. Note that both the XML dataset and the associated Schema file are attachments to the PDF that helps to define the XML markup language used for this file. I took the general introduction from the government web page and made up a brief description for each of the XML elements found in the file. Make sure to use a PDF reader that can display the attachment annotations and that can extract the attachments. Adobe Reader can do that.
Another important benefit that just works out when using PDF envelopes, is that the digital signature technology available for PDF files also can cover the attachments automatically since they are an official part of the PDF file. Government agencies can send digitally certified PDF files containing data files and their customers can authenticate that the PDF, and all the attachments, came from that agency and have not been tampered with. See my previous blog about Authenticated PDF Documents.
Other ways to use PDF attachments for government information delivery
So we described how we can use PDF to provide a complete package for raw data downloading. But we can also augment richly formatted PDF information files with the raw data files that were used to create the PDFs. For example, if a document contains, pie charts, bar charts or tabular information, the raw data that was used to create that formatted information can be attached to the PDF file. An annotation can be placed on the chart or table that allows the appropriate attachment to be extracted. This is covered a bit more in my previous blog.
A third use for PDF attachments is to create hybrid files. These are PDF files that attach the original editable source document (e.g., .odf or .docx file) that was used to create the PDF. In a sense, this make PDF editable, something that people ask for. Quoting from the OpenOffice.org website: "A hybrid PDF/ODF file is a PDF file that contains an embedded ODF source
file. Hybrid PDF/ODF files will be opened in OpenOffice.org as an ODF
file without any layout changes. Users without this extension can open
the PDF part of the hybrid file with their PDF viewer." Adobe Acrobat’s Microsoft Office tools can also create PDF files with the Office file that created them as an attachment.
So hammer away, you PDF enthusiasts.
Jim King (firstname.lastname@example.org)