Posts tagged "Clean"

Exporting accessibility friendly HTML from a PDF in Acrobat XI

In Acrobat XI Save as HTML feature we have made significant improvements in the export of numbered lists and tables among other things in content formatting. Through this we have tried to give our users a HTML output which maintains PDF look and feel to quite an extent using styling statements both at the document level and also through inline. This will help our users to make some styling adjustments easily in the HTML file and also have the advantage of getting the content like lists and tables properly recognized for faster final publishing. We have introduced various settings in this feature to match our customers requirements but those will be a topic of discussion in another blog post.

In this post we are focusing on a requirement which has been brought to our focus by some of our users. These customers, especially in the accessibility domain, want the export feature to be driven by the document’s structure (tags), and ignoring artifacts. They would also like to have none of the styling code in the HTML output so that it is cleaner and more readable. The first screenshot illustrates some examples of Artifacts present in a PDF and the second screenshot shows the exported HTML from Acrobat XI for the same PDF. The HTML has the artifacts which has been shown by arrow pointers.

To help our customers to achieve these requirements we are uploading two files (HTML 3.2 and HTML 4.01) which need to be copied at “Acrobat\plug_ins\SaveAsXML\MappingTables” in Windows installation directory and “Contents/Built-in/SaveAsXML.acroplugin/Contents/MacOS/MappingTables” on your Mac OS installation. Once you copy these files and restart Acrobat you will get new save options in the Save As dialog. These new options are shown in the screenshot below. They would be HTML 3.2 and HTML 4.01 with CSS 1.0. Our customers will remember them as the options that were present in Acrobat 9.0.

Using these options you can get HTML output which is much more accessibility friendly. Below is the HTML output of the PDF which we had shown above with the option HTML 3.2. As you can see this output doesn’t have the artifacts.

Also the below screenshot shows the difference in the HTML source as well. The first screenshot is the original Acrobat XI output and the second screenshot is output with HTML 3.2 option.

Hope our customers will find this solution helpful while working with Acrobat XI.