This is a blog I promised to write months ago and finally have bitten the bullet. To extract text from PDF documents is a rather difficult and a highly technical task and I hope to explain, here, why that is the case.
The Role of PDF
First, I want to position PDF properly for this discussion. A PDF file represents an ordered sequence of fixed pages. The planned appearance of everything that each page contains is completely specified, down to the smallest detail. All the graphics, images, and text are specified to appear at precise spots on the page, in a particular color, of a given and fixed size, etc.
In this way, the creator of the PDF file gets to determine the exact appearance that the viewer will see. Renderers of PDF (like Adobe Reader or Adobe PDF Print Engine) are required to display the page contents exactly as specified by the creator of the PDF file. The Renderer is a slave to the authoring tool. In fact, I like to think of the PDF specification as a contract between the software that creates the PDF file and the renderer that displays it. They must agree on what the positioning, size, color, etc. noted in the PDF file precisely means.
Text in PDF Page Contents
So now let’s turn our attention to text content of PDF pages. Text is rendered for display or printing by using specific graphic designs captured in fonts and each individual graphic is called a glyph. The lower case letter "a" in the Latin alphabet may have hundreds of different glyph appearances such as those shown here:
These all serve the purpose of being a lower case Latin a but have distinctly different glyph appearances and also notice that they also have different widths and spacing required to properly display them.
PDF specifies text content of pages as glyphs not characters. That is, one of the appearances for an "a" is chosen by the creator of the PDF file by choosing a font from which the "a" glyph can be taken. PDF page contents do not specify characters such as just the Latin letter "a".
The rub comes when we want to work with characters not glyphs. Unicode is widely used because it is a character encoding technology not a glyph encoding one. In fact, for many purposes, such as searching for text strings, we do not want to search by appearance but we want to search by the Latin letters (or commonly by the Unicode encoding of characters).
This issue of how PDF pages represent text is further complicated by three more facts.
1) The representation of PDF page contents is tightly specified so that the creator of the PDF file can determine the exact position on the page of every glyph chosen. But this is usually not done by actually specifying a position for each glyph individually. It is expected and required that the creator of the PDF file have access to the same font as the renderer that processes the PDF file, including the width and spacing information for each glyph supplied in the font file. For a string of text, if we agree to strictly space the glyphs, one from the other, using the spacing information given in the font then all we need to specify in the PDF file is the starting position for the first glyph in the string and the positions of all the rest are determined by that. This saves space and is a common and easy convention to work with.
There is no strict control over font files and so to make sure that the creator of the PDF file and the renderer of the PDF file use the same information, it is wise to embed the font that the creator used, with its important spacing information, in the PDF file itself. When that cannot be done, the width and spacing information for the font glyphs used are placed in the PDF file so that the spacing that the creator desired can be followed even when using a substituted font or one that might be slilghtly different.
2) Since PDF page contents primarily specify the appearance of what is to be rendered for that page, things like line breaks, line justification and hyphenation are all done by the creator and not within a PDF renderer. As noted earlier, a PDF renderer is a total slave to the representation within the PDF file that the creator put there. This means that if the word "substantial" is broken across two lines so that "sub–" appears at the end of one line and "stantial" at the beginning of the next, then you will never find the string of glyphs for "substantial" in the file but you will find a string for "sub-" and another for "stantial".
3) Ligatures are glyphs specially designed to enhance the appearance of certain groups of characters when they follow in sequence. For example the two "f"’s and the following "i" in the word "difficult" are considered to look better when they are rendered as one special glyph as in the second rendering shown here:
One single glyph in the right example is used to render "ffi" whereas three are used in the left example. These two words take up a different amount of space horizontally so if material was to follow the word at a specific distance, its position would have to be adjusted. In other words, if you right justify a line of text that contains ligature glyphs, the right justification will be ruined if those ligature glyphs are replaced by the individual character glyphs and visa versa. The bottom line is that the creator of the file and the renderer of the file must agree on which ligatures will be used and which will not. By using glyph references in PDF this problem is avoided because the file creator can determine which ligatures will be used and where, by specifying the proper glyphs. The renderer in its slave role just places the selected glyphs where it should.
Glyph References and Fonts
Different font technologies organize the font glyphs in different ways but basically some identifier is used to select one of the glyphs to use. Several different methods are allowed in PDF in order to support the evolving font technologies but there is one fact that is rather disruptive to being able to extract text from a PDF file. In some cases, all that PDF requires is that the identifier used to select a glyph in the font be used by the file creator. No meaning of it being an "a" is required, just enough information to find the desired glyph in the font is needed. So if we want to extract character text in, say a Unicode representation, we really do not have enough information. We only know how the render the glyph but not what it represents.
There is an optional auxiliary structure called the "ToUnicode" table that was introduced into PDF to help with this text retrieval problem. A ToUnicode table can be associated with a font that does not normally have a way to determine the relationship between glyphs and Unicode characters (some do). The table maps strings of glyph identifiers into strings of Unicode characters, often just one to one, so that the proper character strings can be made from the glyph references in the file.
The process for extracting text also has to recognize and undo ligatures and remove hyphenation. There are various techniques for this and the best ones rely on additional optional information that can be put into the PDF file called "structure" information. I do not want to go into more details at this point but I think by now you should have a good idea of what it takes to extract text from a PDF and why it is so technical and messy.
Why Not Just Use Unicode?
So an easy question is why doesn’t PDF use Unicode to represent the text in a PDF file. The simple answer is that the Unicode represents characters independent of their appearance and PDF is all about appearance. But in fact, if I choose a font and devise a method for picking font glyphs given a Unicode character I can almost get away with using Unicode characters and yet rendering glyphs. Why do I say almost?
Well the issue of ligatures makes it messy because ligatures are glyphs but generally not characters. SVG faces this problem and it partially solves it by having escapes from the Unicode into glyph identifiers when the Unicode cannot handle the cases of interest. Suppose we have the word "difficult" which can be rendered with a ligature for "ffi" or not. But the two choices do require different spacing and placement. So the creator of the PDF file has to be able to specify which ligatures are to be used and which not. Currently whether ligatures are available or not and which ones are available vary by font and even instances of the same font. So more standardization needs to be done before this approach will work.
A New Agreement
The design of PDF established a line in the process of specifying and rendering a document where all the appearance decisions are made by the creator of the PDF file. One could imagine another design where that is not the case. For example, one might leave the line justification, hyphenation, etc. to the rendering component. The difficulty with that is that there can be a relatively large variation in the results so that different renderers making small different choices may produce quite different output. It is also awkward to constrain spacing issues to one part of a document without leaving larger than desired white spaces or overwriting material or laying out the document completely anew. PDF does hold the position that the creator gets to choose the appearance so shifting partial positioning responsibility to the renderer will be difficult to design.
Enough auxiliary information has been defined for PDF files such that a well written program can extract the text content from PDF pages for reading to the blind, searching, or to extract Unicode strings of characters. Adobe has done this and that code is available through the Adobe Developers Connection using the Acrobat PDF SDK.
The PDF design is very tailored to the creator being able to quite directly and without ambiguity, specify the exact output desired. That is a strong virtue for PDF and the price of more difficult text extraction is a price worth paying for that design.
Jim King (contact me at: firstname.lastname@example.org)