I’ve been hearing a lot of questions around SEO and PDF files lately. Duff Johnson wrote an excellent article on this topic titled Make your PDFs work well with Google (and other search engines). The article was written back in 2009 but the concepts are still just as valid today. What’s changed is Acrobat. Prior to Acrobat X, the steps necessary to follow Duff’s guidelines were manual… and tedious.
To make SEO for PDF files easier to implement, I’ve created an Action for use in Acrobat X Pro that will automate setting the properties of the PDF file to match Duff’s recommendations.
The “SEO this PDF file” Action will do the following in this order…
- Optimize a scanned PDF
- Add a document description
- Add tags to a document
- Reduce the file size setting the version compatibility to 9 or higher
- Rename the file to something meaningful to search engines
Download the “SEO this PDF file” Action then open it to install the Action into Acrobat X Pro.
Let’s look at Duff’s recommendations and see how the Action helps. I’m going to present these issues in the order they appear in the article rather than in the Action. It’s easier to follow along that way.
In terms of PDF files, the blue underlined text in Google’s search results comes from one of two places. First, Google looks in the “Title” document information field. If it finds nothing, Google’s indexer tries to guess the document’s title by scanning the text on the first few pages. This usually doesn’t work, producing incorrect and improperly formatted results.
Of course, if the indexer DOES find text in the Title field, it will use it –- regardless of whether that text is garbage or not. As a result, there are millions of PDF files on the web with gloriously informative titles such as: “Brkg2RechBrntGrtA122a.qxd”. Well, apart from the fact that the designer used Quark, there’s nothing else I know (or care to know) about this document from looking at the search results.
Ensuring that each and every PDF file contains a valid and meaningful title is the single easiest way to ensure that search results will display information that’s vital to getting users where they want to go. Leave PDF titles unattended, and the certain result is slower, less-reliable searches for every user, every time they search.
The Action will prompt you to select the title of the document and copy it to the clipboard. The Action will already have recognized the text so even if you started with an image-only PDF file, you’ll, hopefully, be able to select the title. You want to get this text onto the clipboard so you can just paste it into the metadata dialog that will pop up in the next step. You can optionally just type the title into the metadata dialog if that’s easier. The title field will then become the file name of your PDF file with spaces being replaced with underscores. A meaningful title field that corresponds to the file name will raise the search relevance of your document.
This is also an opportunity to fill in the Author, Subject and Keywords though, as Duff made abundantly clear, a proper title is essential.
In the “Other considerations” section, Duff covers the following…
PDF Specification: In early 2006, Google couldn’t index PDF files above the 1.5 specification. As of July 2009, Google indexes content from even Adobe’s latest specification version — 1.7 Adobe Extension Level 3 (Acrobat 9.x).
The Action’s “Reduce File Size” step, recompresses the file to be compatible with Reader and Acrobat version 9 or higher. This makes the file size as small as possible and still be indexed by Google.
OCR: Many PDF files are simply scanned pages with no searchable text. Until recently, these PDF files were effectively invisible to search engines because, well, there’s no text to search.
As of November, 2008, Google OCRs the image-based PDF files it downloads, no extra charge. This means that even plain, scanned pages will be searchable, even if the owner didn’t explicitly make them so. So far as I know, Google is (still) the only search engine to offer this feature.
Now Google’s OCR isn’t the greatest; it’s optimized for speed rather than accuracy. You can do this yourself, and get better search results, by running and quality controlling your own OCR, then posting the results.
To match Duff’s recommendation, the fist step of the “SEO this PDF file” Action is to optimize a scanned PDF file. This will perform an OCR on any pages that are image-only and make them searchable; it won’t have any effect on PDF files that didn’t originate on a scanner – the PDF files you create through the PDFMakers for Office won’t be changed.
Content reading order: Take a close look at search results showing your search terms in context, and you may find oddly spaced, duplicate or jumbled text. If you take Google’s offer to “View as HTML”, as I often do, the text often looks REALLY bad, with disjointed paragraphs, headings demoted to text and tables, columns and sidebars hopelessly confused.
If the way search engines retrieve and display search results matters to you, or if you must adhere to accessibility or Section 508 standards for web content, plan to get familiar with reading order in PDF-based content and tagging, the structure information (headings, lists, tables, etc), of PDF files.
The third step of the “SEO this PDF file” Action is to add tags to the document. Again, if you already have a tagged document, this step is ignored but it’s really great if you started with a scanned file. You’re much more likely to get a reasonable summary in the Google results if the proper reading order is shown. For a more lengthy discussion on this topic, read the original article.
As I mentioned earlier, it’s important to have the document title and name be meaningful.
Name that file! A PDF file’s name is often a vital part of content-management strategies. To ensure the filename also serves search engine optimization interests as well, try to give your files meaningful names that include keywords or reflect the file’s Title.
You can also easily modify this action if the settings that I’m using are not what you’re looking for.