Archive for December, 2007

Batch OCR using Acrobat Professional


Paper— or the digitally scanned equivalent— is still a large component of regulatory filings for many pharmaceutical firms.

Note that the FDA encourages submission of PDF documents created from computer applications instead of scanned PDFs.

The FDA offers this guidance on the CDER site in a PDF document called Portable Document Format Specifications:

Avoid image based PDF files whenever possible. PDF documents created directly from an electronic source such as a word processing file provides many advantages over PDF documents created by scanning paper documents. Scanned documents are more difficult to read and do not allow the reviewer to search or copy and paste text for editing in other documents.

Unfortunately, paper is an unavoidable part of submissions for many firms.

In order to assist agency reviewers, it is a best practice to use Optical Character Recognition (OCR) to create a searchable PDF document from scanned originals.

Background: Making Paper Searchable

Most devices that scan to the PDF format produce an image-only PDF. An image-only PDF contains a picture of a page (scan) in a PDF wrapper— it does not contain searchable text.

Acrobat’s OCR (Optical Character Recognition) feature allows the addition of an invisible layer of searchable text to assist document reviewers:

Acrobat offers OCR via a menu selection

Acrobat Standard can perform OCR on only one document at a time.

Acrobat Professional, however, can create Batch Sequences which offer OCR automation.

In this article, I offer step-by-step instructions to create a batch sequence that allows for the efficient processing of hundreds or thousands of documents.

Note that this article does not cover every aspect of the FDA’s guidance for PDF creation, but I plan to address additional topics in the future.

Continue reading…