Batch OCR using Acrobat Professional

,,,

Paper— or the digitally scanned equivalent— is still a large component of regulatory filings for many pharmaceutical firms.

Note that the FDA encourages submission of PDF documents created from computer applications instead of scanned PDFs.

The FDA offers this guidance on the CDER site in a PDF document called Portable Document Format Specifications:

Avoid image based PDF files whenever possible. PDF documents created directly from an electronic source such as a word processing file provides many advantages over PDF documents created by scanning paper documents. Scanned documents are more difficult to read and do not allow the reviewer to search or copy and paste text for editing in other documents.

Unfortunately, paper is an unavoidable part of submissions for many firms.

In order to assist agency reviewers, it is a best practice to use Optical Character Recognition (OCR) to create a searchable PDF document from scanned originals.

Background: Making Paper Searchable

Most devices that scan to the PDF format produce an image-only PDF. An image-only PDF contains a picture of a page (scan) in a PDF wrapper— it does not contain searchable text.

Acrobat’s OCR (Optical Character Recognition) feature allows the addition of an invisible layer of searchable text to assist document reviewers:

Acrobat offers OCR via a menu selection

Acrobat Standard can perform OCR on only one document at a time.

Acrobat Professional, however, can create Batch Sequences which offer OCR automation.

In this article, I offer step-by-step instructions to create a batch sequence that allows for the efficient processing of hundreds or thousands of documents.

Note that this article does not cover every aspect of the FDA’s guidance for PDF creation, but I plan to address additional topics in the future.

Batch Processing to the Rescue

Setting up and using Batch Processing in Acrobat Professional takes only a few clicks. You will need to:

  1. Set up the Batch Sequence
  2. Run the Batch Sequence

Once you create the Batch Sequence, it may be reused for additional projects.

Setting up for Best Possible Throughput

OCR speed is dependent on three factors:

  1. Location of the source and destination files.
    Reading from and writing files to a local PC hard drive will greatly improve speed compared to using a network folder.
  2. Processing Speed of the host PC
    Modern, faster computers will complete the OCR process faster. For best results, your host PC should have at least 1GB of RAM.
  3. Complexity of the source document

Document Types for Input

Scan your documents locally or send to a PC where Acrobat Pro is installed.

For easiest processing , scan directly to PDF or to an MTIFF (multi-page TIFF). These formats allow all of the pages of a document to be maintained as a single file.

Creating a Batch Sequence

  1. Open the Batch Sequence window:
    In Acrobat Professional 7, choose Advanced—>Batch Processing
    In Acrobat Professional 8, choose Advanced—>Document Processing—>Batch Processing

    Superior OCR Features in Acrobat 8

    Acrobat 8 offers significantly more accurate OCR than previous versions.

    Acrobat 8 offers a total of 35 different languages including double-byte languages such as Japanese, Korean, and Chinese.

    Acrobat always uses English as a secondary OCR language. For example, English words will be properly recognized when mixed in a Japanese language document.

    Acrobat 8 allows OCR to work on documents to which PDF headers and footers were added. These would previously cause a "Renderable Text" error.

  2. Click the New Sequence button.
    Click the New Sequence Button
  3. Give the sequence a name.
    Naming the Sequence

  4. Click Select Commands
    Click the Select Commands button
  5. Next, choose the commands to run on your file set
    +
    Choose Recognize Text Using OCR and click the Add button.
    + Choose Make Accessible and click the Add button

    Make Accessible adds structure to the PDF which allows for easier content re-use. This feature also offers visually impaired individuals the structure necessary to make best use of their screen reading software. Creation of Accessible Documents should be considered a Best Practice.

    Adding commands to the sequence

  6. Select Recognize Text using OCR text on right side of the window.
    Click the Edit button
    Changing the settings for OCR
  7. Choose the appropriate language from the Primary OCR Language menu.

    You may downample the following documents types to 300 dpi to Reduce File Size

    • Handwritten notes(black ink)
    • Plotter output graphics
    • High pressure liquid chromatography

    Click OK again to get back to the main window.

Other Available Options

Other options in the Edit Batch Sequence window allow you to:

  • Set a source and output folder for files

The Output Options button allows you to:

  • Set a prefix or suffix to documents
  • Use the PDF Optimizer to selectively compress files and run other document operations.

Need to log operations?

Choose Edit—>Preferences and enable "Save Warnings and Erros in Log File"

Run a Batch Sequence

To run the batch sequence:

  1. Place all the files you wish to process in a single folders on your hard drive.
  2. In Acrobat Professional 7, choose Advanced—>Batch Processing
    – or –
    In Acrobat Professional 8, choose Advanced—>Document Processing—>Batch Processing
  3. Select the sequence to run
  4. Click OK
  5. Select the folder to process
  6. Click the Select button.
  7. Select the Output Folder

That’s it! Sit back and enjoy a cup a coffee as Acrobat does the work for you