Batch OCR Multiple Files Easily

One requirement for scanned PDFs submitted to most regulatory agencies is that the files are searchable. To make scanned PDFs text searchable, Optical Character Recognition (OCR) is used.

In a previous article on Batch OCR, I discussed using Acrobat Pro to create a Batch Sequence to process multiple files.

In Acrobat 9 Standard and Pro, there is now an easy way to OCR multiple files using a new feature— Recognize Text in Multiple Files using OCR:

OCR Multiple Window

Read on to learn how to use this new feature.

OCR Multiple Files in Acrobat

To use this new feature in Acrobat 9:

  1. Choose
    OCR Text Recognition—> Recognize Text in Multiple Files using OCR

  2. Click the Add Files button at the top of the window
    Add Files button
    Add Files allows you select one or more files from different folders
    Add Folders will OCR every file recursively in a folder
    Add Open Files will OCR files open in the Acrobat application

  3. After adding files, click the OK button.
  4. The Output Options screen appears:
    OCR Multiple Output Options window
    A) You can overwrite the files or save a copy to another folder
    B) You can rename the files with a prefix or suffix. e.g. document_ocr.pdf
    C) Enable Fast Web View which makes PDFs view faster on the web
    D) PDF Optimizer (Pro only) allows you to specify PDF versions and resample images

  5. Click OK.
  6. The Recognize Text – Settings window appears:
    Acrobat Recognize text settings
    - Choose the language needed.
    - Set the PDF Output style. You might try the new ClearScan style which smooths type

  7. Click OK and Acrobat will begin processing your files and the Progress window will appear:
    OCR Progress window
    Any errors will be displayed in this window.

I can’t OCR using Acrobat. What does the "renderable text" warning mean?

One OCR error you might encounter is "The page contains renderable text."

When this error is indicated, the page contains vector text elements:

  1. Acrobat cannot OCR a document which has text elements on top of image elements.

    Acrobat 8.1 and up will ignore renderable text objects within a margin defined as 20% of the width or height of the page.

    For example, for a standard 8.5″ by 11″ page:

    • Left Margin = 2.2″ (8.5″ X .20)
    • Top Margin = 1.7″ (11″ X .20)

    Here’s a PDF version of the above illustration which you can use to check your own files.

  2. Acrobat cannot OCR a document that was output from an electronic file (such as Word) directly to PDF.
    • If you zoom into the document in question and it looks very clear, that is likely the issue.
    • Good news, these documents are already searchable.
  3. Acrobat cannot OCR a file that does not contain an image.
    • You won’t be able to OCR a Visio drawing containing only vector elements, for example.
    • More good news. When converted directly from a CAD program, the text is already searchable.
  4. Acrobat cannot OCR a document that has already been OCRd using the ClearScan option

Final Thoughts

OCR is a great solution if you only have access to paper documents. However, I see many companies using paper-based workflows when there are clearly better, "greener" alternatives.

Business-critical workflows based on paper are fragile, prone to error and costly. Look into using PDF forms to collect data. Use SAFE digital signatures instead of "wet" signatures. Use electronic document review instead of paper.