One requirement for scanned PDFs submitted to most regulatory agencies is that the files are searchable. To make scanned PDFs text searchable, Optical Character Recognition (OCR) is used.
In a previous article on Batch OCR, I discussed using Acrobat Pro to create a Batch Sequence to process multiple files.
In Acrobat 9 Standard and Pro, there is now an easy way to OCR multiple files using a new feature— Recognize Text in Multiple Files using OCR:
Read on to learn how to use this new feature.
OCR Multiple Files in Acrobat
To use this new feature in Acrobat 9:
OCR Text Recognition—> Recognize Text in Multiple Files using OCR
- Click the Add Files button at the top of the window
Add Files allows you select one or more files from different folders
Add Folders will OCR every file recursively in a folder
Add Open Files will OCR files open in the Acrobat application
- After adding files, click the OK button.
- The Output Options screen appears:
A) You can overwrite the files or save a copy to another folder
B) You can rename the files with a prefix or suffix. e.g. document_ocr.pdf
C) Enable Fast Web View which makes PDFs view faster on the web
D) PDF Optimizer (Pro only) allows you to specify PDF versions and resample images
- Click OK.
- The Recognize Text – Settings window appears:
- Choose the language needed.
- Set the PDF Output style. You might try the new ClearScan style which smooths type
- Click OK and Acrobat will begin processing your files and the Progress window will appear:
Any errors will be displayed in this window.
I can’t OCR using Acrobat. What does the "renderable text" warning mean?
One OCR error you might encounter is "The page contains renderable text."
When this error is indicated, the page contains vector text elements:
- Acrobat cannot OCR a document which has text elements on top of image elements.
Acrobat 8.1 and up will ignore renderable text objects within a margin defined as 20% of the width or height of the page.
For example, for a standard 8.5″ by 11″ page:
- Left Margin = 2.2″ (8.5″ X .20)
- Top Margin = 1.7″ (11″ X .20)
Here’s a PDF version of the above illustration which you can use to check your own files.
- Acrobat cannot OCR a document that was output from an electronic file (such as Word) directly to PDF.
- If you zoom into the document in question and it looks very clear, that is likely the issue.
- Good news, these documents are already searchable.
- You won’t be able to OCR a Visio drawing containing only vector elements, for example.
- More good news. When converted directly from a CAD program, the text is already searchable.
OCR is a great solution if you only have access to paper documents. However, I see many companies using paper-based workflows when there are clearly better, "greener" alternatives.
Business-critical workflows based on paper are fragile, prone to error and costly. Look into using PDF forms to collect data. Use SAFE digital signatures instead of "wet" signatures. Use electronic document review instead of paper.