Batch OCR using Acrobat Professional

Have you ever received a PDF file that did not contain searchable text? You may know that you can use Acrobat’s OCR (Optical Character Recognition) to add an invisible layer of searchable text on top of the file. This allows you to select, copy and search text on a paper document. Great!

What do you do when you have hundreds of TIFFs and Image-only PDFs file that you need to search for a big case? Working with these documents one at a time is not efficient.

If you have Acrobat Professional, you can batch OCR and let you computer do the work for you.

NOTE: Acrobat 9 and up make this process much easier. Simply select Document>OCR Text Recognition>OCR Multiple Files. If you have Acrobat 9 and you just want to OCR a bunch of files, this is probably all you need! Acrobat X can do OCR as part of an Action, so you can combine OCR with other operations as part of a document processing workflow.

Read on to learn how…

Batch Processing to the Rescue

There are two steps to follow:

  1. Set up a Batch Sequence
  2. Run a Batch Sequence

Set up a Batch Sequence

Scan your documents locally or send to a PC where Acrobat Pro is installed.

If you have the capability, scan directly to PDF or to an MTIFF (multi-page TIFF). These formats allow all of the pages of a document to be maintained as a single file.

  1. In Acrobat Professional 7, choose Advanced—>Batch Processing
    — or —
    In Acrobat Professional 8, choose Advanced—>Document Processing—>Batch Processing
  2. Click the New Sequence button.
    New Sequence Step
  3. Give the sequence a name.
  4. Click Select Commands
    Select Commands
  5. Choose Recognize Text Using OCR and click the Add button.
    Changing the steps
  6. Double-click the Recognize Text using OCR text (right side of the window) to set OCR Options.
    -Set Downsample Images to 300 dpi. Click OK
    Downsample settings
  7. Click OK again to get back to the main window.
  8. Click Output Options
    PDF Optimizer settings

    Note:
    Output Options allows you specify where the OCR’d files should be written. I suggest writing them to a local drive and copying later to a network store.
  9. Enable PDF Optimizer and Do not overwrite existing files.
  10. Click the Settings Button.
    PDF Optimizer

    Adjust the settings to make the smallest possible files, especially for Black and White (monochrome) files:

    JBIG2 Lossless is very efficient and preserves the exact appearance of the text.

    Consider trying JBIG2 Lossy which causes some visual degradation, but can be up to 70% smaller than JBIG2 Lossless.

  11. Click OK.
  12. Give the revised settings a name such as “B&W Lossy”.

Run a Batch Sequence

Now, all you need to do is to run the batch sequence.

  1. Place all the files you wish to process in a single folder on your hard drive.
  2. In Acrobat Professional 7, choose Advanced—>Batch Processing
    – or –
    In Acrobat Professional 8, choose Advanced—>Document Processing—>Batch Processing
  3. Select the sequence to run
  4. Click OK
  5. Select the folder to process
  6. Click the Select button.
  7. Select the Output Folder

That’s it!

Sit back and enjoy a cup a coffee as Acrobat does the work for you.

24 Responses to Batch OCR using Acrobat Professional

  1. rob light says:

    I have a question for you regarding batch processing of OCR. i am trying to convert a large group of .pdf files to searchable .pdf. however when i follow the batch processing steps that you outline here [http://blogs.adobe.com/acrolaw/?p=118], the software still makes me hit ok after each document is processed. is there anyway to have it automatically convert all 20,000 files w/o pressing ok 20,000 times?
    ————
    Rick’s Reply
    You don’t want to use the instructions from this article since it only applies to PDF Portfolios. Instead, use the instructions here: http://blogs.adobe.com/acrolaw/?p=7

  2. Brian says:

    This is great stuff. Do you know of a way to take the same process as described in your article except make it so that any pdf which makes it’s way into a “monitored folder” is automatically added with an OCR layer?

    Example might make this clearer. Suppose you had a high speed network scanner dumping files to a shared directory called “\\LAW-SERVER1\SCANS\NEEDS-OCR”.

    I am looking for a way to have the software automatically detect there is a new .pdf file in “\\LAW-SERVER1\SCANS\NEEDS-OCR” and than have it OCR’d, and as soon as the OCR process is complete, have it forwarded to “\\LAW-SERVER1\SCANS\ALREADY-BEEN-OCR” (obviously the names are just for illustration).

    In any case, if this can be done with acrobat, or acrobat in combination with some other process it would be amazingly useful for us and I am sure many other attorneys who don’t want (or don’t have the resources) to spend thousands on enterprise type scanning software.
    —– Rick’s Reply —-
    Acrobat doesn’t support hot folders, but you can easily set up a Batch Sequence that will take everything in one folder, OCR it, optionally rename it, optimize it and then place the files in another folder.

  3. Fred says:

    What happens when a PDF already has OCR made?
    Can I avoid rework?
    —————- Rick’s Reply ——————————
    That depends what you mean. If the PDF is a “PDF Normal” file, such as one converted directly from Word, Acrobat will not OCR it. Acrobat can OCR the file more than once. In fact, you might want to do that when upgrading a newer version of Acrobat that offers more accurate OCR, like Acrobat X. The exception is ClearScan OCR. Once you use this “flavor” or PDF, you cannot re-OCR the file.

  4. Liana Joubert says:

    How can I straighten my documents?
    —- Rick’s Reply —-
    Optimize Scanned PDF

  5. Tere says:

    The routine is great, but when I run the sequence I have to keep hitting enter at every file. Can I run this routine without having the system keep asking to hit enter all the time? I would appreciate you comments.

  6. Michael says:

    If I have my Sequence created, is there a way to call it from a command-line? I would like to schedule this to run on a scheduled basis on a server rather than for a person to have to start the process.

    In addition, if it is possible to run via command-line, can I supply a folder name to search as well as a folder to place completed OCR’d files? That would come in very handy. Thanks for any info!

  7. Scott Tilden says:

    What affects the speed of searching with OCR’d files. I download large, old 600-page books — not tabular, nothing special, mostly text books. In some, searches for a name take seconds; in others minutes. Is it possible to embed some manner of preprocessed index I could do? In my case, the additional size isn’t particularly an issue.

  8. Brad Meyer says:

    This is AWESOME! Thanks for the help!!!

  9. Mike says:

    Is it possible to automatically OCR a document at the time of the creation/conversion when I print to PDF Printer?

    • Rick Borstein says:

      When you use an electronic source like Word, Excel, PowerPoint, etc., the files creates are always searchable. IOW, you don’t need to OCR files created from electronic (non-paper) sources.

  10. Chris Johnson says:

    What is the difference between “optimized scanned PDF” and “PDF Optimizer?” Should I add “optimize scanned PDF” to my batch OCR routine (in addition to PDF Optimizer) and if so, before or after OCR?

    Thanks!!

    • Rick Borstein says:

      Good question and different use cases. Optimize Scanned PDF applies adaptive compression, creating B&W and color areas and OCRing the document. It is best for color documents. The PDF Optimizer is a general and powerful tool for changing resolution, compression and removing file items.

  11. Great blog/topic, and you/Rick came SO close to addressing my question in the last post. In Adobe 8 Pro, is there a way to run a batch so that the docs are processed via “Optimize Scanned PDF” – i.e. both OCR and optimized/reduced in file size? It’s strange to me that this is not a default setting option in the batch processing dialog box, as I would think this would (and or should) be one of the first things users (and attorneys in particular) would want to do…Thanks a lot in advance.

    • Rick Borstein says:

      Yes, you can OCR and then use the PDF Optimizer via a Batch Process in Acrobat 8.

  12. David says:

    Quick update, I deleted the previous job and re-created, the process works fine now, no confirmation boxes for every document. Once difference was a triple “…” in the OCR options drop down box, although I cannot find any documention of what the triple full stop refers to. Hope this helps someone!

  13. Johann Ferreira says:

    This works in version 9 but what do I do with Acrobat X as the sequence functionality was removed. I used to OCR, optimize and add a watermark in 1 sequence. Now, with Acrobat X, I can not do this

    • Rick Borstein says:

      You can do that in Acrobat X using the Action Wizard. It’s tons easier, too!

  14. Hunter Goff says:

    Is there a way to set up a scheduled task for doing this? As an example, I want scan a set folder every night and have those files OCRed and moved to another. Any ideas for an easy way to do this?

  15. Mike Davis says:

    Hello, is there a method to schedule batch OCR to start at a specified time i.e. overnight when networks are quiet? We have Pro X. Thanks in advance.