In the AEC Industry we often receive documents that have been scanned. A lot of times though, the user who originally scanned the document did not either have the ability or take the time to OCR (Optical Character Recognition) the document. Why is OCR important? Well if the original creator did not OCR the document you essentially have a nice picture, but no intelligence to the document. If you wanted to search for a word in the document, it would not find any results even though you can see the word exists. This is especially important when you have a large volume of documents and you need the ability to search across those documents for a key word or phrase. Most people do not know that Adobe Acrobat gives you the ability to OCR a document. I am going to use this blog entry to take your through the steps of how to OCR a document starting with an unintelligent scanned image.
What cracks me up when I go out and meet with AEC companies is when I start to discuss how they handle scanned images; I am usually met with the response of “well we use TIFF because it is a better format”! My follow up response is, “do you know who owns the specification to the TIFF file format”? Yes, that is correct Adobe owns the TIFF specification. Okay, enough of the history lesson, let’s take a look at how you can start with a scanned image and make it an intelligent document.
The first step is to simply locate the scanned image and then convert it to pdf. Now there are several ways to do this, such as File - Create PDF – From File, right click on document and choose convert to PDF, or just simple drag and drop the file or files into an open Acrobat window. If you drag multiple files it will give you the option of combining all of them together into a single document. By converting the document to PDF you have not OCR’d the document, so it is still just an image. As you can see from the image below, I have searched for the word “Ozone” and no matches were found even though you can clearly see the word Ozone in the document.

To OCR the document you simply select from the top tool menu Document – OCR Text Recognition – Recognize Text Using OCR which will bring up the following dialogue box.

You can click on the Edit button and make any changes to the settings that you need to. For a detailed explanation of each setting you can visit the Help section to determine which one is best for you. For the most part you can start with the defaults that are given.

Simply click OK and the OCR process will begin. It will go through and recognize the text on the document as well as cleanup and deskew the image. Now if you perform the same search for the word “Ozone” it will find all the instances of Ozone in the document.

While the OCR engine is very accurate, the engine may have trouble with some text or with poorly scanned documents. As a check, you can choose Document – OCR Text Recognition – Find First OCR Suspect. This will produce all the words that the OCR engine had trouble with and you can either confirm or correct the words.
Sometimes you start with a document that has been OCR’d but it still needs work. A good tip & trick to use with these types of documents is the Document – Optimize – Scanned PDF functionality. A customer I was with not too long ago wanted to just simply deskew the document, and this is the feature that they used to make the correction.
This is one of those “wow” features that a lot of people do not even know exists in Acrobat. Scanned images are a staple though in the AEC industry and this is a great feature in taking what is a very dumb document and actually giving it intelligence!
Have fun,
Jonathan

What steps would you take to OCR a scanned CAD drawing? Would OCR software be able to discern various between various sized fonts and geometry?
Hello,
If you scanned the original CAD drawing you need to make sure that is was scanned at 72 dp or higher. For the settings you can probably either use the Searchable Image or Searchable Image Exact. I would recommend a dpi setting of 300 dpi.
Jonathan
Good blog post. I have some scanned material that is a mix of Japanese and English. Do you have any advice on handling this? Also the japanese has furigana (characters indicating the pronunciation) under the kanji which may complicate matters further. Thanks.
Hello,
Acrobat offers a total of 35 different languages including double-byte languages such as Japanese, Korean, and Chinese.
Acrobat always uses English as a secondary OCR language. For example, English words will be properly recognized when mixed in a Japanese language document.
Have Fun,
Jonathan
Hi
Thanks a lot for your blog. I have Acrobat prof. 9. When I go to "OCR text recog.", in "find OCR suspects" window the area under "setting" is empty! & nothing happens! Seems it doesn't work for my pdf!
Hello,
You have to first OCR the document, then in the second step you go back and check for suspects. Hope this helps. Jonathan
Can someone tell me if I can take multiple documents in a folder and convert them all at once to OCR?
Hello,
Yes, you can use the Batch Processing Feature. Go to the Advanced Menu - Document Processing- Batch Processing. Then setup a sequence for OCR.
Jonathan