Acrobat OCR: Make your scanned documents searchable
Please forgive any missing images or edits from our blog archives.
Though OCR was added in Acrobat a while back, we still see a lot of users who are pleasantly surprised when they get to know that Acrobat could also do OCR. This topic also gets a fair amount of coverage on Twitter and Blogworld. So we thought it might be a good idea to provide a quick overview of OCR in Acrobat 9. Read on, if you are interested in knowing more on how to make a document searchable using Acrobat 9.
Users could make a document searchable using Acrobat in 2 different ways. Subsequent sections provide more details on each of these ways.
- During Scan process
- On an already scanned PDF/Image
OCR during Scan process itself
If you have a paper document, that you need to scan and also make searchable, you can use Acrobat to do both in a single step. Go to Taskbar Create >> PDF from Scanner and choose any of the 3 document presets (Black & White Document, Grayscale Document, Color Document). These 3 Presets have OCR option enabled by default so you can get a fully searchable scanned PDF in a single click.
To verify if the OCR option is enabled in the chosen preset, you can click on “Configure Presets” menu option (shown in the screenshot above Create >> PDF from Scanner >> Configure Presets). Choose the appropriate preset in Preset textbox, and verify if Make Searchable (Run OCR) checkbox is on. If it is not checked, check it, save the preset and then press Ok to close the Configure Presets dialog.
Click on Options (next to OCR checkbox) to view and change the OCR settings. Change the Primary OCR Language, if required and then choose the PDF Output Style from the following style options.
- Searchable Image / Searchable Image (Exact): Use this option if you want to keep the scanned image, but still want the text to be searchable. In this case, Acrobat adds a hidden text layer on top of the image. This text is searchable using Acrobat or other desktop search engines.
- ClearScan: This is a new option added in Acrobat 9. Use this option, if you want to keep the look of the document same but still want to convert the scanned image to text so as to reduce the file size. Rick Borstein has more details on Clearscan in his blog posting here.
Once the preset is saved after making changes to OCR option, place the paper in the scanner’s feeder, click on same preset from Create Taskbar (Create >> PDF from Scanner >>…), and you will get a scanned PDF which is fully searchable.
To verify if the OCR is successfully completed, and document is indeed searchable, you can try selecting text using the select tool in Acrobat. If text is selected, then OCR has completed successfully. You can also copy this text and paste it in other applications like Word, Notepad etc.
Tip for improving OCR accuracy during Scanning process
You can improve OCR accuracy while scanning by modifying the compression options under Optimization frame either in Configure Presets dialog or in Custom Scan dialog.
On clicking Options, you will get Optimization Options dialog. Change to Custom Settings, and under compression frame, choose Lossless compression for Color/GrayScale and CCITT Group 4 for Monochrome images. This will ensure that OCR gets to work on highest quality image thereby improving the OCR accuracy. This is especially useful in low resolution scans (<=150DPI).
If you do not want to use presets for scanning, and instead want to use your own scanning settings, you can also use Custom Scan option (Create >> PDF from Scanner >> Custom Scan), which will prompt you to provide all scan related settings and then proceed for scanning. Also note that this is the only option available for Scanning if you are using Acrobat Pro on a Mac (Create >> PDF from Scanner).
Check the OCR option here if you want your scanned PDF to be searchable.
OCR on already scanned PDFs/Images
If you already have a scanned PDF that you received from someone, and you want to make it searchable, you can do so by going to Document >> OCR Text Recognition >> Recognize Text using OCR.
Choose the pages that you want to OCR, click on edit to choose Primary OCR Language, and PDF Output Style, and then click OK and OK again to initiate the OCR process for currently open Scanned PDF.
Another Tip for Improving OCR accuracy if you have a scanned image in TIFF format
If you have a scanned image in an image format like TIFF, you can make this image searchable in Acrobat by first converting it into PDF and then running OCR. Before you convert this image into PDF, you should change the compression options that are used in converting this image to PDF. To do that, go to Edit > Preferences > Convert to PDF, select the image format (e.g. TIFF), click on Edit settings and then change the image compression options here.
- For Monochrome, change the compression to either JBIG2 (Lossless) or CCITT G4
- For Grayscale/Color, change to ZIP which is a lossless compression format.
After changing the compression settings for the image format, you can convert the image to PDF, either by dragging this image and dropping onto Acrobat or using Create PDF from File and then run OCR from Document menu as explained above.
This is a broad overview of how you can make your scanned documents searchable using Acrobat. Let us know if you have any questions.
Aman Deep Nagpal, Acrobat Product Manager