Legal
The views expressed in this blog are my own and do not necessarily reflect the views of Adobe Systems Incorporated.
Search
October 27, 2005
Troubleshooting Acrobat OCR
Searchable PDFs are critical in litigation and matter management. Using Acrobat's OCR function, you can turn mountains of paper into searachable PDFs that look just like the original.
Occasionally, you may run into some issues.
Read on to learn about some workarounds and key considerations.
Acrobat OCR Troubleshooting
Acrobat OCR generally works well, but occasionally you might run into the following problems:
1) Slow Processing
Solutions:
Read and Write Locally
Make sure your source files and OCR's files are written to local volumes. Reading and writing to the network or from a CD or DVD is much slower. If you are short on space, try using an external USB 2.0 drive.
Input Resolution
Have you scanned above 300 dpi? 600 dpi? There are diminishing returns on OCR accuracy above 600 dpi.
Output Resolution
I generally recommend that you downsample after OCR. For example, scanning at 600 dpi yields slightly better accuracy than scanning at 300 dpi, but downsampling back to 300 dpi to make a smaller PDF can add 20% or more to your conversion times.
Did you scan in color or greyscale?
Scanning B&W documents in color mode results in dramatically bigger files. Acrobat cannot convert color documents to black and white. (Adobe Photoshop can and in batch if you need to). An image-only, black and white, letter-sized document should almost never be more than a 50K PDF if properly compressed. If your PDFs are a lot bigger than this, check your scanning settings.
Large PDFs
Solutions:
Scan in Black and White
Make sure you do not scan in color to limit the size of your PDFs.
Use the PDF Optimizer in Acrobat Professional 7 or 8
Taked advantage of JBIG2 Lossy compression to create PDFs that are smaller. Most incoming PDF Image-only files use CCIT Group 4 Fax compression. This compression flavor was designed for fax machines with limited processing power. It was great technology . . . In 1980. Choose Advanced—>PDF Optimizer.
Use Optimize Scanned PDF in Acrobat 8 Standard
This new feature of Acrobat 8 makes it easy to reduce the file size of scanned images. This feature can also deskew scanned pages and remove dirt, etc. Choose Document —>Optimize Scanned Document
Scan to Size
If you scanner supports it, choose automatic page size if you regularly scan documents smaller than 8.5 by 11. Remember that PDF documents can support multiple page sizes. Scanning a business card at letter-size makes a larger file.
Slow Scaning
Solutions:
Buy a Faster Scanner
If you are using a scanner that is more than three years old, it may be time to upgrade. Newer units are dramatically faster. Consider buying a dedicated document scanner. I like the Fujitsu ScanSnap (about $400 street price) which includes a full version of Acrobat 7.0 Standard. The Fujitsu can scan 15 double-sided pages per minute directly to PDF Image-only format! The input bin can hold 50 pages. The downside with the ScanSnap is that it is not Twain or ISIS (two standard methods that applications communicate with scanners) compliant, so it cannot be used with directly from Create PDF from Scanner in Acrobat or used with Photoshop, etc.
I also wrote an article about the Canon DR-2580C. This scanner may be used directly from Acrobat and works particularly well with Acrobat 8. The DR-2580C scans at 25 double-sided pages per minute.
Use a Scanning Service Bureau
Send out those bankers boxes of documents to a local scanning provider. They can return Image-only PDFs to OCR. If your service bureaus offer OCRd PDFs, make sure you test them first. In many cases, we've found that selecting OCR'd text on the PDF is iffy. Ask them what kind of image compression they use. Test the documents to see if they are tagged. Most times, you'll get better results OCRing in Acrobat.
Acrobat Won't OCR your file because it contains renderable text
You'll see the Renderable Text Error when the PDF you are trying to OCR has vector elements on it like stamps, annotations or Bates Numbers. It's a particular problem with federal court files that are image-only PDFs with stamped Bates numbers.
Solutions:
Remove Headers and Footers or Bates Numbers
Go to Document—>Add Headers and Footers and remove the headers and footers and remove all entries.This solution only works if the Headers and Footers or Bates Numbers were stamped using Acrobat 8.
Remove the Header/Footers Manually
You can select and delete the vector elements by choosing Tools—>Advanced Editing—>Touchup Object Tool in Acrobat 8 Professional.
Another option is to use the Redaction Tools in Acrobat 8 Professional to remove them.
Comments
Hi,
Article very helpful.
Dm
What iif it says no ocr because of renderable text, but there's no header or footer?
The renderable text issue may also arise from the use of watermarks, third-party stamping utilities, etc. Annotations don't cause renderable text issues because they are considered a separate layer than the document. Some third party stamps are placed in the bytestream of the PDF so Acrobat does not see them as annotations.
Is there any way to remove the third party stamps that are placed in the bytestream of the PDF so it can be OCRed?
----- Rick's Reply ---
Try: Tools-->Advanced Editing-->TouchUp Object Tool. Alternately, you could save the PDF to TIFFS and recombine the files, then OCR. Neither option is very attractive.
So if there is no header, no footer, not watermarks, etc., how can you make your product searchable, when OCR still says renderable text?
---- Rick's Reply ----
Somewhere on the page, there is non-raster text. I have seen fax servers and other apps silently add this to PDFs. You can use the Touchup tools to delete them, or save back to TIFF and recombine, then OCR.
Regarding the renderable text issue: You can workaround this problem by rasterizing the entire PDF, and re-PDF'ing it. In other words, convert it to an image file and then back to a PDF. This eliminates any added headers/footers, page numbers, or other unknown renderable text.
For example, you can do a SAVE AS, and use the TIF file format. This will save each page in your PDF as a separate TIF file. Then in windows explorer, highlight all of the TIF files and select "Combine in Adobe Acrobat" - this will convert them all back to PDF and combine them into one file. Save the new PDF file, and then you will be able to run OCR on it without getting the renderable text error. ----------- Rick's Reply -------
Thx, Carol. I do know about this workaround, but it is complex. In Acrobat Pro, you can do the TIFF conversion in batch, but you still have separate TIFFs for each page.
When I OCR an image only PDF I have always been confused about the settings. Actually, I get the first two (Primary OCR Language, and PDF Output Style), but what is meant by "Downsample"? What puzzles me is that the different options are described as "Lowest (600 dpi)," "Low (300 dpi),"... "High (72 dpi)." Why is 600 dpi called the lowest, while 72 dpi the highest?
Secondly, is this the setting you referred to above (under the heading Slow Processing Solutions > Output Resolution) when you recommended that we "downsample after OCR" (my italics)?
------------------- Rick's Reply ----------------------
Downsampling means to remove data from the image.
E.g. if you scan at 600 dpi, Acrobat can make reduce the size of the file by downsampling the image of the page to 300 dpi.
Low to High means the least to the most downsampling.
Generally speaking, scanning at 300 dpi is the best option for everybody. There are diminishing returns scanning at higher resolution and the downsampling step requires additional processing time.
Thanks for this info - very helpful with my awful assignment. I think removing headers/footers is easier than the TIf option. Still, it's possible that in some case, if the text policy is poor, then you would be able to touch up the text in Photoshop, perhaps increasing contrast or blackening blacks and neutrals and whitening whites would make for a better OCR result. You would need to be dedicated, though!
When scanning a large doc our person gets the following message when finishing a large scan "Unable to process the page because the Paper Capture feature recognition service experienced an error" They are using Acrobat 7 Standard. Any suggestions?
-------- Rick's Reply -----
1) Reinstall Acrobat and update to the latest dot release
2) Add more memory to the computer
3) Make sure you are OCRing locally instead of on a network or CD
For the "renderable text" problem, I just now tried the remove header/footer trick but there were no headers or footers. There was, however, a date in the upper corner of the doc so I just used the object touch-up tool and deleted it and then the page OCR'ed fine.
-------- Rick's Comment -----------
Yes, that is the expected behavior when Acrobat 8's Bates feature was not used to stamp the document.