Posts in Category "OCR – Optical Character Recognition"

How can I detect if a PDF needs to be OCRd?

You just received 1000 PDFs from the other side which are a mix of PDFs created from Office applications and scans. Some of the documents might have been OCRd and some not.

How can you quickly detect which files need to be OCRd?

Further, how can you pull out and separate searchable and non-searchable PDFs?

I have written on this subject previously in my article “Is that PDF Searchable?” That post included information on how to test if individual documents are searchable and offered a basic way to detect searchability across files.

Why detecting searchability is hard?
When would you call a PDF searchable? When one word is searchable? When 100 words are searchable? When a page is searchable? When all the pages are searchble? What about pictures or text inside of pictures?

I’ve been doing some research and in this article I offer up another way to check for searchable text.

To accomplish this, we will use the Preflight feature of Acrobat Pro. Acrobat’s Preflight feature offers hundreds of different tests including the ability to check for characters on the page. Preflight can be used on a single document or it can be automated using a batch sequence.

The following workflow isn’t perfect, but I offer it here to legal professionals who want to experiment with it.

In this article, you’ll learn how to create a Batch Sequence to run across folders of files which will:

  1. Separate searchable PDFs from non-searchable PDFs and place them in named folders
  2. Ignore non-PDF documents
  3. Create a Summary Report of searchability

Continue reading…

How to OCR Documents in a PDF Portfolio

PDF Portfolios are increasingly being used by law firms and legal departments across many disciplines.

As a result, my email inbox receives questions such as this one which I received today from the law department of large US city:

We are curious as to whether it’s possible to OCR all documents within a PDF Portfolio?

Acrobat 9 allows you to run several operations across a PDF Portfolio including OCR.

What are PDF Portfolios?

A PDF Portfolio may be thought of as a digital container. This container can include PDF documents, Office docs, image formats and a lot more.

Once files are placed in a PDF Portfolio, they may be sorted and viewed various ways.

For more info, watch this short video or go to the Portfolio section of my blog.

Ideally, you’ll want to have Acrobat 9 Pro to OCR all the files in a PDF Portfolio, but you can use Acrobat 9 Standard, subject to the limitation I outline in the article.

Continue reading…

Reducing the File Size of Scanned PDFs

It seems like a lot of folks are struggling with the size of scanned PDFs. Below are excerpts from two emails I received recently:

My [Fujitsu] ScanSnap makes PDFs that are too big . . . like around 60K per page! What can I do to make these smaller in Acrobat?

I have to eFile [with the Federal Court] and am having to split the filings into many segments to go through the [Court] gateway. The issue seems to be with documents that are scanned on our network scanner. PDFs produced directly from Word are a lot smaller. Is there some trick to reduce the size of scanned files?

Before covering how to reduce the size of scanned documents in detail, let’s discuss four factors that affect the size of scanned images:

  1. Scanning Resolution
    A scan at 600 dpi results in a much larger file than at 300 dpi.
  2. Color Space
    Color and grayscale files result in much larger files than black and white files.
  3. Physical dimensions of the scanned page
    A legal-size scan will be larger than a letter-size scan, with all other factors being equal.
  4. Compression
    Raw scan data can be compressed to make it smaller.

 

Compression Types

Lossless compression retains the exact appearance of the original.

Two common types of lossless compression are ZIP and CCITT Group 4.

Lossy compression makes some (hopefully) non-noticeable visual trade-offs to further reduce file size.

JPEG is a common lossy compression method.

Ideally, you would control all of the above factors yourself by scanning at 300 dpi, black and white and using an efficient compression algorithm.

Unfortunately, you many not have that option. Many desktop and network scanners offer limited or confusing options— or— the scanned PDFs arrived from outside your firm.

Legal Scanning Recommendations
In almost all situations, scan at 300 dpi, black and white.

For the purpose of this article we will make a couple of assumptions:

  1. You have a black and white scanned document of unknown dpi and compression
  2. You have already OCR’d the document, or don’t need OCR

Read on to learn how to reduce the file size of scanned documents using Acrobat.

Continue reading…

Better PDF OCR. ClearScan is smaller, looks better

Optical Character Recognition (OCR) converts scanned paper documents into searchable PDF documents. This technology has been available in Acrobat for about ten years.

While OCR accuracy and language support have improved over the years, the  default OCR “flavor”— Searchable Image— was the only useful choice.

Searchable Image retains the underlying scanned image and adds an invisible layer of text on top which may be selected:

 

Searchable Image OCR has some shortcomings:

  1. File Size
    For 300 dpi black and white scans, a typical file size is 15-40K per page. Scanning at higher resolutions (600 dpi Vs. 300 dpi) increases file size about three to four times.
  2. Print Speed
    Because of the image-heavy content, searchable image PDFs can take a long time to print.
  3. Visual Quality
    At 300 dpi, scanned documents are easily distinguishable in quality from computer-generated files.

In Acrobat 9, Adobe engineers added a new flavor of OCR called ClearScan. ClearScan offers improved text quality with a decrease in file size:

I’ve recently completed some benchmarking which shows dramatic file size decreases and quality gains. Read on to learn about size comparisons, how to use ClearScan OCR and a bit more about how it all works.

Continue reading…

Creating a Non-Searchable PDF from Office Documents

Every once in a while, I receive an email that has me scratching my head a bit, such as this one:

When you PDF a document that you generate in MS Word, is there a way to produce an "image-only" PDF, with non-searchable text? The only way I know how is to print out and scan the document back into Acrobat.

Why would someone want to take a perfectly good, fully-searchable document and turn it into an image-only PDF which is just a picture of the page in a PDF wrapper?

The answer is that in the course of vigorously defending a client, some firms desire to make using documents as difficult as possible for the other side.

Of the various PDF flavors , an image-only PDF is . . .

  • 3 to 5 times larger in file size
  • Look worse on screen
  • Print slower
  • Not searchable

"Dumbing down" a PDF to an image probably doesn’t cripple the other side very much. Using OCR, the other side can quickly make the document searchable.

It is not without some trepidition that I share this tip. After all, compact, searchable PDF should be what we all aspire to create.

However, since I suspect that many firms are printing out documents and rescanning them, I want to offer a greener alternative.

It’s not for me to comment on whether this is fair game or not as you work with the other side, but following is a workaround that will create an image-only, non-searchable PDF from an existing PDF document.

,,

Continue reading…

Cleaning up Scanned Images

,

I recently received this message from a legal technology consultant:

I have had several clients (and have wondered myself) why there’s no way to delete something from a PDF. For example, if I scan a document and want to delete the black marks made by the staple holes in the top left corner, I can’t do that without cropping the entire image.  What is the reasoning for not including a feature that would allow me to draw a box around those staple holes and delete them from the image? 

Actually, Adobe did include a feature to clean up scanned images!

You can easily clean up scanned images using the Redaction tool:
Mark for Redaction Tool
Normally, redactions appear as a black box which obscures the underlying document. Did you know that Acrobat can redact to “No Color” as well?

In this article, I’ll offer step-by-step instructions for cleaning up scanned PDFs using the Redaction tool in Acrobat Professional.

Using this workflow, you can easily delete staple marks, hole punches, shadows, dirt and more from PDFs.

Read on to learn how…

Continue reading…

Acrobat 8.1 Update: Fix for Renderable Text Issue

Note: The renderable text issue noted below should not be an issue with any release of Acrobat after 8.1

Normally, a dot release to one of Adobe’s major product offerings isn’t that exciting.

Besides offering support for Microsoft Windows Vista and Microsoft Office 2007, the latest dot release to Acrobat 8 (v8.1) offers an OCR enhancement that will be very welcome indeed!

Acrobat 8.1 offers a fix to a most vexing OCR problem— the dreaded renderable text error:

Renderable Text Error Warning Dialog

Renderable Text is vector (computer generated) text that is placed on top of an image layer.

You may encounter this error if when you try to OCR an image-only PDF containing a Bates stamp. In some federal court districts, stamped image-only PDFs are commonly distributed.

The Acrobat 8.1 Update offers a fix that works for just about every file that has Bates stamps.

For a complete list of fixes in the 8.1 Update, check out this Adobe Knowledge Base Article. (Opens in a new window)

Read on to learn how to get the Acrobat 8.1 Update and some limitations of the fix.

Continue reading…

Troubleshooting Acrobat OCR

Searchable PDFs are critical in litigation and matter management. Using Acrobat’s OCR function, you can turn mountains of paper into searchable PDFs that look just like the original.

Occasionally, you may run into some issues.

Read on to learn about some workarounds and key considerations.

Continue reading…

Is that PDF Searchable?

Most law firms and even solos have a scanner that can create PDF from paper documents. Overwhelmingly, these devices create image-only, non-searchable PDFs.

Using Optical Character Recognition (OCR), Acrobat can add an invisible layer of searchable text while maintaining the original appearance.

The resulting searchable file is referred to as an image+text PDF.

An image+text PDF looks no different than a PDF which is not searchable. That creates a problem.

How can you tell if a PDF is searchable or not?

Continue reading…

Batch OCR using Acrobat Professional

Have you ever received a PDF file that did not contain searchable text? You may know that you can use Acrobat’s OCR (Optical Character Recognition) to add an invisible layer of searchable text on top of the file. This allows you to select, copy and search text on a paper document. Great!

What do you do when you have hundreds of TIFFs and Image-only PDFs file that you need to search for a big case? Working with these documents one at a time is not efficient.

If you have Acrobat Professional, you can batch OCR and let you computer do the work for you.

NOTE: Acrobat 9 and up make this process much easier. Simply select Document>OCR Text Recognition>OCR Multiple Files. If you have Acrobat 9 and you just want to OCR a bunch of files, this is probably all you need! Acrobat X can do OCR as part of an Action, so you can combine OCR with other operations as part of a document processing workflow.

Read on to learn how…

Continue reading…