Comparing Scanned Documents Tips and Workarounds

Acrobat Pro includes tools that allow you to compare two PDFs to quickly spot the differences.

In Acrobat X Pro, the Document Comparison feature is available by choosing View>Compare Documents.

For a brief demonstration of this feature, click here.

Recently, I received an email from a law firm that was having trouble comparing two PDFs. The firm reported that during comparison, Acrobat couldn’t find any differences in the documents, even though visual differences were apparent during a visual inspection.

I was able to examine the documents and discovered that:

  1. The source documents were scanned documents
  2. One document was scanned in black and white. The other was scanned in greyscale.
  3. Documents were OCR’d

The difference in color space (black and white vs. greyscale) was enough to seriously affect Acrobat’s ability to detect changes. In effect, Acrobat saw these as being two completely different documents.

That’s probably a bug (I reported it), but we all still have a job to do. Fortunately, I was able to come up with a workaround.


Control the Scanning

When you can, make sure that scanned documents which need to be compared are scanned using identical settings.

That isn’t always possible if you receive documents from the other side or via the court. What to do?

ClearScan to the Rescue

Acrobat offers different "flavors" of OCR. The type used most frequently by legal professional is searchable image, sometimes referred to as "image+text". When Searchable Image OCR is performed on a scaned doc, the original scanned image is left in place in the PDF and an invisible layer of searchable text is added.

Acroat 9 introduced a new OCR option called ClearScan designed to enhance the quality of the document post-scan.

ClearScan works by turning the images which represent text characters on the page into smoothed vector outlines. Each character on the page is compared and all matching characters are replaced with a an outline character:

Original
ClearScan
800% View in Acrobat
300 dpi scan

You can read more about ClearScan in my article Better PDF OCR. ClearScan is smaller, looks better.

RE-OCRing using ClearScan

I found that running ClearScan OCR on the documents supplied to me by the customer, that Acrobat could then find difference in the text.

Since ClearScan is not the default type of OCR that Acrobat uses, many folks never discover it on their own.

Here’s how to OCR using ClearScan in Acrobat X:

  1. Open the Tools Panel
  2. Go to the Recognize Text section and choose In this File
  3. The Recognize Text window appears. Click the Edit button:
  4. In the Settings window, choose ClearScan from the PDF Output Style menu, then click OK.
  5. Click OK again to OCR your document using ClearScan.

Caveats

Like some other choices in Acrobat, making the change above (Step 4) is a "sticky" setting. If you don’t want to use ClearScan the next time you OCR, you’ll need to remember to change it back. If you think you might occasionally want to use ClearScan, you might consider creating a quick Action which includes the ClearScan option.

ClearScan OCR has many benefits over traditional OCR (smaller file size, faster printing), but I do not reccomend using it in all legal workflows.

Because ClearScan changes the original scanned image, replacing it with a cleaner vector representation, you might potentially call into question the validity of your documents.

Finally, while you can Re-OCR an existing OCR’d PDF using ClearScan, you cannot revert back to the Searchable Image flavor later. For that reason, you might wish to duplicate the documents and work on a copy.

10 Responses to Comparing Scanned Documents Tips and Workarounds

  1. Mary Patten says:

    Hi Rick – still having many issues with the Compare Functionality using scanned PDFs. I tried your OCR setting above with no change. Having same issue similar to above where documents are identical except for textual change. But when scanned and compared, Adobe Pro 9 doesn’t see any differences at all. There are times when it compares two docs 3 pages long and sees many differences – when actually it’s converting the number 1 to a letter ‘i’ or changing WLN.doc to WL:-<.doc. Very frustrating! I've posted to the community with a screen capture but no help yet. Thanks for your time.

    • Rick Borstein says:

      OCR is imperfect and even subtle changes might be flagged as differences. This is especially a problem when scans are generations of the same original document.

  2. Troy Nutall says:

    I’m comparing pfd documents and it’s showing to much when it’s not wrong. Such as footers that are the same. Is there a way to give it exceptions as to what not to compare?

    • Rick Borstein says:

      Yes, you can tell it to ignore headers and footers and (in Acrobat X and XI) only do a text comparison.

  3. Raj says:

    Hi Rick,
    I’ve started OCRing my legal docs using ClearScan, but when OCRing a page with a graphic on it (usually colour), the ClearScan version prints with a grey box in the background of the page. Do you know what this is and how I can get rid of it? Curiously, it only appears when I’m printing to my colour laser printer. If I print to a bog standard inkjet, I don’t get the box. Weird!

    I’m trying to convince my IT department to switch over to ClearScan for all OCRing – the output file size is way smaller and the on-screen quality is way better, but this grey background thing is a bit of a deal breaker at the mo.

    Any help would be appreciated.

    Raj

    • Rick Borstein says:

      That sounds like an artifact of your print driver. You might want to ask your IT folks to update your print driver.

  4. MC says:

    I have a spreadsheet of data I need to OCR. The data is reasonably quality from a scanned hardcopy. I have tried repeatedly to “Recognize text using OCR”. Acrobat 9 seems to complete the process, but no selectable text appears. WHat am I doing wrong?

  5. Scott Schrum says:

    Rick, I’m using Acrobat Pro X to compare two documents. Unlike earlier versions of Acrobat, the comparison I get doesn’t show deleted text as strikethrough, or inserted text as underline. Instead, it shows some highlighting with a comment note. I’ve checked all the settings I can fine, but I can’t figure out how to change this behavior. Is there some setting I’m missing? Is this a known problem with Acrobat Pro X? It makes the output useless.
    Thanks,
    Scott

    • Rick Borstein says:

      That changed in Acrobat 9 over five years ago. The Comparison engine changed, but you cannot use strikethrus and deletes as in previous versions. You can change the colors though. Try creating a Comments Summary after the Compare.