You just received 1000 PDFs from the other side which are a mix of PDFs created from Office applications and scans. Some of the documents might have been OCRd and some not.
How can you quickly detect which files need to be OCRd?
Further, how can you pull out and separate searchable and non-searchable PDFs?
I have written on this subject previously in my article “Is that PDF Searchable?” That post included information on how to test if individual documents are searchable and offered a basic way to detect searchability across files.
|Why detecting searchability is hard?
When would you call a PDF searchable? When one word is searchable? When 100 words are searchable? When a page is searchable? When all the pages are searchble? What about pictures or text inside of pictures?
I’ve been doing some research and in this article I offer up another way to check for searchable text.
To accomplish this, we will use the Preflight feature of Acrobat Pro. Acrobat’s Preflight feature offers hundreds of different tests including the ability to check for characters on the page. Preflight can be used on a single document or it can be automated using a batch sequence.
The following workflow isn’t perfect, but I offer it here to legal professionals who want to experiment with it.
In this article, you’ll learn how to create a Batch Sequence to run across folders of files which will:
- Separate searchable PDFs from non-searchable PDFs and place them in named folders
- Ignore non-PDF documents
- Create a Summary Report of searchability