Generating TIFF and Text files from PDF for Concordance and Summation

Adobe is the custodian for both PDF and TIFF (Tagged Image File Format) formats.

While PDF is superior in many ways, TIFF remains a popular format for use in large case litigation support systems such as Concordance and Summation.

If you have a lot of PDFs in your production it can be a challenge to work with these systems as they do not robustly support PDF and conversion is necessary. These systems want to ingest a . . .

  • TIFF file to represent each individual document page
  • TEXT file of the text of each page

Processing several hundred documents to individual TEXT and TIFF files is a candidate for some serious automation!

Fortunately, repetitive tasks like this can be easily accomplished using Acrobat Professional. Since Acrobat can be automated using JavaScript, it is possible to string together several steps and save a lot of time.

In this article, I’ve included a Tiff-Text Processing Batch Script to download which handles all of this conversion automatically. Here are the results:

Sample Results

What the script do exactly?

The TIFF-TEXT Processing script performs the following steps:

  1. Tags the file for accessibility and text reflow.
    This should make the text files easier to review in litigation support products.
  2. Splits the PDF into individual PDFs by page
  3. Exports the individual PDFs as TIFF
  4. Exports the individual PDFs as Text files

    Have you OCRd your files first?

    Acrobat can’t export text if the file hasn’t been OCRd first. Check out this article on Batch OCR .

     

What’s Covered . . .

  • Installing the script
  • Setting related Acrobat Preferences
  • Tweaking the Script
  • Running the Script
  • Troubleshooting

Download the Sequence File

Below is a PDF file containing the sequence file. Select the file in the Attachments panel of the PDF and click the Save button to extract it. .

Installation_instructions (52K) (Opens in a new window)

Installing the Sequence File for Acrobat 8

The instructions below have been tested with Acrobat 8.

Caution
Use of the TIFF-TEXT sequence file is not supported by Adobe Systems Incorporated. The sequence file is made available as-is and without warranty. Use at your own risk!
Use on a copy of your files!

The above obligatory warning aside, it seems to work.

  1. Quit Acrobat if it is open.
  2. Extract the sequence file contained in the Installation_instructions PDF to your desktop or other location you can find easily.
    The sequence file is called Tiff-Text Processor.sequ
  3. Select the Tiff-Text Processor.sequ file, right-click and choose Copy to place the file on the clipboard
  4. Place the file in the following location:

    WINDOWS
    C:\Documents and Settings\<username>\ApplicationĀ  Data\Adobe\Acrobat\8.0\Sequences

    MAC OSX PPC
    /Users/<username>/Library/Acrobat User Data/8.0_ppc/Sequences

    MAC OSX INTEL
    /Users/<username>/Library/Acrobat User Data/8.0_x86/Sequences

  5. Restart Acrobat

Note
If you or your IT administrator has customized your installation of Acrobat, you may not be able to find the correct folder at the location noted above. Consult your IT department or use the Search function to find the correct folder.

 

Can’t see Files on Windows?

  1. Go to the Control Panel
  2. Choose Folder Options
  3. Click on the View tab
  4. Find Hidden Files and Folders in the list and double click to open it
  5. Enable "Show hidden files and folders

Set TIFF Conversion Preferences

The majority of case documents may be represented well as B&W TIFFS at 300 dpi resolution. Acrobat’s default preference, however, is to make an intelligent conversion of the document which could result in the creation of grayscale or color TIFFs . . . these can be really large!

Let’s make some changes:

  1. Choose Edit—>Preferences . . .
    (Acrobat—>Preferences . . . on the Mac)
  2. In the Preferences window
    A) Choose Convert from PDF
    B) Choose TIFF
    C) Click the Edit Settings button

    Convert to Preferences

  3. Make the following changes in the Settings Window:
    A)Change Colorspace to Monochrome
    B) Change Resolution to 300/pixels/inch
    Click OK

    Save as TIFF Settings

  4. Set Batch Conversion Preferences.
    A) Click on the Batch Processing category (far left)
    B) Enable "Save warnings and errors in log file"
    Click OK

    Batch Processing Prefs

    Sometimes a file may not convert properly. You can view a log file created by Acrobat to help with troubleshooting.

Destination Locations

One thing you should know about the script— the destination folder is hard wired.

If you run the script sample as-is, it will prompt you to find the PDFs to process and then write individual PDFs, TIFFs and Text files into a folder at C:\dest.

Windows:
At the very least, you will need to create the "dest" folder at the root level of your C drive to use the script.

Mac:
On my Mac, I was surprised that Acrobat actually created a folder at /C/dest.

Still, you probably will want to have more control about where the files will go. See below.

Changing the Destination Location

To change the place where files will be written:

  1. Advanced—>Document Processing—>Batch Processing . . .
    A) Scroll down to find the Tiff-Text Processor sequence
    B) Click Edit Sequence . . .

    Select the TIFF-Text Processor script

  2. Click the Select Commands . . . button in the Edit Batch Sequence window

    Select Commands

  3. In the Edit Sequence window:
    A)Choose Execute JavaScript from the list on the right and
    B)Click the Edit button.

    Edit the JavaScript

  4. In the JavaScript editor window, scroll down to find the line:

    var destPath = "c/dest/"

    Understanding JS Paths

    What’s in a path?
    /c/dest/ represents the drive letter and path. If you want to put the transformed files files into a path on your desktop, you might change that portion to:

    Windows
    /c/Documents and Settings/USERNAME/Desktop/FOLDERNAME

    Macintosh
    /Users//USERNAME/Desktop/FOLDERNAME

Source File Location

It isn’t always convenient to have to select the file(s) for conversion, especially if your document production spans several nested folders.

To process a folder and all subfolders within:

  1. Advanced—>Document Processing—>Batch Processing . . .
    A) Scroll down to find the Tiff-Text Processor sequence
    B) Click Edit Sequence . . .

    Selecting the script

  2. In the Edit Batch Sequence window, change the following:
    A) Change Run commands on to "Selected Folder"
    B) Click the Browse button and locate your source folder.
    Click OK.

    Setting the location

    Warning!
    Do not change the Output location via the window above or the script will not work. Leave this as "Ask When Sequence is Run".

Running the Sequence

This part is easy!

  1. Advanced—>Document Processing—>Batch Processing . . .
    A) Scroll down to find the Tiff-Text Processor sequence
    B) Click the Run Sequence button

    Selecting the script

  2. Acrobat will prompt you to select files if you did not change the Source File location.
  3. Acrobat will display the Run Confirmation window.
    You can turn this off in Preferences(Batch Processing category).

    Run confirmation dialog

  4. Acrobat will process the files. This could take a while!
  5. Open your destination folder to view the results:
    Results

Troubleshooting

The script is not perfect. I have found that it does not work properly in the following cases:

  • If no text is created from a file, ensure that it has been OCRd.
  • You might receive messages that files could not be tagged or were already tagged. Generally, you can ignore these.
  • Not all PDF forms can be saved as TIFF. You might need to flatten them first using the PDF Optimizer
  • Corrupted PDFs may cause a crash

I need to convert Word, Excel, etc. to PDF

You can use Batch Processing to convert any type of file supported by Acrobat to PDF. You would need to conduct this operation as a separate batch sequence before running the TIFF-Text Processor script. Regrettably, Acrobat doesn’t allow you to chain together PDF Creation and secondary processing.

Source File Options

How do I learn more about scripting?

Acrobat Developer Center

AcrobatUsers.Com Javascript Corner

    Thank you, Leonard!

    Thanks to Leonard Rosenthol, Adobe’s Technical Standards Evangelist, for his help in developing this script. I don’t know anybody who knows more about the technical intricacies of PDF than Leonard!

Comments are closed.