Understanding “Flavors” of PDF

Most people know that Acrobat files can contain a variety of types of information: text, images, and OCR’d information.

Each of these is a “flavor” of PDF with different capabilities and issues. PDF flavors are behind some oft-heard questions I receive such as:

  • Why isn’t this PDF searchable?
  • Why is this PDF 50K and this one is 10K?
  • Why does this PDF print slowly?
  • Why does this PDF look funny on screen?
  • Why can’t I select text from this PDF?

Not all PDFs are created equal. Some PDFs are more usable or offer benefits that other typed do no.

I’ll examine the different flavors below and make some recommendations.

Why does this matter?

If you choose the wrong flavor of PDF or compression, you may run into the following problems:

  • Wasteful storage on local computer and networks by using the wrong type of compression
  • Longer print times
  • Excessively large PDFs that you intend to eFile making it difficult to meet court rules

I’ve met with a great many law firms and have seen some pretty wacky methods of creating PDF. It is not uncommon to see someone print out a Word document and then scan it back in to create a PDF! Ack!

Flavors of PDF

The table below discusses the four basic flavors of PDF:

PDF Normal
PDF Image
Only
PDF Image+Text
Combination
What is it?
Often called an “electronic PDF”, this type of PDF has never hit paper and was converted directly from an electronic source. An image in a PDF wrapper. Could be an image of a page of text or a JPEG, etc. inside a PDF. An image inside a PDF with an invisible layer of searchable text. Any of the types at left.
Where does it come from?
Produced directly from a software application by “printing” to PDF or using the 1-button PDF creators supplied by Acrobat Scanners, Digital Copy Machines, TIFFs converted to PDF. An image-only file that has been OCR’d using Acrobat Standard or Professional. Create from Multiple Files in Acrobat allows you to combine any kinds of PDFs together.
Is it searchable?
Yes
100% accurate since no OCR has taken place
No.
Does not contain any searchable text.
Yes
OCR is not a perfect process. Do not expect 100% accuracy.
Depends
If the combined PDFs are searchable, yes.
Notes
Prints fastest.
Prints at best quality.
Smallest file size.
Recommend no more than 300dpi for scanning. A good format to use in discovery when you don’t want to give the other side an advantage. Best way to make paper documents searchable. Can contain multiple document sizes.

PDF Settings Affecting File Size

PDF Normal offers the best performance, smallest file size and best searchability. These fully electronic files contain all the fonts needed for printing. If you have an option to create PDF Normal, always use it!

When creating PDFs from paper, carefully choose your compression and scanning resolution.

There are three common black & white compression algorithms used for scanned images:

File Size Compression
Larger
|
|
|
|
|
Smaller
CCITT Group 4
JBIG2 Lossless
JBIG2 Lossy

If you choose Create PDF from Scanner in Acrobat, the default compression is JBIG2 Lossless. This offers a great balance between file size and quality.

Other hardware and software products that scan to PDF generally use the CCITT Group 4 compression which is considerable larger.

CCITT Group 4 compression was developed as a fax compression technology. The rudimentary processors of fax machines in the early 1980s had just enough power to decompress CCITT Group 4 files. Surprisingly, it is still widely used, but is an inefficient compression scheme.

While rarely relevant in the legal market, Acrobat is intelligent enough to compress files selectively using Adaptive compression. A color brochure may have black text, a color image and line art, each of which can have different compression schemes. If you need to scan color brochures and the like– perhaps in an Intellectual Property dispute– choose the Searchable Image-Compact option.

I’ve conducted several visual tests on JBIG2 Lossless versus Lossy. It is difficult to detect the differences between these two compression schemes on good quality scanned documents. If you have good originals, go ahead and use the Lossy JBIG2.

File Size Comparison

The table below compares the file sizes of a typical 8.5″ by 11″ legal document for various flavors of PDF:

Single Page Legal Document – 200 DPI
PDF Normal
PDF Image Only
200 dpi
PDF Image Only
200 dpi
PDF Image Only
200 dpi
PDF+Text
200 dpi
9.71K
40.79K
20.91K
9.4K
26.64K
Compression and Notes
Fonts Embedded, no tags
CCITT G4
JBIG2 Lossless
JBIG2 Lossy
JBIG2 Lossy Compression
Single Page Legal Document – 300 DPI
PDF Normal
PDF Image Only
300 dpi
PDF Image Only
300 dpi
PDF Image Only
300 dpi
PDF+Text
300 dpi
9.71K
53.77K
31.02K
10.7K
34.34K
Compression and Notes
Fonts Embedded, no tags
CCITT G4
JBIG2 Lossless
JBIG2 Lossy
JBIG2 Lossy Compression

Testing Protocol

NOTE: I did these tests back in the Acrobat 7 timeframe. Current versions of Acrobat offer more robust compression (Adaptive Compression in Acrobat X) and generally work better.

  1. The PDF Normal file was created by choosing the Adobe PDF print driver. [Note 1]
  2. The PDF Normal file was opened in Acrobat and saved as either 200 or 300 dpi uncompressed TIFFs.
  3. PDF Optimizer was used to target three types of compression: CCITT G4, JPBIG2 Lossless and JBIG2 Lossy.
  4. All image and image+text PDFs were created using Acrobat 7 by choosing Recognize Text Using OCR.

Recommendations

Here are my tips for making the best choices when working with PDF files:

  1. Where did that PDF come from? You need to know . . .
    Unless you scan it in yourself using the Create PDF from Scanner option in Acrobat, most likely your PDF file could be made a lot smaller using the PDF Optimizer in Acrobat Professional. Chances are the image-only and image+text PDFs you get from outside your firm use, old, inefficient CCITT Group 4 compression.
  2. Keep Electronic Documents Electronic
    Always convert electronic documents directly to PDF using the 1-button PDF Creators installed by Acrobat into Office applications or using the Adobe PDF print driver. You’ll have a considerably smaller file if you do so and searchability is much better.
  3. Scan at 300dpi, OCR and then Downsample if Necessary
    You’ll get more accurate OCR scanning at 300 dpi. Always downsample and compress using the PDF Optimizer in Acrobat Professional after performing OCR. Acrobat Professional can also batch down-sample, too.
  4. Try JBIG2 Lossy Compression
    Although the Lossy word is a bit scary, give this compression scheme a try. Documents still look good on-screen and file sizes can be 50% smaller.

Notes

1. Multiple-page PDF Normal files are considerably smaller that mult-page image-only PDFs. Single page PDF Normal files must contain all the fonts necessary to render the page. This information does not need to be duplicated for successive pages.

2 Responses to Understanding “Flavors” of PDF

  1. james manning says:

    question: how can one determine the DPI setting used when PDF was created?
    We are trying to rad a barcode from PDF files and sometimes it works and other times it does not, as, PDF files are coming from different sources.
    Is there a tool which can determine the DPI setting used when the PDF was created?

    • Rick Borstein says:

      You can use Preflight in Acrobat Pro for this. In Acrobat Pro X or XI, go to the Tools panel and click the flyout to show the Print Production section. Then, choose Preflight. In the Preflight window, go to the PDF Analysis section and choose “List Page Objects, Grouped by Type of Object”.