Legal
The views expressed in this blog are my own and do not necessarily reflect the views of Adobe Systems Incorporated.
Search
November 18, 2005
Understanding "Flavors" of PDF
Most people know that Acrobat files can contain a variety of types of information: text, images, and OCR'd information.
Each of these is a "flavor" of PDF with different capabilities and issues. PDF flavors are behind some oft-heard questions I receive such as:
- Why isn't this PDF searchable?
- Why is this PDF 50K and this one is 10K?
- Why does this PDF print slowly?
- Why does this PDF look funny on screen?
- Why can't I select text from this PDF?
Not all PDFs are created equal. Some PDFs are more usable or offer benefits that other typed do no.
I'll examine the different flavors below and make some recommendations.
Why does this matter?
If you choose the wrong flavor of PDF or compression, you may run into the following problems:
- Wasteful storage on local computer and networks by using the wrong type of compression
- Longer print times
- Excessively large PDFs that you intend to eFile making it difficult to meet court rules
I've met with a great many law firms and have seen some pretty wacky methods of creating PDF. It is not uncommon to see someone print out a Word document and then scan it back in to create a PDF! Ack!
Flavors of PDF
The table below discusses the four basic flavors of PDF:
|
PDF Normal
|
PDF Image
Only |
PDF Image+Text
|
Combination
|
|
|
What is it?
|
Often called an "electronic PDF", this type of PDF has never hit paper and was converted directly from an electronic source. | An image in a PDF wrapper. Could be an image of a page of text or a JPEG, etc. inside a PDF. | An image inside a PDF with an invisible layer of searchable text. | Any of the types at left. |
|
Where does it come from?
|
Produced directly from a software application by "printing" to PDF or using the 1-button PDF creators supplied by Acrobat | Scanners, Digital Copy Machines, TIFFs converted to PDF. | An image-only file that has been OCR'd using Acrobat Standard or Professional. | Create from Multiple Files in Acrobat allows you to combine any kinds of PDFs together. |
|
Is it searchable?
|
Yes 100% accurate since no OCR has taken place |
No. Does not contain any searchable text. |
Yes OCR is not a perfect process. Do not expect 100% accuracy. |
Depends If the combined PDFs are searchable, yes. |
|
Notes
|
Prints fastest. Prints at best quality. Smallest file size. |
Recommend no more than 300dpi for scanning. A good format to use in discovery when you don't want to give the other side an advantage. | Best way to make paper documents searchable. | Can contain multiple document sizes. |
PDF Settings Affecting File Size
PDF Normal offers the best performance, smallest file size and best searchability. These fully electronic files contain all the fonts needed for printing. If you have an option to create PDF Normal, always use it!
When creating PDFs from paper, carefully choose your compression and scanning resolution.
There are three common black & white compression algorithms used for scanned images:
| File Size | Compression |
|
Larger
| | | | | Smaller |
CCITT Group 4 |
|
JBIG2 Lossless |
|
|
JBIG2 Lossy |
If you choose Create PDF from Scanner in Acrobat, the default compression is JBIG2 Lossless. This offers a great balance between file size and quality.
Other hardware and software products that scan to PDF generally use the CCITT Group 4 compression which is considerable larger.
CCITT Group 4 compression was developed as a fax compression technology. The rudimentary processors of fax machines in the early 1980s had just enough power to decompress CCITT Group 4 files. Surprisingly, it is still widely used, but is an inefficient compression scheme.
While rarely relevant in the legal market, Acrobat is intelligent enough to compress files selectively using Adaptive compression. A color brochure may have black text, a color image and line art, each of which can have different compression schemes. If you need to scan color brochures and the like-- perhaps in an Intellectual Property dispute-- choose the Searchable Image-Compact option.
I've conducted several visual tests on JBIG2 Lossless versus Lossy. It is difficult to detect the differences between these two compression schemes on good quality scanned documents. If you have good originals, go ahead and use the Lossy JBIG2.
File Size Comparison
The table below compares the file sizes of a typical 8.5" by 11" legal document for various flavors of PDF:
|
Single Page Legal Document - 200 DPI
|
||||
|
PDF Normal
|
PDF Image Only
200 dpi |
PDF Image Only
200 dpi |
PDF Image Only
200 dpi |
PDF+Text
200 dpi |
|
9.71K
|
40.79K
|
20.91K
|
9.4K
|
26.64K
|
|
Compression and Notes
|
||||
|
Fonts Embedded, no tags
|
CCITT G4
|
JBIG2 Lossless
|
JBIG2 Lossy
|
JBIG2 Lossy Compression
|
|
Single Page Legal Document - 300 DPI
|
||||
|
PDF Normal
|
PDF Image Only
300 dpi |
PDF Image Only
300 dpi |
PDF Image Only
300 dpi |
PDF+Text
300 dpi |
|
9.71K
|
53.77K
|
31.02K
|
10.7K
|
34.34K
|
|
Compression and Notes
|
||||
|
Fonts Embedded, no tags
|
CCITT G4
|
JBIG2 Lossless
|
JBIG2 Lossy
|
JBIG2 Lossy Compression
|
Testing Protocol
- The PDF Normal file was created by choosing the Adobe PDF print driver. [Note 1]
- The PDF Normal file was opened in Acrobat and saved as either 200 or 300 dpi uncompressed TIFFs.
- PDF Optimizer was used to target three types of compression: CCITT G4, JPBIG2 Lossless and JBIG2 Lossy.
- All image and image+text PDFs were created using Acrobat 7 by choosing Recognize Text Using OCR.
Recommendations
Here are my tips for making the best choices when working with PDF files:
- Where did that PDF come from? You need to know . . .
Unless you scan it in yourself using the Create PDF from Scanner option in Acrobat, most likely your PDF file could be made a lot smaller using the PDF Optimizer in Acrobat Professional. Chances are the image-only and image+text PDFs you get from outside your firm use, old, inefficient CCITT Group 4 compression. - Keep Electronic Documents Electronic
Always convert electronic documents directly to PDF using the 1-button PDF Creators installed by Acrobat into Office applications or using the Adobe PDF print driver. You'll have a considerably smaller file if you do so and searchability is much better. - Scan at 300dpi, OCR and then Downsample if Necessary
You'll get more accurate OCR scanning at 300 dpi. Always downsample and compress using the PDF Optimizer in Acrobat Professional after performing OCR. Acrobat Professional can also batch down-sample, too. - Try JBIG2 Lossy Compression
Although the Lossy word is a bit scary, give this compression scheme a try. Documents still look good on-screen and file sizes can be 50% smaller.
Notes
1. Multiple-page PDF Normal files are considerably smaller that mult-page image-only PDFs. Single page PDF Normal files must contain all the fonts necessary to render the page. This information does not need to be duplicated for successive pages.
Comments
Rick: Great Blog! Keep the tips coming. I had a question about an PDF image + text. When the pdf image is ocr'd within acrobat, is a separate file created? And if so, when the pdf file is copied to, say, a laptop, how do you insure that the text file travels with it? Finally, I've noticed that when I ocr'd a pdf file, the size of the file actually decreased. How?
Thanks for your help.
Is it possible to edit a scanned PDF file by converting it to a Word document? I have tried several trial pdf-to-Word programs but I think after reading your info, a "Normal PDF" is what they are assuming I want to convert.
--- Rick's Reply--
first, OCR the document. You can then convert a PDF to Word by choosing File--SAVE AS and selecting .DOC from the format list. Note that tags are automatically added. Remember that OCR is an imperfect process, so your mileage, as they say, may vary.
1. PDF image + text is only one file, you do not have to worry about copy an image and a text file seperately. This is the one of the advantages of PDF files over traditional image files.
--- Rick's Reply: Exactly!
Microsoft has added a similiar feature to the next version of tiff documents but it does not have the maturity of the PDF standard.
--- Rick's Reply: And, unfortunately, neither Photoshop nor Acrobat can open the files. They have created some other kind of format, but it is not standardized.
2. You can reduce a PDF files size after OCRing a document if you save the file as a PDF Normal document type. A PDF image + text file type is two layers, an image layer, and a searchable text layer. A PDF Normal document is one layer. The OCR software recognizes the text from the PDF image file but then it removes the large image component and saves the recognized text as page text objects.
--- Rick's Reply: Sam is speaking about the "Formatted Text and Graphics" option for OCR. Note that this format is not admissible in court. These files are indeed a lot smaller. A good use for this option is a paper form that you want to repurpose.
To reduce the file size of Image+Text PDFs, use the PDF Optimizer in Acrobat Professional. ----
The drawback to saving image files as PDF Normal file types is that OCR is not 100% accurate any translation errors are displayed in the PDF Normal file type.
Also a PDF normal file tries to recreate the formatting of the original page. OCR products do not have algorithms to acurately determine the difference between the various fonts and line spaces, so they make a best guess. These guesses can result in poor displays of the original page.
Since Rick hasn't jumped in with an answer to this, I figured I would give it a go. The OCR process actually removes identified characters from the image and converts them to a text representation with corresponding font information. The converted text is actually part of the .PDF file itself, so there isn't any worry about a separate text file.
--- Rick's Note: That is only true for the Formatted Text and Graphics flavor of OCR. I do not recommend this flavor in litigation as it is not admissible in court. Law firms should stick with Image+Text or Image+Text (compact) for litigation work. -----
As to why the size of the file is smaller after using OCR, that is due to the fact that the image data required to display the text is smaller. Since the image data no longer contains the characters that were OCR'd, compression of that image data would result in less space being occupied by the image. Additionally, text data, even with font information, would still take up less room than an image taken of the same text. That is why Rick pointed out the importance of using the OCR option and then downsampling to reduce the size of the .PDF file.
--- Rick's Reply---
All true, thanks Josh.
Hhow can you adjust the "downsampling" that occurs upon OCRing something? I can't find a setting for it anywhere in Acrobat Pro.
---- Rick's Reply---
Go to Document--Recognize Text Using OCR and click on the Edit button.
I recently attended your Adobe legal seminar in Boston and wanted to congratulate you on making it entertaining. Great job! My question is this: why doesn't PageMaker has a one button PDF creator? Or does it and I just don't know it yet?
-------- Rick's Reply----
PageMaker back to version 6.0 have PDF Export. It's been ages since I've used it. Adobe bundled in a version of Distiller which did the conversion. These days, almost everyone has moved to Adobe InDesign which has the fastest, most amazing PDF export. Seriously, I smile every time I use it!
When I open a PDF in Acrobat 7 Pro and direct it to Reduce file size, it always crashes Windows. I have HotDocs installed and wonder if there is a problem having both HotDocs and Acrobat Pro installed on the same computer.
-- Ricks' Reply --
I'd suggest re-installing Acrobat and updating to the very latest dot release. Make sure you are not performing this function over a network.