In the previous article we gave a quick introduction to PKI technology (Public Key Infrastructure) because that is what is used for PDF and most other digital signatures. Now we are going to talk more specifically how that technology is used for digital signature in PDF.
If I receive a digitally signed PDF file what should I expect? First I would like to know that the file is unchanged since it was signed. A central notion is that there are encryption techniques, essentially all of them, where if the encrypted file is changed it cannot be successfully decrypted. The decryption produces some garbled garbage instead of meaningful results. So the person who is digitally signing the document encrypts it with their private key. Then if I can successfully decrypt it, I know it has been unchanged since they encrypted it.
So we need to know that the file is unchanged, but also since when and who says so. At least, in theory, I know exactly who that signer is because the certificate that has her public key was created by an authority (CA) that I can trust or one that is trusted by one that I can trust or … . Her signature is also time stamped so I know when the file was signed. So the “since when” and “who says so” is handled by the use of PKI certificates and the encryption that takes place when the PDF file is signed. At the time of the signing the document that is to remain unchanged is established.
An interesting side note: there is no way to stop someone from changing a file, but we do have the means to detect if they did.
Message digests (cryptographic hash functions)
Encrypting or decrypting a large file is computationally intensive and can take a noticeable amount of time even on today’s powerful computers. So in cases where we would use encryption as a technique, not to hide the contents, but to assure that a file has not changed, another method has been developed based upon what are called “message digests” (MDs) or “hashes”.
What if we could just number all documents or files that were ever made. Then I could just tell you the document number and you would know what document I meant. Very crudely, this is the idea of a hash or MD. If we truly numbered all documents the size of the numbers would be huge, so nothing would be gained even if we could do this.
The idea of a hash or MD is that the number of real documents is extremely tiny compared to the number of all possible documents so people have developed algorithms that will derive a small sized number for any given document and the algorithms are mathematically justified to 1) rarely, if ever, produce the same number for two different documents 2) not be invertible, (nearly impossible to generate the document from the number) and 3) make it nearly impossible to make a second document, different from the first that has the same number. These phrases have rarely’s and nearly’s because this work is based upon statistically rare events not happening or extremely complicated and time consuming computations that are, today, impractical.
OK, so what is the deal here. It is faster to compute an MD over the document than to encrypt the document, and the resulting digest is limited to say, 32 bytes of data so compared to the actual document, it is very small. I send you both the document and an MD I computed over its bytes. Then you compute the MD again on your copy of the file, using the exact same algorithm that I used and you compare your MD to the one I provided. If they match, you are nearly assured that your copy is the same as mine. Again, computing the MD is faster that encryption/decryption, and besides that, the document can remain unencrypted and hence readable without any special computation. Something useful if I am not so concerned about the chance that it might have been altered but just want to view it.
But there is one glitch in what I have said so far. What if someone intercepted the document and its MD in transit and changed the document, recomputed the correct MD to match the changed document and forwarded it to you. It would check out OK. So in addition to what was said above, we encrypt the MD with the signer’s private key so that it cannot be altered without detection. We use the same document signing technology discussed earlier but we only apply it to the computed MD and not to the document. We let the MD serve as a document surrogate or “digest” for these purposes. As long as we trust and believe the inventors of these hashing algorithms, the chances of making a mistake are so rare they will not happen to us in our lifetime (or some strong statement like that).
So to summarize what happens when a PDF document is signed: 1) the signing software computes an MD using the bytes of the document file and encrypts it with the signer’s private key and 2) the signer’s public key is made available in a signed certificate (package), the certificate having been issued and signed by a CA which may be authenticated by a chain of CAs via their certificates.
When the signature for a document is checked: 1) the MD of this copy of the document is computed, 2) the signer’s public key is obtained from the certificate and if needed the certificate is decrypted and examined to make sure the identity of the signer is as expected, 3) the encrypted MD sent with the file is decrypted using the signer’s public key from step 2 and compared with the MD computed in step 1. If all this checks out then the document is an identical copy to the one signed and the signer is who they say they are.
One last glitch for the basic story. A PDF document is a single file and we want to keep that property. But we have the MD which is a hash over the whole PDF file which would itself then contain the MD. That is a circular problem that is impossible to solve directly: computing something that is based on the results of the computation. So the complete PDF file with all the signature information is saved onto the disk but with a reserved “hole” where the package containing the MD will be placed. Then the MD is computed over all the bytes on the disk, except for the hole. Then the MD package is written into the hole. Of course, when the signature is checked, the hole is again ignored when the MD is computed. A little funky but it works, and we can have a signed PDF document that maintains its property of being a single self-contained file.
And, of course, all of this is documented in the public ISO 32000-1 standard.
Next time I will go over some of the things specific to the European Union with respect to electronic and digital signatures.
Jim King (contact: firstname.lastname@example.org)