January 19, 2009

Digital Signatures: the European Union

According to the European Union definitions, PDF (ISO 32000-1) supports Advanced Electronic Signatures (AES) and Qualified Electronic Signatures (QES). There has been some confusion on this point and I will explain this more in this blog.

Background

The European Union (EU) is going through a very interesting and exciting time of trying to bring together many hitherto independent countries into, well, a European Union. There is particular interest in government-to-constituent electronic communications and in conducting business transactions across the EU electronically. So the EU is pushing the electronic envelope in many ways that by the nature of their activity is very "standards" oriented. This is in contrast to, say the US where various government agencies and various business are approaching these things less from a standards view and more from a view of what works and what is available. At least these are my observations, for what they are worth. This is not to say that the US is not interested in standards but just that the EU seems to be nearly consumed by a standards consciousness.

Electronic signatures (e-signatures) play an important role in assuring trustworthy and legally sound communication between governments and those governed and for businesses conducting electronically assisted transactions with other businesses and with customers also on a sound and legal basis.

The Directive 1999/93/EC of the European Parliament and of the Council, dated December 13, 1999, about a Community Framework for Electronic Signatures, is commonly referenced as the test(s) that electronic signature technology has to pass to be used in various legally binding manners within the EU. In fact, all EU countries have agreed to accept "Qualified Electronic Signatures" (QES) on a par with plain old ink-on-paper signatures and other electronic signatures including "Advanced Electronic Signatures" (AES) "cannot be denied legal effectiveness or admissibility as evidence".

PDF digital signatures can be AES and QES

One reason I bring all this up is that PDF digital signatures can be QES according to the EU definitions, provided that the certificates used are "Qualified Certificates" (QC). PDF digital signatures can also be AES.

A lot of this has to do with establishing a hierarchy of trusted "Certification Service Providers" (CSPs). In practice these are Internet servers that deliver certificates that establish an association between people and their public keys.  I want to make the most important point that since PDF digital signatures are based upon the same PKI (Public Key Infrastructure) standards that these CSPs use then PDF can provide AES and QES. PDF is very suitable for conducting business in the EU and now with PDF an international standard (ISO 32000-1) the EU owns PDF just as much as anyone else. Certainly, Adobe no longer owns PDF.

Meeting requirements

Here is a key quote from the 1999/93/EC Directive (Article 2.2):

"advanced electronic signature" means an electronic signature
which meets the following requirements:

(a) it is uniquely linked to the signatory;
(b) it is capable of identifying the signatory;
(c) it is created using means that the signatory can maintain
under his sole control; and
(d) it is linked to the data to which it relates in such a
manner that any subsequent change of the data is
detectable;

All of these properties can be satisfied by a PDF digital signature. The standardized PKI Certificates can satisfy the (a) through (c) and (d) is satisfied by typical PKI signing technology using message digests. More background about PKI and PDF signatures is provided in my previous blogs. The QES add to these requirements additional ones about the quality of the certificates used and the CSPs and is spelled out in Annex I and II of the Directives. PDF can use these qualified certificates and hence can support QES also.

Watch the wording

PDF digital signatures have many optional choices and exactly which ones are used for any given signature depends upon the software used and in some cases on the signers choices. For example, which signer certificate and who issued it are the signers choice. I have tried to word my claims carefully by saying "PDF digital signatures can be" QES and AES since it is possible to chose options that will not satisfy the EU requirements.

ETSI/ESI and Electronic Signatures

European Telecommunications Standards Institute (ETSI) is recognized by the European Commission as a European Standards Organization. Its Electronic Signatures and Infrastructure Technical Committee (ESI) has established standards as its title suggests, in particular CAdES (CMS Advanced Electronic Signatures:TS 101 733 ) and XAdES (XML Advanced Electronic Signatures:TS 101 903 ). These standards were carefully crafted to follow the European Commission Directives and have become relatively synonymous with those directives.

PDF digital signatures and CAdES share the same infrastructure. They both use the Cryptographic Message Syntax (CMS) including particularly PKCS#7. PDF also allows the use of PKCS#1 and other schemes so this is a point where we must say that PDF supports PKCS#7 but not exclusively. We note that the European Directive is not so specific that it spells out the use of these technologies but CAdES and PDF have made these implementation choices.

There are some very particular differences in the exact way that PDF uses PKCS#7 and the way that CAdES uses it. Since some people associate CAdES synonymously with the European Directive they conclude that these differences make PDF not comply with the directives. The fact is that CAdES and the European Directive are not the same thing and although CAdES is an outstanding standard that follows the directives it is not, nor will it be, the only technology that follows the directives. The difference between PDF digital signature and those of CAdES are very minor and the European Directives does not give enough technical detail to distinguish between them.

ETSI/ESI and ISO 32000-1

ETSI has recently established a Task Force (TF) within ESI to establish standards common between CAdES, XAdES and PDF digital signatures as specified in ISO 32000-1. This TF is in the process of making sure that these technologies come together to everyone's satisfaction and they will make special efforts to make sure there is no doubt that they follow the European Directives as they evolve. In particular, they will spell out and standardize which choices for PDF (ISO 32000-1), assure that the signatures are AES or QES. They also plan to work with the ISO working group on any changes for the future digital signature technology in PDF (anticipated as ISO 32000-2).

This is good news for both ETSI/ESI and the ISO PDF working group.  It is especially good for users who want to use standard digital signatures.

I think I have one more blog article in me about digital signatures, so stay tuned for a few more details about PDF digital signatures and how they work.

Jim King (contact: jking@adobe.com)

 

January 6, 2009

Happy New Year

Welcome to 2009.  I hope you took some time off during the "holiday" period and refreshed your relationships with friends and family. It is great to have one time of year where this is the objective. I had a very relaxing time away from work and have come back refreshed and surprisingly quite a few pounds lighter as my wife put me on a diet in early November and it is working. I did spend good times with my children and now my two grandchildren (15 month old twins, one boy and one girl).

If you have topics related to PDF that you would like me to publish blog articles about, just send me a note.

Again, happy new year and to best to you in this upcoming, probably financially difficult, year.

Jim King  (contact: jking@adobe.com)

Digital Signatures: PDF

In the previous article we gave a quick introduction to PKI technology (Public Key Infrastructure) because that is what is used for PDF and most other digital signatures. Now we are going to talk more specifically how that technology is used for digital signature in PDF.

If I receive a digitally signed PDF file what should I expect?  First I would like to know that the file is unchanged since it was signed. A central notion is that there are encryption techniques, essentially all of them, where if the encrypted file is changed it cannot be successfully decrypted. The decryption produces some garbled garbage instead of meaningful results. So the person who is digitally signing the document encrypts it with their private key. Then if I can successfully decrypt it, I know it has been unchanged since they encrypted it.

So we need to know that the file is unchanged, but also since when and who says so. At least, in theory, I know exactly who that signer is because the certificate that has her public key was created by an authority (CA) that I can trust or one that is trusted by one that I can trust or ... .  Her signature is also time stamped so I know when the file was signed. So the "since when" and "who says so" is handled by the use of PKI certificates and the encryption that takes place when the PDF file is signed. At the time of the signing the document that is to remain unchanged is established.

An interesting side note: there is no way to stop someone from changing a file, but we do have the means to detect if they did.

Message digests (cryptographic hash functions)

Encrypting or decrypting a large file is computationally intensive and can take a noticeable amount of time even on today's powerful computers. So in cases where we would use encryption as a technique, not to hide the contents, but to assure that a file has not changed, another method has been developed based upon what are called "message digests" (MDs) or "hashes".

What if we could just number all documents or files that were ever made. Then I could just tell you the document number and you would know what document I meant. Very crudely, this is the idea of a hash or MD. If we truly numbered all documents the size of the numbers would be huge, so nothing would be gained even if we could do this.

The idea of a hash or MD is that the number of real documents is extremely tiny compared to the number of all possible documents so people have developed algorithms that will derive a small sized number for any given document and the algorithms are mathematically justified to 1) rarely, if ever, produce the same number for two different documents 2) not be invertible, (nearly impossible to generate the document from the number) and 3) make it nearly impossible to make a second document, different from the first that has the same number. These phrases have rarely's and nearly's because this work is based upon statistically rare events not happening or extremely complicated and time consuming computations that are, today, impractical.

OK, so what is the deal here. It is faster to compute an MD over the document than to encrypt the document, and the resulting digest is limited to say, 32 bytes of data so compared to the actual document, it is very small. I send you both the document and an MD I computed over its bytes. Then you compute the MD again on your copy of the file, using the exact same algorithm that I used and you compare your MD to the one I provided. If they match, you are nearly assured that your copy is the same as mine. Again, computing the MD is faster that encryption/decryption, and besides that, the document can remain unencrypted and hence readable without any special computation. Something useful if I am not so concerned about the chance that it might have been altered but just want to view it.

But there is one glitch in what I have said so far. What if someone intercepted the document and its MD in transit and changed the document, recomputed the correct MD to match the changed document and forwarded it to you. It would check out OK.  So in addition to what was said above, we encrypt the MD with the signer's private key so that it cannot be altered without detection. We use the same document signing technology discussed earlier but we only apply it to the computed MD and not to the document. We let the MD serve as a document surrogate or "digest" for these purposes.  As long as we trust and believe the inventors of these hashing algorithms, the chances of making a mistake are so rare they will not happen to us in our lifetime (or some strong statement like that).

So to summarize what happens when a PDF document is signed: 1) the signing software computes an MD using the bytes of the document file and encrypts it with the signer's private key and 2) the signer's public key is made available in a signed certificate (package), the certificate having been issued and signed by a CA which may be authenticated by a chain of CAs via their certificates. 

When the signature for a document is checked: 1) the MD of this copy of the document is computed, 2) the signer's public key is obtained from the certificate and if needed the certificate is decrypted and examined to make sure the identity of the signer is as expected, 3) the encrypted MD sent with the file is decrypted using the signer's public key from step 2 and compared with the MD computed in step 1.  If all this checks out then the document is an identical copy to the one signed and the signer is who they say they are.

One last glitch for the basic story. A PDF document is a single file and we want to keep that property.  But we have the MD which is a hash over the whole PDF file which would itself then contain the MD. That is a circular problem that is impossible to solve directly: computing something that is based on the results of the computation. So the complete PDF file with all the signature information is saved onto the disk but with a reserved "hole" where the package containing the MD will be placed. Then the MD is computed over all the bytes on the disk, except for the hole. Then the MD package is written into the hole. Of course, when the signature is checked, the hole is again ignored when the MD is computed.  A little funky but it works, and we can have a signed PDF document that maintains its property of being a single self-contained file.

And, of course, all of this is documented in the public ISO 32000-1 standard.

Next time I will go over some of the things specific to the European Union with respect to electronic and digital signatures.

Jim King (contact: jking@adobe.com)


December 16, 2008

Digital Signatures: basics

This article is on the basics of digital signature technology. To start with we need to get up to speed on what is called PKI technology (Public Key Infrastructure) because that is what is used for PDF and most other digital signatures.

Some simple basics

The PKI is based, in large part, on the clever asymmetric public key cryptography. Keys are relatively small strings of data that are needed to encrypt and decrypt much larger strings of data. For us, those larger strings of data are PDF document files. The clever invention is to provide a user with two keys that are unique to that individual, one being a "private" key that only that person has and the other a "public" key that is openly given to others. The term "asymmetric" derives from the notion that if one key of a pair is used to encrypt something then the other key of that pair is what must be used to decrypt it. If the same key is used to both encrypt and decrypt then it is a "symmetric" key.

So for example, if I use the private key, of my asymmetric pair, to encrypt a PDF file then I can send that file and my public key to anyone and they will be able to decrypt it.  Why would I send a file and the means to decrypt it together? The answer is, if the file can be successfully decrypted with my public key then, with statistically high confidence, you know that I was the one, and the only one, that could have encrypted it because I am the only one in possession of my private key. Furthermore it cannot have been changed since I did the encryption, or the decryption would have failed. Keeping private keys private is, of course, fundamental to this whole business.

In the reverse direction, if you want to send me something that only I can open, you can encrypt it with my public key and given that my paired private key is only known to me, I am the only one that can decrypt it. Neat, huh!

So, if it were a perfect world, we would all have our pairs of keys, with our private key only held and known by us and our public key available and known to everyone. But two things complicate this picture. How do I associate you, the person with your public key, a string of bytes on my computer?  I could get them from you directly but I still have to worry about them being tampered with while in my possession (within my computer). And if you are a stranger to me, how can I associate your public key with you?   So now enter the "Infrastructure" part of Pubic Key Infrastructure.

What the industry has proposed and implemented is a system of automated notaries that run on (Internet) servers, called "Certificate Authorities" (CAs), or in the EU "Certification Service Providers" (CSPs).  A means to protect the public keys has also been invented using cryptographic documents called "certificates". This is the "infrastructure" part of PKI. The CAs "sign" the certificate documents containing my public key with their private keys just as I sign documents using my private key. The process is, of necessity, hierarchical since I can then ask how I know for sure that I am using the proper public key for that particular CA to check its signing. The answer is by using a certificate created by a higher authority CA that securely supplies me the lower level CAs public key. This stops at some root where I check the proper public key by some other, perhaps manual, means.

The second complication is that I might want to have more than one identity for different roles that I may need to play so I can have multiple key pairs and hence multiple certificates. There is also a spectrum of how strongly the CAs checks out who you are before giving you a certificate so I may have more than one for that reason.

This stuff is all in place and operational but user education seems to be a major obstacle. Each person has to obtain one or more certificates from an authority at a suitable level and establish trusted root authorities for checking other's certificates. I also think that there are way to many options and choices which all fall to the users, who are mostly naive in this technology.

In a subsequent blog I will cover how all this works with PDF files.

Jim King (contact: jking@adobe.com)

 

December 11, 2008

Digital Signatures: a series of articles

When the PDF 1.7 documentation was being reviewed as an ISO Draft last year, one of the most deeply questioned and apparently confusing parts of the standard had to do with the digital signature support defined for PDF. This was especially challenged by European experts on digital signatures. As a consequence of this, I have been learning as much as I can about how digital signatures are handled in PDF, electronic/digital signatures in general, the European Union Directives for Electronic Signatures and the European Telecommunications Standards Institute (ETSI) and its Electronic Signature and Infrastructure Technical Committee (ETSI/ESI). I want to share some of what I have learned with you in this series of blog articles.

Introduction: There is signing and then there is signing

First, you need to understand that the phrase "digital signature" has very special technical meaning. Normally when the word "signature" is used we think of something like this:

I learned early in my adult life, through some painful real estate transactions, that the world's system for conducting business is truly built around honest people who trust each other. You can write contracts that are hundred's of pages long and if someone wants to violate the spirit of the transaction they will still find a way to do it. Our system, based upon these simple ink on paper signatures, is technically very flawed, but it is still the core method used to conduct business worldwide because we rely upon the basic honesty of people.

Someone can change a signed paper document, and if done with skill, the change can be detectable only by an expert. With our widespread availability of high quality copiers it is easy to take a signature off one document and put it onto a different one. Executing criminal activity around paper and ink signatures is rather simple. Not getting caught may be a different story.

We dramatize the idea that only an expert forger can draw my contorted signature just like I do, and we dramatize that there are equally expert analysts that can detect forgeries. The courtroom drama with the expert's testimony is seldom how things end up. Usually a signature is used on some document of record to be primarily a memory aid to remind us when and under what conditions we agreed to something. In fact, recording the act of agreeing via a "signing act" is the legal force for normal signatures. Even the "X" used by illiterates to sign documents is legal because it is the act of making the agreement that is important not the symbol that represents it. The symbol or signature is part of the reminder that you agreed.

The term "electronic signature" is a very general term that includes the idea of putting an electronic image of a signature on an electronic page.  A "digital signature" is much stronger and involves assurances that the document hasn't changed and the signer is who they claim to be. In fact, such digital signatures go way beyond paper and ink signatures with respect to assuring that the document has not been tampered with and that the signature is really a valid signature for that person. We are going to talk primarily about digital signatures.

Going digital

I guess the digital signature is driven to take the stronger path from normal ink signatures because we can make changes to document so much more easily and the changes can be done in ways so that it is impossible to know what was original and what was newly modified. Additionally, when we have digital material we take advantage that we do not have to be face to face to conduct business. We can send documents world-wide via e-mail or the Internet. So with whom we are really dealing becomes an issue. With normal signatures and signing, if we have a concern that who we are dealing with is who they say they are, we turn to notaries who give stronger confidence that the person is who they say to be. That same approach is used for digital signing by setting up Certificate Authorities (CAs) run by Certification Service Providers (CSPs).

So two things, document integrity and personal identification, are tied together into a system based upon clever digital encryption technology to give us digital signatures.

Much more on that in my next few blog articles. 

There is also more material about electronic and digital signatures to be found on the Adobe Security Blog.  For example:

So what is an electronic signature anyway?

Jim King (contact: jking@adobe.com)

 

 

 

October 14, 2008

A Couple of Things

Two things to write about quickly at this time:

1)  Here is a great treatise by Jon Bosak on why PDF and PDF/A are the correct formats for distribution and archiving of information, especially by government departments. Jon is considered by most as the father of XML. Having someone as knowledgeable and as well respected as Jon speak out about the strength of PDF is wonderful.

2)  Adobe has posted a version of the ISO 32000-1 document on its web site and it is freely available. Adobe has an agreement with ISO that it can post the standard for free provided it isn't the "official" ISO version. So the running headings and footings have been changed and the introductory pages are different, but the technical chapters are identical including page and section numbering.  If you need an official version please pay for it at the ISO site. This is one source of income for this important standards organization.


July 29, 2008

Text Content in PDF Files

This is a blog I promised to write months ago and finally have bitten the bullet. To extract text from PDF documents is a rather difficult and a highly technical task and I hope to explain, here, why that is the case.

The Role of PDF

First, I want to position PDF properly for this discussion.  A PDF file represents an ordered sequence of fixed pages. The planned appearance of everything that each page contains is completely specified, down to the smallest detail. All the graphics, images, and text are specified to appear at precise spots on the page, in a particular color, of a given and fixed size, etc.

In this way, the creator of the PDF file gets to determine the exact appearance that the viewer will see. Renderers of PDF (like Adobe Reader or Adobe PDF Print Engine) are required to display the page contents exactly as specified by the creator of the PDF file. The Renderer is a slave to the authoring tool. In fact, I like to think of the PDF specification as a contract between the software that creates the PDF file and the renderer that displays it. They must agree on what the positioning, size, color, etc. noted in the PDF file precisely means.

Text in PDF Page Contents

So now let's turn our attention to text content of PDF pages. Text is rendered for display or printing by using specific graphic designs captured in fonts and each individual graphic is called a glyph. The lower case letter "a" in the Latin alphabet may have hundreds of different glyph appearances such as those shown here:


These all serve the purpose of being a lower case Latin a but have distinctly different glyph appearances and also notice that they also have different widths and spacing required to properly display them.

PDF specifies text content of pages as glyphs not characters. That is, one of the appearances for an "a" is chosen by the creator of the PDF file by choosing a font from which the "a" glyph can be taken. PDF page contents do not specify characters such as just the Latin letter "a".  

The rub comes when we want to work with characters not glyphs. Unicode is widely used because it is a character encoding technology not a glyph encoding one. In fact, for many purposes, such as searching for text strings, we do not want to search by appearance but we want to search by the Latin letters (or commonly by the Unicode encoding of characters).

This issue of how PDF pages represent text is further complicated by three more facts.

1) The representation of PDF page contents is tightly specified so that the creator of the PDF file can determine the exact position on the page of every glyph chosen. But this is usually not done by actually specifying a position for each glyph individually. It is expected and required that the creator of the PDF file have access to the same font as the renderer that processes the PDF file, including the width and spacing information for each glyph supplied in the font file. For a string of text, if we agree to strictly space the glyphs, one from the other, using the spacing information given in the font then all we need to specify in the PDF file is the starting position for the first glyph in the string and the positions of all the rest are determined by that. This saves space and is a common and easy convention to work with.

There is no strict control over font files and so to make sure that the creator of the PDF file and the renderer of the PDF file use the same information, it is wise to embed the font that the creator used, with its important spacing information, in the PDF file itself. When that cannot be done, the width and spacing information for the font glyphs used are placed in the PDF file so that the spacing that the creator desired can be followed even when using a substituted font or one that might be slilghtly different.

2) Since PDF page contents primarily specify the appearance of what is to be rendered for that page, things like line breaks, line justification and hyphenation are all done by the creator and not within a PDF renderer.  As noted earlier, a PDF renderer is a total slave to the representation within the PDF file that the creator put there. This means that if the word "substantial" is broken across two lines so that "sub–" appears at the end of one line and "stantial" at the beginning of the next, then you will never find the string of glyphs for "substantial" in the file but you will find a string for "sub-" and another for "stantial".

3)  Ligatures are glyphs specially designed to enhance the appearance of certain groups of characters when they follow in sequence. For example the two "f"'s and the following "i" in the word "difficult" are considered to look better when they are rendered as one special glyph as in the second rendering shown here:


One single glyph in the right example is used to render "ffi" whereas three are used in the left example. These two words take up a different amount of space horizontally so if material was to follow the word at a specific distance, its position would have to be adjusted. In other words, if you right justify a line of text that contains ligature glyphs, the right justification will be ruined if those ligature glyphs are replaced by the individual character glyphs and visa versa. The bottom line is that the creator of the file and the renderer of the file must agree on which ligatures will be used and which will not. By using glyph references in PDF this problem is avoided because the file creator can determine which ligatures will be used and where, by specifying the proper glyphs. The renderer in its slave role just places the selected glyphs where it should.

Glyph References and Fonts

Different font technologies organize the font glyphs in different ways but basically some identifier is used to select one of the glyphs to use. Several different methods are allowed in PDF in order to support the evolving font technologies but there is one fact that is rather disruptive to being able to extract text from a PDF file. In some cases, all that PDF requires is that the identifier used to select a glyph in the font be used by the file creator. No meaning of it being an "a" is required, just enough information to find the desired glyph in the font is needed. So if we want to extract character text in, say a Unicode representation, we really do not have enough information. We only know how the render the glyph but not what it represents.

There is an optional auxiliary structure called the "ToUnicode" table that was introduced into PDF to help with this text retrieval problem. A ToUnicode table can be associated with a font that does not normally have a way to determine the relationship between glyphs and Unicode characters (some do). The table maps strings of glyph identifiers into strings of Unicode characters, often just one to one, so that the proper character strings can be made from the glyph references in the file.

The process for extracting text also has to recognize and undo ligatures and remove hyphenation. There are various techniques for this and the best ones rely on additional optional information that can be put into the PDF file called "structure" information. I do not want to go into more details at this point but I think by now you should have a good idea of what it takes to extract text from a PDF and why it is so technical and messy.

Why Not Just Use Unicode?

So an easy question is why doesn't PDF use Unicode to represent the text in a PDF file. The simple answer is that the Unicode represents characters independent of their appearance and PDF is all about appearance. But in fact, if I choose a font and devise a method for picking font glyphs given a Unicode character I can almost get away with using Unicode characters and yet rendering glyphs. Why do I say almost?  

Well the issue of ligatures makes it messy because ligatures are glyphs but generally not characters. SVG faces this problem and it partially solves it by having escapes from the Unicode into glyph identifiers when the Unicode cannot handle the cases of interest. Suppose we have the word "difficult" which can be rendered with a ligature for "ffi" or not. But the two choices do require different spacing and placement. So the creator of the PDF file has to be able to specify which ligatures are to be used and which not. Currently whether ligatures are available or not and which ones are available vary by font and even instances of the same font. So more standardization needs to be done before this approach will work.

A New Agreement

The design of PDF established a line in the process of specifying and rendering a document where all the appearance decisions are made by the creator of the PDF file. One could imagine another design where that is not the case. For example, one might leave the line justification, hyphenation, etc. to the rendering component. The difficulty with that is that there can be a relatively large variation in the results so that different renderers making small different choices may produce quite different output. It is also awkward to constrain spacing issues to one part of a document without leaving larger than desired white spaces or overwriting material or laying out the document completely anew.  PDF does hold the position that the creator gets to choose the appearance so shifting partial positioning responsibility to the renderer will be difficult to design.

Enough auxiliary information has been defined for PDF files such that a well written program can extract the text content from PDF pages for reading to the blind, searching, or to extract Unicode strings of characters. Adobe has done this and that code is available through the Adobe Developers Connection using the Acrobat PDF SDK.

The PDF design is very tailored to the creator being able to quite directly and without ambiguity, specify the exact output desired. That is a strong virtue for PDF and the price of more difficult text extraction is a price worth paying for that design.

Jim King  (contact me at:  jking@adobe.com)