Recently I was involved in a discussion around protecting the contents of a form. The request was to be able to protect the form logic and parts of the form data from tools that could extract this information from the PDF. There are various reasons behind this request, including:
- Some of the form data is "office use only". The data is not bound to any form fields, but it is used in some of the form logic/calculations. This data could be sensitive.
- Script embedded in the form represents intellectual property that the form author does not want to share.
- Having access to the form definition makes it easy to create a spoofed version of the form for a phishing attack
I have sympathy for all these reasons. But the bottom line is that there are many different methods and tools you can use to extract the contents of a PDF. I’ve distributed samples on this blog to do just that. The latest example is the lint checking tool that extracts all the script from your form and highlights coding issues.
There are lots of great workflows that are enabled because the contents of the PDF are shareable. It’s one of the strengths of the solution. But on the other hand, we don’t always want to share everything. And we want to offer protection to our clients. There are some strategies to help you along:
You can encrypt the contents of your form. Users then need a digital id or password to open the file. The contents of the PDF are hidden from anyone who does not have the necessary credentials. Everything is hidden behind the encryption — including the form definition and the data. However, encryption doesn’t satisfy all needs:
- Those who have credentials have full access to the form definition and data. It’s not possible to hide selected parts of the form contents.
- It is not viable to require credentials to access forms distributed to the general public.
You can specify a password on your PDF to restrict editing of the form. This prevents users from bringing the form into Designer and making modifications to the form. However it does not prevent users from using various tools to extract the contents of the form and making an editable copy. The edit password sends a message to your friendly users to ‘keep their hands off’, but it is not a deterrent to a hacker.
Certification is the best defence against a phishing attack. Your form definition can always be forged. Even if we prevent access to the form definition, the attacker can always imitate the appearance and behaviour of your form from scratch. But while someone might be able to make a copy of your form, the attacker cannot spoof certification. Train your users to download their forms from your website and train them to expect valid certification on these forms. With certification they can be certain that the form they’re using has been authored by you.
We worry about the safety of user data during submission. To protect user data, make sure the connection you provide for submission is secure. Use https for submissions and the data will be encrypted during transmission.
Any discussion about secure data is not complete without also mentioning XML Signatures. XML Signatures allow the end-user to sign the data that they submit. Signing data has two primary benefits:
- Establishes the identity of the person who signed the data
- Confirms that the data has not been tampered en-route
Sensitive form processing can be delegated to the server. When the logic runs on the server, the execution details can be hidden. There are several strategies to accomplish this:
- Mark calculations to run-at server. When using LiveCycle Forms, this will round-trip the form data to the host to allow a calculation to run in the context of the server.
- Use a web service. Send a SOAP request to the server and get data back.
- Communicate to a server-side process directly using a http submit or using formcalc put/post/get methods.
Of course, the disadvantages of relying on the server are:
- It means that parts of your form logic will work only when the user is online
- Frequent server interaction may limit the scalability of the solution
- Server-based solutions are expensive
Script logic can be stored in byte code format — or more specifically — in an embedded SWF file. Today this is an option only for static XFA forms that run on the client. It’s not possible for dynamic forms or for forms that need their logic to run on the server.
It’s pretty hard to hide your data, especially given that Acrobat users can extract it using the menu command: Forms/Manage Form Data/Export Data… But there are some things you can do to prevent easy access:
- Put the sensitive parts of your data somewhere other than <xfa:datasets/>. e.g. You can put it in a generic packet under the <xdp/> element. To process it at runtime, load it into an E4X object. Storing in a different location will hide it from the average user who knows how to export data.
- Sensitive data can be disguised. Look at how you tag your data. If your element is called <ManagerSalary/> then the meaning is pretty clear. However, if it’s tagged as <value37/> then more context is needed to figure out what the data holds.
- You can apply basic encoding on the contents of the data. e.g. You could take a fragment of your data and store it as a base64-encoded blob. Your form would then need to include logic to decode and process this data. This will not fool the determined hacker. They will figure out that the script to decode the data lives inside the form. But it will hide the data from the casual observer.
Ultimately, obfuscation does not provide complete protection, it is merely a deterrent. We should always assume that obfuscated contents can be reverse-engineered into something meaningful.