" /> Digital Editions: May 2009 Archives

« April 2009 | Main | September 2009 »

May 11, 2009

EPUB generation library written in Java

It's come to my attention that I didn't really give EPUBGen a proper introduction, and so there's been some confusion about the project.

EPUBGen project

EPUBGen is a Java library that demonstrates EPUB generation from a variety of document formats, and which may be a useful starting point or reference code for other EPUB generation needs. That is to say, it's an effort to promote the development of a variety of tools and workflows.

EPUBGen has both a set of back-end code generation modules and front end format importer modules. The back-end modules generate EPUB and illustrate more advanced functionality, including font subset embedding with obfuscation. The front-end modules are experimental and incomplete at this time. There are sample Java applications are provided that illustrate the conversion of respective file types (rt2epub, word2epub, fb2epub). There's a build script and available jar file for the 'rtf2epub' project.

This project is being provided to the development community to build on, and so contribution and derivative works, within the scope of the BSD license are invited.

epub-tools

This project is really part of our efforts to support the community effort around EPUB, and as such is hosted with other tools for generating and working with EPUB documents.

The project is hosted with epub-tools.
You can find the source code via subversion.
For rtf2epub there is an entry in the downloads

A quick look at the packages.

Without going into too much detail, here's a list of the top level packages (by which the project is organized) and a short description of each:

com.adobe.epub

This is where you'll find EPUB related functionality, including handling of OPF, OPS, NCX, and CSS generation.

com.adobe.otf

This has most of the font management code (although Font Subsetting is actually found in com.adobe.epub).

com.adobe.conv.rtf2epub, com.adobe.conv.word2epub, com.adobe.fb2

These are the front-end modules, they'll each have a main entry point and conversion code for reading and parsing through the source format.

com.adobe.office.word

Functionality used by both rtf2epub and word2epub.

Font obfuscation code, available for review

I realize this blog might have a diverse readership so I'll warn those of you who aren't developers, that this particular blog post probably isn't for you. Unless you find reading Java code interesting, you can probably skip this article. I've implemented the IDPF font obfuscation algorithm which is recommended here. The font obfuscation code can be found in the EPUBGen project, and so you can take and review, and use the code. In this article, I'm going to highlight just the portions of that project that are related to embedding and obfuscating the font information. (The project also has font subsetting and other features which I will not cover.)

If you'd like to follow along, the code I'm referring to is hosted on epub-tools, and you'll find it in the epubgen section within epub-tools.

As I mentioned, this should match the approach outlined in the informational document from the IDPF, which describes an algorithm for obfuscating a font, how to identify the document key for that algorithm, and how those fonts should be identified.

Obfuscation Algorithm

The code that actually modifies the font is in the "IDPFFontResource.java" file. You'll find it in .../com.adobe.epub/src/com/adobe/epub/opf/IDPFFontResource.java within the project.

The loop is a little different than that of the recommendation document, but the end result is the same. The obfuscation key (see below) is applied to the first 1040 bytes of the font file.

The code that actually does this is in the the serialize method:

/* 
 * Implements the Obfuscation Algorithm from
 * http://www.openebook.org/doc_library/informationaldocs/FontManglingSpec.html
 * 
 */
public void serialize(OutputStream out) throws IOException {
	try {
		byte[] buffer = new byte[4096];
		int len;
		InputStream in = source.getInputStream();
		boolean first = true;
		while ((len = in.read(buffer)) > 0) {
			if( first && mask != null ) {
				first = false;
				for( int i = 0 ; i < 1040 ; i++ ) {
					buffer[i] = (byte)(buffer[i] ^ mask[i%mask.length]);
				}
			}
			out.write(buffer, 0, len);
		}
	} catch (IOException e) {
		e.printStackTrace();
	}
	out.close();
}

Identifying the Obfuscation Key

Ok, so we need an obfuscation key. The recommendation tells us to use the EPUB's 'unique-identifier'. Every EPUB has a unique-identifier, and it's unique-identifier is referenced in the root element of the OPF file. Look through the metadata in the OPF and you'll find an identifier with an id that matches the unique-identifier, that's the EPUB's unique identifier and is the basis for our obfuscation key.

Of course as EPUBGen is creating the epub, and not parsing it, we need to create the identifier. This is done in the addUID method.

Both the addUID method and the makeXORMask method (below) are found in Publication.java.

	/*
	 * Every EPUB needs a unique identifier, this could be an ISBN or other identifier.
	 * In this case we're generating a random identifier.
	 * 
	 * For the purposes of font obfuscation, this does not need to be random, just unique (like an ISBN)
	 */
	public String addUID() {
		String uid = this.generateRandomIdentifier();		
		return uid;
	}

The value of the unique-identifier needs to have the whitespace removed. The resulting string will be passed through the SHA-1 hash algorithm to give us our key or mask for the font files.

Note: I'm only showing the portion of makeXORMask that relate to the IDPF approach. The project can also create a mask for the deprecated approach.

private byte[] makeXORMask() {
	if(opfUID == null)
		return null;
	ByteArrayOutputStream mask = new ByteArrayOutputStream();
	if (useIDPFFontMangling){
		/*
		 * This starts with the "unique-identifier", strips the whitespace, and applies SHA1 hash
		 * giving a 20 byte key that we can apply to the font file.
		 * 
		 * See: http://www.openebook.org/doc_library/informationaldocs/FontManglingSpec.html
		 */
		try {
			Security.addProvider(new com.sun.crypto.provider.SunJCE());
			MessageDigest sha = MessageDigest.getInstance("SHA-1");
			String temp = strip(opfUID);
			sha.update(temp.getBytes(), 0, temp.length());
			mask.write(sha.digest());
		} catch (NoSuchAlgorithmException e) {
			System.err.println("No such Algorithm (really, did I misspell SHA-1?");
			System.err.println(e.toString());
			return null;
		} catch (IOException e) {
			System.err.println("IO Exception. check out mask.write...");
			System.err.println(e.toString());
			return null;
		}
		if (mask.size() != 20) {
			System.err.println("makeXORMask should give 20 byte mask, but isn't");
			return null;
		}
	}
	else {
		/*
		 * This is where the deprecated approach is implemented.
		 * Removed for clarity.
		 */
	}
	return mask.toByteArray();
}

Specifying Obfuscated Resources

The result is put together by the serialize method in Publication.java. You'll see the call to makeXORMask passing that mask to the IDPFFontResource, the call to the resource serialization (res.serialize), and then generating the encryption.xml file.

public void serialize(ContainerWriter container) throws IOException {
	Enumeration names = resourcesByName.keys();
	byte[] mask = makeXORMask();
	boolean needEnc = false;
	while (names.hasMoreElements()) {
		String name = (String) names.nextElement();
		Resource res = (Resource) resourcesByName.get(name);
		if (mask != null && res instanceof AdobeFontResource) {
			((AdobeFontResource) res).setXORMask(mask);
			needEnc = true;
		}
		if (mask != null && res instanceof IDPFFontResource) {
			((IDPFFontResource) res).setXORMask(mask);
			needEnc = true;
		}
		OutputStream out = container.getOutputStream(name, res.canCompress());
		res.serialize(out);
	}
	if (needEnc) {
		XMLSerializer ser = new XMLSerializer(container.getOutputStream("META-INF/encryption.xml"));
		ser.startDocument("1.0", "UTF-8");
		ser.startElement(ocfns, "encryption", null, true);
		names = resourcesByName.keys();
		while (names.hasMoreElements()) {
			String name = (String) names.nextElement();
			Resource res = (Resource) resourcesByName.get(name);
			if ((res instanceof FontResource) && (useIDPFFontMangling)) {
				SMapImpl attrs = new SMapImpl();
				ser.startElement(encns,"EncryptedData", null, true);
				attrs.put(null, "Algorithm", "http://www.idpf.org/2008/embedding");
				ser.startElement(encns, "EncryptionMethod", attrs, false);
				ser.endElement(encns, "EncryptionMethod");
				ser.startElement(encns, "CipherData", null, false);
				attrs = new SMapImpl();
				attrs.put(null, "URI", name);
				ser.startElement(encns, "CipherReference", attrs, false);
				ser.endElement(encns, "CipherReference");
				ser.endElement(encns, "CipherData");
				ser.endElement(encns, "EncryptedData");
			}
			else if (res instanceof FontResource) {
				/*
				 * This is where the deprecated approach is implemented.
				 * Removed for clarity.
				 */
			}
		}
		ser.endElement(ocfns, "encryption");
		ser.endDocument();
	}
	XMLSerializer ser = new XMLSerializer(container.getOutputStream("META-INF/container.xml"));
	/*
	 * Code removed for brevity.
	 * Can be viewed in the project, it just fills out the container.xml file.
	 */	
}

And with that, you should have the IDPF recommended font obfuscation approach. The font still needs to be listed in the manifest, and can be referenced by the CSS as it normally would.

May 8, 2009

InDesign 6.0.2 and EPUB export

There's a new update to InDesign CS4. Version 6.0.2. The update fixes lots of other stuff, so you may want to look at the release notes to see what's fixed that is not related to the EPUB export, I'm only going to talk about the EPUB plug-in.

Issues fixed by the update are:

Content is dropped during export to Digital Editions when two or more documents are included in a book. [2262891]*
This was an issue where, if you have more than one document in a book file, and you export an EPUB from the book file, the first document is the only one that's guaranteed to have any content.
Export for Digital Editions will put underscores, rather than spaces, in the Contents if TOCStyle is not used. [2296603]*
This is the case where you use the InDesign Book feature to export a set of documents, and the file names end up in the "Contents" panel in ADE. The problem was that if you had spaces in the filenames, the export filter would replace them with underscores. You'll now get spaces rather than having them replaced by underscores.

The issues that were fixed in previous updates, in case you missed them:

Bold (and in some cases italic) won’t work when exporting an InDesign document to Digital Editions. [2296606]
It was actually that bold wasn't working when it wasn't in the paragraph style, and italic worked only if you did not embed the font. Both were fixed in the 6.0.1 update, but are also fixed in the 6.0.2 update.
Exporting to Digital Editions fails with an error in the Italian and French version. [1923992]
The dialog wasn't working, which had prevented the plug-in from being used. So with this update you should be getting much better output from InDesign CS4.