Publishing DocBook content for Digital Editions

This week I was looking at the DocBook mark-up and experimented with convertion of DocBook content into ePub. There seems to be a very good fit there. I was able to create a XSLT stylesheet to transform a couple of O’Reilly-published books into ePub. Here is a sample: Greg Kroah-Hartman‘s “Linux Kernel in a Nutshell” book as ePub (IE sniffes it as zip, but it is really an epub file). Unfortunately, I only can post it with freely distributable clip art, no embedded fonts and with low-quality gif images for the illustrations in the book, so it does not look quite as good as the original (but with the right resources it is possible to make it look as good).

The XSLT stylesheet, a simple bash script that drives it and the free art that I have used can be downloaded here. If you want to run it yourself, this is what you need:

  • Find some UNIX-like environment, in particular bash, xsltproc and zip commands. I have used Cygwin on Windows XP.
  • Unzip downloaded file into some folder.
  • Copy source DocBook XML (available at the book’s web site) into that folder; main book file should be named book.xml.
  • Copy images into epub/OEBPS/images
  • Run ./epub.sh
  • If everything goes right, ePub file will be written into book.epub

For this particular book you should also add author’s name and book identifier into the book source (or it will be missing from the metadata and title page). Insert the following into the book.xml after the title tag:

<bookinfo>
<isbn role=”13″>9780596510480</isbn>
<author><firstname>Greg</firstname><surname>Kroah-Hartman</surname></author>
</bookinfo>

I have only tried XSLT stylesheet with one other DocBook, which is certainly not enough testing; thus, I don’t think it will work with arbitrary DocBook content. If someone wants to take it from there, it would be fantastic because I won’t have time to polish it. The right thing to do is, of course, to integrate this capability into the existing XSLT framework for DocBook which is used to publish DocBook content into XSL:FO and PDF.

And, thank you, Greg for writing this book and for making the DocBook sources available!

21 Responses to Publishing DocBook content for Digital Editions

  1. I have a question about font embedding. I understand if I want to embed them, they have to be OpenType (True Type and Type 1 are not officially supported, right?). However, embedding the fonts amounts to providing a copy of the font files within the zip archive file. Anyone would be able to unzip the file and get a copy of the OpenType files. What font foundries would allow such a plain and simple embedding of fonts? Does Adobe allow such embedding in non-protected epub files?I read somewhere that InDesign CS3 embeds the fonts in a way that makes them usable only with the current epub file. Is this true? If so, how does it do that? Is there a way or a tool to allow partial font embedding (removing all the glyphs that are not used in the epub file from the font files)?These questions would not relevant if the font foundries would allow the publishers of epub files to include the font files as is, but I doubt they would be willing to allow that.

  2. That looks good.But I suppose some non-conformancy points ought to be noted:* content.opf / manifest is missing titlepage.css* toc.ncx / head / dtb:uid content is missing* toc.ncx / head / dtb:depth content is negative (should be ‘5’)* toc.ncx / all playOrder attributes are zero (should be a sequence starting at 1)* style.css and titlepage.css are invalid (there is no ‘adobe-page-master’ property)* coverpage.xhtml / style / import document is not CSS(It is too long since I used XSLT… 1,2, and 4 seem straightforward changes, but 3 (counting depth of tree) seems awkward…)

  3. OK, I was really excited to see an “epub” file created from something other than Adobe® InDesign® that will work in Adobe® DigitalEditions®, and, given the O’s are already down by 4 despite Bedard starting, took time to look, but something’s odd about that file:Um, where did this come from in the sample DocBook file?<item id=”template” href=”template.xpgt” media-type=”application/vnd.adobe-page-template+xml”/ >And how did this file<ade:template xmlns=”http://www.w3.org/1999/xhtml” xmlns:ade=”http://ns.adobe.com/2006/ade”xmlns:fo=”http://www.w3.org/1999/XSL/Format” >Get into the epub document?There are already a couple scripts written that’ll generate .epub from valid XML, but as I’ve asked repeatedly, is .xpgt (not required in the OPF standard), required for use in Adobe® DigitalEditions®? I’ve yet to see a file that lacks the .xpgt actually working in DE?

  4. Peter Sorotokin says:

    Ahmed,Your observations are very good. I will add font embedding issues to my list of topics for future posts.Harrison,These are valid points, although I am not sure if OPS/OPF specs prohibits unknown properties or non-CSS stylesheets – I don’t think this is the case.David,template.xpgt is a stylesheet and it came (along with style.css) from the conversion tool. DocBook does not contain any styling information.xpgt file is certainly not required for an epub to work in Digital Editions. This file is already on my list of things to cover here.

  5. OPS allows non-CSS stylesheets linked from the head of the HTML, but it also says any CSS used must be conformant. So if it is refactored so the xpgt is linked by the HTML, rather than imported by the CSS, I think that is then cool. And other reader apps just ignore (correctly) any unrecognized stylesheet links, which is good.(I wrote an epub construction guide recently, so I had to check the details of the specs.)

  6. Peter Sorotokin says:

    Harrison,Where exactly in the spec is the conformance requirements for CSS that you mention? My interpretation of the spec is that unknown CSS properties and constructs should just be ignored by Reading systems that do not understand them, but they do not make the document invalid. CSS contains an elaborate mechanism to ignore unrecognized content built exactly for this purpose. Maybe I missed something, though.

  7. MishaS says:

    What about the requirement of mimetype file being stored as is?

  8. Peter Sorotokin says:

    Misha,I am not sure what your question is. The script that I have seems to do that part correctly:$ dd if=LinuxKernel.epub bs=1 skip=38 count=20 2>/dev/nullapplication/epub+zip

  9. MishaS says:

    Sorry for not being specific: your script does not seem to do anything to ensure that the file is not compressed. That’s it. I’m not sure how zip works: it might compress the data depending on the surrounding happenings :)I’ll check it out and return back if anything is wrong.

  10. Peter Sorotokin says:

    Ah, I understand now. Here is the secret: zip command (per its man page) will always do what’s best in terms of storing vs. compression. Since it determined that storing is better in this case, it’ll alawys store it. Some other zip utilities have this property as well (e.g. Windows built-in “compressed folder” handler), while others do not.

  11. Wellllll… I infer: extensions to the XHTML subset are explicitly regulated, but there is no mention of CSS extensions.CSS itself seems strict: http://www.w3.org/TR/CSS21/conform.html section ‘Valid style sheet’ says unknown properties (etc.) are invalid.And with the old default ethic of being conservative in what you provide (and tolerant in what you require), the summation/simplification seemed to be: write valid CSS2.1 OPS subset.

  12. Peter Sorotokin says:

    Harrison,CSS2.1 validity criteria are developed specifically for pure CSS styling using box model, so it is not a very useful yardstick for OPS applications. For instance, using any OPS-specific values for the display property would make it just as invalid as using an Adobe-specific property. Even using any SVG-specific properties would make it invalid in CSS2.1 sense (even though SVG was developed by W3C).I see your point, though. It is certainly possible to move all Adobe-specific styling into xpgt file.

  13. vuchrfvdvr says:

    Hello! Good Site! Thanks you! ptjkllomemg

  14. Joseph Gray says:

    Since I don’t see any way to post a question, other than as a comment to a blog entry, I’m posting here.The following style will create a dropcap that is two lines high, pushing both lines to the right, as it should. This displays ok in Firefox 2 and IE 6, with negligible difference. In Digital Editions, this only displays like a large letter on the first line. No dropcap effect. I arrived at these values by experimenting until I got a reasonable result in both Firefox and IE.span.dropcap{float: left;font-size: 3em;line-height: 0.7em;margin-top: 0em;margin-bottom: 0;}Some other CSS that doesn’t seem to work in Digital Editions are the following:text-align: justify;margin-left: auto;margin-right: auto;As I mentioned on the TeleRead blog, a list of exactly which XHTML and CSS statements/properties are supported in Digital Editions would be very helpful.

  15. Peter Sorotokin says:

    I’ll look at the drop cap code. It’s most likely a bug in the Digital Editions. All of these properties are supported and should work.Support for justification is optional per W3C CSS spec and support for margin value of auto is optional per IDPF OPS spec. We are planning to implement them at some point, but you can never rely on a reading system to support those.

  16. Tomas Ulej says:

    Great thing, but the XSLT ignore tag. Has anyone idea how to fix it?

  17. kader says:

    I want to hide the numbers appeared in epub pages viewed by adobe digital edition, thanks

  18. Mangal says:

    I have a questions about the page numbers that appear in the right margin in ePub files. Is this device-specific, or generated by something in the ePub files themselves or by a standard applied across platforms? can you help me to understand this?In the creation of ePub files, do we have any control over how these page numbers are displayed? Is it possible to turn off them?

  19. Anyone interested in doing DocBook->.epub should use the dbtoepub tool released in DocBook-XSL 1.74.0.

  20. Kuldeep Kumar says:

    Is there any solution for turning off the page no. appearing at right hand side of every page of E-pub file?

  21. Peter Sorotokin says:

    There is no way to turn off page number marks at this point, but it is on our list.