Friday, May 3, 2013

A Messy Process

I am just about ready to publish another book for the Kindle. Unlike the previous two, which were children's books with simple captioned graphics, that I authored directly in HTML, this one is an article that I originally wrote using LibreOffice, one of the open-source alternatives to Microsoft Office.

Which would have worked out fine if the book had been destined for print. But as it's only about 4300 words long, it will be sold only as a Kindle edition. The problem lies in going from Open Doc format to HTML using the conversion available in LibreOffice.

Now, the HTML is perfectly legitimate, and it works, to a point. But what it fails to do, and it's something that I find very important, is keep the page breaks in the original. Now, understand, I don't really care if it keeps all of the original formatting. In fact, to make it most compatible across devices, it shouldn't try to keep any of the formatting, except for styles (bold, italic, underline), paragraph breaks, and page breaks. I really want to keep my page breaks.

But what LibreOffice (and OpenOffice, and Microsoft Office) tries to do is make the page look the same in your browser as on the printed page. Which is pretty much the opposite of what HTML is supposed to be about, and exactly what you don't want when formatting a book for a device that could be a 19-inch computer monitor or a 4-inch phone screen, or anything in-between.

Flexible formats like HTML, EPUB, and MOBI are meant to be contextual, not graphic. You need to preserve bold and italic and underline because they mean something in the context of the text. And the same goes for page breaks, which, like paragraph breaks, have a profound effect on the flow of what you are reading. Which is why it's so maddening that LibreOffice insisted on keeping my 72-point title page font size (which won't even fit on one Kindle page) and yet ignored all the manual page breaks, putting the title, the copyright, and the dedication all together.

We need a better solution for converting these Office (whether Open-, Libre-, or Microsoft) into HTML that works for real devices, not some half-assed ghost of the printed page that can never be fully realized, and would not read well for all if it were.

I've seen some solutions on the Internet, but most of them involve a lot of command-line scripting that most of us don't have any time for. I may resort to some myself, or some PHP programming.

At some point. For this book, I will probably just try to manually strip away most of the superfluous mess, clean up whatever damage it left, and leave it in less than perfect—but still much more readable—shape.

And I will add back my page breaks.