lardbucket: 2012 book archive: now with pdfs

11/30/2014

2012 Book Archive: Now With PDFs

Filed under: General — Andy @ 10:58 am

Almost two years ago, I launched (what became) my 2012 book archive. There’s a bit of background on that project on that page. While there have been a few minor developments since then, none of them have been noteworthy enough to document here. Recently, however, I decided that I would try to make PDF copies of the books. I wanted these to be good-quality PDFs, as although a number of PDFs of the books had circulated from other groups in the past, they weren’t particularly visually appealing: they were essentially what you would get by quickly printing each of the HTML files from the publisher.

My guess was that a simple HTML file with all of the book content could be combined with a quick print CSS style and get good-quality output without too much effort. I figured that with a decent number of iterations on the book CSS and test runs, the project might take a week or so. Over two months later, I’m finally set to release the PDFs. I’ve made a few notes of the things that I did along the way, in case anybody finds them useful in the future.

If you just want the PDFs, you can visit the 2012 book archive and download them there. There are a few ways to download each book: you can download a whole-book PDF file, or a PDF for each chapter. Both are accessible from a book’s table of contents, and the whole-book PDFs can be downloaded from the archive’s book list as well.

Getting a PDF

It’s worth noting that the content I have from the books is already in an HTML format, not necessarily the raw input to a book creator. I decided early on to try to work with HTML and CSS, because although the HTML I have is structured and could probably be parsed into a different format, doing that correctly for over a hundred books would be iffy. With that said, the HTML was well-structured, and lent itself to easy use of CSS to style specific things.

If you’ve been following CSS development, there are a lot of features in CSS that are supposed to allow styling of printed content. Unfortunately, support for some of these in browsers is spotty, and they don’t necessarily provide all of the features that you’d like when styling a book. (In particular, footnotes, page-relative positioning, page numbering, and PDF bookmarks are all difficult. There are likely other features I’ve forgotten about as well.)

At first, I was hoping to be able to use something like wkhtmltopdf, which wraps WebKit and converts an input file to a PDF. This gave me a number of problems, and didn’t seem to support the concept of pages as natively as would be desirable. (It’s still impressive that they managed to get the project to work as well as it does, but it doesn’t produce nice-looking books yet.) After that, I decided that perhaps Firefox’s support for printing would work for me: it works pretty well with many of the CSS printing features, and I can probably script Firefox to output PDFs if I want. Unfortunately, I again ran in to bugs with the rendering. I don’t recall the details at the moment, but I believe that content had a tendency to not wrap between pages in the right spots. Either way, this sent me in search of a high-quality PDF rendering solution.

If you look online for advice about generating PDFs from HTML, you will inevitably run upon many people suggesting PrinceXML (or, as it seems to be rebranding itself, Prince). They’re probably right. It is a commercial piece of software in a case where I had hoped to use free software, but it is still the best solution I have found by far, both in ease of use and functionality.

Princely Things

Prince itself is not cheap. A personal license is $495 at the time of writing, and even that may not cover what I intend to do in terms of converting books automatically. (To be clear, it might be covered, but only just barely if it is. I haven’t asked, for reasons I’ll explain shortly) If you are doing anything serious with Prince, you’re probably looking at a $3800 license per server generating PDFs, or a $1900 one if you’re only doing academic things. Upgrades are available for an additional annual cost. To be clear, if you’re generating revenue from the PDFs (or even just saving yourself loads of time), Prince is almost certainly worth every penny, but it’s prohibitive for side projects.

For non-commercial projects, Prince offers a free version with the requirement that you allow it to add a logo and link to the corner of your document’s first page, link to their website wherever you have Prince PDFs for download, and link to their website on a sponsors/partners page. This is mostly unintrusive (although a tad confusing at first: I’ve considered trying to style in a little “Made with” above the logo to explain why it’s there), and very nice of Prince to allow. (To get the “Non-commercial” license, just download the software: you don’t need a special license key or anything.)

In fact, I had a question about their licensing (“The books are licensed under a Creative Commons license that doesn’t allow me to add restrictions to them, so is it required for people who receive the PDFs from me to keep the Prince logo on them? If so, I can’t use the noncommercial license.”), emailed them, and got an email back quite quickly from HÃ¥kon Wium Lie, Prince’s Director (not to mention CTO at Opera and founding member of the Pirate Party of Norway). He’s definitely on top of things, and was quite happy to help. (The answer is no, other people can do whatever they want to the PDFs. In my case, they’re still subject to the Creative Commons license they always were, but that’s not because of Prince.) Later, I had a question about how to get something to render correctly (a somewhat minor, obscure layout bug), and quickly received a comment from Mike Day, the CEO, noting that they were looking into the issue. When I followed up, the bug hadn’t yet been fixed (it undoubtedly has tricky interactions with their page layout code), but I quickly received an alternative suggestion complete with example code. Definitely a pleasant experience all the way around.

If you’re looking for a cheaper option to start with, you’ll probably run into DocRaptor as well. DocRaptor started out as Prince-as-a-service, providing an API to allow people to generate PDFs using Prince. It now appears to support Excel files, although I haven’t looked in to those features. For many people, the benefits of being able to rely on DocRaptor to scale up as your workloads do (they claim “thousands of documents a second”) and the lower initial costs are probably a great benefit. They also provide well-supported libraries for a number of languages, where Prince usage is largely done by command line (although Prince has a PHP API as well). Overall, DocRaptor almost certainly provides benefits for many people. However, their plans aren’t super cheap either, and they’re targeted at recurring use, not one-shot uses like mine. I generated over 2500 PDFs in my final output (one per book, plus one per chapter), which would probably have cost me $149 in a month, assuming I didn’t want to tweak them later. Still far cheaper than the cheapest Prince license, but pricey for a personal side project like mine.

DocRaptor does have a 7-day free trial, which probably would have allowed me to generate whatever I wanted during that time, but that’s not exactly ideal, either. (Nor do I mind paying something for the service, but over a hundred dollars a shot is high for my purposes.) I emailed the DocRaptor folks about a pay-as-you-go plan (so I wasn’t paying monthly fees when I wasn’t using the service), because I had found references to such a plan elsewhere. I got a very nice response from Matt Gordon, the “lead vocalist” for the group running DocRaptor. Unfortunately, they no longer offer that plan, because they found that disproportionately more of their support costs (and they do provide good support) were going to users who didn’t spend much on the service anyway. We had a nice conversation about the possibility of plans that might support alternative uses such as mine, but it doesn’t sound like there’s anything planned in the immediate future. (I can’t blame them, as they need to make money and do what makes sense for their business to continue existing.) They did make a very nice offer (I won’t disclose the details) that I turned down for unrelated reasons, but they’re definitely nice folks too.

My conclusion is that you pretty much can’t go wrong with Prince or DocRaptor. Both have very nice and responsive folks behind them, and seem to be quite well done.

Tables of Contents and Bookmarks

One of the things relatively unique to printed books is cross-references with page numbers. Most of the book content doesn’t include these. This is primarily because any existing cross references are links to a specific section, and I didn’t think it necessary to include a page number along with the section number. However, the table of contents for the book definitely benefits from page numbers. Pulling a table of contents together in Prince is relatively easy. It could possibly be done automatically with JavaScript, but I chose to create taables of contents in a Ruby preprocessor as I was assembling whole-book files anyway. Prince makes it easy to include page numbers for links to given anchors, so I only needed to pull out the anchor for each section. (Luckily for me, I already had the anchors in a database.)

Secondly, I wanted to make sure that chapters and sections were listed in the PDF list of bookmarks. This list is sometimes useful when navigating a book in a PDF viewer, although some viewers don’t show it. Prince again makes this quite easy, simply requiring a CSS annotation for the items you wish to be bookmark headings. (In fact, by default it uses h1-h6 tags, but I disabled that default because it picked up way too many bookmarks.)

Optimization

In creating the full-book files, I noticed that some books created particularly large files. In general, this appeared to be because they embedded the full source images, rather than resampling them. While an option to resample the images inside Prince would be great, it doesn’t exist at this time. Some of the source images were quite large, and clearly intended to be printed at >= 300 dpi, while most users of the PDFs wouldn’t benefit from such images. My first attempt at reducing file size was to use Ghostscript to resample the images. Ghostscript has some features that work similarly to the now-unavailable Acrobat Distiller, and seemed likely to do the job. Unfortunately, after getting Ghostscript working (Ubuntu 14.04’s version appears to crash on larger documents, but 14.10’s works), I found that it removed page numbering information and bookmarks. The next step was to try to export this metadata using PDFtk before using Ghostscript, and then import it again afterward. Unfortunately, while PDFtk will output page numbering details, it won’t import them into a PDF, and there doesn’t appear to be any easily-available way to do so.

So, I temporarily abandoned the option to resample the images using Ghostscript. (It also may or may not have been worth it in the first place: some Ghostscript-generated files were larger than the Prince originals, so I had to handle both cases.) It may be worth patching Ghostscript in the future to keep the metadata around, but that seems likely to be quite involved. In many cases, you may get some benefit out of using Ghostscript with appropriate options (“gs -q -sDEVICE=pdfwrite -dPDFSETTINGS=/printer -dBATCH -dNOPAUSE -sOutputFile=[outfile] [inputfile]” seems to work well besides the additional metadata), but it was unfortunately unsuitable for my purposes at this time.

Chapter Files

Following up on the “whole book PDFs can be pretty big” issue, and after trying to open some such books and experiencing slow loading times, I decided that it may be appropriate to create one PDF per chapter as well as the whole-book PDF. My first pass at creating these PDFs was to use PDFtk to pull out just the pages from a given chapter. This posed a few problems: first, I had to figure out which pages belonged to which chapter. Luckily, the bookmarks inserted by Prince, combined with PDFtk’s metadata output, gave me the starting page for each chapter (although for a few minor reasons, this link was a bit iffy: the generated bookmark title did not always match the section name I had in my database), and I could assume that a chapter ended just before the next one began. Unfortunately, this ran into the same problem I had before: I would lose the page numbering and bookmarks. (Not to mention the fact that I would need to separately render a new first page to describe the licensing and get the Prince logo back on the first page.) Finally, I decided to simply depend on Prince once again.

I got Prince to log the page number and ID for each chapter heading by using a ::before psuedo-class with a content property of “prince-script(log, counter(page), attr(id))”, and a small “log” function in the JavaScript on each page. This allowed me to use the IDs to match up with my database, and easily identify where each chapter started. Because I already had whole-chapter HTML files, I could then use those HTML files to render the chapter in Prince, and everything would still be in sync, without having to try to render and merge together separate front pages for each chapter. (I still needed to get the page numbers to Prince for rendering purposes, but for this, I simply placed the page number in a CSS block in the HTML file.)

This solution appears to have worked surprisingly well, with the page numbers matching up where expected. Because the files were rendered separately, there is the possibility of some unforeseen issue (I certainly didn’t inspect the thousands of files by hand), but it seems unlikely.

Math

Finally, when reviewing one of the math textbooks in the collection, I noticed that Prince’s MathML rendering wasn’t particularly great. It is definitely better than nothing, but the rendering quality did leave something to be desired. Unfortunately, the most common web-based solution here, MathJax, doesn’t work very well with Prince. (This is a noted todo item on Prince’s release notes, but it’s not available yet.) After stumbling through a number of other options to try, I ended up using PhantomJS together with MathJax to prerender the math to MathJax’s “HTML-CSS” output (the SVG output didn’t look very good and produced a very large PDF file after the required fixes to make Prince display the SVG output). I forced MathJax to use the STIX fonts (which I installed on my computer), and after the math was rendered, I output the document’s HTML form again (after removing the MathJax wrapper divs). This produced files with reasonably good-looking math, the way they were intended to look. The prerendering code hasn’t been published yet because I haven’t taken the time to clean it up, but if someone is interested, I can definitely post it.

Prerendering with MathJax is a step that seems to have very poor asymptotic time complexity. I haven’t formally benchmarked it, but a chapter’s sections took about two minutes to prerender in total, while the whole chapter itself took roughly twelve minutes to prerender. The whole book took roughly four days to prerender. It’s not clear why this occurred, but the prerendering did eventually succeed. It’s also not clear if this is a bug in MathJax, or simply some inefficiency in PhantomJS, so I have yet to report it as a bug to either project (and may never report it – it’s unlikely to come up in common use).

Fin

So, to summarize, getting PDFs of a quality I’m comfortable with took quite a bit of effort. In the end, Prince does most of the work, and I rarely had problems with Prince itself. I think it was worth the effort, at least for a personal learning experience. Hopefully the books will be useful to other people as well. Once again, they’re all available at http://2012books.lardbucket.org. Please feel free to copy or redistribute them as you see fit, pursuant to the terms of the associated Creative Commons by-nc-sa license.

Andy Schmitz

P.S. If you’re interested in any of the print-specific (or Prince-specific) things I did to make the books look decent when printed, it’s all left in the book’s CSS file toward the bottom, under the “prince” @media type. Feel free to reuse any of that styling for any purpose you see fit, in any situation. I do not believe it is covered under the Creative Commons license: you may consider it to be public domain.

17 Comments »

  1. Hi there Andy. Can you advise on who to contact for permission to use material from one of the books in the 2012books archive for commercial use? Thanks!

    Comment by Ilka — 12/11/2014 @ 7:37 am

  2. Ilka: Unfortunately, I don’t have the ability to license the books for commercial use, as I am not the original author (or publisher). The publisher has asked to remain anonymous, but I will contact you privately to see if it is okay to pass along your information to them and have them contact you.

    Comment by Andy — 12/11/2014 @ 9:45 pm

  3. Dear Andy, do you know how to access business cases videos “How Would You Handle This?” from Beginning Management of Human Resources book? The videos ask for password (for ex. p.30). Thank you.

    Comment by Tetiana — 1/8/2015 @ 8:18 am

  4. Tetiana: Unfortunately, it looks like those Wistia videos have been removed. For the moment, they’re inaccessible. I do have copies of those videos that I saved when I downloaded the books, so I’ll have to see about getting those put up. Unfortunately, getting videos to display online isn’t particularly straightforward, so it will likely take quite a while for me to get them all together.

    Comment by Andy — 2/15/2015 @ 12:47 pm

  5. Hi Andy,
    Thanks for your work preserving the Creative Commons book archive. I had bought individual chapters from Flatworld for my students under their old business model, but they no longer offer that fee structure and I didn’t want to buy their entire books when I only used a couple chapters, so I stopped assigning the chapters. It is nice to be find the books for use as a reference again.
    I’m thinking about trying to move more of my reading material over to a Nook or Kobo e-reader. My ancient Kindle just broke, so I need to do something different anyhow. I’m wondering if you have any plans for posting these in the EPUB format or if you have any advice about reading them on a EPUB reader?
    Thanks,
    -Jonathan

    Comment by Jonathan Andreas — 4/2/2015 @ 11:24 am

  6. Jonathan Thanks! I don’t have any near-term plans for posting the books in an EPUB format, mostly because of my currently oversubscribed free time. It’s a reasonable idea, and shouldn’t really be all that difficult: EPUB is basically an HTML document anyway. Unfortunately, if I do end up making EPUBs out of the books, I would probably want to spend a bunch of time making sure I got everything right. (Obviously some things, like embedded YouTube videos, wouldn’t transfer well, but most things should be fine.) I’ll put it on my list, at any rate. Sorry about that!

    Comment by Andy — 4/2/2015 @ 4:50 pm

  7. What’s the difference between your two files for “Principles of general chemistry”? What does “1.0M” have that “1.0” doesn’t?

    Comment by Jay — 4/13/2015 @ 1:10 pm

  8. Jay: That’s a good question. At a quick glance, I wasn’t able to see any differences in the table of contents, although it is possible that some of the content is different. I presented them as separate books because they were separate books in the original source of the Creative Commons books, but I wasn’t able to determine why they were separate. Sorry I couldn’t be of more help!

    Comment by Andy — 4/20/2015 @ 6:57 pm

  9. It’s my honor to write and commend you on your effort,even though i could get access to you publication.

    Comment by IBRAHIM — 4/22/2015 @ 10:06 am

  10. Dear owner of 2012books.lardbucket.org/books Website

    Many thanks for posting books on your website! A student in my class has brought it to my attention that the atomic mass of Ag (silver) in the Periodic Table in section 2.7 in the book “Introduction to Chemistry: General, Organic, and Biological” (v. 1.0) is incorrect. The atomic mass should be 107.868, but it is listed as 196.56655. If this can be corrected, it would be great because if would help students who search for basic information from the web.

    Thank you very much for paying attention to this issue.

    Best regards,

    Youxue Zhang
    Professor, Univ. Michigan

    Comment by Youxue Zhang — 9/20/2015 @ 7:38 pm

  11. A colleague of mine and I wanted to use a diagram that I found in one of the books from Kurt Lewin in a book we are writing. How do I get permission to use that diagram?
    Thanks!

    Comment by Anil Saxena — 9/24/2015 @ 11:12 am

  12. IBRAHIM: Thank you! Are you indicating that you can’t view them? If so, can you give a bit more information on what’s not working? Thanks,

    Youxue Zhang: Thanks for the correction! I’ll try to update it soon. Unfortunately, I’ll have to figure out a way to do so, as I’ve thus far tried to keep everything very similar to the books as published, so it may take a while.

    Anil Saxena: Thanks for asking. Unfortunately, I don’t have any rights to sublicense the books. If you can work with the Creative Commons license listed on each page, then you don’t need any extra permission. If you need something beyond what’s allowed by that license, you will need to reach out to the current rightsholders. They have wanted to avoid me directly naming them, so I will contact them and see if I can forward along your request.

    Comment by Andy — 10/26/2015 @ 5:01 pm

  13. Hello,
    Thank you for the work you are doing to make textbooks accessible to all students. I am a community college instructor working to create a free textbook for students. My goal is to combine text from two of the Lardbuck textbooks to create a single composition and analysis book my students can use. My question is about attribution. Do you have any guidelines regarding citing our use of full and partial chapters? For example, should we include source information at the beginning of each chapter, before or after each section borrowed, in the preface, footnotes, in-text citations? I look forward to hearing from you.

    Comment by Tammy — 3/5/2016 @ 10:53 am

  14. Hello, I’m writing to request information about attributing the cc-sa-by textbooks. I’m an instructor at a community college and we are interested in utilizing chapters of several of the textbooks in our attempt to create a first year composition textbook for our students. I am hoping to learn how you expect users to attribute sections of the text (chapter introduction, foot notes, in-text citations, preface). Thank you.

    Comment by Tammy — 3/7/2016 @ 1:18 pm

  15. Hi Andy,

    The .PDF link for Advertising Campaigns: Start To Finish seems to be broken. I click on the .PDF link but I just get a black screen. Any advice would be appreciated, thank you.

    Comment by Steve G — 3/9/2016 @ 12:02 pm

  16. I have read through the licensing but still have a question. There are chapters of the book that I do not want to use, and I would like to add some of my own material. Can I delete or add material to the book? If so, is there anything that I need to do to modify the book? Thank you for your work on these books!

    Comment by Lynne — 5/31/2016 @ 8:15 am

  17. Sorry for not getting back to everyone sooner!

    Tammy: Unfortunately, I can’t really offer great citation recommendations, other than “cite them as you would anything else”. The Creative Commons organization has a set of best practices for attribution that may cover what you need. Because the original publisher has asked for the authors’ names and the publisher’s name to be removed from the attribution, you may be best off citing “an unnamed author”. You are welcome to link to my copies of the books as well, as the original publisher has asked that the copies not be linked to them. (I am also not a lawyer, and if you’re concerned about the legal aspects of citing a Creative Commons work, you may wish to talk to a lawyer.) Don’t forget that you can’t use the CC by-nc-sa license commercially. If you’d like me to get you in touch with the original publisher to talk to them about a different license, let me know.

    Steve G: The full PDF file for Advertising Campaigns: Start to Finish seems to load for me. It’s somewhat large, so it takes a bit to load, but it does successfully load. Can you try again? If it doesn’t work in your browser, try right-clicking the link, saving it, and opening the resulting file instead.

    Lynne: Yep! The Creative Commons license allows you to modify the book and reuse it, as long as you don’t do so for commercial purposes. To do so, just make sure you attribute the chapters that came from other authors to them. (In this case, that may be to “an unnamed author”: see my response to Tammy just above.) You’ll need to make sure you follow the license that’s linked to from the top of each chapter, but that should be relatively simple. Unfortunately, if you want to sell access to the books, you would need to get a license from the original publisher, as I don’t have the ability to give that permission. If you want to talk with them, let me know, and I can put you in touch.

    Comment by Andy — 6/17/2016 @ 4:56 pm

RSS feed for comments on this post.

Leave a comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

My Stuff
Blog Stuff
Categories
Archives