Hathi Download Helper Alternative Method

Download books from the hathitrust website in a fast and easy manner

Brought to you by: hdh-creator

Home

Labels: Alternative (1)

This page describes an alternative download method which also supports institutional hathitust login from the internet
respesctively wiki

You may have access to the HathiTrust library through your university. If so, you can download books from it with a bit of work. HathiTrust provides access to lots of books unavailable on libgen and other places, so it’s worth having this trick in your back pocket, just in case.

You’ll need to install:
+ FireFox with the DownThemAll extension
+ img2pdf
+ ocrmypdf

I’m assuming you’re on Mac OS X , but this should work on linux and, mutatis mutandis, on windows, too.

1. Download raw pages from HathiTrust.

Using FireFox, access HathiTrust using your instiutitonal login.
Search for your book and check it out.
In the PDF viewer, navigate to the last page of the PDF and copy the URL from the address bar of your browser.
Open a text editor and past the URL you just copied. It will look something like this:
https://babel.hathitrust.org/cgi/pt?id={ID NUMBER}&view=image&seq={number}

e.g. https://babel.hathitrust.org/cgi/pt?id=njp.32101073965608&seq=14

In the line below this URL copy the following template:
https://babel.hathitrust.org/cgi/imgsrv/image?id={ID NUMBER};seq={number};width=2000
Copy the {ID number} ( starting after ?id= and ends before next & ) and the {number} of the last page ( starting after seq= and end before next & ) into the template. E.g.: njp.32101073965608
Then copy the new URL
Now click on the DownThemAll button, and select “Manager.”
Paste into the “Download” box. You’ll have something that looks like this.

https://babel.hathitrust.org/cgi/imgsrv/image?id=njp.32101073965608;seq=10;size=150;rotation=0

This link corresponds only to one page. To download the entire book, we’ll need to make DownThemAll download the whole sequence using the brackets notation. We can also request an arbitrary image size using “width” or “size.”

https://babel.hathitrust.org/cgi/imgsrv/image?id=njp.32101073965608;width=2000;seq=[1:24]

or the "res=0" parameter to get the best resolution available for the corresponding title.

https://babel.hathitrust.org/cgi/imgsrv/image?id=njp.32101073965608;res=0;seq=[1:24]

Note: The amount of image pages you can download is limited to round about 20 MB per minute.

To download pdf files use the following link structure

https://babel.hathitrust.org/cgi/imgsrv/download/pdf?id=njp.32101073965608;seq=[1:24]

Note 1: Firefox is not able to display the PDF pages downloaded from hathitrust correctly and only shows a blank pages!
Note 2: The amount of pdf pages you can download is limited to 15 pages every 5 minutes!

The defaults for DownThemAll work fine. I like to set a custom subfolder for the project. Click “Download,” then “Batch Download.”

DownThemAll should begin working its magic. HathiTrust limits the number of viewable images for a certain time period, returning a server error. So if you’re downloading a bunch of pages, you’ll want to modify the network preferences. Click on DownThemAll, “Preferences,” then “Network.” Here’s what I use.

Concurrent downloads: 1
Number of retries of downloads on temporary errors: 99
Retry every (in minutes): 2

You may need to step away for a couple of hours while DownThemAll gets through all the pages.

2. Merge raw pages, ocr them.

You should now have all of your images in a single directory on your computer. Using your command line, navigate to that directory.

cd ~/Downloads/example_book/

Use img2pdf since it’s always lossless. We need to sort the images correctly. They also come in a combination of jpg and png, in my experience. So this one-liner does the trick.

img2pdf --fit shrink --output out.pdf $(ls *.{jpg,png}|sort -V)

Now OCR this pdf with your preferred method. The simplist one is likely ocrmypdf. A good alternative is FineReader. This may take awhile.

ocrmypdf out.pdf out_ocr.pdf

You’ve now got a nice pdf of your book, DRM free.

sources:
* http://matthewdelhey.com/inc/hathi.html
* https://en.wikisource.org/wiki/User:Mukkakukaku/Guide