DIY scanner and Scan Tailor processed books on Google Books

A place to tell us about your work and projects. Self-links encouraged!

Moderator: peterZ

User avatar
Misty
Posts: 481
Joined: 06 Nov 2009, 12:20
Number of books owned: 0
Location: Frozen Wasteland

DIY scanner and Scan Tailor processed books on Google Books

Post by Misty »

I wanted to let people know that my institution has a set of DIY scanned and Scan Tailor processed books up on Google Books now. They provide a good example of Scan Tailor book quality (though both books are using fonts that didn't shrink especially well - that's not a Scan Tailor problem). They both have illustrations in addition to text, and binarized lineart.

Oakland Township: Two Hundred Years by Stuart A. Rammage (also available: volumes 2, 3, 4.1, 4.2, 5.1, and 5.2)
Herons and Cobblestones: A History of Bethel and the Five Oaks Area of Brantford Township, County of Brant by the Grand River Heritage Mines Society

(Would people prefer this in the Scan Tailor forum?)
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.
User avatar
daniel_reetz
Posts: 2812
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: DIY scanner and Scan Tailor processed books on Google Bo

Post by daniel_reetz »

Holy kamole, these look FANTASTIC!

Are these the first books on Google Books from members of this forum? I know people have contributed to the Internet Archive and to BookShare, but Google Books?

This is a day-maker, thanks Misty! (and I think this is a fine place to post these results).
litchie
Posts: 18
Joined: 04 Mar 2014, 00:53

Re: DIY scanner and Scan Tailor processed books on Google Bo

Post by litchie »

Very impressive!

Especially love the high quality pictures. Are they manually processed?
User avatar
Misty
Posts: 481
Joined: 06 Nov 2009, 12:20
Number of books owned: 0
Location: Frozen Wasteland

Re: DIY scanner and Scan Tailor processed books on Google Bo

Post by Misty »

The photos were shot on a Canon PowerShot G10 using its raw mode. I batched one custom colour setting across all of the images. When I took them into Scan Tailor, I did have to manually clean up its picture selection for many pages.

While this thread is alive again - these books will soon be available as free PDF downloads, including the illustrations.
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.
JDSimmons

Re: DIY scanner and Scan Tailor processed books on Google Bo

Post by JDSimmons »

How does one go about getting a book on Google Books? I've donated a few to the Internet Archive and I wonder if Google Books would be a good alternative to IA?
User avatar
rob
Posts: 773
Joined: 03 Jun 2009, 13:50
E-book readers owned: iRex iLiad, Kindle 2
Number of books owned: 4000
Country: United States
Location: Maryland, United States
Contact:

Re: DIY scanner and Scan Tailor processed books on Google Bo

Post by rob »

I tried to look at the scans on Google Books, but they are all coming up as limited preview?
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.
User avatar
Misty
Posts: 481
Joined: 06 Nov 2009, 12:20
Number of books owned: 0
Location: Frozen Wasteland

Re: DIY scanner and Scan Tailor processed books on Google Bo

Post by Misty »

JD: You can only upload books you own copyright to or that you have permission from the copyright holder to upload. I don't think Google takes PD book submissions.

Rob: Yes, they're limited preview on Google Books. The Library doesn't own the copyright to those books, so while we received permission to display them in their entirety on our own website (and in some cases to make them downloadable), we don't always have permission to put them up for full display on other sites.
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.
User avatar
Misty
Posts: 481
Joined: 06 Nov 2009, 12:20
Number of books owned: 0
Location: Frozen Wasteland

Re: DIY scanner and Scan Tailor processed books on Google Bo

Post by Misty »

I have a couple of downloadable PDF versions available now - the rest of the set will be available later on. Here are the links:

Herons and Cobblestones, 11MB download
Oakland Township: Two Hundred Years, Volume I, 34MB download

The files are a bit bigger than I'd like, but they're within reason and have full OCR text.

I produced these using a script I wrote that yokes together a few different tools to produce layered PDFs - that helps keep the filesizes down, by using an efficient bitonal compression on the text while downscaling the illustrations to 100DPI and compressing them with medium JPEG. I've received permission from my employer to release the script as GPL, so I'll make that available soon. I just have a few improvements to make before I do that. The steps it does are:

- Separate Scan Tailor images into separate TIFF files for bitonal text and images, using ST Separator (Note: Currently this isn't a part of the script, but I'm hoping to be able to integrate that functionality into it)
- Make white background in illustration files transparent, and convert to PDF with medium quality JPEG
- Encode to DJVU and back to TIFF, for symbol merging
- Encode text to PDF (currently using Group4 - I'm not sure if there's an open-source solution for JBIG2 encoding)
- Merge text and illustration into a single page with two layers
- Merge all pages into a single document

I currently don't have an open-source solution to provide OCR, unfortunately. I've been using Acrobat for that. If there's an acceptable open-source way to do that, I'd love to integrate that into the script too.
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.
Tim

Re: DIY scanner and Scan Tailor processed books on Google Bo

Post by Tim »

Misty wrote:I currently don't have an open-source solution to provide OCR, unfortunately. I've been using Acrobat for that. If there's an acceptable open-source way to do that, I'd love to integrate that into the script too.
There are three decent (depending on your needs and skills) options for open source OCR right now. Tesseract, Ocropus, and Cuneiform.
Tesseract - http://code.google.com/p/tesseract-ocr/ - the development version has to be built from source in order to get page layout analysis. See http://code.google.com/p/tesseract-ocr/wiki/ReadMe
Ocropus - http://code.google.com/p/ocropus/ - very much in development, not sure there is a Windows version
Cuneiform - https://launchpad.net/cuneiform-linux - for the linux version and - http://www.cuneiform.ru/eng/ - for the Windows version, but when I tried it in Wine, it seemed to be all in Russian, unlike the Linux version which is just command line.

The first two take some significant configuration and tweaking to get good results, but if you can do that, people report very good results. To get really good results requires training. Cuneiform seems just ok, but it works without all the configuring and tweaking. The Windows Russian version is the only one of the above that has the familiar OCR GUI right now, though there is a kind of manual GUI for tesseract on Linux - http://sourceforge.net/projects/gimagereader/ and one for Windows - http://www.paperfile.net/ . I'm not aware of any GUI efforts that use the latest Ocropus.

In the end I still use Omnipage because of it's accuracy and ease, but I hope the above can progress. Many people use them already. I haven't used either of the tesseract GUI tools, I just linked them for completeness. Open source OCR seems to be progressing. Tesseract had been stalled for a while, but is developing again. Ocropus was developing fast, but seems slowed while they are working on the decapod project.
User avatar
dingodog
Posts: 110
Joined: 22 Jul 2010, 18:19
Number of books owned: 1000
Country: on the net
Location: on the net
Contact:

Re: DIY scanner and Scan Tailor processed books on Google Bo

Post by dingodog »

Misty wrote:- Encode text to PDF (currently using Group4 - I'm not sure if there's an open-source solution for JBIG2 encoding)
Dear Misty

Oh, an opensource solution for jbig2 encoding exist!

it's a project named

*jbig2enc*
- http://github.com/agl/jbig2enc

a prebuilt binary version is available here: (for Windows and Linux both) Needs python
- http://github.com/agl/jbig2enc

syntax:

jbig2 -s -p -v *.tiff (if you are using tiffs, otherwise type image extension you have)

then, use

*pdf.py* script(python is needed) I attach

syntax:

pdf.py output > file.pdf

this sends jbig2 output in a pdf container

Note you can combine ( in la Linux shellscript) these two tasks

jbig2 - s-p -v *.tiff ; pdf.py output > file.pdf

jbig2 (with a mix sometimes of jpeg2000 for color images) is standard compression used by googlebooks

I usually

- Scan
- process images with Scantailor (B/W outpu)
- then I process again with jbig2 in order to have higher compression ratio

many times, it happens that compression ratio is better than djvu
Attachments
pdf.py.zip
(1.76 KiB) Downloaded 1768 times
Post Reply