DIY Book Scanner

Posted: **13 Mar 2014, 21:58**

I continue to mass-convert my scanned books to djvu. These are not the ugly low-quality scans that I have kind of made a hobby out of dressing up. These are all 600dpi sheet-fed scans of books which have had their spines chopped off. Recall that they were Adobe ClearScan files before, and I was generally pleased with their small size. Converting these scans to djvu has proven DjVu's superiority to me once and for all, though.

Many of the books are half the size once converted with the process outlined here (minidjvu for the black-and-white portions, c44 for any background images at 1/4th size, manually mashed together with djvumake and scripts). I have even seen some that are 1/8th the size of the PDF. Then some are about the same size, and two or three books have been slightly larger. I can't predict which outcome I'll get before the conversion, as there doesn't appear to be any rhyme of reason to it. I have to imagine that sometimes ClearScan builds up lots of redundant font images, which djvu/Jb2 manages to share between pages.

Example: Plato's republic, 397 pages, all black and white except the cover image.

PDF: 14MB ClearScan
DJVU with OCR: 6MB
DJVU without OCR: 4MB

Example: Suzuki, Zen and Japanese Culture, 577 pages, color cover and several grayscale photo pages.

PDF: 19MB ClearScan
DJVU with OCR: 11MB
DJVU without OCR: 8MB

I have mostly been avoiding the books with lots of colored text, as those are a lot of manual labor with free tools (minidjvu only understands bitonal images, but csepdjvu won't share image dictionaries, so... my ideals force me to use minidjvu and create foreground color masks by hand). Luckily, the majority of books mix plain bitonal text with color images, and those are easy to split out with scripts.

I've generally been dropping the OCR data in the conversion, because when I look at it... it's not that great. ClearScan OCR has tons of mistakes in it, and I don't search in my PDFs that often in the first place. For large books it will save 2 or 3 megabytes to leave it out, so I've been doing that. In the numbers above I gave the with-OCR sizes to make the comparison to the OCR'ed PDFs fair.

[EDIT: I should say all the ClearScan files were created with Acrobat X, and saved as "Reduced Size PDF" afterward. Don't know if Acrobat XI does a better job or not.]

Posted: **13 Mar 2014, 23:09**

Another interesting one from tonight. OConnor, Understanding Jung, 162 pages of black-and-white text. ClearScan PDF: 3.1MB ... DJVU with OCR: 1.4MB ... DJVU without OCR 800kb. All 600dpi.

So was it a waste of time scanning my library to PDF if I've decided to slog through the djvu creation process after all? I don't think so. Because Adobe's ClearScan process does a lot of the tedious part for me, splitting the pages into foreground and background images. It also does some amount of lossy matching on the characters like JBIG2 does, so it could very well be improving the results from minidjvu (or at least making its job easier). So I think the Adobe "pre-processing step" makes up a bit for the fact that I'm not using a similar commercial-quality MRC tool on the DjVu side.

What really blows my mind is when I take an electronic document PDF--not a scan--and the DjVu turns out smaller. There's no reason that should be possible, ever. Doesn't say much for the PDF generators that people use.

Posted: **14 Mar 2014, 14:04**

Today I'm back to my hobby of cleaning up bad scans I find on the net. Our subject for today is a computer game manual for Ultima 4, which I bought from GOG.com. It is a poor scan, and djvudigital just wants to make every page all-background. So this makes for a good example of how I go about extracting the text from a complicated image. Here's a tiny version of an example page:

: origsmall.png (399.78 KiB) Viewed 14994 times

As you can see, the background is busy, and the contrast around the inking isn't great:

: Know_CU1.PNG (51.18 KiB) Viewed 14994 times

Still, I've seen much worse. So, every document is different, but the main plan of attack I start with is to stretch the contrast in different channels of different colorspaces, and look for the one that seems most "split." For example:

DIY Book Scanner

Learning to Create Tiny DJVU files

Re: Learning to Create Tiny DJVU files

Re: Learning to Create Tiny DJVU files

Re: Learning to Create Tiny DJVU files

Re: Learning to Create Tiny DJVU files

Re: Learning to Create Tiny DJVU files

Re: Learning to Create Tiny DJVU files

Re: Learning to Create Tiny DJVU files

Re: Learning to Create Tiny DJVU files

Re: Learning to Create Tiny DJVU files

Re: Learning to Create Tiny DJVU files