DIY Book Scanner

Posted: **15 Apr 2015, 09:05**

After following the boards and doing about 6mo of research, I have just finished assembling an Archivist, installed CHDK on two cameras and installed SpreadPi on a RaspberryPi 2. It actually went faster than I expected so now I'm at the software stage unexpectedly.

My first project with the scanner is to digitize a cookbook collection and run OCR to make them searchable. Some of the older books are text only, but the newer ones often are image & text. Some of the larger ones are 500+ pages (a couple are even 1000+ pages). And like many cookbooks, the text has an intro, then blocks of ingredients and then followed by recipe steps. The new ones have inset boxes or margin printing. So, that is to say the text can be irregular.

I'd like to draw on the community experience to start with an appropriate workflow and filetype for this project.

1) I think my first fork is whether I use pdf or djvu file format. As much as I've read I can see pros and cons to each. PDF seems to create more universal and searchable OCR, but can get get large file sizes and take a long time to deal with. Djvu seems to be great for storing this type of book efficiently, but may not be as good with OCR (?) and requires special software to open. What would some of you recommend as the right choice for a searchable cookbook database?

2) Once I've gotten that, I'm interested in the workflow recommendations. I have an older copy of Adobe Acrobat X on an academic license, but the just released Acrobat DC is pretty accessible with their new subscription model. ($14/mo vs. $500 outlay...) So ClearScan becomes an option. Though Tesseract based OCR seems to be doing great these days too. A workflow of Spreads -> ScanTailor -> Acrobat seemed to be the simplest, but I'm up for a more complicated workflow that gives better results.

So.... thoughts and guidance? Thank you!

Posted: **15 Apr 2015, 11:51**

It is great to see you up and running!

If you have Acrobat, I think that Spreads -> ScanTailor -> Acrobat is probably the best workflow at the moment for a lot of people.

For your situation, I would recommend avoiding ST's mixed mode or binarization stuff. It looks like you will be having a lot of pictures that you'd want to keep in color and so it would take a long time to go through each one and make sure that the pictures are tagged correctly in mixed mode. Instead, just leave everything as color. Then let Acrobat worry about compressing or sharpening the text.

Try it on one cookbook or even just a single chapter and get your process worked out on a small scale before trying it on a bigger scale. In my experience, it is common for the post-processing software workflow to take two or three times as long as the bookscanning itself if you let it.

One other trick: The archivist has a little notch through the middle of the cradle. Part of the purpose of that little notch is that you can stick little pieces of foam or other material in there to push up the spines of the books for ideal scanning. Given the variety of cookbook bindings, you might find it very important to put the right jig there for the right book.

Best of luck.

-D

Posted: **17 Apr 2015, 22:41**

The good thing about pdf and Acrobat in particular is that you will be able to index all of your cookbooks in the same index. Searching is very quick. Adobe reader is able to use the index for searching as well.

I have mostly used a flat bed scanner on the books and magazines I have copied. then taking the pdf or jpg file straight into OmniPage for OCR. I did try taking photos of some books on a trip with a camera on a tripod pointing downwards. Two pages per shot. Not a good idea. On splitting (cropping to single page) and cropping the pages with Acrobat, I then used OmniPage.

I have also OCRed a pdf of a pdf file from Internet Archive. The pdf was an image of the original photo so I used YASW as I had seen David Landin's video in the past before using OmniPage. I have never used ScanTailor.
I also use InFix as a pdf editor as I find easy to use. My Acrobat is V9, I see v11 has some editing built in.

Posted: **01 May 2015, 09:26**

Thanks duerig and BruceG. I had to do some work travel so I'm just picking this back up and am delayed by a struggle to get SpreadPi software up and running on my RPi 2 - so nothing to report just yet. I'm going to start with the Spreads -> ScanTailor -> Acrobat workflow and will report back.

DIY Book Scanner

Workflow & Filetype for a Cookbook collection

Workflow & Filetype for a Cookbook collection

Re: Workflow & Filetype for a Cookbook collection

Re: Workflow & Filetype for a Cookbook collection

Re: Workflow & Filetype for a Cookbook collection