Best OCR for high quality imaging.

Posted: 07 Sep 2016, 05:53
by Kaytk
I think this is the right section to post this. I have a collection of some rare books. It was a collection passed over to me by my uncle. The books are very old. It was then I read about benefits of document imaging ... nformation . So, I decided to store those in digital formats. First, I require a good, high-resolution OCR to scan the books. I want each page to appear without any underlying shadows. I'd appreciate suggestions on good OCR device.


Posted: 07 Sep 2016, 11:03
by duerig
I wanted to resolve a bit of confusion and point you in the right direction. In general, book scanning works in three phases:

(1) Capturing photographs of book pages.
(2) Cleaning up these photographs, cropping, 'post-processing' them.
(3) Converting the resulting photographs into a suitable format for easily reading (epub, pdf, djvu, etc.).

For the first step, there are a lot of options for an individual. The DIY solutions all involve some kind of physical framework for holding the book gently, keeping the pages flat, and lighting the pages evenly. They mount cameras (usually cheap point and shoot cameras) onto the rig, and need some kind of a controller which triggers those cameras.

Take a look at the bottom of the front page here to see a schematic diagram of these:

For a list of potential options of scanner rigs that you could build yourself or make from a kit, see here:

We also have a gallery of many of the scanners people have built themselves and you can click on the picture to go to the proper forum thread. Take a look here:

As you can see, there are a ton of options. They range in complexity from cut up cardboard boxes to aluminum monstrosities. Generally, the more elaborate and/or expensive builds get better quality scans.

Once you have photographs of the pages, there are a number of things you need to do to clean them up. Usually you want to crop out any part of the photo not on the page, rotate them so they are all in the same direction, and possibly do a number of other 'cleanup' style things. There is free software to do this kind of operation all on a batch of photos. You could do these operations one by one yourself with Photoshop and the like, but that would be very time consuming.

Finally, you want them in a usable form to read easily on your tablet or computer. This is where OCR comes in. OCR is software that takes a scanned image and tries to figure out what the actual letters are. Some people use OCR to generate an 'epub' which lets you read the letters that the computer finds. Others want to read the cleaned up images (so no formatting is lost and you see the original page), but use OCR to provide a 'search' function that takes you to the right page photo. In this case, you will end up with a 'pdf' or 'djvu' file. There are both paid options and free options for OCR software. Typically the paid options provide a more slick UI, and they may do a better job at actually recognizing the letters.

Hopefully this gives you a good overview of the whole process. Let me know if you have any particular questions about any step along the way. Best of luck with your project!

-Jonathon Duerig



Posted: 08 Sep 2016, 00:12
by Kaytk
Thanks a lot for explaining in so much detail.