OCR for text-only book copy

Convert page images into searchable text. Talk about software, techniques, and new developments here.
OCR for text-only book copy

Post by Canadane » 11 Oct 2011, 21:16

Hey guys. I have finally built my first scanner! I slightly modified the new standard build with a single camera mounted directly above both pages. Thanks in large part to this tutorial I figured out a batch process to create a nice semi-search-able pdf, but I want to go one step further and create a clean text document that can be used, for instance, on a kindle. The page size that I'm scanning is just too big to view on such a devise in PDF format (I think - if not please tell me!!!)

I need help figuring out if this is possible. The wiki (and many forum posts) list lots of OCR software, but my copy of Acrobat X has in-built OCR tools - are others better for my purpose? What have been your experience with conversion to a text-only format?

Sample image from my scan:

Thanks a million!

Re: OCR for text-only book copy

Post by aguncan » 12 Oct 2011, 03:37

I have met same dilemma. I have used two OCR program; ABBYY Finereader and Adobe Acrobat X (with built-in OCR as you said) for my scanned books. Two of the programs have good result both in English and also my language Turkish. Quality of OCR depends on the several factors especially quality of image that you obtain via the camera. The more you get quality images, the more you obtain correct OCR. In my opinion, Built in OCR of Adobe Acrobat X is enough for OCR processing, you do not need to install additional program. You need to calibrate your camera ( with different adjustments, and different light conditions and etc.) and shot images with these different conditions. I can write my protocol to find optimal OCR output step by step.
1) Shot images with these different conditions (mentioned above)
2) Then use Scan Tailor (my preference, because of easy use, user-friendly interface), then obtain the outputs ( different resolutions in Scan Tailor can effect the quality of OCR).
3) Combine outputs with Adobe Acrobat X
4) Use the built in OCR (Adobe Acrobat X)
5) Then copy your text and paste to any word processing program
6) You decide best resuls.


Re: OCR for text-only book copy

Post by Canadane » 16 Oct 2011, 21:56

Thanks. Do you think there is room to improve for the sample I posted?

Re: OCR for text-only book copy

Post by daniel_reetz » 17 Oct 2011, 12:28

Can you post an image straight from the camera?

In general, yes, it is possible to get a better image for post-processing, but it's actually not my forte - many people here are better at the postprocessing stuff than me, and understand the software better. If you're just prepping the text for OCR, consider using Scan Tailor's black and white mode to get things very clean. What you've posted is still a color image and it could be better if it were binarized.

Re: OCR for text-only book copy

Post by Misty » 17 Oct 2011, 12:54

Absolutely seconded on Scan Tailor; it may get you better OCR results than the current book. The bleeding colour on the edges of your text pis probably making OCR worse.

Unfortunately I'm not familiar with reflowable content for e-readers. That may be tricky for Kindle. However: Calibre is capable of converting from a PDF with OCR to Kindle-readable formats like MOBI. The formatting won't be ideal, but at least it'll work.

In addition to Acrobat, dedicated OCR software like ABBYY (commercial) or Tesseract and Cuneiform (free open source) would be another option. Those can output as plain text rather than keep the page image that Acrobat is giving you. That plain text can either be read on your e-reader directly or converted to something like MOBI. ABBYY costs money but will give you the best text recognition. Tesseract and Cuneiform are free but have worse accuracy; they will probably be close to the accuracy from Acrobat, or slightly worse.
Re: OCR for text-only book copy

Post by Tim » 18 Oct 2011, 10:43

I also think you may be able to get better images for ocr. Make sure you're getting at least 300 dpi in both dimensions. if not, try taking a picture of one page at a time. It seems your light may be uneven since the letters are bolder on one side. or that could be because your camera is not parallel to the page you are imaging. Hard to tell without a raw picture directly from the camera. Also more light, evenly dispersed, just about always helps. After that I think Acrobat X should do your OCR fine. I don't know if its workflow is as easy as a standalone OCR program though.

Re: OCR for text-only book copy

Post by stearn » 31 May 2012, 19:03

Finereader works best at 300dpi - tests I carried out some time ago using various flatbed resolutions confirmed this - and at around £100 I think it is the best value out of the box solution. I did quite a bit of work using another of their products - FlexiCapturePro - as structured text extraction is possible, and I was looking to create an automatically populated database of information - the project ended up as http://www.radiotimesarchive.com and straight forward text embedded PDFs. Talking to the UK ABBYY team, I believe that colour information is also taken into consideration with FCP, but I am not sure if this extends to FineReader, which is a slightly different engine IIRC.

Acrobat is fine for most things if you already have it, but I certainly wouldn't recommend buying it just for OCR as it is hugely expensive in comparison with FR11, and has some strange quirks on some settings. I have just upgraded to Acrobat X and the batch processing is superb but, as someone pointed out on another thread here and on some other forums, it seems to introduce some weird spac ing pro blems which can cause word searching to fail. I am just testing FineReader 11 Corporate edition to see if the batch processing on this does all I need, although I believe only Recognition Server (serious money) will output to multiple file types at the same time, so if you want text embedded PDFs, XML and plain text, you will have to run jobs several times using FR. ABBYY FotoReader was a separate programme at one point, but I think is now bundled within FR and I know that early experiments a few years ago with that and phone camera quick point and click images were impressive.

The bottom line with any OCR software is the better the original image, the better the results. One thing to remember is that if you are scanning to a compressed format, any processing is likely to introduce additional artefacts which could cause different problems. All of my scanning (via flatbeds) is done at 300dpi and usually to TIFF.

