Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

Pdfbeads + ABBYY Finereader

Don't know where to start, or stuck on a certain problem? Drop by and tell us about it. Feel like helping others? Start here.
Post Reply
Enki

Pdfbeads + ABBYY Finereader

Post by Enki » 07 Jan 2012, 15:46

I'm trying to add OCR layer to pdf generated by pdfbeads. I've got the best results in ABBYY Finereader, but I have a problem with saving recognized text into original pdf. It seems like the only way is to create entirely new and ridiculously oversized file directly in Finereader (this new file is 4-5 times larger than original pdfbeads output !)
I need a way to save OCR layer in original file (without messing with data generated by pdfbeads) or to export ocr-ed text into hocr file. The other possibility is to extract OCR layer from pdf created by Finereader and save it as hocr. Unfortunately I haven't found any program capable of such thing, there's hocr2pdf but it seems to work only in one direction (hocr to pdf).

Do you have any idea?

wwang

Re: Pdfbeads + ABBYY Finereader

Post by wwang » 12 Jan 2012, 14:17

There are a few ways to shink down the PDF output files by applying the MRC compression and use the settings with B&W mode and lower image resolution by using FineReader. Depending on the documents, some may get a much smaller file size, but some might not much.

If other output option is acceptable, FineReader also can output to DjVu, ePub which is compaetible with many ebook readers.

Enki

Re: Pdfbeads + ABBYY Finereader

Post by Enki » 09 Feb 2012, 10:36

wwang wrote:There are a few ways to shink down the PDF output files by applying the MRC compression and use the settings with B&W mode and lower image resolution by using FineReader. Depending on the documents, some may get a much smaller file size, but some might not much.
But pdfbeads gives much better results, this is probably the only software that divide input files on two layers (text and graphics) and process it by using two different algorithms. If there's text and graphics on one page, then Acrobat or FineReader use standard jpg/jp2 compression on both layers. In such cases text become blurred and file size drastically increase.

After trying many options I've received best results by using Acrobat's clearscan method, but there's big problem with errors - there's no way to correct it. It's seems rather ironic - OCR without error correction, only Adobe could did this. :D

Post Reply