Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

How should I create OCR text from existing DJVU image-only file and make a searchable text layer in the DJVU file?

Don't know where to start, or stuck on a certain problem? Drop by and tell us about it. Feel like helping others? Start here.
Posts: 10
Joined: 02 Sep 2016, 12:59
E-book readers owned: Sony Reader PRS650. iPad Mini, Nexus 7 Android tablet
Number of books owned: 2000
Country: United Kingdom

Re: How should I create OCR text from existing DJVU image-only file and make a searchable text layer in the DJVU file?

Post by johnwmcc » 12 Sep 2016, 18:19

Thank you for your advice, and your trial conversions.

I didn't scan the book myself, but was given the DJVU scanned image file by someone who'd already scanned it. Like you I now think it was probably scanned at 300dpi, but somehow other apps (particularly DJView and Mac Preview) see it as 72dpi, which makes the pages too big for OmniPage 15. I tried reducing the image size by 50%, but in retrospect, I think I should just have changed the dpi setting. I assumed (wrongly) that OmniPage was rejecting the TIF input I had because it had too many pixels, but subsequent checking suggests that it was probably the calculated page size in inches that was too big.

I'm using this sample chapter to get quicker results from each trial stage of processing, but ideally would want to have a workflow which will handle all 636 pages, or at least the 600 or so pages of main Chapters in one go, once I have a working process.

I tried the OCR built in to Infix, with partial success. It did fairly well on the text, but wrongly interpreted quite a lot of the 'knots' as rather scrambled and meaningless text. Going through it page by page, I was able to delete the extra text boxes, but wasn't able to respecify the boundaries of the real text, to stop it trying to do OCR in the wrong parts of the page. Unfortunately, it kept crashing intermittently after some of my edits. So again, rather frustrating - so near, but not quite all there.

But it did produce a useful though not complete searchable text layer in the PDF file. It seems to have skipped some lines altogether.

PS. I tried to respond yesterday, late at night here, then thought I'd lost the draft reply which I'd saved, so gave up until now, when I see you've added another reply. Thank you. Too late to follow it up tonight, and I'm busy most of tomorrow, but I will follow it up in the next day or two and post here how I get on.

Post Reply