How should I create OCR text from existing DJVU image-only file and make a searchable text layer in the DJVU file?

Don't know where to start, or stuck on a certain problem? Drop by and tell us about it. Feel like helping others? Start here.

Moderator: peterZ

johnwmcc
Posts: 10
Joined: 02 Sep 2016, 12:59
E-book readers owned: Sony Reader PRS650. iPad Mini, Nexus 7 Android tablet
Number of books owned: 2000
Country: United Kingdom

How should I create OCR text from existing DJVU image-only file and make a searchable text layer in the DJVU file?

Post by johnwmcc »

I own a printed copy of the classic Ashley Book of Knots by Clifford Ashley, originally published in 1944, then (in the edition I have) reprinted with corrections in 1963 or thereabouts. I love browsing it, but it is big and very heavy, and as I get older, hard to handle. So I'd like to make an eBook out of it, just for my own use.

Someone else has done a good scan of the book, in DJVU format, occupying around 20MB, which includes images (but only images) of all the pages - the prefatory pages (8 pages before p1 of Chapter 1 of the book itself), 41 chapters, a few with subheadings, then a bibliography, glossary, a very comprehensive and detailed index, and scans of the B&W photos on the inside front and back endpapers. It has 636 pages altogether.

Each Chapter opening page includes line diagrams, which occupy an inverted L-shaped box around the text, and each page within the Chapters has a vertical rectangular part of text in parallel with a rectangle of diagrams.But the widths and proportions of the L-shapes and rectangles are not the same in different Chapters. The bibliography and glossary have a two-column text-only layout, and the index a three-column text-only layout. Index terms may point to more than one page - sometimes a lot more than one page for a term used frequently throughout the book.

I have succeeded in a few hours in using DJView for Mac to create an Outline with links to each chapter, to some subheadings in the chapters that have them, and between index term initial letters and each page of the comprehensive Index. I've also created links from the outline to the title page, contents page, bibliography and glossary pages. I then used the same DJView for Mac software to embed the Outline into the DJVU file. So the eBook is now usable on my Android tablet and phone using the eBookDroid viewer.

I have a format which is readable, which I can use to look up an item in the Index and then use the eBookDroid viewer to go to that page. The eBookDroid app allows you to set a page offset (-8 in my case) to map the index page numbers to the DJVU image file page numbers.

But I'd like to do more. I'd like to create (using OCR) a searchable text layer for the whole book, and also create links from the Index to the relevant page(s) in the book for each index term, and embed these back into the DJVU file.

I've also manually made screenshots of the Index pages (13 pages) from viewing the DJVU file in DJView, saved them as PNG files, and used OmniPage 15 to OCR them into an RTF document edited in Word, which I'v'e proofread and corrected for scanning and OCR errors. That took quite a few more hours! But so far, I haven't found a way to use it, or embed it back in the DJVU file.

Now what I'd like to do is to OCR the rest of the text - Prefatory pages, Chapter text, bibliography and glossary pages - and create a searchable text layer within a new DJVU file, to include at least both the Chapter text and the corrected Index.

I've found an on-line program - https://www.newocr.com - which allows me to upload the whole DJVU file, and OCR it - but it only allows OCR and download of one page at a time, which would be impossibly time consuming to do, even though I have a very fast Internet connection.

Aside from OmniPage 15 for Windows - if that can help further - I'd like to use free software as far as possible.

As well as the native OS X El Capitan on my iMac I have three virtual machines - Windows 7, Windows 10, and Linux Mint 17.3 Rosa. I've installed djvusmooth on Linux Mint, because I read that it can manage the creation of OCR, but I can't see how to do that in the program.

What workflow should I use?

So far, I've found ones that depend on having a set of image pages in TIF or TIFF format, then processing those automatically. How would I create those from the DJVU file?

And having got the text OCR'd, how do I edit it and insert a edited searchable text layer into the DJVU file?

I'm moderately technically competent in using the command line in Linux or Mac (and to a lesser extent in Windows batch files, but not PowerShell).

I would like advice, please, on workflow and suitable free software for each stage of the process I should follow.

Many thanks in advance.
BruceG
Posts: 99
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: How should I create OCR text from existing DJVU image-only file and make a searchable text layer in the DJVU file?

Post by BruceG »

John
I have been working on making epub books lately. Only for reading as a novel not searching. And not using DJVU.
From pdf scans use Omnipage Ultimate to OCR and save as epub
To edit the epub I use Calibre epub editing which is free. Usually some pages out of Omnipage do not flow into the next and some hyphens need to be removed.
For reading I have a Kobo reader.
It may be worth checking out Calibre to see what it can do.
johnwmcc
Posts: 10
Joined: 02 Sep 2016, 12:59
E-book readers owned: Sony Reader PRS650. iPad Mini, Nexus 7 Android tablet
Number of books owned: 2000
Country: United Kingdom

Re: How should I create OCR text from existing DJVU image-only file and make a searchable text layer in the DJVU file?

Post by johnwmcc »

Thanks for the suggestions, and for taking the trouble to reply. I appreciate it.

I already use Calibre, and have the latest version. Unfortunately, its Editor can't handle DJVU files, though Calibre itself can view them using DJView (but not with the built in viewer).

I tried converting the 20MB DJVU file into a PDF. It exploded in size to nearly 900MB, which is unworkably large.

I really need, I think, to work with the DJVU format. I know in principle it can do what I want, I just don't know how to do it.
cday
Posts: 451
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: How should I create OCR text from existing DJVU image-only file and make a searchable text layer in the DJVU file?

Post by cday »

johnwmcc wrote: I tried converting the 20MB DJVU file into a PDF. It exploded in size to nearly 900MB, which is unworkably large.
That is a matter of the compression settings used...

Although DJVU is considered in some respects to be a technically superior format, it is little used in the west and using PDF would give you a wider choice of tools and well-established workflows, albeit probably with a small increase in file size. Some people do use DJVU, though.

Regarding your overall process, I think you will probably need to obtain a good set of page images and then run them through OmniPage or FineReader: I don't think you will be able to use your existing text file.

If you have good quality images Adobe Acrobat Clearscan (the absolute opposite of freeware, I realise) would probably do the best job of preserving the layout of the pages, you should be able to simply run the images through it and obtain useable output. You would also have all the tools required for the other things you need. It is now available on a subscription service so for one book that might possibly be an option. It does work best for high quality images when it can produce excellent results and very small file sizes, otherwise file size can increase substantially.

Edit:

As you are looking for the usual searchable image output preserving the page layout is not, of course, an issue, whereas it could be a major issue if you were looking to produce a vector text output. OmniPage or FineReader should therefore be viable tools in that respect at least.

If you are lucky enough to have really good scans Clearscan could produce excellent, if rather expensive, vector output with much smaller file sizes and guaranteed layout preservation straight off. Getting a bit off topic, I realise.
BruceG
Posts: 99
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: How should I create OCR text from existing DJVU image-only file and make a searchable text layer in the DJVU file?

Post by BruceG »

Create pdf and tiff
Can you print from DJVU? If you can, then print to pdf. Drivers (if that's what they are called) are freely available. You can then use Omnipage to produce tif. Omnipage will alow you to reduce the size of the pdf also.
johnwmcc
Posts: 10
Joined: 02 Sep 2016, 12:59
E-book readers owned: Sony Reader PRS650. iPad Mini, Nexus 7 Android tablet
Number of books owned: 2000
Country: United Kingdom

Re: How should I create OCR text from existing DJVU image-only file and make a searchable text layer in the DJVU file?

Post by johnwmcc »

Thanks to all of you.

I have found that I can export just the Glossary pages (for trial on a small number of pages) as a multipage TIFF file with page images, and get OmniPage to recognise them and output as a Word or RTF file.

Needs quite a lot of work to clean up the recognition, though.

Tried export as PDF with text, but the text doesn't seem to be coming over - at least, the PDF isn't searchable. Maybe I have the settings in OmniPage wrong? Will experiment further.

I'll try again on the main text.

I'll also try print to PDF - I know I can do that, but haven't tried it, nor seen what else I can do to reduce the monstrous file size of my first PDF output.
cday
Posts: 451
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: How should I create OCR text from existing DJVU image-only file and make a searchable text layer in the DJVU file?

Post by cday »

johnwmcc wrote:.Tried export as PDF with text, but the text doesn't seem to be coming over - at least, the PDF isn't searchable. Maybe I have the settings in OmniPage wrong? Will experiment further.
OmniPage has a complex interface that is generally considered inferior to that of Abbyy FineReader; in addition it is necessary to grasp the underlying concepts with regard to the compression options available, for example.
BruceG
Posts: 99
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: How should I create OCR text from existing DJVU image-only file and make a searchable text layer in the DJVU file?

Post by BruceG »

John
When I said put the large pdf file through Omnipage, I meant to also save as a image file to reduce its size. I am not sure about V15 but I have a options button near save that has a lot of settings that you can play with. I suggest that you extract a few pages of the large file to play with. If you can extract a few pages I am happy to have a look at them.
johnwmcc
Posts: 10
Joined: 02 Sep 2016, 12:59
E-book readers owned: Sony Reader PRS650. iPad Mini, Nexus 7 Android tablet
Number of books owned: 2000
Country: United Kingdom

Re: How should I create OCR text from existing DJVU image-only file and make a searchable text layer in the DJVU file?

Post by johnwmcc »

Post mostly duplicated - removed
Last edited by johnwmcc on 10 Sep 2016, 06:43, edited 1 time in total.
johnwmcc
Posts: 10
Joined: 02 Sep 2016, 12:59
E-book readers owned: Sony Reader PRS650. iPad Mini, Nexus 7 Android tablet
Number of books owned: 2000
Country: United Kingdom

Re: How should I create OCR text from existing DJVU image-only file and make a searchable text layer in the DJVU file?

Post by johnwmcc »

Using the DJView program, I have extracted a short (ten page) chapter 4 from the book, and saved it as DJVU, file size 299KB.

I have also Exported it as a TIF Document (I'm guessing that means multi-page TIF) size 679KB using the following settings in the Export dialogue:
Screen Shot 2016-09-10 at 11.09.37.png
Screen Shot 2016-09-10 at 11.09.37.png (62.97 KiB) Viewed 9725 times
How do I attach or upload a file here please? Maybe I just don't have permission yet? When I tried using the Attachments tab below, I got a general HTTP error on clicking [Submit], and it seems a slightly earlier version of this post got submitted anyway - I've removed it.

Let me try links to Dropbox - that ought to work, if I have understood the FAQ correctly. Here's the folder with Chapter 4 in DJVU and TIF format:
https://www.dropbox.com/sh/djjlyayir92h ... 4Uw8a?dl=0

What I'd like to do now, in the light of comments here, is turn the whole book (most of the chapters look quite like this sample) into a searchable, viewable PDF, sized not more than (say) 50MB in comparison to the 20MB of my existing image-file DJVU document, to which I've added an Outline.

I'd like to include a searchable index (for which I have a proofread and corrected RTF document) ideally with clickable links to the relevant page, but more likely just by going to that page. And the page number in the index is 8 pages earlier in the DJVU document page numbering, so I want to introduce an 8-page offset somehow in the page numbering in my reader.
Post Reply