How should I create OCR text from existing DJVU image-only file and make a searchable text layer in the DJVU file?

Don't know where to start, or stuck on a certain problem? Drop by and tell us about it. Feel like helping others? Start here.

Moderator: peterZ

johnwmcc
Posts: 10
Joined: 02 Sep 2016, 12:59
E-book readers owned: Sony Reader PRS650. iPad Mini, Nexus 7 Android tablet
Number of books owned: 2000
Country: United Kingdom

Re: How should I create OCR text from existing DJVU image-only file and make a searchable text layer in the DJVU file?

Post by johnwmcc »

Very frustrating: I now find that the TIFF file I've generated can't be read by OmniPage 15, because the image size is too large - around 2500-2600px x 1300-1400. And it won't see a file with a .tiff extension, only .tif - another hiccup it took me a while to work out.

And when I try to use the Mac Preview app to reduce the image size to 50% of the original and change resolution from 72 to 300 dpi, I get either the image or the thumbnail inverted to negative. Seems I can't get both to stay black on white! If I use the black and white sliders in Adjust Colour, I can invert the image back to black on white, but then the thumbnail goes negative, and vice versa. Then I find when I have got the images (one page at a time) back to black on white, OmniPage sees most of the pages rotated 90 degrees or 270 degrees.

This is all getting too difficult. Nothing seems straightforward, all the things I try have a hiccup somewhere, and there are a confusing number of settings, most of which I don't understand, or where I don't see the pros and cons of different choices.

I have at least managed in parallel to output a PDF of more manageable size, by changing the output options in DJView. But although the PDF is now only 48MB (vs my first effort of almost 900MB) I find it has completely lost the chapter outline and links. GRRR!

I'm wondering if I should just stop where I've got to - a small DJVU file that my Android tablet can view, with an index I can read manually, showing a page number I can put into the Go To Page dialogue. Return for effort is diminishing sharply as the unproductive effort keeps going up, and no benefit is resulting.

To get me further, I would welcome a description of a workflow that will:
- generate a multipage TIF file of a resolution that OmniPage can read (DJView doesn't seem to have a resolution option anywhere I can find when exporting to TIFF from DJVU)
- scan it to generate searchable text of the main chapters - preferably using the OCR software (OmniPage 15) that I already have
- combine this with a proof-read and corrected OCR'd Glossary and Index
- put the text and images together into a searchable document of a manageable size, with page numbers matching the book image page numbers
BruceG
Posts: 99
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: How should I create OCR text from existing DJVU image-only file and make a searchable text layer in the DJVU file?

Post by BruceG »

John
I took the DJVU file & converted online to pdf (the ten pages together)
Omnipage would not except the file because of page size.
Changed size of pages to A4
Omnipage then excepted file for OCR
Results attached
I see that there are a few things need fixing
Ashley Book of Knots.pdf
OCR output from OmniPage
(1.82 MiB) Downloaded 452 times
johnwmcc
Posts: 10
Joined: 02 Sep 2016, 12:59
E-book readers owned: Sony Reader PRS650. iPad Mini, Nexus 7 Android tablet
Number of books owned: 2000
Country: United Kingdom

Re: How should I create OCR text from existing DJVU image-only file and make a searchable text layer in the DJVU file?

Post by johnwmcc »

Thank you for that.It has produced a searchable PDF. The file is a bit big - around ten times the size of the DJVU file - are the images in it still rather larger than necessary?

Which online service did you find that would convert DJVU to PDF? I have tried several, which produce different PDF file sizes from yours - one bigger (2.7MB), one smaller (1.5MB). A third offered a choice to reduce the image size, but failed twice to upload the file, so I gave up on that one.

Could you identify the online service, and the OmniPage process and settings you used, so I can try to repeat it for the rest of the book?

Thank you.
BruceG
Posts: 99
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: How should I create OCR text from existing DJVU image-only file and make a searchable text layer in the DJVU file?

Post by BruceG »

djvu to pdf
http://djvu2pdf.com/
It took the 10 pages as one file. Not sure if there is a limit.
It was the first in my google search

Used InFix to re-size the page size. Other programs may also do this.

Did the OCR again - no editing - instead of saving graphics -as is- reduced to 100dpi could have gone to 72dpi
knots.pdf
OmniPage graphic 100dpi no editing
(333.56 KiB) Downloaded 332 times
Yes OCR was done in OmniPage
Re-did zones if needed
For text for each page removed bold, sized text the same, removed super/sub text etc.
Before saving this time used options to reduce graphic dpi

trust this helps
johnwmcc
Posts: 10
Joined: 02 Sep 2016, 12:59
E-book readers owned: Sony Reader PRS650. iPad Mini, Nexus 7 Android tablet
Number of books owned: 2000
Country: United Kingdom

Re: How should I create OCR text from existing DJVU image-only file and make a searchable text layer in the DJVU file?

Post by johnwmcc »

Thanks again. Infix looks pricey to buy ($159), and I don't have much need for it otherwise.

Do you know of any other program for Windows or Mac, or online service that would let me change the image size in the pdf? My version of OmniPage won't load either your PDF or the other one I tried that produced a 1.5MB file.

Tried three different online PDF resize programs. All produced a file no smaller, or bigger, than the original.

Still feeling thwarted at every turn!
Last edited by johnwmcc on 11 Sep 2016, 05:58, edited 1 time in total.
BruceG
Posts: 99
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: How should I create OCR text from existing DJVU image-only file and make a searchable text layer in the DJVU file?

Post by BruceG »

I see InFix in UK is 6 pounds a month
I have been using it for a while now, the new version 7 is by subscription rather than buying outright.
If you have a lot of material at once a one month sub may be worthwhile.
I would have thought Acrobat would do it. Googling I am not to sure. Most say print to pdf with page size set.

My OmniPage is V19 or Ultimate
johnwmcc
Posts: 10
Joined: 02 Sep 2016, 12:59
E-book readers owned: Sony Reader PRS650. iPad Mini, Nexus 7 Android tablet
Number of books owned: 2000
Country: United Kingdom

Re: How should I create OCR text from existing DJVU image-only file and make a searchable text layer in the DJVU file?

Post by johnwmcc »

Have just downloaded free trial of Infix.

It took in your PDF, but seems not to resize it - I changed page size to A4, and told it to reduce contents in proportion, but the saved version seems no different in file size. The book feels large and heavy in the hand, but I've just measured the pages, and they are pretty close to US Letter - 11 x 8.5in near enough, so not surprising that changing page size makes little difference. Bigger than most books, but not as big as it feels subjectively.

Can't see any option to reduce image size or reduce image resolution specifically, but maybe I've missed it.

But I did run the built-in OCR just using default settings. It works to produce a very usable searchable PDF, but the file size goes up again to 4.2MB - just for one ten page chapter. Multiply by say 60 for 636 pages, most of which have images, and the file would be getting very big again.

It seems to have done a good job in recognising the text, but has also 'created' text from the knot images - and while I can see how to delete the contents of these incorrect text boxes, I can't see how to delete the boxes themselves. [LATER] Found it - have to choose the arrow select instead of text select tool.

At least part of my problem in converting from DJVU to PDF and recognising text seems to be that the original DJVU file had quite large images - around 2400x1400px, though individual pages vary a bit. And I haven't found any way to reduce that either online or locally. But I suppose for this page size, it isn't actually that high a resolution - 2400 px / 11 inches or about 200 dpi.
BruceG
Posts: 99
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: How should I create OCR text from existing DJVU image-only file and make a searchable text layer in the DJVU file?

Post by BruceG »

I used InFix just to give the pdf file a reasonable page size so the OmniPage would accept it. The conversion from djvu to pdf produced pages 45 by 35 inches. I mostly used Infix for editing, it is good for moving whole text boxes or pictures away from the edges ie gutters.
Had not used the OCR feature before. It looks like it just puts a text layer over a image so does not help with file size as you found out.

Where as OmniPage and I expect Abbyy Finereader splits the page into text and graphics. Text pages produce small files. Its the graphics that increase file sizes. OmniPage allows you to save graphic with different dpi.
as is 1.82 mb
100 dpi 334 KB - no editing
72 dpi 210 KB - on editing
With re-zoning these may increase
cday
Posts: 451
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: How should I create OCR text from existing DJVU image-only file and make a searchable text layer in the DJVU file?

Post by cday »

I've been following this thread but with many issues raised and not much time it has been difficult to comment usefully.

Did you scan the book yourself, and if so do you recall what DPI value you used, whether you scanned in colour or black and white, and do you still have the original scans?

Without rereading the whole thread I'd make the following comments in case they are any help:

First, DPI determines the print size of the page, and the value can normally be changed easily using image editing software. As I'm a Windows user and your example file is in DjVu format, I've used Irfanview (Windows-only freeware) to open your 10-page file to view it. I've also extracted the individual pages to TIFF format with CCITT G4 'Fax' compression, which is normally a good choice for black and white images. I noted, unconfirmed, that the extracted pages seemed to be 24-bit colour images and converted them to 1-bit black and white, which would normally produce smaller file sizes.

Below is page 1 of your file as extracted as a TIFF, and then the same page after the DPI had been changed to 300, an arbitrary value I tried but also a common flatbed scanner setting. The page size shown in the file properties of the extracted file is 35.56 x 45.01 inches, and the page size of the edited file 8.53 x 10.8 inches: as that size is close to the U.S. letter size you mentioned, it looks as if the original page was likely that size and that it was actually scanned at 300DPI. You might note that both the image pixel dimensions (2560 x 3241) and file size (85KB) are unchanged.

Page_1 as extracted to 1-bit TIFF
Page_1 as extracted to 1-bit TIFF
10-page file converted to 1-bit, DPI=300 multipage TIFF
10-page file converted to 1-bit, DPI=300 multipage TIFF
Irfanview has a batch conversion facility so I've also converted your 10-page file to a multipage TIFF 300DPI, 1-bit file in case it is any use in your tests. The file size has increased, which tends to support the view that DjVu is a very efficient format.
The Ashley Book of Knots - DJVU - Chapter 4_300DPI_1-bit_300DPI_CCITT_G4.tif
Page_1 as extracted to 1-bit TIFF, DPI=300
(663.84 KiB) Not downloaded yet
OCR is a demanding application and there has been gradual progress in the results that can be obtained, so your OmniPage 15 is now getting rather old; I'm not suggesting that you buy the latest version, but you might note in passing that Abbyy FineReader can both open DjVu files and also save to the format.

E&OE: All based on quick tests due to time limits!
cday
Posts: 451
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: How should I create OCR text from existing DJVU image-only file and make a searchable text layer in the DJVU file?

Post by cday »

In addition to my post above, you might also find the second tool DjVuToy in this new post interesting; reading the French, it looks potentially very interesting for what you wish to do, and the link to the English language version works and produces a utility that opens successfully:
DjVuToy.png
DjVuToy.png (25.04 KiB) Viewed 10964 times
Post Reply