Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

Make a djvu file and add ocr: DjVuToy; TiffDjvuOcr; CuneiDjVu

General discussion about software packages and releases, new software you've found, and threads by programmers and script writers.
b0bcat
Posts: 42
Joined: 30 Nov 2012, 21:37
Number of books owned: 0
Country: UK

Make a djvu file and add ocr: DjVuToy; TiffDjvuOcr; CuneiDjVu

Post by b0bcat » 12 Sep 2016, 16:43

A question regarding dtic's "tiffdjvuocr" and a couple of notes among other things to memorialise "DjVuToy".

I've decided I want to try creating Djvu files under MS Windows 7 x64 and/or adding an ocr text layer to the one or two djvu files I have which omit it. In the course of my neophyte searches I have come upon the answers to some of my questions but also more questions...

1) How to make a djvu file from the output tif images of ScanTailor and/or add an ocr text layer?

It does seem to be the case that a lot of djvu knowledge and know-how is fragmented across the net and often in non-English sources. What I have found are these:

a) tiffdjvuocr by dtic, originating from
http://www.diybookscanner.org/forum/vie ... =319#p3300
"Adding positionally aware ocr to a djvu scan"
and located at http://nod5.dcmembers.com/tiffdjvuocr.html
I like the look of this but a question for dtic or anyone else who happens to know: on startup there is a splash screen prompting for the program paths of djvulibre and tesseract. In my case they are

C:\Program Files (x86)\DjVuLibre
and
C:\Program Files (x86)\Tesseract-OCR

(Tesseract now at https://github.com/tesseract-ocr)

However inputting the above two paths at the splash screen prompt does not permit a save, whether each is followed by a trailing \ or whether they are both in double quotation marks. How to fix? Without surmounting this obstacle I can't try the program.

b) https://sourceforge.net/projects/cuneidjvu/
"CuneiDjVu is a graphical frontend to a set of the Windows console utilities providing the DjVu OCR capability based on the CuneiForm-Linux OCR Engine." The accompanying literature states the ocr quality from the Windows port is basic-amateur, so I've not yet tried it.

c) "Introducing djvubind for djvu file creation"
http://www.diybookscanner.org/forum/vie ... 521&p=4839
When I last looked, for Linux (?) only.

2) DjVuToy - a Windows (gui, self-contained) application for creating djvu files, modifying metadata etc.

I came across this a few years ago but was recently impressed by its capability of creating small djvu files from the ScanTailor output tifs. The author home page is http://www.cnblogs.com/stronghorse/ (in Chinese) but I found what seems to be the latest version (v2.08, in English) here (hosting webpage in French):
http://www.gratilog.net/xoops/modules/m ... 2&lid=2796
Interestingly this application can apparently also ocr the djvu file, but this capability is dependent on some Microsoft application.


To anyone who is up to speed on djvu file creation and ocr, do these Windows applications seem to be the current offerings? Any others known and used successfully under Windows? Also, if anyone has the solution to the paths problem in "tiffdjvuocr" that would be great to know! tia

dtic
Posts: 463
Joined: 06 Mar 2010, 18:03

Re: Make a djvu file and add ocr: DjVuToy; TiffDjvuOcr; CuneiDjVu

Post by dtic » 13 Sep 2016, 16:54

b0bcat wrote:A question regarding dtic's "tiffdjvuocr"
...
However inputting the above two paths at the splash screen prompt does not permit a save, whether each is followed by a trailing \ or whether they are both in double quotation marks. How to fix? Without surmounting this obstacle I can't try the program.
Hi. The first time you start TiffDjvuOcr the settings window is shown. Add the two paths you mentioned:
C:\Program Files (x86)\DjVuLibre
C:\Program Files (x86)\Tesseract-OCR
No slash at the end of the path string is needed and you shouldn't add quotation marks. Then press "save". The red "not found" text at the bottom of the settings window should now disappear. If it doesn't then double check that you have the files mentioned in red text exist in the two folders you have set the paths for. Next press Esc to close the settings window. The program should now work.

You can also open the ini file TiffDjvuOcr.exe.ini in Notepad and check that the two paths were saved correctly.

I should note that I haven't myself actively used TiffDjvuOcr for quite some time. But when I quickly tested it now, including the setup, it seems to work. I used Tesseract 3.02 and DjVuLibre 3.5.27.

b0bcat
Posts: 42
Joined: 30 Nov 2012, 21:37
Number of books owned: 0
Country: UK

Re: Make a djvu file and add ocr: DjVuToy; TiffDjvuOcr; CuneiDjVu

Post by b0bcat » 13 Sep 2016, 23:24

Hi dtic,

My experience is all as you explain (including in one iteration I'd also manually edited the *.ini file to insert the paths) but the stumbling block remained in each case, after pressing save and seeing the red warning disappear, pressing save again did not dismiss the splash screen - now I know I should have pressed esc! :oops:

Just tried a small test run, 11 tif totalling 737kB produce a djvu with ocr of 101kB; it ran some minutes (tesseract a lot slower than Acrobat in doing recognition), and I think an ocr layer is there - I didn't see a cmd window with tesseract running, but if I search in DjView for e.g. 'the' it highlights it on all pages (I haven't worked out yet how if at all in DjVu you can either save the layer as text or select and copy part of the text layer as in Adobe Acrobat). btw I used

https://github.com/UB-Mannheim/tesseract/wiki tesseract v.3.05 and
DjVuLibre v.3.5.27+4.10.4

-- so I may explore the earlier versions.

Two other questions which you or someone else may be able to answer:

a) each quadrant of the program screen is greyed out except the upper left hand one ".tiff to .djvu ocr". I assume this is not normal and if the earlier two versions of the helper programs were installed, all quadrants would be potentially active.

b) is the compression scheme the lossy one that for b&w images has been observed in the past to alter certain characters, making it unsuited for accurate ocr (or images)? I can't remember its name off the cuff but I saw a post on this site the other day referring to it (jb2 not jbig2?)

Thanks for your input!

dtic
Posts: 463
Joined: 06 Mar 2010, 18:03

Re: Make a djvu file and add ocr: DjVuToy; TiffDjvuOcr; CuneiDjVu

Post by dtic » 14 Sep 2016, 07:57

b0bcat wrote: a) each quadrant of the program screen is greyed out except the upper left hand one ".tiff to .djvu ocr". I assume this is not normal and if the earlier two versions of the helper programs were installed, all quadrants would be potentially active.

b) is the compression scheme the lossy one that for b&w images has been observed in the past to alter certain characters, making it unsuited for accurate ocr (or images)? I can't remember its name off the cuff but I saw a post on this site the other day referring to it (jb2 not jbig2?)
Hi, the grey boxes are active. Bad GUI design to make them grey, I know. I guess I wanted to highlight the main "tiff to djvu ocr" command in the GUI. You can also click the small "m" in the top right corner to show more commands, including non-lossy djvu creation (potentially better quality, but larger file size). For more details on what commands are run under the hood, look around in the .ahk source.

b0bcat
Posts: 42
Joined: 30 Nov 2012, 21:37
Number of books owned: 0
Country: UK

Re: Make a djvu file and add ocr: DjVuToy; TiffDjvuOcr; CuneiDjVu

Post by b0bcat » 21 Sep 2016, 15:26

Thanks very much for the further input dtic, much appreciated.

Whenever I have in mind to post here I always seem to find that despite lengthy searching, immediately after doing so I find (am told) the answer was out there all along and I just somehow missed it. . . so please all bear that in mind when considering the following.

I'm impressed with DjVu's ability to produce small files from tiffs with good visual appeal, the thing that's contained my enthusiasm is the knotty problem of how to create the searchable text layer ocr element to an acceptable standard (or at all).

(For old books I discarded the idea a few years ago that ocr text only was impractical for faithful reproduction of the original, absent many dedicated hours of professional proofing work).

Anyway, the main point of this post is to report one solution which may be unknown to and of interest to others with the same objective: how to create a DjVu multi-image file with accurate ocr searchable text layer? Answer: DjVuToy v2.08 with MODI (Windows Office Digital Imaging).

I'm going to set out some potentially turgid findings on the interaction of DjVu and CuneiForm / tesseract ocr below, but for those who just want the bottom line:

(A) Assuming you already have DjVuToy v.2.08 installed (see link in earlier post), but don't have Microsoft MODI (MS Office 2003 or 2007), you can get it as a free download from MS in the form of a component of MS's (32 bit) SharePoint Designer 2007:

https://www.microsoft.com/en-us/downloa ... n&id=21581

Follow the instructions for installing the component here:
https://support.microsoft.com/en-us/kb/982760
and then run DjVuToy, choosing the 'Maker' tab, selecting the options to taste (I used lossless compression for the DjVu creation step).

The ocr text layer produced on two specimen runs I've done (500+pp in each file, platform Intel i5 and 8GB RAM, Windows 7 Pro. x64) were at least comparable to Adobe Acrobat 8 ocr accuracy on pdfs created for test purposes from the same source tiffs (ScanTailor Featured output). Here's an example:

DjVuToy/MODI:
"Mr. Norman himself has never considered it
necessary to give any information about the reasons
responsible for his decision to favour the restoration
of sterling to its old parity in 1925. Questioned
about it by the Macmillan Committee only five or
six years after the decision, he surpassed himself
in evasiveness by answering that he could not
remember the sequence of events. If he was
unwilling to explain in 1930, it is most unlikely
that he would ever do so after the collapse of sterling,
since any explanation might be taken for an attempt
to vindicate himself, which is the last thing he
would ever think of doing. Rather than volunteer
the briefest of explanations in defence of his policy,
he would go down to history as the cause of all our
troubles. For this reason alone, it is the duty of
his critics to be as fair to him as is humanly possible."

Adobe Acrobat 8:
"Mr. Norman himself has never considered it
necessary to give any information about the reasons
responsible for his decision to favour the restoration
of sterling to its old parity in 1925. Questioned
about it by the Macmillan Committee only five or
six years after the decision, he surpassed himself
in evasiveness by answering that he could not
remember the sequence of events. If he was
unwilling to explain in 1930, it is most unlikely
that he would ever do so after the collapse of sterling,
since any explanation might be taken for an attempt
to vindicate himself, which is the last thing he
would ever think of doing. Rather than volunteer
the briefest of explanations in defence of his policy,
he would go down to history as the cause of all our
troubles. For this reason alone, it is the duty of
his critics to be as fair to him as is humanly possible."

I also tested an earlier version of DjVuToy (v.2.02) and same version of MODI on an Intel Pentium 4, 2.66GHz, 2GB RAM desktop running Windows XP SP3 (version 2.08 would not run on XP, repeatedly terminating after commenced a make operation). In that case the ocr log stated for about 15 of 512 pages 'OCR failed'. All but two of these were fully blank pages, but two had half pages of text. This problem has not recurred in my three tests with version 2.08: the log reports 'OCR failed' for some pages, but I checked and all are 100% blank images.

(B) ocr by tesseract:
What else is there? Well of course dtic's tiffdjvuocr which I found runs very nicely in creating DjVu and doing ocr operations on them except that in my testing I found that it (like minidjvu.08) wouldn't accept grayscale/color images whereas DjVuToy accepted same and b&w. Also (and here is the other main point of this post) there seems to be some interaction between tesseract and DjVu image files that results in garbage ocr text layer creation. With a lossless b&w-only DjVu file created twice in each of DjVuToy and tiffdjvuocr this is representative of tesseract's output on them:

"Mr. Norman himself has never considered it
necessary to give any information about the reasons
responsible for his decision to favour the restora-
tion of sterling to its old parity in 1925. Questioned
about it by the Macmillan Committee only veor
si xye arsaf terth ede cision,he su rpassedhi
mselfin ev asivenessby an sweringth athe co uldno
tre memberth ese quenceof ev ents.If he wa
sun willingto ex plainin 19 30,it is mo stun
likelyth athe wo uldev erdo so af terth eco llapseof st
erling,si ncean yex planationmi ghtbe ta kenfo ran at
temptto vi ndicatehi mself,wh ichis th ela stth inghe
wo uldev erth inkof do ing.Ra therth anvo
lunteerth ebr iefestof ex planationsin de fenceof hi spo
licy,he wo uldgo do wnto hi storyas th eca useof al lou
rtr oubles.Fo rth isre asonal one,it is th edu tyof
hi scr iticsto be as fa irto hi mas is hu manlypo
ssible."

Worse gibberish was produced with a few other 'takes' using different tiffs from ScanTailor made into DjVu file using the lossless option in DjVuToy, tiffdjvuocr and minidjvu.08, variously. So I next tried their ouput with CuneiForm via CuneiDjVu. Incidentally, bemused by the poor output of tesseract with the DjVu image files I tried tesseract at the command line on a single grayscale tiff from ScanTailor, selection:

"No poultry pigeons rabbits or any other animals whatsoever shall at any time
be kept or allowed to remain on the Scheduled Property except for domestic pets on that
part of the Scheduled Property described in paragraph 1 of the First Schedule hereto"
[parcel 1 on the ?led plan] "8. Nothing shall be done or permitted to be done on the
Scheduled Property or any part thereof which shall be or become a nuisance or
annoyance damage or any inconvenience to the Vendor or the owner or occupier of any
other house 9."

Not perfect but a radically better result compared with the DjVu file's ocr text. So what is the DjVu create process (via different programs but lossless images selected for all) doing that tesseract can't handle whereas Cuneiform can?

(C) ocr by CuneiDjVu:

Different source book and tiffs (also from ScanTailor), but comparable font type and size:

"But if by I933 political science in Oxford was in the process of
severing these earlier links, it may be held that this was more
than compensated for by its new position as part of the triad
of social studies, then ambitiously styled 'Modern Greats' and
now, more modestly, P.P.E. When in the following year Mr.
Adams was succeeded by Sir Arthur Salter (as he then was), it
was natural that the choice should have fallen upon someone
with so distinguished a record in the administration of national
and international economic affairs. For it was with economics
(and modern philosophy) that the study of politics in
Oxford now seemed to have found its appropriate resting"

Cuneidjvu seems to work if no greyscale or color pages are present, as in my test. It doesn't seem to like images with long dashes (like MS Word's (unicode?) version of --) or words printed in much lower case eg footnotes compared with most of rest of text.

Compare tesseract's efforts on a DjVu with tiff images from the same book (different page but uniform in fonts etc):

"oses,fo undth emselvesin po ssessionof ac o nsiderablesu r-pl
us,an dwe reme tto ap proveas c hemefo rit sus e.Th epr o-po
salbe foreth emwa sex plainedby Si rWi lliamAn son,M. P.,Wa
rdenof Al lSo ulsCo llege,wh osa idth at?t heidea wast oap pl
ythes urp lustowa rdsrais ingther ead ershipinpo li ticalscie
ncealre adyesta blishedatOx fo rdtoth es tat usofap ro - fess
orship,tobe ca ll edtheG lad stoneProf essorshipofpo li ticaltheo
ryandi nst itutions?.Hehadr ea son tobeli ev e,hesaid ?t hatt herewou
ldben odiff ic ul tyinobtain in gtheassen tof theUni ve rsi
ty?.Thisisnots urpr is ing .Nouniversi ty thatIknowo fhas e vert
ho ugh tof' TimeoDa na osetdo nafere nt es?a sapossiblem ot t
o.?Ifthe propos alwer eca rried?,c onti nuedAnson,? thenameof Mr.Gla
dstone woul db epe rmanently conne ct edwithOxfor d,aplacet
owhi chhewas l oyalt ot hever yd ayo fhisd ea th. ?But ift he
his toryofthe "

Incidentally, it took tesseract appx 1hr on the i5 to do the ocr of a 512 page image DjVu file with results similar to the above, whereas DjVuToy took about 12m.

(D) as I think dtic indicated, tiffdjvuocr was made several years ago and hasn't been in active maintenance since; contrast DjVuToy which is fairly regularly updated. So I also scouted around for any other Windows applications that could make a DjVu file from ScanTailor's output tiffs and ocr them with CuneiForm or tesseract. There's a few, such as at http://www.ocrivist.com (language files would not download) and at Sourceforge: djvuplus, djvupp and minidjvu v0.8 (create only) but the first three seemed to have basic functionality problems when I tried them, maybe they're in early development. For instance, djvuplus (DjVu++) when I tried getting an EN ocr of one page (again tesseract) gave in part:

"M. Human mu .. IKVUV =..._.... A
r-muav%|:v[g1-¢k1m‘xrlwwI(h>v; mm: 7:-an
._, .»,.n....,. .-.4 my ‘. an o:.....'�
W . Iy .. M...“.u.. Cum-nu-onky 5». W
.. y... ulna -u Atunnn. As w-vud mu
"mm ... ...u.... .« M. n .. ..
mm x «M M mm x. \- MW unlxkdy
W my ~..\..u... mu. .. him .. .n ...n..
. W... mm, mm ‘. ... .. M ».
...,m EVKV «NM am. am. 0... W...
1. ...‘.‘.. .1; ..u...:: m .‘._:.K. .. 3.5m
uwbkn n. m. M" I1/Hz, . ‘. KM duw 9;
...,...,...‘....u.¢.‘.,.....u..», mu.
m.....,.n.......m;x...«.y..y ....y
a. W ...m ..: mam: ..4 mm... mm"


None of the applications does automatic image rotation as in Adobe Acrobat. I think DjVuToy allows manual rotation so I suppose you could create a DjVu image file in it without ocr, do the rotations, then run its MODI ocr. tesseract seems to make no ocr output of e.g. an up-ended landscape page whereas Cuneiform does heroically try, but outputs garbage.
All instances of tesseract used were the last official Windows binary or the unofficial v.3.05dev linked to in the first post above.

(E) Conclusion:

DjVuToy v.2.08 made from ScanTailor mixed, b&w and color ouput tiffs and ocr'd a 368 page DjVu (lossless) on the i5 Pentium in something like 10 minutes, file ouput 7.85MB. Acrobat 8 created a pdf (with normal perams) of 31.76MB and that is pre-ocr but without optimising using internal jbig[?] The image text outlines in the DjVu file version look slightly less 'jagged'.

Verdict so far: DjVuToy is victor ludorum with tiffdjvuocr a near second; the rest of the field far, far behind. What's most concerning is how and why tesseract operated so badly in relation to DjVu image files in my tests; and sadly, why CuneiForm gives the impression of disappearing whereas its ocr output seems to be superior and more worthy of further development. Check out the Wikipedia pages of both and you may find the links for CuneiForm are either dead or in suspended/inactive.

b0bcat
Posts: 42
Joined: 30 Nov 2012, 21:37
Number of books owned: 0
Country: UK

Re: Make a djvu file and add ocr: DjVuToy; TiffDjvuOcr; CuneiDjVu

Post by b0bcat » 21 Sep 2016, 15:49

Keyboard's wearing out LOL... in the above
"discarded the idea a few years ago that ocr text only was impractical for faithful reproduction"
obviously =
"discarded the idea a few years ago that ocr text-only was practical for faithful reproduction"...

Yet another question so far eluding response from Google, Bing etc etc:
how in Windows to write in simple metadata to a DjVu file like author, title. I thought DjVuToy did this but now I can't find it...

dtic
Posts: 463
Joined: 06 Mar 2010, 18:03

Re: Make a djvu file and add ocr: DjVuToy; TiffDjvuOcr; CuneiDjVu

Post by dtic » 26 Sep 2016, 18:35

b0bcat wrote:tiffdjvuocr ... there seems to be some interaction between tesseract and DjVu image files that results in garbage ocr text layer creation.
The tesseract ocr appears to get almost all characters right, they're just incorrectly spaced. In your sample the ocr missed the two characters "fi" in "five" and the characters after that are AFAICT correct but two positions off. "si xye arsaf terth ede cision" should be "six years after the decision". I think there is a bug in tiffdjvuocr's formatpl() function that processes txt and box data from tesseract before insertion into the djvu. Unfortunately I don't have the time to work out a fix for it ATM.

b0bcat
Posts: 42
Joined: 30 Nov 2012, 21:37
Number of books owned: 0
Country: UK

Re: Make a djvu file and add ocr: DjVuToy; TiffDjvuOcr; CuneiDjVu

Post by b0bcat » 05 Oct 2016, 13:56

Thanks for the heads-up @dtic.

DjVuToy certainly provided in my testing a good - and fast - standard of ocr as well as DjVu file creation, but I'm hopeful the 'open source' tradition can be continued with it in parallel via TiffDjVuOcr as and when you have the time to update or - if willing - to make the code available to someone who might, if not already out there.

btw thanks for the support in this thread - a link to a *personalised* tip jar via PayPal is in order.

L.Willms
Posts: 132
Joined: 21 Sep 2016, 10:51
E-book readers owned: Tolino Shine
Country: Germany
Location: Frankfurt/Main, Germany

Re: Make a djvu file and add ocr: DjVuToy; TiffDjvuOcr; CuneiDjVu

Post by L.Willms » 05 Oct 2016, 17:55

b0bcat wrote: 1) How to make a djvu file from the output tif images of ScanTailor and/or add an ocr text layer?
ABBYY Fine Reader can save the OCR result as Dejavue file.

I admit not having tried that yet.

I use version 11

dtic
Posts: 463
Joined: 06 Mar 2010, 18:03

Re: Make a djvu file and add ocr: DjVuToy; TiffDjvuOcr; CuneiDjVu

Post by dtic » 06 Oct 2016, 05:10

b0bcat wrote:when you have the time to update or - if willing - to make the code available to someone who might
The source code is included in the zip, as an .ahk file that can be read in any text editor.

Post Reply