Page 1 of 2

Missing OCR text in djvubind?

Posted: 19 Jan 2012, 15:14
by Mangan
Hi, i have a TIFF that looks like this (in Swedish) after running through ScanTailor:
out1.tif
I now try djvub ind on it, and get a resulting .djvu with embedded OCR. Great! :D
But, some text is missing. If I open it in djview4 and copy the entire page as text and paste I get
När kan jag använda jordbrukstraktorn
Traktorns grundutrustning
Traktorns skogsutrustning
Montering av vinsch
Montering av linkran
Kontroll och skötsel av utrustning
Personlig skyddsutrustning
Planering av drivningstrakten
Sortiment till stickväg
Stammar eller stamdelar till stickväg
Stammar till avlägg
Sortiment till avlägg
Ekonomi
As you can see, many complete lines are missing and page numbers and heading also (I hope you don't have to understand swedish to see the missing lines?).
In djvubind I have selected tesseract as ocr, and have added "-l swe" as option in config (or else it didnät work at all).

When I run "tesseract -l swe" from the command line though, it gives me all the lines. So something is strange., and I would appreciate help with debugging.
|NNEHÃ…LL
När kan jag använda jordbrukstraktorn
iskogen?
Traktorns grundutrustning
Traktorns skogsutrustning
Kraftöverföringsaxlar
Traktorkärran
Vinschar
Montering av vinsch
Linkranar
Montering av linkran
Griplastare
Hjälpmedel
Stàllinor
Kontroll och skötsel av utrustning
Personlig skyddsutrustning
Körteknik
Planering av drivningstrakten
Drivningsmetoder
Sortiment till stickväg
Stammar eller stamdelar till stickväg
Stammar till avlägg
Sortiment till avlägg
Ekonomi

djvubind-1.1.0
tesseract-3.01
djvulibre-bin 3.5.24-8
Ubuntu 11.10 amd64

Learning to scan, my aim is mostly to scan misc documents, but I got this book I'd like to scan first.

Re: Missing OCR text in djvubind?

Posted: 19 Jan 2012, 15:21
by Mangan
BTW, when I scan and do OCR directly with HP Solution Center that came with the scanner, I get perfect result:

"INNEHÃ…LL
När kan jag använda jordbrukstraktorn
i skogen? 2
Traktorns grundutrustning 4
Traktorns skogsutrustning 8
Kraftöverföringsaxlar 13
Traktorkärran 16
Vinschar 20
Montering av vinsch 23
Linkranar 26
Montering av linkran 31
Griplastare 37
Hjälpmedel 38
Stållinor 46
Kontroll och skötsel av utrustning 54
Personlig skyddsutrustning 60
Körteknik 61
Planering av drivningstrakten 66
Drivningsmetoder 70
Sortiment till stickväg 74
Stammar eller stamdelar till stickväg 78
Stammar till avlägg 80
Sortiment till avlägg 86
Ekonomi 92"

Will HP always have better scans?

Re: Missing OCR text in djvubind?

Posted: 19 Jan 2012, 16:15
by strider1551
Thank you for reporting this and giving me an image to test. I can confirm the bug on my machine and have opened an issue on the djvubind issue tracker. You can get email notifications on any updates there if you star the issue, but I do plan to update this thread once everything is fixed. This is actually big enough that I'll release a new version of djvubind.

My offhand, haven't-looked-at-that-code-in-months guess is that it has to do with the extended character set. When I wrote the sections that handles moving text data from tesseract format to djvused format, I actually based it off of a Ruby script that I found from a blog. For some reason, that script was very careful to strip out non-ASCII characters. I thought I had removed that later on, but perhaps I didn't...

In any case, in all likelihood I'm both at fault on this one and able to correct it. Unfortunately you hit me in the first week of possibly my busiest semester in grad school, but such is life.

Re: Missing OCR text in djvubind?

Posted: 19 Jan 2012, 16:20
by Mangan
Give me a hint and I can maybe look at the code myself?

Re: Missing OCR text in djvubind?

Posted: 19 Jan 2012, 16:55
by strider1551
All of that should be handled by djvubing/ocr.py. A good starting place is Tesseract.analyze() (line 385 in the repo version).

Re: Missing OCR text in djvubind?

Posted: 19 Jan 2012, 23:30
by Anonymous2
Fixed it (in Bindery, but I'm using a Python2 version of djvubind):

Code: Select all

def translate(boxing):
    """
    Translate djvubind's internal boxing information into a djvused format.

    .. warning::
       This function will eventually migrater to djvubind.encode
    """

    page = djvuPageBox()
    line = djvuLineBox()
    word = djvuWordBox()
    for entry in boxing:
        if entry == 'newline':
            if (line.children != []):
            ^^^^^^^^^^^^^^^^^^^^^^^^^
            Delete this line and re-indent the code
            ^^^^^^^^^^^^^^^^^^^^^^^^^
                if (word.children != []):
                    line.add_element(word)
                page.add_element(line)
            line = djvuLineBox()
            word = djvuWordBox()
        elif entry == 'space':
            if (word.children != []):
                line.add_element(word)
            word = djvuWordBox()
        else:
            word.add_character(entry)
    if (word.children != []):
        line.add_element(word)
    if (line.children != []):
        page.add_element(line)

    if (page.children != []):
        return page.encode()
    else:
        return ''
That line prevents single-word lines from being OCR'd.

Re: Missing OCR text in djvubind?

Posted: 20 Jan 2012, 03:28
by Mangan
Many thanks !!!

Here's a diff of what changes I made.

Code: Select all

*** ocr.py      2012-01-20 08:13:17.245665722 +0100
--- ocr.py.orig 2011-02-23 03:33:57.000000000 +0100
***************
*** 482,490 ****
          word = djvuWordBox()
          for entry in boxing:
              if entry == 'newline':
!                 if (word.word != ''):
!                     line.add_word(word)
!                 page.add_line(line)
                  line = djvuLineBox()
                  word = djvuWordBox()
              elif entry == 'space':
--- 482,491 ----
          word = djvuWordBox()
          for entry in boxing:
              if entry == 'newline':
!                 if (line.words != []):
!                     if (word.word != ''):
!                         line.add_word(word)
!                     page.add_line(line)
                  line = djvuLineBox()
                  word = djvuWordBox()
              elif entry == 'space':

Re: Missing OCR text in djvubind?

Posted: 20 Jan 2012, 07:53
by strider1551
Thank you both. Glad to see that it wasn't a character set problem, and should have only affected lines with a single word in them. A fix has been made in the repository. Once I see if I can clear out some other issues easily, there should be a new release.

@Anonymous2: This would be a really good time to get the dummy progress() function you mentioned elsewhere into a release.

Re: Missing OCR text in djvubind?

Posted: 20 Jan 2012, 21:04
by Anonymous2
I'm still working on translating Bindery into Python3 because I can't import djvubind while Bindery still uses Python2. I'm almost done. Once I get that working I'll delete my stale copy of djvubind and start importing.

Re: Missing OCR text in djvubind?

Posted: 21 Jan 2012, 04:37
by Anonymous2
Okay, I'm done with converting Bindery to Python 3. I've also attached my patch for djvubind (forum doesn't allow .patch extension so I had to rename it to .txt).