Missing OCR text in djvubind?

Don't know where to start, or stuck on a certain problem? Drop by and tell us about it. Feel like helping others? Start here.

Moderator: peterZ

Mangan
Posts: 17
Joined: 19 Jan 2012, 14:33
E-book readers owned: Sony Xperia Arc
Number of books owned: 1000

Missing OCR text in djvubind?

Post by Mangan »

Hi, i have a TIFF that looks like this (in Swedish) after running through ScanTailor:
out1.tif
I now try djvub ind on it, and get a resulting .djvu with embedded OCR. Great! :D
But, some text is missing. If I open it in djview4 and copy the entire page as text and paste I get
När kan jag använda jordbrukstraktorn
Traktorns grundutrustning
Traktorns skogsutrustning
Montering av vinsch
Montering av linkran
Kontroll och skötsel av utrustning
Personlig skyddsutrustning
Planering av drivningstrakten
Sortiment till stickväg
Stammar eller stamdelar till stickväg
Stammar till avlägg
Sortiment till avlägg
Ekonomi
As you can see, many complete lines are missing and page numbers and heading also (I hope you don't have to understand swedish to see the missing lines?).
In djvubind I have selected tesseract as ocr, and have added "-l swe" as option in config (or else it didnät work at all).

When I run "tesseract -l swe" from the command line though, it gives me all the lines. So something is strange., and I would appreciate help with debugging.
|NNEHÃ…LL
När kan jag använda jordbrukstraktorn
iskogen?
Traktorns grundutrustning
Traktorns skogsutrustning
Kraftöverföringsaxlar
Traktorkärran
Vinschar
Montering av vinsch
Linkranar
Montering av linkran
Griplastare
Hjälpmedel
Stàllinor
Kontroll och skötsel av utrustning
Personlig skyddsutrustning
Körteknik
Planering av drivningstrakten
Drivningsmetoder
Sortiment till stickväg
Stammar eller stamdelar till stickväg
Stammar till avlägg
Sortiment till avlägg
Ekonomi

djvubind-1.1.0
tesseract-3.01
djvulibre-bin 3.5.24-8
Ubuntu 11.10 amd64

Learning to scan, my aim is mostly to scan misc documents, but I got this book I'd like to scan first.
Mangan
Posts: 17
Joined: 19 Jan 2012, 14:33
E-book readers owned: Sony Xperia Arc
Number of books owned: 1000

Re: Missing OCR text in djvubind?

Post by Mangan »

BTW, when I scan and do OCR directly with HP Solution Center that came with the scanner, I get perfect result:

"INNEHÃ…LL
När kan jag använda jordbrukstraktorn
i skogen? 2
Traktorns grundutrustning 4
Traktorns skogsutrustning 8
Kraftöverföringsaxlar 13
Traktorkärran 16
Vinschar 20
Montering av vinsch 23
Linkranar 26
Montering av linkran 31
Griplastare 37
Hjälpmedel 38
Stållinor 46
Kontroll och skötsel av utrustning 54
Personlig skyddsutrustning 60
Körteknik 61
Planering av drivningstrakten 66
Drivningsmetoder 70
Sortiment till stickväg 74
Stammar eller stamdelar till stickväg 78
Stammar till avlägg 80
Sortiment till avlägg 86
Ekonomi 92"

Will HP always have better scans?
User avatar
strider1551
Posts: 126
Joined: 01 Mar 2010, 11:39
Number of books owned: 0
Location: Ohio, USA

Re: Missing OCR text in djvubind?

Post by strider1551 »

Thank you for reporting this and giving me an image to test. I can confirm the bug on my machine and have opened an issue on the djvubind issue tracker. You can get email notifications on any updates there if you star the issue, but I do plan to update this thread once everything is fixed. This is actually big enough that I'll release a new version of djvubind.

My offhand, haven't-looked-at-that-code-in-months guess is that it has to do with the extended character set. When I wrote the sections that handles moving text data from tesseract format to djvused format, I actually based it off of a Ruby script that I found from a blog. For some reason, that script was very careful to strip out non-ASCII characters. I thought I had removed that later on, but perhaps I didn't...

In any case, in all likelihood I'm both at fault on this one and able to correct it. Unfortunately you hit me in the first week of possibly my busiest semester in grad school, but such is life.
Mangan
Posts: 17
Joined: 19 Jan 2012, 14:33
E-book readers owned: Sony Xperia Arc
Number of books owned: 1000

Re: Missing OCR text in djvubind?

Post by Mangan »

Give me a hint and I can maybe look at the code myself?
User avatar
strider1551
Posts: 126
Joined: 01 Mar 2010, 11:39
Number of books owned: 0
Location: Ohio, USA

Re: Missing OCR text in djvubind?

Post by strider1551 »

All of that should be handled by djvubing/ocr.py. A good starting place is Tesseract.analyze() (line 385 in the repo version).
Anonymous2
Posts: 97
Joined: 18 Oct 2011, 16:05

Re: Missing OCR text in djvubind?

Post by Anonymous2 »

Fixed it (in Bindery, but I'm using a Python2 version of djvubind):

Code: Select all

def translate(boxing):
    """
    Translate djvubind's internal boxing information into a djvused format.

    .. warning::
       This function will eventually migrater to djvubind.encode
    """

    page = djvuPageBox()
    line = djvuLineBox()
    word = djvuWordBox()
    for entry in boxing:
        if entry == 'newline':
            if (line.children != []):
            ^^^^^^^^^^^^^^^^^^^^^^^^^
            Delete this line and re-indent the code
            ^^^^^^^^^^^^^^^^^^^^^^^^^
                if (word.children != []):
                    line.add_element(word)
                page.add_element(line)
            line = djvuLineBox()
            word = djvuWordBox()
        elif entry == 'space':
            if (word.children != []):
                line.add_element(word)
            word = djvuWordBox()
        else:
            word.add_character(entry)
    if (word.children != []):
        line.add_element(word)
    if (line.children != []):
        page.add_element(line)

    if (page.children != []):
        return page.encode()
    else:
        return ''
That line prevents single-word lines from being OCR'd.
Mangan
Posts: 17
Joined: 19 Jan 2012, 14:33
E-book readers owned: Sony Xperia Arc
Number of books owned: 1000

Re: Missing OCR text in djvubind?

Post by Mangan »

Many thanks !!!

Here's a diff of what changes I made.

Code: Select all

*** ocr.py      2012-01-20 08:13:17.245665722 +0100
--- ocr.py.orig 2011-02-23 03:33:57.000000000 +0100
***************
*** 482,490 ****
          word = djvuWordBox()
          for entry in boxing:
              if entry == 'newline':
!                 if (word.word != ''):
!                     line.add_word(word)
!                 page.add_line(line)
                  line = djvuLineBox()
                  word = djvuWordBox()
              elif entry == 'space':
--- 482,491 ----
          word = djvuWordBox()
          for entry in boxing:
              if entry == 'newline':
!                 if (line.words != []):
!                     if (word.word != ''):
!                         line.add_word(word)
!                     page.add_line(line)
                  line = djvuLineBox()
                  word = djvuWordBox()
              elif entry == 'space':
User avatar
strider1551
Posts: 126
Joined: 01 Mar 2010, 11:39
Number of books owned: 0
Location: Ohio, USA

Re: Missing OCR text in djvubind?

Post by strider1551 »

Thank you both. Glad to see that it wasn't a character set problem, and should have only affected lines with a single word in them. A fix has been made in the repository. Once I see if I can clear out some other issues easily, there should be a new release.

@Anonymous2: This would be a really good time to get the dummy progress() function you mentioned elsewhere into a release.
Anonymous2
Posts: 97
Joined: 18 Oct 2011, 16:05

Re: Missing OCR text in djvubind?

Post by Anonymous2 »

I'm still working on translating Bindery into Python3 because I can't import djvubind while Bindery still uses Python2. I'm almost done. Once I get that working I'll delete my stale copy of djvubind and start importing.
Anonymous2
Posts: 97
Joined: 18 Oct 2011, 16:05

Re: Missing OCR text in djvubind?

Post by Anonymous2 »

Okay, I'm done with converting Bindery to Python 3. I've also attached my patch for djvubind (forum doesn't allow .patch extension so I had to rename it to .txt).
Attachments
progress.txt
Dummy progress patch
(2.35 KiB) Downloaded 380 times
Post Reply