Homer glitch with OCR component

General discussion about software packages and releases, new software you've found, and threads by programmers and script writers.

Moderator: peterZ

Post Reply
victoriaaustralia
Posts: 55
Joined: 07 Nov 2011, 16:22
E-book readers owned: newton
Number of books owned: 2
Country: Australia
Location: Castlemaine, Victoria, Australia

Homer glitch with OCR component

Post by victoriaaustralia »

I think Homer is an excellent conversion of pdfbeads to a Windows environment without having to learn any programing for the simple minded like myself.
I am having an intermittant glitch that I cannot fathom with the OCR component. As per this thread:
http://www.diybookscanner.org/forum/vie ... =19&t=2835

I photograph, process left and right, rename and then run through ScanTailor then Homer. The pdfbeads compression is reliable and always works. The tesseract OCR component will work for some Scantailor projects and not others. For example I have just processed an old magazine using the mixed mode output in Scan tailor, only 50pages. OCR worked fine. I then had a 200page book, output BW only and it compressed but did not OCR. It does not seem to be size - I just put 50pages of the BW book through Homer and no OCR layer again. I am just trying it now, re-processed in Scan Tailor to see if output in Mixed mode helps with OCR.

Anybody else have this experience with an otherwise excellent little program?
Freeware Windows workflow in 2020
viewtopic.php?f=19&t=3620
victoriaaustralia
Posts: 55
Joined: 07 Nov 2011, 16:22
E-book readers owned: newton
Number of books owned: 2
Country: Australia
Location: Castlemaine, Victoria, Australia

Re: Homer glitch with OCR component

Post by victoriaaustralia »

Nope, re-processing in ST mixed mode did not work.

Some further hints - This is the Tesseract log file for the 50page mixed output magazine were OCR worked well:
Tesseract Open Source OCR Engine with Leptonica
Tesseract Open Source OCR Engine with Leptonica
Tesseract Open Source OCR Engine with Leptonica
Tesseract Open Source OCR Engine with Leptonica

This is what is in the Tesseract log file for the BW or mixed output that did not OCR:
Error: -l must be arg3, not 5
Usage:c:\opt\Tesseract-OCR\tesseract.exe imagename outputbase [-l lang] [configfile [[+|-]varfile]...]
Error: -l must be arg3, not 5
Usage:c:\opt\Tesseract-OCR\tesseract.exe imagename outputbase [-l lang] [configfile [[+|-]varfile]...]
Error: -l must be arg3, not 5
Usage:c:\opt\Tesseract-OCR\tesseract.exe imagename outputbase [-l lang] [configfile [[+|-]varfile]...]
Error: -l must be arg3, not 5

Both projects were photographed with same cameras, processed on same laptop running vista, antrenamer, scantailor, homer
Freeware Windows workflow in 2020
viewtopic.php?f=19&t=3620
victoriaaustralia
Posts: 55
Joined: 07 Nov 2011, 16:22
E-book readers owned: newton
Number of books owned: 2
Country: Australia
Location: Castlemaine, Victoria, Australia

Re: Homer glitch with OCR component

Post by victoriaaustralia »

Figured it out!! SOrry i bothered to post. I was looking at the tesseract log error and thought - what if the l means length and it wants the file lenght between 3 and 5 digits? And it worked. The 50 page magazine had a 4 digit file number and the larger book I had been trying to do had a six digit. I used ant renamer to chop it down to four digits, ran it though Homer again and the file has been OCR'd nicely!

Take home message: Homer only accepts input tiff files with format 0001.tiff, 0002.tiff etc.
Freeware Windows workflow in 2020
viewtopic.php?f=19&t=3620
Post Reply