Page 4 of 7

Re: Introducing djvubind for djvu file creation

Posted: 30 Oct 2010, 10:25
by Lazy_Kent
strider1551, could you please change "/usr/etc" to "/etc"?
Building packages for openSUSE I got error:

Code: Select all

... checking filelist
djvubind: "/usr/etc" is not allowed anymore in FHS 2.2.
djvubind: "/usr/etc/djvubind" is not allowed anymore in FHS 2.2.
djvubind: "/usr/etc/djvubind/config" is not allowed anymore in FHS 2.2.
Thanks.

Re: Introducing djvubind for djvu file creation

Posted: 30 Oct 2010, 10:46
by strider1551
Someone else caught that horrifically embarrassing mistake last night. I adjusted the distutils script (setup.py) and released version 1.0.1 a little bit ago this morning.

Nothing like writing software to teach you some humility time and time again.

Re: Introducing djvubind for djvu file creation

Posted: 31 Oct 2010, 08:38
by Lazy_Kent
1. I can't disable OCR.
User config:

Code: Select all

cores = 2
ocr = False
ocr_engine = cuneiform
cuneiform_options = -l ruseng
tesseract_options =
bitonal_encoder = minidjvu
color_encoder = csepdjvu
c44_options =
cjb2_options = -lossy
csepdjvu_options =
minidjvu_options = --dpi 300 --pages-per-dict 80 --verbose

Code: Select all

% djvubind -v --no-ocr
djvubind version 1.0.1
Executing with these parameters:
{'ocr_engine': 'cuneiform', 'tesseract_options': '', 'verbose': True, 'cjb2_options': '-lossy', 'cuneiform_options': '-l ruseng', 'bitonal_encoder': 'minidjvu', 'color_encoder': 'csepdjvu', 'ocr': False, 'quiet': False, 'minidjvu_options': '--dpi 300 --pages-per-dict 80 --verbose', 'win_path': 'C:\\Program Files\\DjVuZone\\DjVuLibre\\;C:\\Program Files\\Tesseract-OCR;C:\\Program Files\\ImageMagick-6.6.5-Q16', 'cores': 2, 'csepdjvu_options': '', 'c44_options': ''}

* Collecting files to be processed.
* Analyzing image information.
  Spawning 2 processing threads.
* Performing optical character recognition.
  Spawning 2 processing threads.
* Encoding all information to book.djvu.
2. While recognizing the program removes output hocr-files. So there is no text in book.djvu. Tested with Tesseract 3.00 and Cuneiform 1.0.0.
Python3 3.0 and 3.1, djvulibre 3.5.21 and 3.5.22.

Re: Introducing djvubind for djvu file creation

Posted: 31 Oct 2010, 11:12
by strider1551
1.
Thanks for catching that. I reworked a lot of the code recently and that fell through the cracks. It's now fixed in the repository, and a 1.0.2 release should be later this week.

2.
This seems to be the same as issue 15. I've been developing with cuneiform-0.8.0, since that is the latest available from Gentoo's Portage. cuneiform-0.9.0 and above changed the format of their .hocr files, and so my parser doesn't know how to read these newer versions. I have some examples of the new format and just need to find the time to sit down and work with them. In the meantime you could switch the ocr-engine to tesseract, which should still work (it only switches automatically if cuneiform crashes, not when it gives output I didn't expect).

Re: Introducing djvubind for djvu file creation

Posted: 31 Oct 2010, 12:27
by Lazy_Kent
I think there is another problem with inserting text layer.
Trying tesseract engine:

Code: Select all

% djvubind --ocr-engine=tesseract --tesseract-options="-l rus" -v
djvubind version 1.0.1
Executing with these parameters:
{'ocr_engine': 'tesseract', 'tesseract_options': '-l rus', 'verbose': True, 'cjb2_options': '-lossy', 'cuneiform_options': '-l ruseng', 'bitonal_encoder': 'minidjvu', 'color_encoder': 'csepdjvu', 'ocr': False, 'quiet': False, 'minidjvu_options': '--dpi 300 --pages-per-dict 80 --verbose', 'win_path': 'C:\\Program Files\\DjVuZone\\DjVuLibre\\;C:\\Program Files\\Tesseract-OCR;C:\\Program Files\\ImageMagick-6.6.5-Q16', 'cores': 2, 'csepdjvu_options': '', 'c44_options': ''}

* Collecting files to be processed.
* Analyzing image information.
  Spawning 2 processing threads.
* Performing optical character recognition.
  Spawning 2 processing threads.
* Encoding all information to book.djvu.
The same time monitoring working directory:

Code: Select all

% inotifywait -m -r --format '%:e %f' tst
...
CREATE 155_box.box
OPEN 155_box.box
MODIFY 155_box.box
MODIFY 155_box.box
CLOSE_WRITE:CLOSE 155_box.box
...
CLOSE_NOWRITE:CLOSE 155.tif
CREATE 155_txt.txt
OPEN 155_txt.txt
MODIFY 155_txt.txt
MODIFY 155_txt.txt
CLOSE_WRITE:CLOSE 155_txt.txt
OPEN 155_box.box
ACCESS 155_box.box
CLOSE_NOWRITE:CLOSE 155_box.box
OPEN 155_txt.txt
ACCESS 155_txt.txt
CLOSE_NOWRITE:CLOSE 155_txt.txt
DELETE 155_box.box
DELETE 155_txt.txt
...
CLOSE_NOWRITE:CLOSE 157.tif
CREATE enc_temp.djvu
OPEN enc_temp.djvu
MODIFY enc_temp.djvu
...
As far as I can see tesseract output files were deleted before starting djvu encoding. The whole log attached.

Re: Introducing djvubind for djvu file creation

Posted: 01 Nov 2010, 13:10
by strider1551
As far as I can see tesseract output files were deleted before starting djvu encoding.
Yes, that is what happens. All the files are read, parsed, and formatted into the djvused format. That information is then kept internally in a variable.

I do not have the russian language files for tesseract, but just running with "-l eng" produced a normal (albeit completely incorrect) text layer. If you could give me the tesseract output files, I'll take a look and see if something language/encoding related in them is messing up the parser:

Code: Select all

tesseract "input.tif" "out_box" -l rus batch makebox
tesseract "input.tif" "out_txt" -l rus batch

Re: Introducing djvubind for djvu file creation

Posted: 01 Nov 2010, 14:19
by Lazy_Kent
Doesn't work for me even with "-l eng".
Russian output files attached.

Re: Introducing djvubind for djvu file creation

Posted: 04 Nov 2010, 03:47
by Lazy_Kent
strider1551
It works now. Thanks.
May I set custom DPI?

Re: Introducing djvubind for djvu file creation

Posted: 04 Nov 2010, 06:46
by strider1551
That's great news, glad it works.

djvubind will determine the resolution of the images on its own and give it to the encoder. If you also specify a resolution in an encoder option in the config file, it will end up running a command like "cjb2 -dpi 300 -lossy -dpi 400 image.tif". I have no idea what happens when you pass along two resolutions; it may take the second one and use that, or it may crash and complain. Hence, the config file recommends you don't specify a resolution.

So if the images themselves have the correct resolution, there's no need to set it in djvubind. If they have an incorrect resolution, you can take your chances and see what happens, or you can fix the images with ImageMagick (-density, I believe)

Images coming from scantailor should have the correct resolution.

Re: Introducing djvubind for djvu file creation

Posted: 23 Nov 2010, 22:44
by caudwell
Got this working on Ubuntu 10.04, it's great!

Thanks strider!