Introducing djvubind for djvu file creation

General discussion about software packages and releases, new software you've found, and threads by programmers and script writers.

Moderator: peterZ

Post Reply
User avatar
strider1551
Posts: 126
Joined: 01 Mar 2010, 11:39
Number of books owned: 0
Location: Ohio, USA

Introducing djvubind for djvu file creation

Post by strider1551 »

Djvubind is my one-step, post Scantailor utility for creating highly compressed djvu files complete with metadata and positional ocr. I figured out various pieces of it during the past half-year or so that I've been hanging around the forum, and finally pulled everything together into the most professional code and project management that I can muster. All open-source, naturally.

What does it do?
  • bitonal images are compressed with minidjvu for the greatest compression that I know of, everything else with djvulibre
  • front and back cover images inserted automatically at front/back
  • positional ocr added with the help of tesseract
  • metadata can be added to the file (author, title, etc.)
  • bookmarks/outlines can be added to the file
What does it need?
  • Presently it is Linux-only. I don't have any experience in Windows/Mac development, so someone else would have to step in and port it. I know enough to know that it shouldn't be hard.
  • Python 3.0 or greater. Most distros have python3 available but don't use it as the default, so you may have to manually use "python3 /usr/bin/djvubind" for the time being
  • djvulibe, imagemagick, minidjvu
  • tesseract, although technically you could not have it and always use --no-ocr
I've made an ebuild for any other gentoo users out there, and I very well may put together rpm and deb packages in the near future. One mea culpa for the first release: if installed to /usr/bin and python2 is the default python, you have to type out "python3 /usr/bin/djvubind". I'm fairly certain changing one line (the she-bang) will correct the behavior such that "djvubind" will suffice, but that will have to wait for the next release.

The closest alternative to djvubind that I know of is dtic's TiffDjvuOcr. I've never used it personally (because it is a Windows-only gui), but I do know that it will put together a djvu file with positional ocr.
Last edited by strider1551 on 14 Aug 2010, 07:03, edited 1 time in total.
univurshul
Posts: 496
Joined: 04 Mar 2014, 00:53

Re: Introducing djvubind for djvu file creation

Post by univurshul »

Awesome! A few questions come to mind:

How large are the files sizes of the djvu completed ebooks vs. a conventionally built PDF?

High resolution?

Faster Rendering on iPads?
User avatar
strider1551
Posts: 126
Joined: 01 Mar 2010, 11:39
Number of books owned: 0
Location: Ohio, USA

Re: Introducing djvubind for djvu file creation

Post by strider1551 »

univurshul wrote:How large are the files sizes of the djvu completed ebooks vs. a conventionally built PDF?
If we're talking a file created from scanned images, a djvu file will be significantly smaller in size. Djvu was made specifically for scanned images and makes use of jb2 compression, whereas the best compression for a pdf is Group4. In addition, minidjvu will create a shared dictionary for multiple images, which reduces the file size further. I know that isn't an in-depth explanation, but it should give you enough to start exploring with Google.

A while ago I scanned a 545 page book, completely bitonal (black and white) except for the front cover which was full color. The pdf version of it was 37.5 MB, excluding the cover and ocr. The djvu version was 12.8 MB, including cover and ocr. Later I remade the djvu with minidjvu, which came to 6.2 MB, including cover and ocr. So, quite significant.

On the other hand, if the book is not scanned but created from another computer program like Scribus or OpenOffice, djvu shouldn't even be an option. Djvu is a container for compressed images, unlike pdf's which can also represent images from document data (such as put this text in this font in this size here).
univurshul wrote:High resolution?
Non-issue. Both djvu and pdf can contain images at whatever resolution they want as far as I am aware.
univurshul wrote:Faster Rendering on iPads?
Now that's an interesting question. I couldn't find anything on differences in rendering speed between pdf's and djvu files. My money would be on djvu since it is working with a smaller file size to begin with. Ultimately, though, I think this would need to be a comparison of compression codecs, i.e. how does jb2's decoding speed compare to Group4 or LZW or any of the others.

Edit:
I forgot to mention that several months ago I heard of a new compression for pdf's that is very similar to jb2 and would produce similar file sizes. For the life of me I can't remember the name of it. I do remember that there was a big issue of it being encumbered by patents. If it does take off, it would probably be a few years before it makes it into pdf reader software, and who knows when it would be accessible in the open source world if there are patent questions.
univurshul
Posts: 496
Joined: 04 Mar 2014, 00:53

Re: Introducing djvubind for djvu file creation

Post by univurshul »

Wow. Many thanks for the detailed thought-out responses. I need my go-to mac builder to port this so I can start working with it.

The final tier to ebook construction hasn't been a really much-discussed/designed endeavor.
Last edited by Anonymous on 14 Aug 2010, 10:22, edited 1 time in total.
dtic
Posts: 464
Joined: 06 Mar 2010, 18:03

Re: Introducing djvubind for djvu file creation

Post by dtic »

Great strider! The features are clearly much more powerful than TiffDjuOCR. I'm going to try it later, have to to install a linux distro etc in virtualbox first.

Your "what does it need" list forgot tesseract (though it is mentioned further up already).

re ImageMagick: just in case you didn't see it, forum member dott found that tesseract 3 lets us bypass imagemagick. Also, the djvulibre dev has added an option to output uncompressed tiff and so bypass imagemagick even with tesseract 2 but djvulibre binaries are yet to be updated.
http://www.diybookscanner.org/forum/vie ... t=10#p4338

univurshul: djvu files render at ok speeds on my android phone so it should likely be ok on Ipads too.
User avatar
strider1551
Posts: 126
Joined: 01 Mar 2010, 11:39
Number of books owned: 0
Location: Ohio, USA

Re: Introducing djvubind for djvu file creation

Post by strider1551 »

dtic wrote:Your "what does it need" list forgot tesseract (though it is mentioned further up already).
Nice catch. Thank you.
dtic wrote:re ImageMagick: just in case you didn't see it, forum member dott found that tesseract 3 lets us bypass imagemagick.
Yes, but I use ImageMagick for other purposes (insert evil laugh). Along with the familiar "convert" command, ImageMagick provides "identify" which I use to identify bitonal images (since minidjvu can only handle bitonals) and image dpi.

Here's the weird thing, though. I've been running tesseract-2.04 on ddjvu extracted pages and Scantailor pages, both of which are compressed (LZW?), without any problem. The whole need for tessearct 3 didn't even make sense until now. I guess on the Linux side tesseract has access to libraries that aren't typically installed in Windows?

Edit:
There is a python api to ImageMagick, so there is potential to trade the ImageMagick dependency for a python module dependency and avoid making some system calls. I didn't go that route yet because I don't know if it's supported in python3.
dtic
Posts: 464
Joined: 06 Mar 2010, 18:03

Re: Introducing djvubind for djvu file creation

Post by dtic »

If I remember correctly tesseract uses libtiff if available, and it is in linux distros but needs installation in windows.
Tulon
Posts: 687
Joined: 03 Oct 2009, 06:13
Number of books owned: 0
Location: London, UK
Contact:

Re: Introducing djvubind for djvu file creation

Post by Tulon »

Great news, strider1551!

Now I am going to do something I don't really appreciate when done to me :), that is ask you for features:
1. The ability to process mixed text / graphics pages. It's easy to separate text from graphics (see below) in Scan Tailor's Mixed output files. Having done that, you encode the text part with minidjvu and the graphical part with c44 and then merge them with csepdjvu. Now, I haven't yet tried your program, so accept my apologies if it's already implemented.
2. I'd like to see support for Cuneiform as an alternative to Tesseract. Tesseract may be developed more actively, but Cuneiform was already quite good when it was open-sourced. It supports more languages and I believe it's still ahead of Tesseract quality-wise.


* It's easy to separate text from graphics in Scan Tailor as it makes sure it doesn't use pure black and pure white colors in pictures. Textual content on the contrary, uses exclusively pure black and pure white colors.
Scan Tailor experimental doesn't output 96 DPI images. It's just what your software shows when DPI information is missing. Usually what you get is input DPI times the resolution enhancement factor.
User avatar
strider1551
Posts: 126
Joined: 01 Mar 2010, 11:39
Number of books owned: 0
Location: Ohio, USA

Re: Introducing djvubind for djvu file creation

Post by strider1551 »

Tulon wrote:Now I am going to do something I don't really appreciate when done to me , that is ask you for features:
Oh, people can ask for features all they want, whether it will ever happen... . Seriously, though, I'm glad to take feature requests. I want this to be useful for other people, that's the whole point of putting it out there. I've put both of your feature requests in the project issue tracker. Speaking of which:
Tulon wrote:1. The ability to process mixed text / graphics pages. It's easy to separate text from graphics (see below) in Scan Tailor's Mixed output files. Having done that, you encode the text part with minidjvu and the graphical part with c44 and then merge them with csepdjvu.
Right now a mixed page would be detected as non-bitonal, encoded with cpaldjvu, and inserted at the proper place. So they would be processed, just not in the most optimal way that you suggest. And actually, compared to c44, cpaldjvu probably doesn't like a page with photo images, which I never thought about since my test book is just text. If someone tries that before I get to it, let me know. Anyway, let me see if I understand the mixed mode. Scantailor creates a tiff file in mixed mode. If I flip the pure-black pixels to white, I have the graphical version of the image. If I flip non-pure-black pixels to white, I have the textual version of the image. Yes?

Suppose someone give me a greyscale or rgb image that didn't go through scantailor (the horror!). How would I know that this non-bitonal is a regular non-bitonal and not a scantailor mixed mode non-bitonal? Is there a clue I could find with "identify -verbose", or would I have to rely on user input via a command line option or whatnot?
Tulon wrote:2. I'd like to see support for Cuneiform as an alternative to Tesseract. Tesseract may be developed more actively, but Cuneiform was already quite good when it was open-sourced. It supports more languages and I believe it's still ahead of Tesseract quality-wise.
It will happen unless I run into some type of issue. I thought about structuring the code for easy drop-in of additional ocr engines from the start, but got lazy. Still, it will just be a matter of me playing with cuneiform and learning how it works.
Tulon
Posts: 687
Joined: 03 Oct 2009, 06:13
Number of books owned: 0
Location: London, UK
Contact:

Re: Introducing djvubind for djvu file creation

Post by Tulon »

strider1551 wrote:If I flip the pure-black pixels to white, I have the graphical version of the image. If I flip non-pure-black pixels to white, I have the textual version of the image. Yes?
Right.
strider1551 wrote:Suppose someone give me a greyscale or rgb image that didn't go through scantailor (the horror!). How would I know that this non-bitonal is a regular non-bitonal and not a scantailor mixed mode non-bitonal? Is there a clue I could find with "identify -verbose", or would I have to rely on user input via a command line option or whatnot?
I think you can just ignore this possibility, because nothing horrible would happen. Some picture parts will be encoded as B/W, but only the pure black and white ones. Even though that's not optimal, the result should still look fine.
Scan Tailor experimental doesn't output 96 DPI images. It's just what your software shows when DPI information is missing. Usually what you get is input DPI times the resolution enhancement factor.
Post Reply