Learning to Create Tiny DJVU files

Discussions, questions, comments, ideas, and your projects having to do with DIY Book Scanner software. This includes the Stereo Data Maker software for the cameras, post-processing software, utilities, OCR packages, and so on.

Moderator: peterZ

RichardT
Posts: 27
Joined: 24 Apr 2012, 10:17
E-book readers owned: Kindle 3rd, Kindle Fire, iPad3
Number of books owned: 0
Country: USA
Contact:

Re: Learning to Create Tiny DJVU files

Post by RichardT »

I continue to mass-convert my scanned books to djvu. These are not the ugly low-quality scans that I have kind of made a hobby out of dressing up. These are all 600dpi sheet-fed scans of books which have had their spines chopped off. Recall that they were Adobe ClearScan files before, and I was generally pleased with their small size. Converting these scans to djvu has proven DjVu's superiority to me once and for all, though.

Many of the books are half the size once converted with the process outlined here (minidjvu for the black-and-white portions, c44 for any background images at 1/4th size, manually mashed together with djvumake and scripts). I have even seen some that are 1/8th the size of the PDF. Then some are about the same size, and two or three books have been slightly larger. I can't predict which outcome I'll get before the conversion, as there doesn't appear to be any rhyme of reason to it. I have to imagine that sometimes ClearScan builds up lots of redundant font images, which djvu/Jb2 manages to share between pages.

Example: Plato's republic, 397 pages, all black and white except the cover image.
  • PDF: 14MB ClearScan
  • DJVU with OCR: 6MB
  • DJVU without OCR: 4MB
Example: Suzuki, Zen and Japanese Culture, 577 pages, color cover and several grayscale photo pages.
  • PDF: 19MB ClearScan
  • DJVU with OCR: 11MB
  • DJVU without OCR: 8MB
I have mostly been avoiding the books with lots of colored text, as those are a lot of manual labor with free tools (minidjvu only understands bitonal images, but csepdjvu won't share image dictionaries, so... my ideals force me to use minidjvu and create foreground color masks by hand). Luckily, the majority of books mix plain bitonal text with color images, and those are easy to split out with scripts.

I've generally been dropping the OCR data in the conversion, because when I look at it... it's not that great. ClearScan OCR has tons of mistakes in it, and I don't search in my PDFs that often in the first place. For large books it will save 2 or 3 megabytes to leave it out, so I've been doing that. In the numbers above I gave the with-OCR sizes to make the comparison to the OCR'ed PDFs fair.

[EDIT: I should say all the ClearScan files were created with Acrobat X, and saved as "Reduced Size PDF" afterward. Don't know if Acrobat XI does a better job or not.]
RichardT
Posts: 27
Joined: 24 Apr 2012, 10:17
E-book readers owned: Kindle 3rd, Kindle Fire, iPad3
Number of books owned: 0
Country: USA
Contact:

Re: Learning to Create Tiny DJVU files

Post by RichardT »

Another interesting one from tonight. OConnor, Understanding Jung, 162 pages of black-and-white text. ClearScan PDF: 3.1MB ... DJVU with OCR: 1.4MB ... DJVU without OCR 800kb. All 600dpi.

So was it a waste of time scanning my library to PDF if I've decided to slog through the djvu creation process after all? I don't think so. Because Adobe's ClearScan process does a lot of the tedious part for me, splitting the pages into foreground and background images. It also does some amount of lossy matching on the characters like JBIG2 does, so it could very well be improving the results from minidjvu (or at least making its job easier). So I think the Adobe "pre-processing step" makes up a bit for the fact that I'm not using a similar commercial-quality MRC tool on the DjVu side.

What really blows my mind is when I take an electronic document PDF--not a scan--and the DjVu turns out smaller. There's no reason that should be possible, ever. Doesn't say much for the PDF generators that people use.
RichardT
Posts: 27
Joined: 24 Apr 2012, 10:17
E-book readers owned: Kindle 3rd, Kindle Fire, iPad3
Number of books owned: 0
Country: USA
Contact:

Re: Learning to Create Tiny DJVU files

Post by RichardT »

Today I'm back to my hobby of cleaning up bad scans I find on the net. Our subject for today is a computer game manual for Ultima 4, which I bought from GOG.com. It is a poor scan, and djvudigital just wants to make every page all-background. So this makes for a good example of how I go about extracting the text from a complicated image. Here's a tiny version of an example page:
origsmall.png
origsmall.png (399.78 KiB) Viewed 14506 times
As you can see, the background is busy, and the contrast around the inking isn't great:
Know_CU1.PNG
Know_CU1.PNG (51.18 KiB) Viewed 14506 times
Still, I've seen much worse. So, every document is different, but the main plan of attack I start with is to stretch the contrast in different channels of different colorspaces, and look for the one that seems most "split." For example:

Code: Select all

  convert infile.png -separate -contrast-stretch 7%x30% +append sepRGB.png
  convert infile.png -colorspace LAB -separate -contrast-stretch 7%x30% +append sepLAB.png 
  convert infile.png -colorspace CMYK -separate -contrast-stretch 7%x30% +append sepCMYK.png 
Here's the result in RGB-space:
sep.png
Now, you might initially think that the two on the right are better candidates because the text looks darker, but my main concern isn't the dark text. When I threshold the image it will get the rest of the way dark. My main concern is that the gray splotches in the background need to go away. So the Red channel, the leftmost image, is the best candidate as far as I am concerned. Now I imagine a simple '-threshold should get me home now, but I somehow never seem to get that command right. So, I tend to use '-white-threshold' and then just replace everything that's not white with pure black. I'm sure someone with stronger IM-fu would laugh, but it's what works easiest for me. After playing with the inputs a couple times, this looks pretty good to me:

Code: Select all

convert in.png -channel R -separate +channel -colorspace gray -contrast-stretch 7%x30%  -white-threshold 50% -fill black +opaque white out.png
  • -channel R -separate +channel says grab the red channel, then go back to operating on all channels. At this point we have the red channel as a grayscale image.
  • -colorspace gray I'm frankly not sure we need this, but I'm "helpfully" telling IM that we are pure grayscale at this point
  • -contrast-stretch 7%x30% I want to warp the contrast of the image, forcing a ton of the lightest colors to white (30%), and a few of the darkest colors to black (7%), and normalize in between.
  • -white-threshold 50% -fill black +opaque white binarize
Here's the output:
test3.png
test3.png (9.8 KiB) Viewed 14506 times
So now we have a black-and-white mask of the foreground. What do we do with it?? I'll make a second post for that.
RichardT
Posts: 27
Joined: 24 Apr 2012, 10:17
E-book readers owned: Kindle 3rd, Kindle Fire, iPad3
Number of books owned: 0
Country: USA
Contact:

Re: Learning to Create Tiny DJVU files

Post by RichardT »

Ok, so in the last post we found an incantation that gets us a reasonable foreground mask for a particular document. I'm afraid every document is different, but if you're lucky the same formula will at least work for every page of the document at hand. So, the first order of business is to split out the foreground for all the pages of the document.

Code: Select all

 mkdir text
 for x in page-*.png ; do 
   convert $x -channel R -separate +channel -colorspace gray -contrast-stretch 7%x30%  -white-threshold 50% -fill black +opaque white text/${x%png}tiff 
 done
This leaves the text/ directory with all the pages' masks. I saved them as TIFF because the next thing to do is run minidjvu on them:

Code: Select all

  cd text
  minidjvu -p 30 -l -r -i -d 600 *.tiff index.djvu
This will produce an indirect djvu file of our foreground. Now, if all you wanted was black-and-white text, you're done! But if you want to recreate the colors and the background, you've got more to do.

So, let's create the backgrounds. I would really like the "hole-filling" section of here: http://www.imagemagick.org/Usage/masking/#hole_filling to be filled in, because I'm sure there's a lot I don't know. But I'm not too concerned for basic stuff like these pages. The steps I like to use are:
  • Pick a suitable background color to replace the text with. I open the original image and select a suitably bland color from the background. In this case I pick rgb(235,232,199).
  • Create our background image by overlaying our text mask on top of the original, setting the colors to our background, then subsampling it to 150 dpi
  • Create a subsampled mask to help c44 know where it can ignore image errors
  • Run c44 on all the backgrounds, and extract their BG44 chunks.
The commands to overlay the mask, create the c44-mask, run c44 and extract the bg44 chunk are:

Code: Select all

 for x in page-*.png ; do 

  # make a small background with the foreground "erased"
  convert  ${x} \( text/${x%png}tiff -alpha set -transparent white -fill rgb\(235,232,199\) -opaque black \) -compose Over -composite \
              -resize $(~/bin/subsamp.sh $x 4)\! -quality 100 -density 150 back${x%png}jpg

  # make a small PBM for c44 to mask with
  convert text/${x%png}tiff -resize $(~/bin/subsamp.sh $x 4)\! mask${x%png}pbm

  # run c44  (probably add a -decibel option to reduce quality to your liking)
  c44 -dpi 150 -mask mask${x%png}pbm back${x%png}jpg
  djvuextract back${x%png}djvu BG44=${x%png}bg44
done
So we wind up with a bunch of background images that look like this, then subsampled and djvu'ed:
testBack.jpg
testBack.jpg (64.51 KiB) Viewed 14505 times
So, now we have the foreground text, and a bunch of background images. The only step left is to merge them together. Let's do that in a third post.
RichardT
Posts: 27
Joined: 24 Apr 2012, 10:17
E-book readers owned: Kindle 3rd, Kindle Fire, iPad3
Number of books owned: 0
Country: USA
Contact:

Re: Learning to Create Tiny DJVU files

Post by RichardT »

Ok, so if you've followed the last two posts, you know we are splitting a page into foreground/background, processing them separately, and recombining them into a djvu file. We're now on the last step... how do you take the foreground minidjvu output, and combine it with the background bg44 chunks? It's actually not too bad.

First, rename all the indirect page files, to get them out of the way. I just put an 'x' in front of the name:

Code: Select all

cd text
for x in page*.djvu ; mv $x x$x ; done
Now you use djvumake to create a file that:
  • Has an INCL chunk that points to the shared shapes minidjvu made for us.
  • Has a Sjbz chunk from our renamed x file from minidjvu. This is the foreground layer.
  • Has a FGbz chunk that colors our foreground layer a dark brown to match the ink on the original. (I chose #64462B here).
  • Has a BG44 chunk that is our bakground image.
Actually, the only slightly tricky thing is getting the INCL chunk right. If you run:

Code: Select all

 djvudump xpage-002.djvu 
You can see which iff file goes with page-002, but it's pretty straightforward. In my case I have the following .iff files: page-002.iff, page-032.iff, page-062.iff. So, I will need to point all the pages up to 31 to page-002.iff. Then all the pages from 32 to 61 need to point to page-032.iff. Then page 62 to the end need page-062.iff. That's how minidjvu does it, every time. (We start with page-002 because I skip the cover page, in case you are wondering).

The djvumake command you want for the first group looks like this:

Code: Select all

for x in xpage-0[0-2][0-9].djvu xpage-030.djvu xpage-031.djvu ; do
  djvumake ${x#x} INFO=,,600 INCL=page-002.iff Sjbz=$x FGbz=#64462B BG44=../${x%djvu}bg44 
done
(...and similar commands for the other iff files).

Finally, convert the indirect djvu file into a bundled one, along with the front and back cover (which I just did with c44 as an image, and didn't discuss here):

Code: Select all

 djvm -c Ultima4_Spellbook.djvu  ../cover.djvu index.djvu ../zzcover.djvu
Here's the final version of our example page (all images are scaled way down for posting purposes, obviously):
outpicSmall.png
outpicSmall.png (427.63 KiB) Viewed 14504 times
... and here's a close-up of the same "the" from a couple posts back:
theRealClose.PNG
theRealClose.PNG (32.78 KiB) Viewed 14504 times
The final product looks better than the original, and is less than half the size. That's hard to beat!
RichardT
Posts: 27
Joined: 24 Apr 2012, 10:17
E-book readers owned: Kindle 3rd, Kindle Fire, iPad3
Number of books owned: 0
Country: USA
Contact:

Re: Learning to Create Tiny DJVU files

Post by RichardT »

Just a note: until today, when I cropped all the pages in a PDF, I had to use acrobat to export them to images, then create a new PDF before running through djvudigital. This is because ghostscript was ignoring the cropping. It turns out I can pass "-dUseCropBox" through to ghostscript and process the original PDF directly.

Code: Select all

 djvudigital -gsarg=dUseCropBox --bg-subsample=... ... ...
(and of course if you are using gs to create the PDF images then the same argument will make the files the right size there, too)

Nice! It seems to me -dUseCropBox should be on by default, but at least now I know the option is there. There is always more to learn... one of the reasons I write this stuff here is in hopes that someone comes along and says "you, know doing x-y-z will make your life five times easier."
RichardT
Posts: 27
Joined: 24 Apr 2012, 10:17
E-book readers owned: Kindle 3rd, Kindle Fire, iPad3
Number of books owned: 0
Country: USA
Contact:

Re: Learning to Create Tiny DJVU files

Post by RichardT »

Ok, so last time, I showed pulling some text off a 'parchment'-style background on a fuzzy jpeg scan. That was very successful. Today's experiment was only partially successful.

This time, I'm working with a terrible scan I found at http://astrolibrary.org/ebooks/. There is tons of bleed-through on every page. Unfortunately, it is black and white, with no halftoning or dithering of any kind. So... the text is almost illegible in places. Here's a sample:
inf.png
inf.png (30.6 KiB) Viewed 14479 times
So, there's no color information to go by, to try to get rid of the "background"... the only thing I could think to do is go by the thickness of the black areas. We can at least say that the foreground looks thicker, right? So, here's the plan.
  • Dilate the white area quite a bit, so that only the thickest parts of the image will leave a trace. We're hoping to be left with at least a few dots where every foreground letter was. The bleed-through letters should be mostly gone now.
  • Erode the white area by even more than we dilated it, so that we take the dots that are left and make them cover the area that the foreground letters covered. Here's what I mean:
dilero_small.png
dilero_small.png (10.41 KiB) Viewed 14479 times
Now, think of the above image as a mask, and we'll compose it with the original, only allowing the original to show if the mask is black. And since the mask is black where the letters are, that should clean up the original quite a bit. Here's the outcome:
outfile.png
outfile.png (21.25 KiB) Viewed 14479 times
To generate the mask:

Code: Select all

convert infile.png -morphology Dilate:3 Rectangle:5x5 \
    -morphology Erode:3 Rectangle:10x10 mask.png
To compose the mask and the image:

Code: Select all

convert infile.png mask.png -alpha set -transparent white -compose Dst_In -composite -background white -alpha remove final.png
Or, to do it all in one step:

Code: Select all

convert infile.png \( +clone -morphology Dilate:3 Rectangle:5x5 \
   -morphology Erode:3 Rectangle:10x10 \) \
   -alpha set -transparent white -compose Dst_In -composite  
    -background white -alpha remove final.png
Now, if you look closely, you can see this file was beyond my abilities to fully automatically clean up. A few of the thinner letters (like the 'l' in 'comelieft' on the second line) got destroyed. If I widen the erosion rectangle to 50x10, then the mask covers any letters to the left or right. But, unfortunately that also means and bleed-through between letters will also become visible again. I think it's best to over-delete, then proofread the document, pasting in missing letters from the original. That is, if I really care about the document. Since this was just for practice, I'm taking my leave of this one forever!

Are there other techniques that would work better on a file as evil as this one?
RichardT
Posts: 27
Joined: 24 Apr 2012, 10:17
E-book readers owned: Kindle 3rd, Kindle Fire, iPad3
Number of books owned: 0
Country: USA
Contact:

Re: Learning to Create Tiny DJVU files

Post by RichardT »

Tip of the day... if only a few files have backgrounds you can just generate them specifically. But if most pages have a background, it's easier to just generate them all:

Code: Select all

  for x in page*.png ; do convert $x [options]  back${x%png}jpg ; done 
... then check and erase the ones with only one color (which means there was nothing left after the black part was removed):

Code: Select all

  fox x in back*.jpg ; do ( identify -format "%k" $x | grep -q '^1$' ) && rm $x ; done
RichardT
Posts: 27
Joined: 24 Apr 2012, 10:17
E-book readers owned: Kindle 3rd, Kindle Fire, iPad3
Number of books owned: 0
Country: USA
Contact:

Re: Learning to Create Tiny DJVU files

Post by RichardT »

To celebrate the 50-year anniversary of the BASIC programming language, I was looking around the net and found the PDF version of the "10 PRINT" book. It's offered here: http://10print.org/. The PDF is 50 megabytes. Of course, I couldn't let that stand! In short order, I produced a 600-dpi DjVu at 4 megabytes. Fantastic!
dtic
Posts: 464
Joined: 06 Mar 2010, 18:03

Re: Learning to Create Tiny DJVU files

Post by dtic »

Very useful posts Richard!
Post Reply