Any images in this post are not the originals. They have been resized, converted to jpg, etc. to be a little more manageable in reading this post.
The Task at Hand
Original image: color, mixed, and black/white modes, respectively: I must have had my light settings off on the camera, because the page is a good bit yellow-er than normal. That actually doesn't make a difference for anything I'm doing here. The first thing is to run the image through scantailor. As you can see in the second image, only the color/grayscale mode preserves the red text, so that is the base image I'll have to work with. I'm a bit curious now how things would change if I didn't use the white margins and equalize illumination options, but that can be for some other day.
Reducing to White, Black, and Red
My goal is to get the image as close as I can to being purely white, black, and red. My tool of choice is ImageMagick, since it is powerful and easily scripted. The specific values I used for various options were tailored to this image.
original: final: whole process: Step 1: Remove the background
Code: Select all
convert test.tif -fuzz 20% -fill white -opaque "#fff8cf" test_01.tif
Step 2: Saturate the red color
Code: Select all
convert test_01.tif -modulate 100,150,100 test_02.tif
Step 3: Make reddish colors red
Code: Select all
convert test_02.tif -fuzz 30% -fill red -opaque red test_03.tif
Step 4: Make blackish colors black
Code: Select all
convert test_03.tif -fuzz 50% -fill black -opaque black test_04.tif
Step 5: Reduce the number of colors
Code: Select all
convert test_04.tif -colors 5 test_05.tif
Encoding to djvu
This is where things got really interested. Normally we get really good compression on black/white images, but those encoders only work on black/white and nothing more. My options were c44, cpaldjvu, and csepdjvu. c44 is made more for images with several colors, such as photos. cpaldjvu is made for images with a few colors, like we have here. csepdjvu encodes the black/white portion separately, then combines it with the colored portions... so I had no idea how well/bad it would perform. I encoded the image at each step of the process, and here were the results: (file sizes are in kB) cpaldjvu clearly pulls out as the winner once the number of colors in the image starts to drop. In fact, by step 5, the other encoders were increasing the image size, not decreasing it. c44 did better with more colors, as expected. csepdjvu held fairly steady throughout, and only beats c44 once there is more pure black text to work with. Overall, for a non-bitonal image, c44 is the best choice if no processing has been done on the image, csepdjvu if the image is a scantailor mixed mode with a lot of black text, and cpaldjvu if the image has been reduced to a handful of colors.
Of course, this data is from only one image, so it's laughably unscientific. I think it provides a good baseline expectation and an insight into what affects the performance of the djvulibre encoders.