Preserving colored text

Discussions, questions, comments, ideas, and your projects having to do with DIY Book Scanner software. This includes the Stereo Data Maker software for the cameras, post-processing software, utilities, OCR packages, and so on.

Moderator: peterZ

User avatar
strider1551
Posts: 126
Joined: 01 Mar 2010, 11:39
Number of books owned: 0
Location: Ohio, USA

Preserving colored text

Post by strider1551 »

A couple days ago I needed to copy a few pages from the Liturgy of the Hours (a catholic liturgical text). This presented a unique challenge for two reasons: the pages are super thin and some of the text is red. The red text needs to be preserved as red, because it indicates what you are doing or how to do it, whereas black text is what you are saying (from what I know this is the origin of the word "rubrics"). I know this is a pretty unique situation, but I thought I should share what I learned about digitizing texts with a few colors (but more than black/white). For this little case study I took one of the pages, worked out a process to improve its quality, and then looked at the efficiency of different djvu encoding methods.

Any images in this post are not the originals. They have been resized, converted to jpg, etc. to be a little more manageable in reading this post.

The Task at Hand

Original image:
original.jpg
original.jpg (417.18 KiB) Viewed 15373 times
color, mixed, and black/white modes, respectively:
mode-comparison.jpg
mode-comparison.jpg (182.89 KiB) Viewed 15373 times
I must have had my light settings off on the camera, because the page is a good bit yellow-er than normal. That actually doesn't make a difference for anything I'm doing here. The first thing is to run the image through scantailor. As you can see in the second image, only the color/grayscale mode preserves the red text, so that is the base image I'll have to work with. I'm a bit curious now how things would change if I didn't use the white margins and equalize illumination options, but that can be for some other day.

Reducing to White, Black, and Red

My goal is to get the image as close as I can to being purely white, black, and red. My tool of choice is ImageMagick, since it is powerful and easily scripted. The specific values I used for various options were tailored to this image.

original:
start.jpg
start.jpg (125.99 KiB) Viewed 15373 times
final:
final.jpg
final.jpg (98.3 KiB) Viewed 15373 times
whole process:
enhance.gif
enhance.gif (439.51 KiB) Viewed 15373 times
Step 1: Remove the background

Code: Select all

convert test.tif -fuzz 20% -fill white -opaque "#fff8cf" test_01.tif
In human language, this says to take color fff8cf (-opaque) and any color within 20% of the color range to it (-fuzz) and replace those with white (-fill). I obtained the color code with Gimp. This is the most visually dramatic step, since the ugly background is removed all in one go.

Step 2: Saturate the red color

Code: Select all

convert test_01.tif -modulate 100,150,100 test_02.tif
The middle number of -modulate says to increase saturation to 150%. This step makes the black text a little bit redder, but not enough to mess with later steps.

Step 3: Make reddish colors red

Code: Select all

convert test_02.tif -fuzz 30% -fill red -opaque red test_03.tif
If something is within 30% of the color red, make it actually red.

Step 4: Make blackish colors black

Code: Select all

convert test_03.tif -fuzz 50% -fill black -opaque black test_04.tif
A little more aggressive here - if something is within 50% of black, make it pure black.

Step 5: Reduce the number of colors

Code: Select all

convert test_04.tif -colors 5 test_05.tif
This will collapse the number of colors in the image to 5. Going to 4 or less produces junk output, but 5 works.

Encoding to djvu
This is where things got really interested. Normally we get really good compression on black/white images, but those encoders only work on black/white and nothing more. My options were c44, cpaldjvu, and csepdjvu. c44 is made more for images with several colors, such as photos. cpaldjvu is made for images with a few colors, like we have here. csepdjvu encodes the black/white portion separately, then combines it with the colored portions... so I had no idea how well/bad it would perform. I encoded the image at each step of the process, and here were the results: (file sizes are in kB)
encoding_efficiency.png
encoding_efficiency.png (45.48 KiB) Viewed 15373 times
cpaldjvu clearly pulls out as the winner once the number of colors in the image starts to drop. In fact, by step 5, the other encoders were increasing the image size, not decreasing it. c44 did better with more colors, as expected. csepdjvu held fairly steady throughout, and only beats c44 once there is more pure black text to work with. Overall, for a non-bitonal image, c44 is the best choice if no processing has been done on the image, csepdjvu if the image is a scantailor mixed mode with a lot of black text, and cpaldjvu if the image has been reduced to a handful of colors.

Of course, this data is from only one image, so it's laughably unscientific. I think it provides a good baseline expectation and an insight into what affects the performance of the djvulibre encoders.
spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: Preserving colored text

Post by spamsickle »

I still haven't explored DJVU, but I'm impressed by the results you're getting and the way you're using ImageMagick. I wonder if it might be possible to use ImageMagick to create masks for the red and black text and bypass the ScanTailor step altogether. I may have an old King James around somewhere from which I could get a couple of pages to play with.
cccshit
Posts: 3
Joined: 04 Mar 2014, 00:53

Re: Preserving colored text

Post by cccshit »

No and no. This is a messy process and wrong encoding scheme.

1) Just as you convert to b/w by setting a threshold level for illumination, in order to convert to three colours you may as well only consider illumination and set two thresholds (this method is valid since black and white are unique RGB colours in terms of illumination).

2) CPalDjVu is lossless! Not to mention it supports quantization natively, so most of your IM commands are odd. A 600 DPI page should be around 10KB in a multipage DjVu file. Your results are far from that, even considering they are single pages.

There isn't yet any free automated method for good colour text compression using DjVu, because it would require exporting coloured text coordinates and feed them to djvumake after propper lossy encoding.

CAN WE HAVE A SAMPLE OF THE ORIGINAL PICTURE TAKEN PLEASE?
spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: Preserving colored text

Post by spamsickle »

cccshit wrote: 1) Just as you convert to b/w by setting a threshold level for illumination, in order to convert to three colours you may as well only consider illumination and set two thresholds (this method is valid since black and white are unique RGB colours in terms of illumination).
By "illumination" do you mean the brightness of a given pixel? If so, I don't see how what you're proposing would improve the results strider1551 got with his "within 30% of red" method. In fact, I don't think using brightness would work, since both overall brightness of a pixel and "red channel" brightness of a pixel could be the same for a red pixel in the middle of a red letter as for a gray pixel on the edge of a black letter. It seems to me that only by actually considering the "redness" of the pixel (i.e., the difference between the red channel and the g/b channels) can one accurately identify the red text.

Most likely this means I'm misunderstanding what you're saying, so could you perhaps say it a different way?
cccshit
Posts: 3
Joined: 04 Mar 2014, 00:53

Re: Preserving colored text

Post by cccshit »

Yes. Your are absolutely right pointing out that, and not me.

Actually my proof was incorrect, and my method worked fine just because a despeckle filter was later applied.

By illumination I meant lightness. Probably it will work with brightness as well, but I really ignore the difference between both.
User avatar
strider1551
Posts: 126
Joined: 01 Mar 2010, 11:39
Number of books owned: 0
Location: Ohio, USA

Re: Preserving colored text

Post by strider1551 »

cccshit wrote:CAN WE HAVE A SAMPLE OF THE ORIGINAL PICTURE TAKEN PLEASE?
Whoa... no need to raise voices. I can't attach the original because it is too large, but I'll attach a section of it.
cccshit wrote:No and no. This is a messy process and wrong encoding scheme.
Well yeah. This was an entirely new problem for me, and I only had less than an hour to go from book-on-desk to printed copies. The next day I thought, "now what in the world did I do, and was it worth it?", and then I thought I might as well share it here. I know my first thought when looking at the original file was "well I'm screwed". I figure other people here would think the same, and I like expanding the realm of what we consider doable.

I look forward to learning from your expertise.
Attachments
sample.jpg
(424.33 KiB) Downloaded 47 times
User avatar
daniel_reetz
Posts: 2812
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: Preserving colored text

Post by daniel_reetz »

cccshit wrote:No and no. This is a messy process and wrong encoding scheme.
Note here from your friendly moderators. Keep the yelling to a minimum and don't jump on people when they're sharing. Or ever.

I was really glad to see this post and learned a few things myself. Your contributions here are held in very high esteem, Strider. Thanks for being so open and for consistently taking the risk of sharing.
User avatar
strider1551
Posts: 126
Joined: 01 Mar 2010, 11:39
Number of books owned: 0
Location: Ohio, USA

Re: Preserving colored text

Post by strider1551 »

daniel_reetz wrote:Your contributions here are held in very high esteem, Strider. Thanks for being so open and for consistently taking the risk of sharing.
Thanks Daniel.

It sounds like cccshit knows what he is talking about, so I do hope he gets back to us with some more details. In the meantime I've been spurred on to keep exploring the encoding part of the problem and took some time to read a few white papers on djvu and really try to understand how the format actually works. My first post was able to get the djvufile down to 148 kB, but my current best is 27.2 kB. A big problem was that the black characters were actually faintly outlined by red pixels, and removing those makes a big difference for cpaldjvu. I got it smaller still by learning how to use djvumake. I'm a bit too busy today to organize everything and make a good post explaining it all, but I will do so hopefully before the weekend.

One reason I'm pursuing this further, by the way, is the recent discussions on vectorization. If you take a scantailor-bitonal-tiff, run it through potrace outputting it as a .pgm file, the image really does look a lot better at ridiculous levels of zoom (obviously pgm is a raster graphic, not a vector graphic, so there is some blockiness the more you zoom). Of course the image is now grayscale and not bitonal, so you don't get the same level of compression. I'm hoping that the more I learn about efficiently compressing palletized images (images with few colors), I might be able to get grayscale to an acceptable size for people who aren't satisfied with the perceived quality of the bitonal. And who knows, maybe ocr works better with the a grayscale image?

...but that's all crazy thoughts that I have yet to explore (and a new topic if it actually works). Oh, and I'll try attaching the .tif I'll be working from, since I guess the filesize limits are gone? It's already been through scantailor and had the yellow-ish background removed.
Attachments
test_01.tif
User avatar
Misty
Posts: 481
Joined: 06 Nov 2009, 12:20
Number of books owned: 0
Location: Frozen Wasteland

Re: Preserving colored text

Post by Misty »

strider1551 wrote:And who knows, maybe ocr works better with the a grayscale image?
I would doubt that. OCR works on bitonal text; if you feed a greyscale or colour image into OCR software, it converts it internally to bitonal for recognition. The reason that using the original scan can in some circumstances produce better results than ST images is that, in certain situations, the OCR's internal bitonalization produces results more suitable for OCR than ST does. In the case of vectorization you're talking about going scan --> bitonal --> vector --> bitonal, and since the vectorization can't produce any truly new information I would be surprised if it helped results at all. In a worst case scenario it could slightly reduce accuracy.
Last edited by Misty on 10 Nov 2010, 12:29, edited 1 time in total.
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.
User avatar
daniel_reetz
Posts: 2812
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: Preserving colored text

Post by daniel_reetz »

since I guess the filesize limits are gone?
Yep, and it's going to stay that way forever unless there is a problem. I actually thought I'd fixed that a year ago, but it turns out that images have their own category, buried somewhere in PHPBB... Next I need to see if I can get it to thumbnail your TIFF.
Post Reply