Preserving colored text

strider1551 · Post by **strider1551** » 12 Nov 2010, 08:57

My current best is 17.3 kB. The key to getting a smaller djvufile is to rely upon the layered structure of the djvu format. To put things simply, there is a foreground, a background, and a mask. The mask is a simple black/white image of the text, encoded with cjb2. The black portions of the mask will use the foreground image for color information and the white portions will use the background image (or white by default). Both foreground and background are typically iw44 images as made with c44. From here on I will only talk about the mask and foreground layers, since I'm only concerned about colored text.

So to show this visually, this would be the mask plus the foreground creating the final djvufile.

Creating the foreground image is fairly straight forward. Take the colorized original, threshold it to black and white, invert the black and white colors, then use that as a mask for c44. The code is below, and final djvufile size was 49.5 kB. The test_01.tif file is attached to a post above.

Code: Select all

# Create the iw44 foreground.
convert test_01.tif _test_01.ppm
convert _test_01.ppm -threshold 99% -negate _foreground_mask.pbm
c44 -dpi 600 -decibel 30 -mask _foreground_mask.pbm _test_01.ppm _foreground.djvu
djvuextract _foreground.djvu BG44=_foreground.iw4

# Create the text layer that will be colored.
convert test_01.tif -threshold 99% _text.tif
cjb2 -dpi 600 -lossy _text.tif _text.djvu

# Put is all together.
djvumake test_01.djvu INFO=,,600 Sjbz=_text.djvu FG44=_foreground.iw4

As always, less colors and colors isolated into larger segments produces smaller foreground images. In my case I don't want the true color of the text as captured, I just want red and black. So modifying the colors not only creates an image that I consider better looking, but also an image that is smaller. The important part is to eliminate the stray black pixel in the red text and vice versa, not keeping the shape of the text, hence the blurring and aggressive -fuzz settings (since the shape of the text is in the mask layer, not the foreground layer). This time the final djvufile size is 17.3 kB. Note that just the bitonal djvufile produced by cjb2 of this image was 12.3 kB, so adding color is not that much more.

Code: Select all

#! /bin/bash

# Create a better base image to work with.  Bring out the black and red colors.
convert test_01.tif -fuzz 40% -fill black -opaque black -modulate 100,150,100 -fuzz 30% -fill red -opaque red _base.tif

# Isolate black and red colors to only the sections of the image where those colors should be.
# Note that we can "loose" the shape of the characters, all we need to do is get red and black
# in the general areas they should be in.
convert _base.tif -fill white +opaque black -despeckle -blur 10 -fuzz 50% -fill white -opaque white -colors 2 _black.tif
convert _base.tif -fill white +opaque red -despeckle -blur 10 -fuzz 50% -fill white -opaque white -colors 2 _red.tif
composite -compose multiply _red.tif _black.tif _composite.ppm

# Create the iw44 foreground.
convert _composite.ppm -threshold 99% -negate _foreground_mask.pbm
c44 -dpi 600 -decibel 30 -mask _foreground_mask.pbm _composite.ppm _foreground.djvu
djvuextract _foreground.djvu BG44=_foreground.iw4

# Create the text layer that will be colored.
convert _base.tif -threshold 99% _text.tif
cjb2 -dpi 600 -lossy _text.tif _text.djvu

# Put is all together.
djvumake test_01.djvu INFO=,,600 Sjbz=_text.djvu FG44=_foreground.iw4

ibr4him · Post by **ibr4him** » 15 Nov 2010, 03:08

Can this exact processing be done in any other editor or Photoshop CS5? I'm having a hard time installed ImageMagick on Mac, tons of errors.

univurshul · Post by **univurshul** » 15 Nov 2010, 08:58

see notes and sample at bottom of page

Post by **daniel_reetz** » 15 Nov 2010, 11:22

I won't get into the procedure in this thread, because I used commercial software to do it; I don't want to veer off course from the good work and service to the DIY community Strider1551 is doing here.

As a moderator, I want to make clear: It is absolutely fine to talk about commercial software and/or how to use it on this forum at any time. Alternative approaches are welcome although sometimes it *is* appropriate to start a new thread rather than drop in somewhere, but that's up to your judgment. Saying "you can do this with Acrobat too" is a helpful thing to do. Or starting another thread ("How to preserve colored text using tool Z") is also fine. There is no way that I would discourage sharing information and helpful tutorials, ever.

I think the important thing here is just to keep doing what you're doing -- sharing information and testing out tools. The value will be inherently apparent (already is, IMO). The whole "closed VS open" argument has been had a million jillion times all over the internet (and for good reason) -- we don't really need to repeat that here, ONLY because it has been done (and done well) elsewhere and people tend to get fighty over it.

Many people from the Open Source community (in the very broad sense) believe that closed software reduces your freedom and does you harm. I am not going to say if this is right or wrong, but it is important to recognize that software is deeply politicized. By virtue of their beliefs, they will have strong opinions and perhaps react in a way that seems disproportionate, unless you agree/can see it from their POV. However, on the flip side, these same people are often the programmers who are coding up Free alternatives to closed applications. They aren't stopping to throw stones - they are throwing stones by building things.

I'm sure my views on Open Source and Free Software are plainly visible to anyone who uses this forum. But I am deeply invested in getting people scanning as much as they can, as quickly as they can, and as easily as they can. That may mean using commercial software, which means that this forum should be the source of the best tutorials on using commercial software. I know you are among the best, if not the best source of knowledge and tutorials on that right now, so don't hold back too much.

ibr4him · Post by **ibr4him** » 15 Nov 2010, 11:30

@univurshul,

No problem, I use PDFs only atm, and OCR doesn't matter to me because 99% of books I scan are in Non-English (Arabic, Urdu etc..).

univurshul · Post by **univurshul** » 15 Nov 2010, 13:17

Alright then, no worries. Just rip it through Scan Tailor with color + equalized illumination. Compress it, bind it, done deal.

Don't forget white balance.

univurshul · Post by **univurshul** » 15 Nov 2010, 13:27

...and Strider's example is an extreme one on biblical vellum-type paper. Most books shouldn't give you background issues like this.

univurshul · Post by **univurshul** » 16 Nov 2010, 01:55

ibr4him wrote:Can this exact processing be done in any other editor or Photoshop CS5? I'm having a hard time installed ImageMagick on Mac, tons of errors.

Color text preservation for GUI OSX users. 4 minutes of tone-curve adjustment tests in an image editor. White Balance correction. Normal ST processing. Full vectorization OCR searchable. Quality resolution. No bleaching. 26k PDF finished file size:

reggilbert · Post by **reggilbert** » 21 Nov 2010, 21:07

Misty wrote:
strider1551 wrote:And who knows, maybe ocr works better with the a grayscale image?
I would doubt that. OCR works on bitonal text; if you feed a greyscale or colour image into OCR software, it converts it internally to bitonal for recognition. The reason that using the original scan can in some circumstances produce better results than ST images is that, in certain situations, the OCR's internal bitonalization produces results more suitable for OCR than ST does. . . . .

I have nearly zero technical knowledge on OCR or really anything computer / graphical, but I have scanned quite a few books over the last five years on a Plustek OpticBook 3600 /3600 Plus, usually pairs of pages scanned to 300 dpi BMP format then assembled and OCR'd in Acrobat Pro, and at some point about a year ago determined (with a half dozen comparisons) that the Acrobat OCR did clearly better on greyscale scans vs. b&w ones. Close examination of the BMP scans in FastStone Image Viewer (at 300%) showed pretty broken letters in the b&w in comparison to the greyscale. Speaking with a complete lack of knowledge about imaging technology, it seems Adobe OCR is simply better at determining the edge of a letter when the on/off, 0/1, b/w decision has not already been made for it by the scanner software. I have no idea where Scan Tailor fits into this - is there reason to believe that the algorithm it uses to deduce b/w from original color or greyscale text-image data is superior to that of Acrobat Pro applied to the same data?

BTW, if anybody has either 1) a better suggestion for the employment of my consumer scanner than my current strategy of 300 dpi BMP scan format / Acrobat Pro assembly / Acrobat Pro OCR, in terms of final image and OCR quality (without massively increasing the total human processing time), or 2) knowledge that DIY / camera-based scanning would likely lead to better image and OCR quality (again without massively increasing required human time), I would really appreciate it, as I scan many of the books I must read as part of my effort to get a doctorate in history, and I worry about future searches for stuff I know I have read that fail solely due to bad OCR.

Anonymous1 · Post by **Anonymous1** » 22 Nov 2010, 00:12

I would discourage a flatbed scanner. It hurts the bindings of books, it's slow, and it doesn't look cool.
Also, I would check out Scan Tailor, as it is more or less the best in the business. It splits pages, finds content areas, makes it selectively bitonal (images are excluded), and all you do is run djvubind on those files and you have an almost perfect digital book with OCR!

DIY Book Scanner

Preserving colored text

Re: Preserving colored text

Re: Preserving colored text

Re: Preserving colored text

Re: Preserving colored text

Re: Preserving colored text

Re: Preserving colored text

Re: Preserving colored text

Re: Preserving colored text for OSX GUI

Re: Preserving colored text

Re: Preserving colored text