Colored Text

Discussions, questions, comments, ideas, and your projects having to do with DIY Book Scanner software. This includes the Stereo Data Maker software for the cameras, post-processing software, utilities, OCR packages, and so on.

Moderator: peterZ

User avatar
strider1551
Posts: 126
Joined: 01 Mar 2010, 11:39
Number of books owned: 0
Location: Ohio, USA

Re: Colored Text

Post by strider1551 »

Oh, ok. I thought you were doing that for each page, which would be a crazy amount of effort.
clemd973 wrote:Can you sharpen the text in your sample?
Not really. I would have to introduce more colors around the edges of the letters to smooth the transition from black to white (or color to white), but compression becomes more efficient with fewer colors. I might play with it just to see what happens, but I'm not expecting that it would be worth it. (ooohh... or there's a thought: keep the edges in a low resolution background layer... probably still won't be worth it)

You mentioned Clear Scan, but in a way that's a whole other world. From what I understand, Clear Scan makes the text into vectored images, whereas djvu is made for raster images. A vectored image can be zoomed to infinity, but a raster image is supposed to be viewed at a specific DPI and not zoomed. Now, from what I've seen in my samples, if the raw input from the camera has sufficient resolution (mine come in at ~370 dpi for smaller books) there is no visual difference between vector and raster until the zoom exceeds 300% or so.

Even if I could get the text sharper, I don't think I would work that into djvubind. Other than encoders that have lossy compression, djvubind does no modifications to the input images, enhancements or otherwise. I intend to keep it that way, so that there is a clear difference between post-processing software like Scantailor which enhances images and binding software like djvubind that puts everything together into a single file.
clemd973 wrote:[P]retty soon I'll be scanning the funeral rite, also containing colored text; if your interested in those images as well. Let me know.
Sure, that would give me two different sources, which should help test things.
User avatar
clemd973
Posts: 121
Joined: 22 Aug 2010, 21:20

Re: Colored Text

Post by clemd973 »

strider1551 wrote:Also, care to post the file size of the final pdf? It will give me a reference to work from.
Strider, I finished processing the RCIA Ritual, and the 400+ page final pdf saved at 6.9 MB.
User avatar
strider1551
Posts: 126
Joined: 01 Mar 2010, 11:39
Number of books owned: 0
Location: Ohio, USA

Re: Colored Text

Post by strider1551 »

Sorry to take so long to get back to you on this (pesky thing called real life kept interrupting). My end result was a djvu file weighing in at 11.9 MB. Compared to your pdf, the size is a bit bloated because each page has it's own shape dictionary, instead of using a shared dictionary across several pages. I can't figure out any way to produce a multipage, colored foreground djvu file with the existing open source tools, whether alone or in combination (and if someone out there knows an open source solution, please share). If the file had a shared dictionary, the file size would be closer to or lower than your pdf. Of course, our two files can't be directly compared because we processed the raw images separately.

The script I used and a sample screenshot from djview are below. Scantailor was set on color/greyscale mode at 600 dpi without white margins or equalized illumination. The script doesn't bother to preserve the original colors, but instead tries to force everything to black, white, or red.

Code: Select all

#! /bin/bash

set -e -u

TMP=`mktemp -d`

for i in page_*.tif; do

    # Create a better base image to work withby getting rid of background junk.
    convert "${i}" \
            -fuzz 15% -fill white -opaque grey \
            "${TMP}/_base.tif"

    # Isolate black and red colors to only the sections of the image where those colors should be.
    # Note that we can "loose" the shape of the characters, all we need to do is get red and black
    # in the general areas.
    convert "${TMP}/_base.tif" \
            +dither -posterize 2 -fill black -opaque blue \
            -fill white +opaque black \
            "${TMP}/_black.tif"

    convert "${TMP}/_base.tif" \
            +dither -posterize 2 -fill black -opaque blue \
            -fill white -opaque black \
            -blur 10 -fuzz 50% -fill white -opaque white \
            -colors 2 \
            "${TMP}/_color.tif"

    composite -compose multiply "${TMP}/_color.tif" "${TMP}/_black.tif" "${TMP}/_composite.ppm"

    # Create the iw44 foreground.
    convert "${TMP}/_composite.ppm" -threshold 99% -negate "${TMP}/_foreground_mask.pbm"
    c44 -decibel 16 -crcbfull -mask "${TMP}/_foreground_mask.pbm" "${TMP}/_composite.ppm" "${TMP}/_foreground.djvu"
    djvuextract "${TMP}/_foreground.djvu" BG44="${TMP}/_foreground.iw4"

    # Create the text layer that will be colored.
    convert "${TMP}/_base.tif" -threshold 85% "${TMP}/_text.tif"
    cjb2 -dpi 600 -lossy -losslevel 120 "${TMP}/_text.tif" "${TMP}/_text.djvu"

    # Put it all together.
    djvumake "${TMP}/out.djvu" INFO=,,600 Sjbz="${TMP}/_text.djvu" FG44="${TMP}/_foreground.iw4"
    if [ -f final.djvu ]; then
        djvm -i final.djvu "${TMP}/out.djvu"
    else
        mv "${TMP}/out.djvu" final.djvu
    fi

done

rm -r $TMP
A screenshot:
sample.tif
User avatar
clemd973
Posts: 121
Joined: 22 Aug 2010, 21:20

Re: Colored Text

Post by clemd973 »

Nice work! Clear and vibrant, and a nice shade of red. Just curious, how long would it take you to process these 400+ pages from start to finish?
User avatar
strider1551
Posts: 126
Joined: 01 Mar 2010, 11:39
Number of books owned: 0
Location: Ohio, USA

Re: Colored Text

Post by strider1551 »

The 425 page file took 240 minutes to produce, or exactly 4 hours. That should work out to 36 seconds per page.
ibr4him
Posts: 102
Joined: 18 Oct 2010, 10:36

Re: Colored Text

Post by ibr4him »

how do I use this script? I want to remove pale yellow bg from a book - magic wand in photoshop would be perfect but ofcourse I can't use it for every book.

Many thanks!
User avatar
strider1551
Posts: 126
Joined: 01 Mar 2010, 11:39
Number of books owned: 0
Location: Ohio, USA

Re: Colored Text

Post by strider1551 »

The script above would be doing a bit too much for you, since the bulk of it was trying to preserve only text, but also as either black or red. All you should need is a mogrify command. This would clean up all the images in the directory in which it is run, and then you could bind the images into PDF/Djvu/whatever with the program of your choice.

Code: Select all

#! /bin/bash
for i in *.tif; do
    mogrify -fuzz 15% -fill white -opaque grey "${i}"
done
Mogrify is part of the imagemagick suite - pretty much the same thing as convert, except it modifies the image you give it instead of making a new image.
  • -opaque This is the color that will be removed. You can also use a hex color code.
  • -fill This is the color that will be inserted instead.
  • -fuzz This is replace colors within a certain percentage range of the -opaque value.
Obviously, make a backup of your images before running mogrify on anything. Try it with a few samples first - you will need to adjust -opaque and -fuzz for you specific circumstance. For the opqaue value, it will probably be easiest to open an image in photoshop/gimp/whatever, and find a hexcode for the background color you want to remove. The more accurate the opaque value, the lower you can set the fuzz value, and the less chance of mucking up text/images that you want to keep.

Let me know if that is enough to get you going.
Last edited by strider1551 on 17 Jul 2011, 14:15, edited 1 time in total.
ibr4him
Posts: 102
Joined: 18 Oct 2010, 10:36

Re: Colored Text

Post by ibr4him »

Great, so I should just type this in Terminal (mac)?

Thanks!
User avatar
strider1551
Posts: 126
Joined: 01 Mar 2010, 11:39
Number of books owned: 0
Location: Ohio, USA

Re: Colored Text

Post by strider1551 »

Yep, wih one little revision - I noticed I forgot a "done" to finish off the for loop. Of course, no need for the "#! /bin/bash" line if you're already in the terminal.
Post Reply