Colored Text

Discussions, questions, comments, ideas, and your projects having to do with DIY Book Scanner software. This includes the Stereo Data Maker software for the cameras, post-processing software, utilities, OCR packages, and so on.

Moderator: peterZ

User avatar
clemd973
Posts: 121
Joined: 22 Aug 2010, 21:20

Colored Text

Post by clemd973 »

I'm about to begin scanning a number of books with colored text (mostly red and black). Is it possible to keep the text color, and if so, how do I do that??? Not looking to reinvent the wheel if there are already threads out there. Thanks for the help.
User avatar
Misty
Posts: 481
Joined: 06 Nov 2009, 12:20
Number of books owned: 0
Location: Frozen Wasteland

Re: Colored Text

Post by Misty »

It's an interesting question. It isn't a problem I've run into personally, but I could see it being a real issue for some people.

Strider was looking into the issue earlier, so you should check his thread: http://diybookscanner.org/forum/viewtopic.php?f=3&t=655 He might be able to help you.
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.
User avatar
reggilbert
Posts: 49
Joined: 28 Sep 2010, 19:57
Number of books owned: 3000
Location: Buffalo, New York

Re: Colored Text

Post by reggilbert »

clemd973 wrote:Is it possible to keep the text color, and if so, how do I do that?
Forgive me for potentially being dense, but I am not clear on the problem. Does simply compiling color photo images and then OCRing the result yield too big a final file and/or cause OCR difficulties? I scan full color books on a regular scanner, combine and then OCR the resulting 24Mb images unmodified in Acrobat Standard (perhaps with some fine-tune bulk cropping after the combining) and the resulting file for a 300-page book (150 images) is incredibly 80Mb at most (much less if only some pages are color) with both the OCR quite accurate and the resolution seemingly no different than the original images -- able to be magnified to 400 percent and still be pretty sharp. Acrobat Standard is expensive at nearly $300, but $80 versions are available to students, and older versions of Acrobat (at least back to 7 - it is now version 10) work just as well for this purpose and are available on eBay for under $100.

Reg Gilbert
User avatar
ceeann1
Posts: 106
Joined: 17 Nov 2010, 20:00
E-book readers owned: Several Palm PDA's
Number of books owned: 700
Location: Albuquerque, New Mexico
Contact:

Re: Colored Text

Post by ceeann1 »

Mist and I where chatting about capturing pictures in true color the other day and using EyeFi cards to do this. It may be helpful the link is below.
http://diybookscanner.org/forum/viewtopic.php?f=1&t=722
User avatar
strider1551
Posts: 126
Joined: 01 Mar 2010, 11:39
Number of books owned: 0
Location: Ohio, USA

Re: Colored Text

Post by strider1551 »

Yeah, a more specific question would help. On the software end, my first questions would be "what ebook format do you prefer?" and "what platform are you working on?" General advice: before you scan an entire book, experiment with just one page. Take multiple shots with different settings on your camera, lighting adjustments, etc. to see what produces images with the truest color. Then experiment with getting the images into your preferred format with the best color and smallest file size.

The file size will get a lot smaller if you palletize the image (reduce the number of colors), which is what I was experimenting with in the thread Misty mentioned. Obviously, if you care about preserving/archiving the book exactly as it appears, palletization might be too much image modification. (palletization? Anyone know if I'm making words up?)

I am interested in playing with colored text more and seeing if I could get something into djvubind for it, but I need a larger sample set to work with... heck, I could even get an ftp server setup for you to send me a batch of original images.

In any case, please share whatever you learn with the rest of us.
dansheffler

Re: Colored Text

Post by dansheffler »

I have several fine press editions that are entirely text but have the occasional red chapter title. I want these parts of the text to be binarised just like they are in Scantailor, but then be able to go back and select the text and make it red. Since these are not images, it is not all that important to keep the true color only set them off from the main text, and the file size should be kept low. I try to keep most of my 300-500 page books under the 5mb mark. I'm also thinking of bibles where the words of Christ are in red (a feature that I carefully try to avoid when purchasing bibles.) It seemed like strider's mask solution would do this well, but I'm not sure how to do it in PDF.
User avatar
clemd973
Posts: 121
Joined: 22 Aug 2010, 21:20

Re: Colored Text

Post by clemd973 »

strider1551 wrote:I am interested in playing with colored text more and seeing if I could get something into djvubind for it, but I need a larger sample set to work with... heck, I could even get an ftp server setup for you to send me a batch of original images.
Strider, if you set up the ftp server, I'll send you over 403 pages of the Rite of Christian Initiation of Adults. I just scanned it in to experiment with colored text as well. It's all red/black as are all of the ritual books. You should recognize it, and it should be plenty enough pages for you.[/quote]
strider1551 wrote:In any case, please share whatever you learn with the rest of us.
In answer to your request, here goes:

If you've read any of my posts, you know that I'm a closed-source junkie! I have neither the expertise, nor the time, to reinvent the wheel, although, I admire the effort and ingenuity of open-source software, as well as that of coders. That being said, I arrived at my results by using: Adobe Lightroom 3, Scan Tailor, and Adobe Acrobat Professional X. In the sample page below, I began with the raw .jpg and imported it into LR3, where I enhanced the red text. After enhancement, I exported it as a .tif and imported it into Scan Tailor, going through steps 1-5 as usual. However, at step 6, "Output," it's important to select under "Mode," <color/grayscale>, and leave "White Margins" and "Equalize Illumination" UNCHECKED. This will be an important step once in Acrobat. Once the .tif is then output from Scan Tailor, I imported it to Acrobat. Once in Acrobat, I ran the OCR with "Clear Scan," which enhances the text even more. At that point, I went to "Preflight" under <Edit> and selected "Create separate layers for vector objects, text, and images," and then ran "Analyze and Fix." This separates the text and vector objects from images (the splotchy background, which is the actual paper page from the original .jpg) and puts each on a separate layer, although there are no vector objects in my sample page below. At this point, the file has to be closed and then reopened in order to access layer navigation in the left hand pane. At that point, you can change the layer properties to hide the background and leave only the text layer visible.

The first image below is how the page looks when output from Scan Tailor. The second image the same page, but with the "background" image layer hidden.
The page as it appears when output from Scan Tailor.  Although not bad, notice the splotchy, off color, background.
The page as it appears when output from Scan Tailor. Although not bad, notice the splotchy, off color, background.
The page as it appears when exported from Acrobat X with the image layer hidden.  Much cleaner!!!
The page as it appears when exported from Acrobat X with the image layer hidden. Much cleaner!!!
Again, I'm a closed-source junkie, but perhaps the steps outlined above can help some of you open-source developers and coders achieve the same results. And perhaps a layer feature could be developed for Scan Tailor, which would be too cool.

I am in the process of dealing with a small problem, though. On some of the pages in this particular book, there are scores of music on various pages. When going through the layering process, the musical score gets chopped up: the words to the songs are recognized as text, the bars in the score are recognized as vector objects, and the musical notes are recognized as images. Therefore, when I hide the image layer, the musical notes go away. The work around here is to deal with it in Scan Tailor by marking the paragraphs of red text as "picture zones" (which maintains the red text) and processing it through under "mixed" mode, again with "White Margins" and "Equalize Illumination" UNCHECKED. From that point, it is output from ST and carried through processing in Acrobat as described above. This seems to work, but I have yet to have an opportunity to process all 430 pages. Your comments would be helpful.
User avatar
strider1551
Posts: 126
Joined: 01 Mar 2010, 11:39
Number of books owned: 0
Location: Ohio, USA

Re: Colored Text

Post by strider1551 »

clemd973 wrote:Strider, if you set up the ftp server, I'll send you over 403 pages of the Rite of Christian Initiation of Adults.
Excellent. I'll send you the server details through a PM shortly. Also, care to post the file size of the final pdf? It will give me a reference to work from.
clemd973 wrote:However, at step 6, "Output," it's important to select under "Mode," <color/grayscale>, and leave "White Margins" and "Equalize Illumination" UNCHECKED. This will be an important step once in Acrobat.
Why is that? I would think that the equalized illumination would effectively eliminate the background, and then you wouldn't have to hid the image layer and would still see the musical notes.
User avatar
clemd973
Posts: 121
Joined: 22 Aug 2010, 21:20

Re: Colored Text

Post by clemd973 »

strider1551 wrote: Also, care to post the file size of the final pdf? It will give me a reference to work from.
File size of the single page .pdf is 82KB. Haven't processed the entire book yet. Still working out some particulars in the workflow.
clemd973 wrote:However, at step 6, "Output," it's important to select under "Mode," <color/grayscale>, and leave "White Margins" and "Equalize Illumination" UNCHECKED. This will be an important step once in Acrobat.
strider1551 wrote:Why is that? I would think that the equalized illumination would effectively eliminate the background, and then you wouldn't have to hid the image layer and would still see the musical notes.
For some reason, when "Equalize Illumination" was selected, it effectively "bleached" the background - yet not good enough - and prevented Acrobat from recognizing the background as an "image" and, therefore, failed to put it on a layer. While it seems that it would "effectively eliminate the background," it didn't do as good a job as hiding the background layer all together. While it was an "acceptable" output, it wasn't the best.
Anonymous1

Re: Colored Text

Post by Anonymous1 »

Closed source junkies and their overkill solutions ;)

I do this all the time with just ImageMagick. What you're looking for is a just converting the image into a three color palette, namely black, white, and red. I'll post a script when I get home, but here's something you can look at (just don't scroll up. It never ends): http://www.imagemagick.org/Usage/quantize/#handling.
Post Reply