Converting Color/Grayscale Text Scans to Black & White

Share your process here - how to build something, scan something, or use something.

Moderator: peterZ

cday
Posts: 447
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Converting Color/Grayscale Text Scans to Black & White

Post by cday »

Fabian wrote:You're right, I should have chosen one of the many examples where no B&W version is offered. Unfortunately, your OCR'd copy is not suitable for my purposes because the original typeface is sacrificed. File size and copy-and-paste is important to me too but I'd also like to preserve as much fidelity to the original as possible.

Is the bottom line, therefore, that each page must be individually processed in order to remove the tint? There is no program that will do it for me in one book-length pass?
cday wrote:I think that it may be possible to do what you want directly using the cross-platform freeware software XnConvert , based on a quick test, although you would still need to OCR the resulting file if you need searchability. Note that you will also need to install the freeware utility Ghostscript and to set a DPI value in XnConvert to obtain suitable image quality. And as there is no or minimal documentation other than the XnView forum, the easiest way to learn how to use the program is to explore the interface fully and to do some tests.
Out of interest, I did a quick test using the above freeware program -- which can be run on a Mac -- and obtained a proof of concept result, although as stated above if searchability is needed the output file would have to be run through OCR software. But there are a number of practical considerations, starting with the fact that a little knowledge of image processing would be quite helpful, and including the fundamental issues that apply to creating multi-page PDF image files in terms of the output file size.

For my test I opened the downloaded colour file with the yellow tint in an image editor, and then applied a fairly aggressive levels adjustment to a typical page, to determine white and black points that effectively removed the yellow tint without causing too much collateral damage to the text. I then ran the file through XnConvert to apply that levels adjustment automatically to each page, and save the resulting output file as a PDF. The processing was direct PDF-to-PDF file without extracting the images in the file, as originally requested.

Saving the output file as a colour or grayscale images, as expected, produced quite a large file size even using JPEG compression, although there may be scope for experimenting with greater compression settings. I then repeated the test converting the images to black and white after the levels adjustment and then saving with Fax (CCITT 4) compression: that as expected produced a much smaller file size and image quality was in fact only slightly degraded.
Composite_p15.tif
Sadly the image doesn't display without downloading: note that the file size of the composite image is irrelevant, as lossless compression was used to ensure that both images reflected the originals, I hope...

Although the output for the book text looks reasonable on a quick check, there are necessarily trade-offs involved in the processing, and some text with small character sizes in the front matter of the book is compromised in the output file, or possibly in some cases missing. The 248-page grayscale version had a file size of 98MB, and the black and white version around 8MB which means I can upload it for inspection:
So proof of concept only, a lot of scope for experimentation and optimisation, and not an entirely easy ride for anyone with no image editing experience, but potentially a direct solution other than the need to OCR the resulting output file if searchability is required...
russca
Posts: 53
Joined: 04 Mar 2014, 00:53
Number of books owned: 0
Country: ____

Re: Converting Color/Grayscale Text Scans to Black & White

Post by russca »

cday wrote:...other than the need to OCR the resulting output file if searchability is required...
My off-topic two cents. These days text searchability is a must. Obviously a lot depends on the quality of scanned text files and the OCR program. I don't buy "File size" argument anymore. The storage drives are so large, something like up to 8 terrabytes per single external unit, and so cheap. And internet speed is super fast.

Abbyy FineReader does excellent OCR job out of the box and can be trained to improve character recognition. Here is a snapshot of page 43 recognized in Abbyy FineReader. I am also copy/pasting unaltered recognized text in the quotation box below.
43-FineReader.tif
43-FineReader.tif (1.94 MiB) Viewed 14821 times
PROSATEURS
43
days of Margaret of Navarre; and obviously what the Typical English Novelist had always aimed at— if he had aimed at any Form at all—and what the Typical English Critic looked for—if ever he condescended to look at a Novel—was a series of short stories with linked characters and possibly a culmination. Indeed, that conception of the Novel has been forced upon the English Novelist by the commercial exigencies of hundreds of years. The Romances of Shakespeare, novels written for ranted recitation and admirable in the technique of that Form, were moulded by the necessity for concurrent action in varying places : the curtain had to be used. So you had the Strong Situation in order that the psychological stages of Othello should be firm in the hearer's mind whilst Desdemona was alone before the audience. The Novels of Fielding, of Dickens, and of Thackeray were written for publication in Parts : at the end of every part must come the Strong Situation, to keep the Plot in the reader's head until the First of Next Month. So with the eminent contemporaries of ours in the 'nineties of last century : if the writer was to make a living wage he must aim at Serialisation : for that once again you must have a Strong Scene before you write " To be continued," or the reader would not hanker for the next number of the magazine you served. But you do not need to go to Commercial Fiction to find the origin of the tendency : if the reader has ever lain awake in a long school dormitory or a well-peopled children's bedroom, listening to or telling long, long tales that went on from day to day or from week to week, he will have known, or will have observed, the necessity to retain the story in the hearer's mind, and to introduce, just
cday
Posts: 447
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Converting Color/Grayscale Text Scans to Black & White

Post by cday »

russca wrote:These days text searchability is a must...
Yes, usually, but someone scanning fiction, for example, might not need it...
Abbyy FineReader does an excellent OCR job out of the box and can be trained to improve character recognition.
Yes indeed, and when searchability is required image enhancement can be done within FineReader before OCR is performed as you no doubt already know: here is the downloaded colour file with a quite aggressive levels W=135 B= 100 applied before recognition:
thustorevisitsom00ford_FR12_Levels_W135B100_MRC.pdf
(10.91 MiB) Downloaded 481 times
So when as is usually the case searchability is required, FineReader, or the Nuance OmniPage equivalent if there is one, would likely be the easiest direct solution to removing background colouration, although always subject to the level of collateral damage when the images are processed, which may be an issue for lower quality original images.

Interesting, Adobe Acrobat doesn't seem to have any image enhancement options that I can find in Acrobat Standard XI: otherwise the ClearScan output option would be ideal for maintaining the fidelity of the original images while also potentially greatly reducing the file size, but in my tests the yellow tint remained.
Fabian
Posts: 6
Joined: 09 May 2016, 14:13
Country: Canada

Re: Converting Color/Grayscale Text Scans to Black & White

Post by Fabian »

Thank you to everyone who responded to my initial question. I've been a little overwhelmed with some of the technical details but have now spent a number of hours experimenting with options.

In particular, russca, I've been following the instructions from your video tutorial and have been largely successful in reproducing your results using blending mode, including running my first Photoshop automated batch. It works well when the page is just text but I've had a lot of difficulty when the page consists of both text and image. I attach a copy of another test page which you can see contains a rather faint image. I can't seem to find the right balance that preserves both text and image using Screen/Multiply; in almost every variation, the image vanishes into the whitened background. Any suggestions?
Test page with both text and image
Test page with both text and image
russca
Posts: 53
Joined: 04 Mar 2014, 00:53
Number of books owned: 0
Country: ____

Re: Converting Color/Grayscale Text Scans to Black & White

Post by russca »

cday wrote:So when as is usually the case searchability is required, FineReader, or the Nuance OmniPage equivalent if there is one, would likely be the easiest direct solution to removing background colouration, although always subject to the level of collateral damage when the images are processed, which may be an issue for lower quality original images.
Your linked file came out quite nicely. As a general rule, I avoid "one-stop-all-you-can-eat" programs. And prefer doing image prep/enhancements separately from OCR. E.g. I hate Adobe Acrobat file optimization, but love its deskew function for auto text straigthening .
russca
Posts: 53
Joined: 04 Mar 2014, 00:53
Number of books owned: 0
Country: ____

Re: Converting Color/Grayscale Text Scans to Black & White

Post by russca »

Fabian wrote:In particular, russca, I've been following the instructions from your video tutorial and have been largely successful in reproducing your results using blending mode, including running my first Photoshop automated batch. It works well when the page is just text but I've had a lot of difficulty when the page consists of both text and image. I attach a copy of another test page which you can see contains a rather faint image. I can't seem to find the right balance that preserves both text and image using Screen/Multiply; in almost every variation, the image vanishes into the whitened background. Any suggestions?
It really depends on what quality one requires for his or her use. For example, you can preserve color images and turn page background to white with black letters. That would require additional post conversion steps with file being saved as a color image.

If it's a "pensil" drawing as in your uploaded sample page, then b & w conversion with the threshold method is the best route to take, IMO.

I made a GIF animation of your sample page converted into b & w with a threshold level varying from 190 to 220 in 5 points increments:

Image
Attachments
testpage-sample.gif
Fabian
Posts: 6
Joined: 09 May 2016, 14:13
Country: Canada

Re: Converting Color/Grayscale Text Scans to Black & White

Post by Fabian »

Thank you again to everyone for your patient and detailed replies!
dpc
Posts: 379
Joined: 01 Apr 2011, 18:05
Number of books owned: 0
Location: Issaquah, WA

Re: Converting Color/Grayscale Text Scans to Black & White

Post by dpc »

It's interesting to see the difference in that same page that was scanned by Google here.

The pencil drawing in the center of the page has darker lines in some of the areas that might say something about the post-processing algorithm they are using.
mrwarper
Posts: 18
Joined: 29 Dec 2012, 21:50
E-book readers owned: 10x iRex DR1000, 15x iRex DR800
Number of books owned: 10000
Country: Spain
Contact:

Re: Converting Color/Grayscale Text Scans to Black & White

Post by mrwarper »

Sorry for going somewhat off-topic -- this is a most interesting thread, and I nearly missed it. Shouldn't this be in Tutorials/How-Tos in the Software and Processing section? (I mean, if it's possible to move threads from one subforum to another)
User avatar
Jetigen
Posts: 1
Joined: 01 Sep 2015, 02:54
Number of books owned: 0
Country: Kyrgyz Republic - USA

Re: Converting Color/Grayscale Text Scans to Black & White

Post by Jetigen »

dpc wrote:It's interesting to see the difference in that same page that was scanned by Google here.

The pencil drawing in the center of the page has darker lines in some of the areas that might say something about the post-processing algorithm they are using.
I liked more the Google's conversion of the pencil drawing, though it looks washed-out overall. The sharper lines of the far-away house are neat. For text, I think there is no difference. The two methods from my video tutorial allow thicker or thinner black & white letter conversion.

BTW, this is me, russca.

Here they are next to each other. Click on the images to see in original size:

Google's black and white conversion
google-black-white-conversion.jpg
Generic black & white conversion using Threshold
test-black-white-conversion.jpg
Post Reply