Converting Color/Grayscale Text Scans to Black & White

Share your process here - how to build something, scan something, or use something.

Moderator: peterZ

intermediatic
Posts: 11
Joined: 23 Apr 2010, 23:14

Re: Converting Color/Grayscale Text Scans to Black & White

Post by intermediatic »

Is there any way to do this to an existing PDF? I've encountered a large number of scanned books that I am trying to make more readable.
cday
Posts: 447
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Converting Color/Grayscale Text Scans to Black & White

Post by cday »

intermediatic wrote:Is there any way to do this to an existing PDF? I've encountered a large number of scanned books that I am trying to make more readable.
One way would be to extract the images from the PDF, convert them to black and white or perform any other enhancement operations, and then create a new PDF file from those images. That's all possible using batch processing with a choice of freeware software, so not too difficult, but the catch if there is one is that searchablity will be lost if the original text was searchable, although the new files can easily be OCR'ed to restore it, and as OCR software has moved on accuracy may well be improved if the original images weren't high quality.

I'm not aware of a way of enhancing the images within the PDF while maintaining searchability although there may possibly be one, but extracting the images and then creating a new PDF isn't too difficult anyway...

Update:

I think that it may be possible to do what you want directly using the cross-platform freeware software XnConvert , based on a quick test, although you would still need to OCR the resulting file if you need searchability. Note that you will also need to install the freeware utility Ghostscript and to set a DPI value in XnConvert to obtain suitable image quality. And as there is no or minimal documentation other than the XnView forum, the easiest way to learn how to use the program is to explore the interface fully and to do some tests.
Fabian
Posts: 6
Joined: 09 May 2016, 14:13
Country: Canada

Re: Converting Color/Grayscale Text Scans to Black & White

Post by Fabian »

intermediatic wrote:Is there any way to do this to an existing PDF? I've encountered a large number of scanned books that I am trying to make more readable.
I also have a number of PDF books (in this case downloaded from Internet Archive) that have a similar yellowish background hue/tint, and I would like to find a simple way to convert them to a plain white background. Is there a (Mac) program that will do the job to an existing PDF, or must I extract and then process each individual page image as the above posts describe? Please be as specific as possible.

This is my first post on this forum, so please forgive my ignorance about these matters.
BruceG
Posts: 99
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: Converting Color/Grayscale Text Scans to Black & White

Post by BruceG »

Fabian
Can you give a link to the book you have from Internet Archive. ie Dropbox etc.

The first entry about Photoshop will allow batch changes.
Fabian
Posts: 6
Joined: 09 May 2016, 14:13
Country: Canada

Re: Converting Color/Grayscale Text Scans to Black & White

Post by Fabian »

BruceG wrote:Fabian
Can you give a link to the book you have from Internet Archive. ie Dropbox etc.
An example would be this work: https://archive.org/details/thustorevisitsom00ford (the PDF download option is halfway down the page).

I know I can export the individual page images to TIFF files and then process them thru a program like ScanTailor, but that involves multiple stages and can be time-consuming. I'm wondering if there is a much simpler solution to replace the yellowish background with a white one.
BruceG
Posts: 99
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: Converting Color/Grayscale Text Scans to Black & White

Post by BruceG »

Fabian
Did you have a look at the B&W pdf version at Internet Archive. There some tint on some pages but not over the whole page like the normal pdf.
Thus to Revisit OCR.pdf
OCR ed by OmniPage
(839.5 KiB) Downloaded 472 times
This OCR was with the B&W pdf. Using the tinted version I would have to remove it one page at a time after OCR. Either while fixing OCR errors or a job by itself.
A lot depends how you want to use the material. Copy a paste is important to me as well as file size so I OCR. Try copy and paste with either pdf from Internet Archive.
Fabian
Posts: 6
Joined: 09 May 2016, 14:13
Country: Canada

Re: Converting Color/Grayscale Text Scans to Black & White

Post by Fabian »

BruceG wrote:Fabian
Did you have a look at the B&W pdf version at Internet Archive. There some tint on some pages but not over the whole page like the normal pdf.
You're right, I should have chosen one of the many examples where no B&W version is offered. Unfortunately, your OCR'd copy is not suitable for my purposes because the original typeface is sacrificed. File size and copy-and-paste is important to me too but I'd also like to preserve as much fidelity to the original as possible.

Is the bottom line, therefore, that each page must be individually processed in order to remove the tint? There is no program that will do it for me in one book-length pass?
duerig
Posts: 388
Joined: 01 Jun 2014, 17:04
Number of books owned: 1000
Country: United States of America

Re: Converting Color/Grayscale Text Scans to Black & White

Post by duerig »

Fabian, I don't know the Mac ecosystem that well. But if you are comfortable with the command line, I think that ImageMagick might be your best bet. It is easily scriptable and I am almost certain that it can do everything you need here.

-D
Fabian
Posts: 6
Joined: 09 May 2016, 14:13
Country: Canada

Re: Converting Color/Grayscale Text Scans to Black & White

Post by Fabian »

duerig wrote:Fabian, I don't know the Mac ecosystem that well. But if you are comfortable with the command line, I think that ImageMagick might be your best bet. It is easily scriptable and I am almost certain that it can do everything you need here.
Thanks for the suggestion, duerig, but I've never really been comfortable working from the command line. And since I almost never work with photos, my familiarity with image-editing programs like Photoshop or Gimp is pretty rudimentary. :cry:
russca
Posts: 53
Joined: 04 Mar 2014, 00:53
Number of books owned: 0
Country: ____

Re: Converting Color/Grayscale Text Scans to Black & White

Post by russca »

Fabian wrote:And since I almost never work with photos, my familiarity with image-editing programs like Photoshop or Gimp is pretty rudimentary. :cry:
Just follow my Photoshop video tutorial and you'll get the results you want. Can't get any easier when batch automated. For OCR you can use Abbyy FineReader. It'll put ocr'ed text under the text scan, letting you preserve the original printed look and add searchability and ability to copy the text.

For demonstration purposes, I extracted page 43 in tiff format from the pdf file you linked in this thread. Ran it through the Photoshop steps and finally ocr'ed using Fine Reader. Searchable PDF for page 43 is at the very bottom of my reply.

Here they are:

Extracted Page 43
Thus to Revisit-43c.tiff
Converted B&W Page 43
Thus to Revisit-43.tiff
Searchable B&W Page 43 PDF
Thus to Revisit OCR-43.pdf
(947.39 KiB) Downloaded 462 times
Post Reply