Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

Preprocessing RAW images for Scantailor

Share your software workflow. Write up your tips and tricks on how to scan, digitize, OCR, and bind ebooks.
Posts: 79
Joined: 15 Sep 2010, 15:33
Number of books owned: 2000
Country: USA
Location: Ohio

Preprocessing RAW images for Scantailor

Post by abmartin » 16 Feb 2013, 16:18

This tutorial explains how to prepare your raw images for scantailor. In this process, you will fix colors with a gray card and fix perspective and lens distortions. Upon the discovery of the ppmunwarp tool in recent weeks, I have since created a single script to use that tool and, with ufraw, fix colors as well. What used to be a complex process is now quite simple.

This tutorial assumes the use of Linux. I hope that Windows and Mac people can figure out how to use this approach with your own scripts.

Software Tools:
ppmunwarp (http://diybookscanner.org/forum/viewtop ... =19&t=2589)

Meatspace Tools
Gray card (I use an 18%, which has rgb values of 128 128 128)
Calibration image for ppmunwarp -- see its thread. In my post in that thread, I have included eps files to use to print said image. As this calibration image also is used as a rough crop, it is handy to cut a variety of sizes for differently-proportioned books.

Steps to do before running the script
1. When photographing the book in RAW, take a photo of the calibration image and the gray card with each camera.
2. Copy images from your cameras into left and right directories.
3. Rename the gray card image as "color.ext" (ext being whatever extension your RAW photos are in. On my canons, it is CRW)
4. Rename the calibration image as "calibration.ext"
5. Rename your other files as appropriate. (I use numbers with an _L or _R ending)

Running the script...
1. Run the script in the directory containing the images

Fixing Color:
1. The script opens the color file in UFRaw and asks your for three values.
2. In UFRaw, draw a selection box over much of your gray card, and click the eyedropper button, which creates a true gray.
color calibration 1.jpg
gray values equalized
3. Using the exposure adjust tool, adjust until the spot values reach their true value. (In the case of 18%, it's 128, 128, 128)
color calibration 2.jpg
exposure adjusted to appropriate values
4. Enter the resulting temperature, green, and exposure values into the script.
color calibration 3.jpg
values entered into script
color calibration 3.jpg (25.74 KiB) Viewed 12144 times
5. Now, your raw input is color adjusted and converted to the ppm format using ufraw-batch

Fixing Distortions
1. The script uses the file calibration.ppm to prepare for unwarping
2. The script now runs ppmunwarp on all images in the directory

Preparing Images for Scantailor
1. The script opens the corrected calibration image in GIMP.
2. Using the measuring tool (shift-M), determine the pixels per centimeter. (each grid point is half a centimeter)
Measuring DPI.jpg
GIMP measuring tool
3. Enter that number into the script
4. The script, using that value, now converts the image format into lzw compressed tif files with correct DPI information.

1. The script deletes the intermediate ppm and png files.
2. The script moves the final output down a directory

Run the script again on the other directory...

You are ready for Scantailor!

Things for you to adjust as needed in the script:
1. Input format for RAW files. Currently, it is entered as CRW. Change to whatever file format is appropriate.
2. Path to ppmunwarp. I installed mine in /usr/bin, so I can just call it with ppmunwarp. You may need to include the path to the binary

The Script:

Code: Select all


#This script takes RAW files and fixes color and geometry and prepares images for scantailor.
#It requires the use of a gray card and the ppmunwarp grid.
#Change input file formats in the script as appropriate for you camera.
#Before running the script, rename the gray card as color and the calibration image as calibration
#Name all other files as you choose

##Color Calibration from a gray card.
#Using UFRaw: 1. select an area on your gray card, 
#2. click the eyedropper, which equalizes RGB values
#3. adjust the exposure control to get RGB values to their ultimate goal. 
#(e.g. an 18% gray card is 128 128 128)
#The script will ask for the resulting values from the gui and apply them to all images.

echo "The color calibration image is being loaded in UFRaw. Enter the following values for color correction"
ufraw color.CRW &
echo "Color temperature?: "
read temperature
echo "Green Value?: "
read green
echo "Exposure change?: "
read exposure

echo "Running Ufraw-batch, this will take a few minutes..."

ufraw-batch --temperature=$temperature --green=$green --exposure=$exposure --out-type=ppm *.CRW

##Using ppmunwarp and the calibration image, lens and perspective distortion are fixed.

echo "Calibrating geometry..."
ppmunwarp -m check.ppm calibration.ppm > calibration.bin

echo "Correcting geometry.  This will take some time..."
for i in *.ppm; do
 if [ -e "$i" ]; then
   file=`basename "$i" .ppm`
   ~/software/ppmunwarp/ppmunwarp -d calibration.bin "$i" > "$file.png"

##Using gimp, determine the pixels per centimeter

echo "Use the measuring tool in GIMP (shift-m) to determine the number of pixels per centimeter (two dots)."
gimp calibration.png &
echo "PixelsPerCM?: "
read ppc
ppi="$(echo "$ppc*2.54" | bc)"

##Imagemagick coverts images to a scantailor-compliant file

echo "ImageMagick will now convert the format into one useable by Scantailor"
mogrify -verbose -format tif -density $ppi -units PixelsPerInch -compress lzw *.png

##Housekeeping -- comment out what you don't want to happen.
##Clean temporary files
rm *.png
rm *.ppm

##Move files down a directory, in cases of Left/Right folders
mv *.tif ..

Example images:

Calibration Image -- I tried to get some keystoning and some good lens distortion in this image. I normally zoom more than that, but this is to show how it works.
calibration image.jpg
calibration image.jpg (177.58 KiB) Viewed 12144 times
Geometry data from calibration image visualized
geometrydata.jpg (150.99 KiB) Viewed 12144 times
Calibration image fixed!
geometryfixed.jpg (220.44 KiB) Viewed 12144 times

Sample page
original.jpg (92.84 KiB) Viewed 12144 times
Color Fixed
color-fixed.jpg (107.19 KiB) Viewed 12144 times
Distortions fixed
perspectiveandcropped.jpg (141.92 KiB) Viewed 12144 times

Photos taken with two canon sx130 cameras, with CHDK.

Now that I've got that figured out, I guess I can't ignore my terrible lighting anymore....

The image of the book comes from Richard Stallman's excellent essay collection Free Software, Free Society. As I understand the licensing terms, I am supposed to offer the complete work to you. If you want the final djvu file of this book, let me know. It's kind of unnecessary though, since you can get the source files from which the book was printed on the FSF website. (http://www.gnu.org/philosophy/fsfs/rms-essays.pdf) I strongly suspect that the FSF would consider this a fair use case too, making the offer doubly-unnecessary.

Let me know if you have ideas to make this better! This was my first bash script, so I really had no clue what I was doing. The internet is an amazing resource to find answers to questions. (The only tricky bit was to figure out how to do floating point math, which, rather hilariously, isn't possible in bash)

Happy bookscanning!

Posts: 39
Joined: 12 Sep 2012, 16:54
E-book readers owned: Iliad
Number of books owned: 200
Country: Spain

Re: Preprocessing RAW images for Scantailor

Post by pablitoclavito » 17 Feb 2013, 12:04

Thanks for your effort, I will try to use this...
but I have a question about dpi.
In your image you measure 113 pixels, but, don't you need to multiply that by 2,54 to get DPI (dots per inch)
so your dpi would be 287?


Posts: 79
Joined: 15 Sep 2010, 15:33
Number of books owned: 2000
Country: USA
Location: Ohio

Re: Preprocessing RAW images for Scantailor

Post by abmartin » 17 Feb 2013, 13:02

That's correct. I have the script multiply the dpc by 2.54 to get the dpi information that is used in the final image. That was the weirdest thing to figure out how to do, since bash doesn't do decimal points natively. The following line is what takes the centimeter and does the conversion:
ppi="$(echo "$ppc*2.54" | bc)"

Perhaps it might be easier to make calibration images with the dots every quarter or third of an inch? I chose to stick with what the author of the ppmunwarp tool produced, but basically anything would work just as well, provided enough points are detected.

I'd love to somehow get rid of that gimp step. If the size of the grid is known, there should be a way to do it in software. As I am not a programmer (LaTeX is about as far as I go into the murkier realms), I don't, as of yet, see a way of using the geometry data to also spit out a DPI that could be used. If there are any programming gurus out there... Getting rid of one data-gathering step would be outstanding! In an active thread in the hardware forum, the issue of processing with little user input seems to be something that folks want.

Posts: 39
Joined: 12 Sep 2012, 16:54
E-book readers owned: Iliad
Number of books owned: 200
Country: Spain

Re: Preprocessing RAW images for Scantailor

Post by pablitoclavito » 17 Feb 2013, 14:02

Sorry, I didn't read the script.

For the dpi matter, I would like an automated approach too, but for the moment I use the IKEA paper ruler stack to the glass :) (it has two rulers cm/in, so I use the one that measures in inches)

...and then I measure in IrfanView.
How I measure dpi
IMG_0441.jpg (155.88 KiB) Viewed 12119 times
Do you think your process takes much memory, in order to use it inside a virtual machine?
I am using win7 now, and I have ubuntu somewhere around, but I have never used it...

One more question
I have scanned some books and I have seen that the deformation of the pages is not the same at the beginning, the middle or at the end of the book.
Do you have to take more than one "calibration grid sample" as you advance through the book?
And then divide the process of unwarping in various "lots"?


Posts: 79
Joined: 15 Sep 2010, 15:33
Number of books owned: 2000
Country: USA
Location: Ohio

Re: Preprocessing RAW images for Scantailor

Post by abmartin » 21 Feb 2013, 13:24

It doesn't use much memory. Ufraw-batch doesn't seem to ever go over 1 percent of my ram according to top (but that is 16 gigs). PPMunwarp likewise is very light since it only does one image at a time. It seems to max at .4%. Imagemagick is also very lightweight, since it also just does one image at a time. The script itself is basically nothing, since it's just a sequence of text. The most memory intesive process is having the file open in gimp.

You should be able to do all of this in Windows without having to go virtual machine. (Although, I encourage anyone to get familiar with gnu/linux) If you do use a virtual machine, I'd encourage you to consider a lightweight distro, since Ubuntu itself is pretty heavy (much more than the tasks being done here). Ufraw-batch is installed alongisde of Ufraw in windows. Likewise, PPMunwarp can be compiled in windows too. Imagemagick also has windows binaries. While my script won't work, the workflow will work perfectly in windows too. I posted my script with the hope that some Windows gurus can write a quick batch file that does the same stuff. You can determine the dpi the same way you always have!

Regarding changing deformations: With my hardware setup, there doesn't seem to be much change accross a book, because the cameras, focus, and zoom are stationary and the platen is good at staying in the same place. The biggest issue of changing page shape is that the book doesn't always want to press up against my platen. To fix that, I have foam inserts that I place under the books to help push the pages up. With a fixed platen like on Dan's kits, that simplifies everything. There will still be issues of rotation, but Scantailor is magnificent at fixing that. If there are significant changes in page geometry in your setup, by all means, take multiple calibration images. It would work perfectly fine! (The author of ppmunwarp does talk about that a little in his tutorial)

Posts: 39
Joined: 12 Sep 2012, 16:54
E-book readers owned: Iliad
Number of books owned: 200
Country: Spain

Re: Preprocessing RAW images for Scantailor

Post by pablitoclavito » 22 Feb 2013, 14:32

Is it necessary to do the process with raw images?

About the gray card: When using this method, do I still have to use the white balance setting in my camera ("daylight") or the fixing color step has nothing to do with that. I am sorry, I know almost nothing about photography. In fact this is the first time I know about gray cards.

However, I guess the settings in your camera when you take the photo of the gray card initially, are meant to be kept afterwards in the scanning process. Right?

Thanks again.

Posts: 79
Joined: 15 Sep 2010, 15:33
Number of books owned: 2000
Country: USA
Location: Ohio

Re: Preprocessing RAW images for Scantailor

Post by abmartin » 09 Apr 2013, 18:03

Wow -- Sorry I've been so out of it that I haven't responded yet. By now, I expect you've played around a bit and have already figured out a lot. If anyone has improved on this, please let me know!

Re the Raw input -- Unfortunately, I don't think that ufraw can be used with non-raw files. I just tried some jpegs and they didn't load. That's a shame. Similar things could be done with Imagemagick, but I could never figure a simple way to do it. (As in I was slamming my head against the desk (figuratively) trying to properly script that)

A Raw file is, basically, a dump of what the sensor picked up rather than what the camera produces after its software goes to work to fix colors and geometry. Cameras have to do that because it is impossible to make a sensor that responds to all wavelengths of light equally and to make lenses with no distortion at all zoom levels -- hence why for the real stuff, we get out our expensive prime lenses on the SLR. When they make the cameras, they know how off it is and then have the internal software correct things. (mine are mega blue, for example, without any fixing -- and on the point and shoot, you can see how bad the lens distortions are) However, that correction is the best compromise rather than perfect for every circumstance. What the gray card allows is for us to apply the correct white balance for each specific scenario. (In my sample images, the camera's suggested values were already applied in the pre-corrected image which is actually not too bad, so if I were just going to a black and white final, the color correction isn't really necessary) When I get out the DSLR for other photography, I like to get a gray card somewhere in the shot, then take the same image again without it. Since I am color blind, I really don't have any other option to get accurate color, since I sure I can't trust my eyes to adjust things properly.

It really doesn't matter what the camera's white balance is set on when taking in raw, since you will fix it afterwards anyway. (Although I still set it as best as I can) So, when answering the second question: no, but you still can. (It's at least a rough estimate already done)

Yes, you should use identical settings in all of the photos. (and identical lighting conditions too) Otherwise, the calibration data isn't going to be correct for the others in the series.

Forgive the lack of clarity. That's something I really need to work on.

Posts: 39
Joined: 12 Sep 2012, 16:54
E-book readers owned: Iliad
Number of books owned: 200
Country: Spain

Re: Preprocessing RAW images for Scantailor

Post by pablitoclavito » 09 Apr 2013, 19:39

Thanks for all the explanations! Very clear indeed!

Posts: 19
Joined: 27 Nov 2012, 19:43
E-book readers owned: ipad
Number of books owned: 0
Country: Norway

Modified version to the scripts, v1.0

Post by royeven » 29 Apr 2013, 13:39

Hi abmartin.

I have taken the liberty to change a few parts of your script and a few parts of ppmunwarp in an attempt to automate one of the steps that you do manually, i.e. the PPI-detection. I have tested it in ubuntu and debian and think I have removed all the errors, but there might still be some, of course. I hope that some of you who read this will test it, try it out and report back any bugs you may find. Please also tell me whether my calculations are correct.

I have made some modifications to the program ppmunwarp. See here for documentation of the original version: http://diybookscanner.org/forum/viewtop ... =19&t=2589. The program now reports the mean calculated distance (in pixels) between to adjacent calibration dots. When issuing the command

Code: Select all

ppmunwarp -mul 5.08 -m check.ppm calibration.ppm > calibration.bin
you will get output like this:
Number of detected points: 2526
Only 2317 detected points used for calibration!
Average: 90.880333, calculated from 2276 of 2277 data points. PPI: 462
Deskewed picture size: 5360 x 3724 (99.97% x 99.97%)
It means that 2526 calibration dots were found, but only 2317 of them were used for the calibration process. The mean distance between dots were 90.88033 pixels, but only 2276 of the 2277 data points were used to calculate the mean distance. If there are 2 dots pr cm, and 2.54 cm pr inch, we must multiply the 90.88033 pixels by 5.08 to get the number of pixels per inch - hence the -mul (for multiplier) in the command - to get 462 PPI.

Here is the source code for the program:
Source file to modified version of ppmunwarp
(63.07 KiB) Downloaded 408 times
To compile it use

Code: Select all

g++ -o ppmunwarp ppmunwarp.cpp
And here is a modified version of your bash script:
Bash script to correct images
(3.89 KiB) Downloaded 389 times
I have commented out the parts of the code that does color correction, since I do not know how to automate this part of the process yet. Since you have the same uneven lighting as I have, you want to manually select a part of your reference card that is least affected by that, and I have not found a good way to automate this. Does anybody have a good solution to how to do it? I can implement it in software if I could get some tips as to how to do it.

But the script does detect the PPI from the descewed photos, opens it in gimp so you can manually measure the PPI, and compare it to the output from ppmunwarp. If ppmunwarp doesn't compute correctly, input the correct PPI value when prompted, and this will be used instead.

Please give me feedback to the accuracy of the PPI estimate that ppmunwarp now does. I will do my best to make it perfect, but this will require feedback from the rest of you guys.

Posts: 79
Joined: 15 Sep 2010, 15:33
Number of books owned: 2000
Country: USA
Location: Ohio

Re: Preprocessing RAW images for Scantailor

Post by abmartin » 29 Apr 2013, 19:34

Brilliant! I will compile your version and try it out as soon as I can. (Maybe even as early as tonight) I really like the average approach which should improve accuracy! Thank you very much for doing that.

I was actually thinking about the color thing recently and have a thought that might be useful for you. The beauty of UFraw is that I am able to select an area which is averaged to provide a more accurate benchmark than a single pixel. I was thinking that we might be able to get the average color of a region with imagemagick.

1. Crop an area in the center of the image where the gray card presumably is. (maybe 250x250 or 500x500 pixels?)
$convert color.png -gravity center -crop 250x250+0+0 color-out.png
2. Resize the cropped image to 1x1 pixels. My belief is that this averages the color. (it seems to, but I haven't tested it thoroughly yet) Then use the text mode of imagemagick for the rbg values of that one pixel
$convert color-out.png -resize 1x1 txt:

The final output of that step looks like this:
# ImageMagick pixel enumeration: 1,1,255,srgb
0,0: (102,100,104) #666468 srgb(102,100,104)

I'm not sure how to get that value into a script yet, nor have I done the math to use either ufraw or imagemagick to fix colors from that RGB value.

I look forward to trying out your revisions.

Post Reply