Page 1 of 4

Scantailor preprocess script

Posted: 17 May 2013, 19:33
by abmartin
I've been posting the slow progress of development in another thread as this script morphed from a tutorial into something that stands more on its own. ( ... =19&t=2795) It is at a place now where I feel that I can post it here as complete. I have written this for *nix systems. I believe that very little would need to be changed to get it to run on macs. I have no clue how to make it work for Windows. If anyone has knowledge of windows or mac scripting, I would be pleased to see those additions.

ppmunwarp as modified by royeven ( ... 795#p15975)
(UFRaw and GIMP are optional)

What the script does:
Color correction with a gray card either manually with UFRaw and UFRaw-batch or automatically with imagemagick. (or turned off entirely)
Geometry correction and cropping using ppmunwarp as modified by royeven ( ... 795#p15975)
Automatic DPI calculation using ppmunwarp and royeven's new -mul option, or manual DPI entry with GIMP
Imagemagick creates final images ready for Scantailor.

In the following example, I show the steps in the automatic modes using jpg inputs, as that is, I expect, a preferred choice for most camera wielders. In my description, I will refer to the "input-format" rather than JPG, since that is a variable that the user should set for their own workflow. Images here are resized and compressed to save bandwidth, but the necessary points should be clear. (Compressed far too much...)

1. Color Correction
The script looks for the gray card calibration image, which is named in relation to user-defined variables in the script (default: color.input-format)
Input gray card image
Imagemagick crops a box, centered in the image, of an user-defined size (default, 500x500) to elminate all non-gray card areas:
Gray area used for color calibration
color-test.jpg (17.12 KiB) Viewed 8857 times
Imagemagick then determines the average RGB values of this small box and calculates the necessary transformations to reach the target value of the gray card (defined in script variables). These transformations are applied to all files of the user-defined input format contained in the directory. Also, at this stage, the input format is converted into ppm files that ppmunwarp can use.
color - fixed.jpg
Gray card after color matrix has been applied
2. Geometry Correction
Using a calibration image for ppmunwarp (also user-defined file name -- default is calibration.input-format), ppmunwarp will determine the lens and perspective distortions. The size of the calibration image provides crop information. (So make your calibration image a bit bigger than the book!)
Calibration Image
Using the calibration data, ppmunwarp will apply the necessary geometry fixes to all ppm files in the directory. During this stage, the images are also cropped.
Calibration image after geometry correction
3. DPI Calculation
In automatic mode, ppmunwarp will recalibrate itself on the now-corrected calibration image. It determines the average distance between calibration points. (For the points that are not detected, ppmunwarp will estimate the position of undetected dots) Using the -mul option (configurable), ppmunwarp will multiply the average distance by the user-supplied multiplier to determine DPI.

In manual mode, the image will be opened in GIMP, and, using the measuring tool and math, the user will input the correct DPI.

(Additionally, there is a mode that uses the autodetect, but allows the user to double-check)

4. Final steps
Imagemagick converts the fixed ppm files into the output format of the user's choice. (default is lzw compressed tif files) Using the DPI determined in step three, imagemagick also saves DPI information into the images. (Saving that step in Scantailor)

The script deletes intermediate files.

Here's the camera photo. There is quite a bit of lens distortion and some obvious keystoning. (Kind of fun to misaim the cameras for an example...)
Example page - input
After color correction, geometry correction, and cropping, the final image looks much better:
Example-page output
After running the image through Scantailor, here is an example of the result:
Final - processed.jpg
Example page - after Scantailor
Here is the log file:

Code: Select all

Your camera images are now being prepared for Scantailor

Determining RGB values of color.JPG...
Detected RGB values are 151, 146, 152
RGB values to be adjusted by 1.0728476821, 1.1095890410, 1.0526315789

Running Imagemagick to adjust colors and convert JPG to ppm...

Calculating geometry calibration data from image: calibration.ppm
Number of detected points: 1938
Average: 64.829646, calculated from 1900 of 1900 data points. PPI: 329
Deskewed picture size: 3358 x 2518   (83.98% x 83.98%)

Correcting geometry.  This will take some time...

Calculating DPI from image: calibration_corrected.ppm
Number of detected points: 1938
Average: 64.579329, calculated from 1900 of 1900 data points. PPI: 328
Deskewed picture size: 3355 x 2516   (99.91% x 99.93%)
Calculated PPI is: 328

ImageMagick will now prepare images for Scantailor...

Deleting temporary files...

Your images are ready for Scantailor
And here is the script:
Preprocess script
(7.71 KiB) Downloaded 430 times
I guess I'll call that RC1 for now.

I apologize for the badly compressed images. A 75% jpg quality is obviously not enough... And yes, my lighting is poor and there are a lot of reflections in my old New Standard. It used to be better, but it involved a behemouth black shroud that was just too massive for the living room. I'm trying to save up to get one of Dan's new kits which would also solve the reflection problems. At least the software workflow is ready for that day!!

Happy bookscanning!

Re: Scantailor preprocess script

Posted: 20 May 2013, 09:52
by wbest1
OK. And how does that look after running through something like Tesseract? Or Abby? Does the OCRed file translate better than the original?


Re: Scantailor preprocess script

Posted: 20 May 2013, 16:52
by abmartin
If your goal is to have text output, it doesn't make a whole lot of difference to fix color. Since Scantailor will binarize text, there is no real reason unless you have images. (I just do it all of the time regardless, since I have to run an imagemagick step anyway to get ppm images for unwarping, and it doesn't take any more time to do it. Fixing the colors also helps with the ppmunwarp step, since it is trying to detect points of a particular color) With the geometry, it doesn't make a significant difference with OCR if the problems are small. OCR software seems to deal pretty well.

With that sample image, there are a couple of odd spots with the original that aren't a problem with the fixed version. (Not a huge deal though -- and I would never actually take an image that bad either)

I prefer to have djvu output with the text layer stored in the background (using djvubind), so, for me, having correct geometry makes reading a lot more seamless. I have attached a 2 page djvu file with the sample page processed without the correction (just scantailor from the camera image without dewarping) and one having been corrected. You can see the slight differences in the text layer where the lines are especially not straight. When running OCR, I used Tesseract.

I had to add a txt extension to upload the file to the forum, so you will have to download the file and delete the extension before viewing it.
Example -- Delete the .txt extension
(43.28 KiB) Downloaded 292 times

Re: Scantailor preprocess script

Posted: 21 May 2013, 11:57
by Misty
Wow, this looks fantastic! Going to have to borrow some of your techniques for my own postprocessing script.

Re: Scantailor preprocess script

Posted: 25 May 2013, 02:19
by tkr
i) I'm curious to know whether ppmunwarp will handle the case where the book surface is not perfectly flat, but has a curve to it ?

ii) Thanks for posting the 11x17 and 11x14 calibration sheets - however when I printed them out, it appears that the printed area does not change and is the same size as 8.5x11. Could you please check your .eps code ?

iii) Also, could somebody please post an exe file for the ppmunwarp - I was able to compile and run it exactly once (on Windows 7), but I made some changes to my environment (uninstalled Visual Studio) and now cannot get back to a working state (Error Msg: "The application was unable to start correctly").


Re: Scantailor preprocess script

Posted: 25 May 2013, 10:11
by pablitoclavito
tkr wrote: ii) Thanks for posting the 11x17 and 11x14 calibration sheets - however when I printed them out, it appears that the printed area does not change and is the same size as 8.5x11. Could you please check your .eps code ?
Did you print eps files or pdf files? (I ask this because I don't know if you can print eps directly)
I am going to tell you what I did to get correct dimensions in the ones I use.

I converted the eps I modified to pdf in Linux, using epstopdf. Then I printed the pdf, and it had the correct dimensions.
I don't know if there is a binary for windows.
Some converters have problems with keeping the boundaries you indicate in the eps.

Re: Scantailor preprocess script

Posted: 25 May 2013, 11:36
by abmartin

1. ppmunwarp has no problems with curves, provided the calibration image shares that curve. (I haven't tested to see how far it can be pushed) I believe that the author of the program was intending on building a scanning setup with a curve in it! (I assume for loose sheets rather than books) From the computer's perspective, there isn't really a difference between a curve and the lens distortion from my sample image. (Again, it is only effective if the calibration image shares the same distortions as the page)

2. pablitoclavito has it right. Perhaps I should have posted PDF files instead, but they are more than 1000x the size of the EPS text. Convert the EPS files to a printable format, and you will have it. I also like the ghostscript toolset, but you probably don't have access to that on Windows. You can also compile the eps code into image files using GIMP. When opening in GIMP, just set the DPI to match that of the printer. Then save as a massive image file. (And I mean HUGE) It also might be possible to print from GIMP?

3. Hopefully someone can step up and make a windows binary. I don't have any Windows machines, so I can't do it, unfortunately.

Re: Scantailor preprocess script

Posted: 25 May 2013, 12:35
by tkr
I used a free online service (forget the name now) to convert the eps file to pdf - it is possible that the error was introduced there.
I'll try pablitoclavito's method of using Linux to generate the pdf.

Re: Scantailor preprocess script

Posted: 28 May 2013, 05:17
by andyh2000
GSView and ghostscript toolsets are available for Windows for all your ps, eps and pdf converting needs. Start here:


Re: Scantailor preprocess script

Posted: 11 Jun 2013, 18:59
by lab_rat
sorry for the total n00b question... what is the proper syntax to define the location of the ppmunwrap.cpp file?
I keep getting a ./ line 118: ppmunwrap: command not found