Scantailor preprocess script

General discussion about software packages and releases, new software you've found, and threads by programmers and script writers.

Moderator: peterZ

abmartin
Posts: 79
Joined: 15 Sep 2010, 15:33
Number of books owned: 2000
Country: USA
Location: Ohio

Scantailor preprocess script

Post by abmartin »

I've been posting the slow progress of development in another thread as this script morphed from a tutorial into something that stands more on its own. (http://diybookscanner.org/forum/viewtop ... =19&t=2795) It is at a place now where I feel that I can post it here as complete. I have written this for *nix systems. I believe that very little would need to be changed to get it to run on macs. I have no clue how to make it work for Windows. If anyone has knowledge of windows or mac scripting, I would be pleased to see those additions.

Dependencies:
Imagemagick
ppmunwarp as modified by royeven (http://diybookscanner.org/forum/viewtop ... 795#p15975)
(UFRaw and GIMP are optional)

What the script does:
Color correction with a gray card either manually with UFRaw and UFRaw-batch or automatically with imagemagick. (or turned off entirely)
Geometry correction and cropping using ppmunwarp as modified by royeven (http://diybookscanner.org/forum/viewtop ... 795#p15975)
Automatic DPI calculation using ppmunwarp and royeven's new -mul option, or manual DPI entry with GIMP
Imagemagick creates final images ready for Scantailor.

In the following example, I show the steps in the automatic modes using jpg inputs, as that is, I expect, a preferred choice for most camera wielders. In my description, I will refer to the "input-format" rather than JPG, since that is a variable that the user should set for their own workflow. Images here are resized and compressed to save bandwidth, but the necessary points should be clear. (Compressed far too much...)

1. Color Correction
The script looks for the gray card calibration image, which is named in relation to user-defined variables in the script (default: color.input-format)
Input gray card image
Input gray card image
Imagemagick crops a box, centered in the image, of an user-defined size (default, 500x500) to elminate all non-gray card areas:
Gray area used for color calibration
Gray area used for color calibration
color-test.jpg (17.12 KiB) Viewed 14195 times
Imagemagick then determines the average RGB values of this small box and calculates the necessary transformations to reach the target value of the gray card (defined in script variables). These transformations are applied to all files of the user-defined input format contained in the directory. Also, at this stage, the input format is converted into ppm files that ppmunwarp can use.
Gray card after color matrix has been applied
Gray card after color matrix has been applied
2. Geometry Correction
Using a calibration image for ppmunwarp (also user-defined file name -- default is calibration.input-format), ppmunwarp will determine the lens and perspective distortions. The size of the calibration image provides crop information. (So make your calibration image a bit bigger than the book!)
Calibration Image
Calibration Image
Using the calibration data, ppmunwarp will apply the necessary geometry fixes to all ppm files in the directory. During this stage, the images are also cropped.
Calibration image after geometry correction
Calibration image after geometry correction
3. DPI Calculation
In automatic mode, ppmunwarp will recalibrate itself on the now-corrected calibration image. It determines the average distance between calibration points. (For the points that are not detected, ppmunwarp will estimate the position of undetected dots) Using the -mul option (configurable), ppmunwarp will multiply the average distance by the user-supplied multiplier to determine DPI.

In manual mode, the image will be opened in GIMP, and, using the measuring tool and math, the user will input the correct DPI.

(Additionally, there is a mode that uses the autodetect, but allows the user to double-check)

4. Final steps
Imagemagick converts the fixed ppm files into the output format of the user's choice. (default is lzw compressed tif files) Using the DPI determined in step three, imagemagick also saves DPI information into the images. (Saving that step in Scantailor)

The script deletes intermediate files.



Examples:
Here's the camera photo. There is quite a bit of lens distortion and some obvious keystoning. (Kind of fun to misaim the cameras for an example...)
Example page - input
Example page - input
After color correction, geometry correction, and cropping, the final image looks much better:
Example-page output
Example-page output
After running the image through Scantailor, here is an example of the result:
Example page - after Scantailor
Example page - after Scantailor
Here is the log file:

Code: Select all

Your camera images are now being prepared for Scantailor

Determining RGB values of color.JPG...
Detected RGB values are 151, 146, 152
RGB values to be adjusted by 1.0728476821, 1.1095890410, 1.0526315789

Running Imagemagick to adjust colors and convert JPG to ppm...
done

Calculating geometry calibration data from image: calibration.ppm
Number of detected points: 1938
Average: 64.829646, calculated from 1900 of 1900 data points. PPI: 329
Deskewed picture size: 3358 x 2518   (83.98% x 83.98%)
done

Correcting geometry.  This will take some time...
done

Calculating DPI from image: calibration_corrected.ppm
Number of detected points: 1938
Average: 64.579329, calculated from 1900 of 1900 data points. PPI: 328
Deskewed picture size: 3355 x 2516   (99.91% x 99.93%)
Calculated PPI is: 328

ImageMagick will now prepare images for Scantailor...
done

Deleting temporary files...
done

Your images are ready for Scantailor
And here is the script:
preprocess.txt
Preprocess script
(7.71 KiB) Downloaded 630 times
I guess I'll call that RC1 for now.


I apologize for the badly compressed images. A 75% jpg quality is obviously not enough... And yes, my lighting is poor and there are a lot of reflections in my old New Standard. It used to be better, but it involved a behemouth black shroud that was just too massive for the living room. I'm trying to save up to get one of Dan's new kits which would also solve the reflection problems. At least the software workflow is ready for that day!!

Happy bookscanning!
wbest1
Posts: 4
Joined: 25 Apr 2013, 09:38
E-book readers owned: Nexus 7
Number of books owned: 1000
Country: USA

Re: Scantailor preprocess script

Post by wbest1 »

OK. And how does that look after running through something like Tesseract? Or Abby? Does the OCRed file translate better than the original?

Thanks,
WTB
abmartin
Posts: 79
Joined: 15 Sep 2010, 15:33
Number of books owned: 2000
Country: USA
Location: Ohio

Re: Scantailor preprocess script

Post by abmartin »

If your goal is to have text output, it doesn't make a whole lot of difference to fix color. Since Scantailor will binarize text, there is no real reason unless you have images. (I just do it all of the time regardless, since I have to run an imagemagick step anyway to get ppm images for unwarping, and it doesn't take any more time to do it. Fixing the colors also helps with the ppmunwarp step, since it is trying to detect points of a particular color) With the geometry, it doesn't make a significant difference with OCR if the problems are small. OCR software seems to deal pretty well.

With that sample image, there are a couple of odd spots with the original that aren't a problem with the fixed version. (Not a huge deal though -- and I would never actually take an image that bad either)

I prefer to have djvu output with the text layer stored in the background (using djvubind), so, for me, having correct geometry makes reading a lot more seamless. I have attached a 2 page djvu file with the sample page processed without the correction (just scantailor from the camera image without dewarping) and one having been corrected. You can see the slight differences in the text layer where the lines are especially not straight. When running OCR, I used Tesseract.

I had to add a txt extension to upload the file to the forum, so you will have to download the file and delete the extension before viewing it.
book.djvu.txt
Example -- Delete the .txt extension
(43.28 KiB) Downloaded 488 times
User avatar
Misty
Posts: 481
Joined: 06 Nov 2009, 12:20
Number of books owned: 0
Location: Frozen Wasteland

Re: Scantailor preprocess script

Post by Misty »

Wow, this looks fantastic! Going to have to borrow some of your techniques for my own postprocessing script.
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.
tkr
Posts: 35
Joined: 29 Jan 2012, 21:53
Number of books owned: 0

Re: Scantailor preprocess script

Post by tkr »

i) I'm curious to know whether ppmunwarp will handle the case where the book surface is not perfectly flat, but has a curve to it ?

ii) Thanks for posting the 11x17 and 11x14 calibration sheets - however when I printed them out, it appears that the printed area does not change and is the same size as 8.5x11. Could you please check your .eps code ?

iii) Also, could somebody please post an exe file for the ppmunwarp - I was able to compile and run it exactly once (on Windows 7), but I made some changes to my environment (uninstalled Visual Studio) and now cannot get back to a working state (Error Msg: "The application was unable to start correctly").

Tks
TKR
pablitoclavito
Posts: 39
Joined: 12 Sep 2012, 16:54
E-book readers owned: Iliad
Number of books owned: 200
Country: Spain

Re: Scantailor preprocess script

Post by pablitoclavito »

tkr wrote: ii) Thanks for posting the 11x17 and 11x14 calibration sheets - however when I printed them out, it appears that the printed area does not change and is the same size as 8.5x11. Could you please check your .eps code ?
Did you print eps files or pdf files? (I ask this because I don't know if you can print eps directly)
I am going to tell you what I did to get correct dimensions in the ones I use.

I converted the eps I modified to pdf in Linux, using epstopdf. Then I printed the pdf, and it had the correct dimensions.
I don't know if there is a binary for windows.
Some converters have problems with keeping the boundaries you indicate in the eps.
abmartin
Posts: 79
Joined: 15 Sep 2010, 15:33
Number of books owned: 2000
Country: USA
Location: Ohio

Re: Scantailor preprocess script

Post by abmartin »

Tkr,

1. ppmunwarp has no problems with curves, provided the calibration image shares that curve. (I haven't tested to see how far it can be pushed) I believe that the author of the program was intending on building a scanning setup with a curve in it! (I assume for loose sheets rather than books) From the computer's perspective, there isn't really a difference between a curve and the lens distortion from my sample image. (Again, it is only effective if the calibration image shares the same distortions as the page)

2. pablitoclavito has it right. Perhaps I should have posted PDF files instead, but they are more than 1000x the size of the EPS text. Convert the EPS files to a printable format, and you will have it. I also like the ghostscript toolset, but you probably don't have access to that on Windows. You can also compile the eps code into image files using GIMP. When opening in GIMP, just set the DPI to match that of the printer. Then save as a massive image file. (And I mean HUGE) It also might be possible to print from GIMP?

3. Hopefully someone can step up and make a windows binary. I don't have any Windows machines, so I can't do it, unfortunately.
tkr
Posts: 35
Joined: 29 Jan 2012, 21:53
Number of books owned: 0

Re: Scantailor preprocess script

Post by tkr »

Abmartin,
I used a free online service (forget the name now) to convert the eps file to pdf - it is possible that the error was introduced there.
I'll try pablitoclavito's method of using Linux to generate the pdf.
TKR
andyh2000
Posts: 3
Joined: 04 Jan 2012, 06:21
E-book readers owned: Lots!
Number of books owned: 1000

Re: Scantailor preprocess script

Post by andyh2000 »

GSView and ghostscript toolsets are available for Windows for all your ps, eps and pdf converting needs. Start here: http://pages.cs.wisc.edu/~ghost/gsview/get50.htm

Andrew
lab_rat
Posts: 6
Joined: 04 Jan 2013, 23:55
E-book readers owned: kindle
Number of books owned: 0
Country: USA

Re: Scantailor preprocess script

Post by lab_rat »

sorry for the total n00b question... what is the proper syntax to define the location of the ppmunwrap.cpp file?
I keep getting a ./preprocess.sh: line 118: ppmunwrap: command not found

Thanks
Post Reply