Dr. Cheap's software & workflow

Share your software workflow. Write up your tips and tricks on how to scan, digitize, OCR, and bind ebooks.

Moderator: peterZ

Post Reply
DrCheap
Posts: 48
Joined: 07 Jan 2012, 19:27
E-book readers owned: pdf
Number of books owned: 750

Dr. Cheap's software & workflow

Post by DrCheap »

Since the scanner has been up and running, I have been playing about with software options and Daniel asked if I would post some details on my processing and workflow.

The following is the suite of software I am using at the moment:

CR2 Converter: Freeware. Batch-converts CR2 to TIFF. Quick, super-easy, and produces good quality images. Better image quality than some other free options I tried.
http://www.cr2converter.com/

Bulk Rename Utility: Freeware. Tons of great options, auto-numbers, with padding, moves files, etc. User interface is a bit noisy and congested but it does the job.
http://www.bulkrenameutility.co.uk/Main_Intro.php

GIMP 2.8 with Davids Batch Processor plug-in: Freeware. A very robust image manipulation tool. Does good color adjustment and some other tricks in big batches. Can work with TIFF or JPG but cannot take CR2 files directly, unfortunately. Install GIMP first, run it once, reboot, then install David's Batch Processor (be sure it is in the right folder), and from then on you can do some limited batch processes in GIMP.
GIMP: http://www.gimp.org/
Davids Batch Processor: http://members.ozemail.com.au/~hodsond/dbp.html

ScanTailor: Freeware. Needs no introduction to these forums, I suppose.
http://scantailor.sourceforge.net/

Finally, the one expensive piece of commercial software: Adobe Acrobat Pro. I'm fortunate that I can buy at educational pricing (~$100) but I do not think there is any software out there that can even come close to replacing Acrobat Pro (though I confess I have not used ABBYY). The OCR is very good in Acrobat Pro, but the ClearScan function is AMAZING. It makes the finished product SO MUCH neater and cleaner.

Do be aware that Adobe is very vigilant and tight about their educational pricing activations, so if you cannot prove you qualify for the discount don't try. They may ask for certification of full-time student status or a pay stub proving employment by a qualifying educational institution. More than one person who tried to sneak through without qualifying has been frustrated by Adobe. Also be aware it may take a week to get software activated on educational pricing due to these procedures.

Next post will be my preferred workflow and settings.
DrCheap
Posts: 48
Joined: 07 Jan 2012, 19:27
E-book readers owned: pdf
Number of books owned: 750

Re: Dr. Cheap's software & workflow

Post by DrCheap »

I'm using the a beta version of the standard scanner kit with two A480 cameras running CHDK. I'm using a 2-camera 4.5v battery-powered USB trigger made by Frans van de Kamp. Mine is similar to this one: http://www.flickr.com/photos/fvdk3d/475 ... 151791739/

I have them set to shoot in both CR2 (raw) and JPG so that each time it snaps a picture it saves both versions of the image. The JPGs are ok for quick and dirty work, but the CR2 files are much higher quality.

When I am done scanning I have the following files to process:

SDCard1: Left page images: 2 folders, one with all the CR2 and one with all the JPGs.
SDCard2: Right page images: 2 folders, one with all the CR2 and one with all the JPGs.

So, here is my order of operations for processing these on my workstation (for reference, this is an i7-2760QM cpu with 8gb ram running Win7pro x64).

1A: Convert to TIFF and move off SD Card to HD.
I drop SDCard1 into the card reader.
Boot CR2 converter.
Click Add, navigate to the SDCard, select the .cr2 images off my card, add them, close the add window.
Select Output Format as TIFF.
Select Output Folder as F:BookScans/ProjectName/LeftPages/
My large second hard drive is the F: drive. If you have just 1 drive, yours will be C:... etc... My C: drive is much too small for this kind of work (ssd).
Note: I keep my left and right pages separate until the very end of the process right now, but that might change down the line. I am doing this because ScanTailor's handling of some pages and functions makes it preferrable.
Click convert and grade some papers or read or answer email for a bit. It's pretty quick but I am impatient.

1B: Repeat the same process with SDCard2 with the right page images, saving them in F:BookScans/ProjectName/RightPages/

2: Every physical page image in the book should now have a corresponding TIFF file, including every blank page, and the file names for the left pages should follow the exact sequence of the book, and the same for the right. If I did everything right when shooting, each folder (left/right) will have all of the corresponding page images and Windows will sort the names in the order that matches the book. Left and right won't match each other but we will fix that in a minute. If I had an error where one camera fired but the other did not (this only happens to me if I am being a bonehead and did something stupid like only turned on one camera), then I need to delete or replace files at this stage. The LeftPages folder should have every left page in perfect order (including blank pages -- this is important), and the same for the right. Since I have all the files as TIFFs on my HD, I can quickly skim them now and see if I missed a page (turned 2 pages instead of 1 when shooting, etc.). If I made such errors, now is a good time to fix them, but I have to be sure that I don't do anything that messes with page order corresponding to file numbers. If the camera native name is crw1234, crw1235, etc... I need to be sure any inserted pages naturally fit into that sequence, so if I have to reshoot a page or insert missed pages, I need to use that naming convention to keep the pages in order. crw1234a will fall between 1234 and 1235, etc.

3a. Now I rename the files. Boot Bulk Rename Utility. Select the TIFF files in the LeftPages directory.
Setting:
Numbering mode to Prefix,
Start at 1,
Increment at 2, and
Pad at 3 (note: books over 999 page images will need a pad of 4).
In the main window of the utility be sure you have selected all the TIFFs for the left pages. It should preview the new names and they should read something like 001originalfilename.tif, 003... 005... etc. The padding forces all the numbers to 3 digits, adding the 00 to 1-9, 0 to 10-99, then no padding to 100+. This helps Adobe autosort the pages later in the process. Click Rename. This only takes a few seconds.

3b. Repeat this with the RightPages directory, setting the numbering to Prefix, Start to 2, Increment to 2, and Pad to 3, creating files named 002originalname.tiff, 004..., 006..., etc...

4a. GIMP processing. RAW cr2 files have no color correction or post-processing applied. That can be good and bad. Luckily GIMP will help us out here. I load GIMP, go Filters >> Batch Process. In the batch process tool, Add my LeftPages TIFF files that have now been renumbered.
Settings:
- Turn to proper orientation (so ScanTailor won't need to turn pages);
- Colour-Enable, AutolevelsOn;
- Resize-Enable, Relative, X-Scale 2.0, Y-Scale 2.0;
- Rename-Select-Dir-and I set the output directory to a subdirectory of LeftPages I call GBP, just to keep my files distinct and organized. If my original was B&W/greyscale then I will check "Convert Grey" on the Rename tab options (NOT on the Colour options).
- Output-TIFF
Click start and go do something else for awhile. This can take some time.
You can play with the settings (such as manual contrast or brightness settings) and see samples with the Test Images button. This plugin likes to spit out error messages if you don't follow its on-screen directions but never fails to work great.

4b. Repeat with RightPages, be sure to change the Turn setting and output directory.

5: ScanTailor time. Run ScanTailor on the left pages then on the right pages. Good news is you can open and run two instances of ScanTailor at the same time and it will speed things up, as ScanTailor is not adept at using more than one core of modern CPUs. There's a whole forum on ScanTailor, so I won't go into details.

6. Merge the two ScanTailor outputs into one folder, which should now have the finished final TIFF images of the whole book in order 001...tiff, 002...tiff, 003...tiff etc...

7. Acrobat -- Make PDF from multiple files (merge files into one pdf) -- select the new folder with all the ScanTailor output images, Acrobat should automatically sort them into correct order by the numbered image file names (001, 002, 003).

8. Once it has finished making the PDF, you can go through and delete blank or empty pages as you wish. ClearScan OCR hates blank pages for some reason. Once I delete the blank pages, I run OCR (do NOT optimize or do other changes to the PDF images before OCR -- run OCR before any other changes to the PDF). In the OCR options, select ClearScan. For black and white text books, downsample to 300dpi. For color or detail work, stick to 600dpi.

Only problem I have with this workflow right now is that if I am not careful in how I set the content selection and margins in ScanTailor the Left and Right pages in Adobe end up slightly different sizes. I'm working on finding a way to resolve that, but right now trying to do L and R together in ScanTailor causes me too many headaches.
dtic
Posts: 464
Joined: 06 Mar 2010, 18:03

Re: Dr. Cheap's software & workflow

Post by dtic »

DrCheap wrote: I keep my left and right pages separate until the very end of the process right now, but that might change down the line. I am doing this because ScanTailor's handling of some pages and functions makes it preferrable."
...
right now trying to do L and R together in ScanTailor causes me too many headaches.
What problems do you get when you process L & R together?
DrCheap
Posts: 48
Joined: 07 Jan 2012, 19:27
E-book readers owned: pdf
Number of books owned: 750

Re: Dr. Cheap's software & workflow

Post by DrCheap »

A couple quick notes on process.

I have found that for basic B&W text, the JPG files work fine and save me a step. For nice archiving, I will still use the CR2 raw files, but for simple text, I have switched to JPG. I can also mostly process clean simple text with left and right pages together, which saves some effort.

My biggest problem in the past with doing L and R together is that I sometimes need to manually set content area due to either unclean / marked originals (in cases where I do not want previous readers' margin comments) or due to some kind of bad glare or background texture that causes ScanTailor to think the content area is much larger than it is. When doing a 600 page book, being able to set a few score pages content at a time by going "apply to this and all subsequent pages", then jump down 40 or 80 pages, adjust for shift in spine, apply again, etc., has been essential for a few originals.

If I have a reasonably clean B&W original, I do the following:

Bulk rename utility on L camera SD card: 001, 003, 005, etc.
Then repeat same on R camera SD card: 002, 004, 006, etc. into the same directory.

I then run GIMP with the bulk processor to double the size of the image and convert it to TIFF. In my experience thus far, this produces MUCH better end result outputs. These files are saved in a new directory.

Then I run ScanTailor on those resulting TIFF files.

After ST, I drop merge them into a single PDF in Acrobat Pro X and run OCR with ClearScan enabled.

Results from clean originals look great. Excellent OCR, clean images, very readable, etc.
dtic
Posts: 464
Joined: 06 Mar 2010, 18:03

Re: Dr. Cheap's software & workflow

Post by dtic »

DrCheap wrote: I then run GIMP with the bulk processor to double the size of the image and convert it to TIFF. In my experience thus far, this produces MUCH better end result outputs. These files are saved in a new directory.
I tried that on a bunch of pages but can't notice any difference compared to plugging the jpegs directly into ScanTailor. What improvements are you seeing?
DrCheap
Posts: 48
Joined: 07 Jan 2012, 19:27
E-book readers owned: pdf
Number of books owned: 750

Re: Dr. Cheap's software & workflow

Post by DrCheap »

dtic- I think I get a little cleaner white space inside of characters, (like the loop inside the top of an e) and also better ScanTailor performance.

I've also noticed Adobe seems to do a better job of smoothing the characters and running accurate OCR.

It's possible all this is confirmation bias, so if I get the chance this weekend, I will see if I can do some true comparison tests and either disprove my current perceptions or else post something here that demonstrates the difference. It's an easy thing to test -- just need to be processing some texts to have material to work with.
DrCheap
Posts: 48
Joined: 07 Jan 2012, 19:27
E-book readers owned: pdf
Number of books owned: 750

Re: Dr. Cheap's software & workflow

Post by DrCheap »

So, I played around with a few settings and made some comparisons and I think honestly you are right -- there is not a significant gain in quality by enlarging before processing.

ScanTailor also processes the smaller image files much faster.

So, the new fast and easy workflow for simple JPG images is this:

Rename the JPG files using Bulk Rename Utility.

Run through ScanTailor to do color output with white margins and correct illumination.

Run through the batch bitonal converter to make B&W images

Drop into Adobe Acrobat Pro to OCR with ClearScan.

For simple readable text, this is the fast clean and easy way to do it. Obviously, for archival quality color images this probably would not do, but fast and easy with JPGs, this is pretty good.

Oh, I also have been using the chdk auto-focus lock, which speeds up my scanning time and gets me 100% in focus images every shot.

Overall, my time from start to finish has dropped dramatically. Only problem I have now is a tiny bit of glare on one edge of the images that sometimes causes ScanTailor's content area recognition to fail and select far too much, but that's caused by some of the lighting in the room where I scan.
spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: Dr. Cheap's software & workflow

Post by spamsickle »

My biggest problem in the past with doing L and R together is that I sometimes need to manually set content area due to either unclean / marked originals (in cases where I do not want previous readers' margin comments) or due to some kind of bad glare or background texture that causes ScanTailor to think the content area is much larger than it is. When doing a 600 page book, being able to set a few score pages content at a time by going "apply to this and all subsequent pages", then jump down 40 or 80 pages, adjust for shift in spine, apply again, etc., has been essential for a few originals.
I believe Scan Tailor allows you to specify a range of pages to which the content selection will apply, and also has an "apply to every other image" option within that range which will enable you to do what you're doing while still getting Scan Tailor to set all the output images to be the same size.

If I'm mistaken about that (I haven't used Scan Tailor much recently) ImageMagick should allow you to do a batch resize of one folder or both.
DrCheap
Posts: 48
Joined: 07 Jan 2012, 19:27
E-book readers owned: pdf
Number of books owned: 750

Re: Dr. Cheap's software & workflow

Post by DrCheap »

spamsickle wrote: I believe Scan Tailor allows you to specify a range of pages to which the content selection will apply, and also has an "apply to every other image" option within that range which will enable you to do what you're doing while still getting Scan Tailor to set all the output images to be the same size.

If I'm mistaken about that (I haven't used Scan Tailor much recently) ImageMagick should allow you to do a batch resize of one folder or both.
Problem is, if I process L & R together I cannot apply content selection to every other page after this one, which is what I would need. If I apply to every other page it will do so from page 1 on regardless of where I apply it, thus there will be slippage between the first and last page that causes either excess white space or text cropping. If I apply to every page after this one, then obviously I cannot process Left and Right together. I have to say that since improving my lighting and the accuracy of my dpi settings, this have been much less of an issue (autoselect works much better) but with some text I still have to do manual settings for content on every page, and when I do it works much better to process L and R separately.
tkr
Posts: 35
Joined: 29 Jan 2012, 21:53
Number of books owned: 0

Re: Dr. Cheap's software & workflow

Post by tkr »

Hello,
Please refer to this post for background: http://www.diybookscanner.org/forum/vie ... =19&t=1282
Since you seem to have experience with running a batch process in GIMP, I wanted to ask whether the program you suggested would be able to automate the running of a plugin (like deskew).

Thanks,
TKR
Post Reply