Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

Most Efficient Workflow / Process Available Currently

Share your software workflow. Write up your tips and tricks on how to scan, digitize, OCR, and bind ebooks.
User avatar
mellow-yellow
Posts: 46
Joined: 28 Jun 2010, 13:33
Number of books owned: 1
Country: USA
Location: Portland, OR, USA
Contact:

Most Efficient Workflow / Process Available Currently

Post by mellow-yellow » 02 Jul 2010, 00:34

After spending several hours reading various workflow posts on this forum, I'm looking for the most efficient workflow available today, regardless of software licensing preference (open vs. closed) and assuming acceptable final output. Any suggestions or corrections on the following proposed steps?

Proposal: Most Efficient Workflow (e.g. 300 pages in under 60 minutes)
  • 1. Assuming you use the New "Standard Scanner," zoom both cameras appropriately, attempting to "crop inside the camera" by removing views of the book's spine/crease seen in Hank's (BTW great build) samples (http://www.diybookscanner.org/forum/vie ... t=70#p3693). Save these settings to your camera's Custom option, if available. Hints: Use 1-bit black and white TIFF for primarily text scans or grayscale/color JPEG for image-heavy or long (e.g. over 500 page) scans.

    2. Transfer images from #1 into a "left" and "right" folder on your computer. Launch IrfanView (freeware for personal use, $12 commercial; probably PC exclusive, any Mac alternatives?) and run a batch process: (optional) crop (see #1), (optional) force DPI, (optional) resize, etc. as needed. Most importantly, run "Batch conversion - Rename result files" twice (once for left, once for right), saving the results of both into a single "book" destination folder, with left source images renamed as 1,3,5,7... and the right source images renamed as 2,4,6,8,...

    3. Import/Open the "book" folder into ABBYY FineReader, enable its automatic "image pre-processing" feature, and output the OCR'd results to your desired file format (e.g. PDF, Word, OpenOffice, etc.). FineReader's Version 10 User's Guide says the "pre-processing" feature is capable of automatically "removing noise from digital photos, deskewing, straightening text lines, and correcting trapezium distortions" with the capability of removing motion blur, reducing noise, rotating and flipping to standard orientation, changing image resolution, splitting images, reading mathematical symbols (via pattern training), and (non-automatic) cropping. Best of all, it's very fast and very accurate.
Estimated Times For a 300-page book, where A=Attended (i.e. manual or "your") processing time and U=Unattended processing time
  • 1. 15 minutes (A), 0 minutes (U)
    2. 3 minutes (A), 2 minutes (U)
    3. 3 minutes (A), 21 minutes (U)
    TOTALS: 21 minutes (A), 23 minutes (U): 44 minutes combined
If "cropping inside the camera" is not possible in your setup (workarounds anyone?), #2 allows cropping in post-processing. I tried to derive this list from -- among many others -- spamsickle's Photoshop idea (http://www.diybookscanner.org/forum/vie ... ?f=3&t=416), Spamsickle's Custom Camera Settings post (http://www.diybookscanner.org/forum/vie ... =285#p2530), and JJM's Workflow (http://www.diybookscanner.org/forum/vie ... =424#p3920).

JJJM
Posts: 26
Joined: 13 May 2010, 01:24

Re: Most Efficient Workflow / Process Available Currently

Post by JJJM » 17 Jul 2010, 03:32

For only text books, Scantailor is mandatory previous to Finereader. In my tests I get better results using it.

Also for text books, after OCR a lot of postprocessing work is needed to get a usable document, mainly to correct OCR errors and to format paragraphs and characters in a consistent way to be readable. I do not know if you include this time in your estimated times. For 300 pages I think 1.5-2 hours is needed to have a nice rtf or doc document.

will1384

Re: Most Efficient Workflow / Process Available Currently

Post by will1384 » 31 Jul 2010, 00:58

For text only pages with no distortion correction

Have the following installed under Windows:

(1) ImageMagick
(2) IrfanView
(3) JPEGCrops

and a directory in the root of your drive like:

C:\Book\Left
C:\Book\Right

and inside "C:\Book" have MAKEBOOK.BAT

Now lets work with your image files:

(1) Copy your images to the correct directory ether

C:\Book\Left
or
C:\Book\Right

(2) Use JPEGCrops to crop each directory, C:\Book\Left
and C:\Book\Right, make sure it saves in the correct directory.

HINT: JPEGCrops has advanced options, use them, the
"Destination Folder = Source Folder" helps a lot.

(3) Use MAKEBOOK.BAT for rename, merge and rotate.

(4) Use IrfanView and "Batch Conversion" with output in "TIF" use "Advanced Options"
with "Change Color Depth" set to "2 Colors" and "Auto Adjust Colors"


You can find MAKEBOOK.BAT in this post over here
http://www.diybookscanner.org/forum/vie ... w&start=10

That's about as fast and dirty as you can get, I just tried a lot of software and found using this
does just about as good as any of it, its not perfect, but its fast, and the output is readable, it
seams to do an acceptible job for black and white images, like say Manga / Comics pages, next
I need to find a way of auto correcting the distortion.

I would also like to OCR the pages and PDF it, but most of what I have tried failed badly, so I
am leaving it images only.

User avatar
mellow-yellow
Posts: 46
Joined: 28 Jun 2010, 13:33
Number of books owned: 1
Country: USA
Location: Portland, OR, USA
Contact:

Re: Most Efficient Workflow / Process Available Currently

Post by mellow-yellow » 29 Sep 2010, 17:21

Since writing my post above, I have experimented, corrected, and improved this proposal substantially. Feedback welcome! :)

NOTE: A=Attended ("your" time), U=Unattended ("CPU" time)

Fastest (300 pg book)
1. Scan with SDM using S_FAST* (8 min A)
2. Transfer L and R images to PC (2 min A)
3. Rename L (001.jpg, 003.jpg, etc.) and R (002.jpg, 004.jpg) with IrfanView in Batch (1 min A)
4. Combine results into a single folder, move to ABBYY Hot Folder** and convert to PDFs (1 min A, 20 min. U)
5. Acrobat Standard - Combine Files - to create a single PDF (1 min A, 2 min U)
Total: 13 minutes (A) or 35 minutes (A+U)
Advantages: Speed (a 300-page, OCR'd book in 13 min!), Less time waiting for and returning to the PC (#4 to #5)
Disadvantages: Poor contrast (JJM's correct), no cropping*** (rig visible, IrfanView can crop but you'll add 1 min. A and 6 min. U)


Better Quality (300 pg book)
1. Scan with SDM using S_FAST* (8 min A)
2. Transfer L and R images to PC (2 min A)
3. Rename L (001.jpg, 003.jpg, etc.) and R (002.jpg, 004.jpg) with IrfanView in Batch (1 min A)
4. ScanTailor L then ScanTailor R: steps #1-#4 (5 min A, 3 min U)
5. ScanTailor Cropping*** Fix (http://diybookscanner.org/forum/viewtop ... =466#p4791) (2 min. A)
6. ScanTailor L then ScanTailor R: steps #4-#6 with Mixed selected (5 min A, 7 min U)
7. Copy L and R "out" folder to ABBYY Hot Folder** for conversion to PDFs (1 min A, 20 min. U)
8. Acrobat Standard - Combine Files - to create a single PDF (1 min A, 2 min U)
Total: 25 min (A), 57 min (A+U)
Advantages: White backgrounds on black text, good colors and contrast, cropped images
Disadvantages: Cumbersome, 62% slower ( (57-35)/35 *100 = 62.857), More time waiting for and returning to the PC (#4 to #5, #6 to #7, #7 to #8)

* S_FAST: http://www.diybookscanner.org/forum/vie ... 5528#p5528
** ABBYY Hot Folder Settings included: PDF/A, Mixed Raster Content (MRC), text under image
*** JPEGCrops was unstable (crashes, slow, probably due to the hundreds of 12MP color images) for me on both Windows 7 and XP SP3.

JJJM
Posts: 26
Joined: 13 May 2010, 01:24

Re: Most Efficient Workflow / Process Available Currently

Post by JJJM » 29 Sep 2010, 22:36

mellow-yellow wrote: 1. Scan with SDM using S_FAST* (8 min A)
This is very fast. Are you sure it is only 8 minutes? Can you upload pictures of your scanner? Maybe It is worth for me to build a more complex scanner if you save so much time.
mellow-yellow wrote: 3. Rename L (001.jpg, 003.jpg, etc.) and R (002.jpg, 004.jpg) with IrfanView in Batch (1 min A)
For managing files (batch renaming, copying, etc) I prefer Total Commander, a very powerful file administrator which I use for evth
mellow-yellow wrote:8. Acrobat Standard - Combine Files - to create a single PDF (1 min A, 2 min U)
Try to use Clearscan OCR within Acrobat, you will get amazing pdfs: more contrast, lighter, and more ebook friendly.

User avatar
mellow-yellow
Posts: 46
Joined: 28 Jun 2010, 13:33
Number of books owned: 1
Country: USA
Location: Portland, OR, USA
Contact:

Re: Most Efficient Workflow / Process Available Currently

Post by mellow-yellow » 30 Sep 2010, 02:17

Thanks JJJM for the Total Commander and Adobe OCR trick. I'll experiment and see what sticks. In the meantime, I'm attaching a video (upload didn't work, see posting below) to show the scanning portion (8 min.) of this workflow. I can currently max about 22 scans (44 pages) per minute, which is 300/44 = 6.8 min max speed.
Last edited by mellow-yellow on 30 Sep 2010, 02:38, edited 3 times in total.

JJJM
Posts: 26
Joined: 13 May 2010, 01:24

Re: Most Efficient Workflow / Process Available Currently

Post by JJJM » 30 Sep 2010, 02:23

Very stressing! 1 scan every 3 secs! I prefer a more relaxed pace. I will have a look at your video.

User avatar
mellow-yellow
Posts: 46
Joined: 28 Jun 2010, 13:33
Number of books owned: 1
Country: USA
Location: Portland, OR, USA
Contact:

Re: Most Efficient Workflow / Process Available Currently

Post by mellow-yellow » 30 Sep 2010, 02:38

Somehow, the system wouldn't accept my video, so I posted it at blip instead: http://blip.tv/file/4184801

JJJM
Posts: 26
Joined: 13 May 2010, 01:24

Re: Most Efficient Workflow / Process Available Currently

Post by JJJM » 30 Sep 2010, 06:21

mellow-yellow wrote:Somehow, the system wouldn't accept my video, so I posted it at blip instead: http://blip.tv/file/4184801
Very good!!

Too fast for me. I prefer to go slower just in case you have problems with sticky pages and to be more relaxed.

As far as i saw you use the same or similar standards for the scanner as daniel's model.

Give a try to clearscan and let me know your opinion please.

Regards.

User avatar
daniel_reetz
Posts: 2779
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: Most Efficient Workflow / Process Available Currently

Post by daniel_reetz » 30 Sep 2010, 13:29

mellow-yellow, I posted your video to the blog. thanks for posting it!

Post Reply