(Only Text) Books Digitalization Workflow v.1.0.

Share your software workflow. Write up your tips and tricks on how to scan, digitize, OCR, and bind ebooks.

Moderator: peterZ

Post Reply
JJJM
Posts: 26
Joined: 13 May 2010, 01:24

(Only Text) Books Digitalization Workflow v.1.0.

Post by JJJM »

Hello,
After setting my low cost single page scanner v.1.5
http://img149.imagevenue.com/img.php?im ... _492lo.jpg
I have done some digitalizations of books to be read with my family ebook readers (Kindle 2 and Sony PRS-300).

I have realized scanning procedures with camera is very well detailed and optimized but, from my point of view, OCR and text editing is much more time consuming than snapping pictures, and I have missed feedback of what rest of people are doing to optimase this intense time consuming task.

Maybe someone is interested in a procedure that I would like to share that I think it will save time. The importance of having a written procedure is high because if you do not follow some precise steps you can end up with a total chaos, and a total desastrously digitalized book.

I have divided the work into numbered phases which number can be used as a code to identify the stage where a document is to keep everything under control.

I hope it helps. I am sure it will have future improvements but I consider it is an interesting point to start with. I also include a download as written procedure in a wysiwyg way is not easy with this board editor.

http://www.megaupload.com/?d=63JNWC8Y


(ONLY TEXT) BOOK DIGITALIZATION WORKFLOW v.1.0.

PHASE 0
Snap pages
Create 3 folders: even, odd and full to place jpeg initially and tiff in full folder finally.
Rotate pictures with right click under Windows
Rename pictures with Total Commander matching with book page numbers
Check number of pages and content matches with original

Tip: If a page has almost no content this will out focus photo. Put a written piece of paper at the centre of page to force focus (later you will get rid of its digital OCR)

Tip: If platen is not properly adjusted to the page and base of scanner, it will give blurry images.

Tip: Snap also white pages in order to match number of photos with real pages

PHASE 1
Scan Tailor odd and even separately. Check manually the content Scan Tailor detects to correct possible errors
No need to select white pages
Rename tiff files removing numbers generated by scan tailor at the beginning of the filename.
Put odd and even tiff files at full folder to proceed with Abby Finereader (FR)

PHASE 2
FR settings: Convert to RTF, keep pages break, avoid headings and footers, avoid line breaks and personal size of paper set to 25x15 cm with small margins
Proceed to select tiff and OCR with Abby
At Options/Style editor, make styles homogeneous setting body text and rest of styles except footnotes to Times New Roman, size 11, no bold, no special spacing between characters. Footnotes style set at size 11, no bold and no special spacing.
Check every page of the document correcting areas selected by FR and possible images detected by OCR
Generate RTF file

PHASE 3
Using Word:
Set Author and Title at properties option.
Select everything and set idiom to your local language
Select everything except footnotes and apply size 11 and no bold

Apply these Find and Replacement commands to remove page and section breaks (For this tedious and hard task I have generated some semi automated macros):

Find and replace all ("in automatic mode"):
FIND___________REPLACE
^-^m__________(nothing)
^-^b___________(nothing)
-^m___________(nothing)
-^b___________ (nothing)
^- ____________(nothing)
:^m___________: (blank space)
:^b___________: (blank space)
,^m__________, (blank space)
,^b___________, (blank space)
;^m___________; (blank space)
;^b___________ ; (blank space)
This set of find and replace is done for all the cases and is straightforward. All these substitutions can be recorded in one macro.

Find and replace ("in manual mode"):
FIND___________REPLACE
^$^m__________(blank space)
^$^b___________(blank space)
.^m____________Insert paragraph if replacement applies
.^b_____________Insert paragraph if replacement applies
?^m____________Insert paragraph if replacement applies
?^b____________Insert paragraph if replacement applies
!^m____________Insert paragraph if replacement applies
!^b____________Insert paragraph if replacement applies
These replacements have to be checked manually because sometimes its application is not right depending on the rest of text. I have generated one macro for each replacement. All of them could be programmed in one macro but I am not good at programming so for me is enough.

Find and replace ("in manual mode")
FIND________________REPLACE
- (blank space)_______nothing if it is right to apply
This last replacement is to differentiate between right dashes for compound words and wrong OCRs with dashes at the end of a line. No need to write a macro.

These replacements have been selected for Spanish language but I think they apply to many other languages.
If some more characters generate problems can be added to this list with similar method.


PHASE 4
In word:
Run orthographic corrector
Check final document giving format to titles and checking possible lost numbers and footnotes and any possible error.


GO
Your RTF file is ready to be read.
Last edited by JJJM on 07 Jun 2010, 04:21, edited 1 time in total.
StevePoling
Posts: 290
Joined: 20 Jun 2009, 12:19
E-book readers owned: SONY PRS-505, Kindle DX
Number of books owned: 9999
Location: Grand Rapids, MI
Contact:

Re: (Only Text) Books Digitalization Workflow v.1.0.

Post by StevePoling »

if you do not follow some precise steps you can end up with a total chaos, and a total desastrously digitazed book.
"digitazed?"

I have this mental image of a book lying on the ground twitching spasmodically. with the other books in your library saying, "don't digitaze me, bro"
JJJM
Posts: 26
Joined: 13 May 2010, 01:24

Re: (Only Text) Books Digitalization Workflow v.1.0.

Post by JJJM »

StevePoling wrote:
if you do not follow some precise steps you can end up with a total chaos, and a total desastrously digitazed book.
"digitazed?"

I have this mental image of a book lying on the ground twitching spasmodically. with the other books in your library saying, "don't digitaze me, bro"
Great joke. English is not my natural language as you can imagine.

Correction done.
starsky

Re: (Only Text) Books Digitalization Workflow v.1.0.

Post by starsky »

Hi JJJM,
Your megashare link is dead!!
Can you post to another file sharing service?
Thanks
Post Reply