Basic Guide to Workflow ?

Don't know where to start, or stuck on a certain problem? Drop by and tell us about it. Feel like helping others? Start here.

Moderator: peterZ

recaptcha
Posts: 64
Joined: 03 Sep 2010, 13:23
Number of books owned: 0
Location: Calgary, Alberta, Canada

Basic Guide to Workflow ?

Post by recaptcha »

Sorry for the noob question, and I apologize if this has been asked before but I am new to the book scanning world and wanted to ask about the order and steps involved in the basic scanning workflow. I've done a search on this forum and I've browsed for a basic guideline on the web, yet haven't really found anything that does a good job of explaining it to a neophyte. Is there a basic FAQ on the internet, or some sort of Book Scanning for Dummies guide ? :lol:

I have books with plain text and others with lots of photos, graphics and charts, and want to end up with a high quality (archival or close), reasonably small-sized, fully searchable file to be read on an e-reader.

I don't actually have a scanner yet, but I just want to get a better idea of what's involved, what I'm getting myself into. From what I've gleaned so far the process goes something like:

1). Scan book pages using a scanner or camera. This is to get the images onto a computer.
2).The book page images are in image files, and need to be converted to text files
3) Text files can only (?) be created from image files through an OCR process
4). These OCR conversions need to then be proofed, and formated on a page(?) and compressed (?)
5) Then the proofed and compressed files can be converted to PDF (?)
6). The PDFs can be converted to an e-reader format like e-pub or .azw, or just left as PDFs.

Anyway, as you can see, except for the beginning and end, I'm not 100% sure about the order or what exactly is involved in each step.

Thanks in advance.
univurshul
Posts: 496
Joined: 04 Mar 2014, 00:53

Re: Basic Guide to Workflow ?

Post by univurshul »

Goes like this:

Capture. Import. Scan Tailor. PDF (+/- OCR)

That is basic yet tons of room for quality. Do you use Mac or PC?
recaptcha
Posts: 64
Joined: 03 Sep 2010, 13:23
Number of books owned: 0
Location: Calgary, Alberta, Canada

Re: Basic Guide to Workflow ?

Post by recaptcha »

So you do the OCR after making a PDF? I thought a PDF was a kind of text file.
aslambilal
Posts: 7
Joined: 04 Mar 2014, 00:53

Re: Basic Guide to Workflow ?

Post by aslambilal »

You can OCR before or after PDF.

Omnipage allows you to OCR any tpye of image, or a PDF itself. I guess for best results do it before PDF so you dont lose any image quality? Just a guess tho, i'm new at this too.
User avatar
strider1551
Posts: 126
Joined: 01 Mar 2010, 11:39
Number of books owned: 0
Location: Ohio, USA

Re: Basic Guide to Workflow ?

Post by strider1551 »

recaptcha wrote:I thought a PDF was a kind of text file.
It can be, but normally it is not for book scanning, especially if you want "archival or close quality". To preserve the page layout, the fonts, etc., most people here cleanup their images with scantailor, compress the image, and put it directly into the pdf - think of the pdf as a container format for a bunch of pictures. While that sounds like a whole book would be a huge file, black and white images compress very well - I get about 15.5 kB per page at ~365 dpi.

OCR text is icing on the cake. If you also provide the pdf with OCR information you can search the pdf and it will bring you to the page the word is on, and possibly even highlight where the word is in the image. Typically it is preferred to do the OCR step before compressing the image and putting it in the pdf, since the most efficient compression methods tend to be "lossy", but I don't know if anyone has done a study on whether this has any significant impact on OCR quality.
recaptcha
Posts: 64
Joined: 03 Sep 2010, 13:23
Number of books owned: 0
Location: Calgary, Alberta, Canada

Re: Basic Guide to Workflow ?

Post by recaptcha »

Thanks. Does it make any difference to OCR accuracy whether I do the PDF before or after? Is it any easier to correct OCR mistakes before or after PDF ? If I want to end up with an e-pub or .azw file should this be converted from a PDF, or could I make an e-pub file straight from ScanTailor?

Yes, I'd like a fully searchable document.

I forgot to answer univurshul's question: I'm on a Mac with a bootcamp partition, so I guess I could go either way (Windows or Mac). I'm willing to use whatever software makes things faster and easier e.g. Abbyy FineReader, OmniPage, or any of the open source ones mentioned around here. (I read somewhere that you can scan the page image right into AbbyyFineReader, thus eliminating at least one importing step).
User avatar
daniel_reetz
Posts: 2812
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: Basic Guide to Workflow ?

Post by daniel_reetz »

could I make an e-pub file straight from ScanTailor
Scan Tailor takes the camera images and post-processes them into nice "OCR-ready" images. You can apply OCR to those images, yielding searchable text. Depending on your target format, the software you use after Scan Tailor can be anything. Others here can advise better than I can.
Tim

Re: Basic Guide to Workflow ?

Post by Tim »

recaptcha wrote:Thanks. Does it make any difference to OCR accuracy whether I do the PDF before or after? Is it any easier to correct OCR mistakes before or after PDF ?
In general, OCR will be most accurate on the uncompressed image, so that means before the PDF stage. Though anything is possible I suppose, so you could try testing a few pages before and after making a pdf to see what works best.
Yes, I'd like a fully searchable document.
That's accomplished by having a document that is both text and image in the pdf file. Basically the pdf is an image of the book page and then it has an unseen layer of text embedded in it that works with the search functions and can be output through most pdf software's save to text feature.
I read somewhere that you can scan the page image right into AbbyyFineReader, thus eliminating at least one importing step.
Yes, all commercial OCR software will work that way with flatbed scanners, since they can work with their drivers. They can't do the same with cameras however. But the DIY scanner designs using cameras are so much faster than flatbed scanners that it makes the separate step worth it.

To answer your broad original question, we don't really have a perfectly turnkey solution yet that can take scanned images and put them into a searchable pdf, because there are lots of different ways that people on the forum accomplish it with software for various platforms. You may need to pick which parts do each step best for you.
univurshul
Posts: 496
Joined: 04 Mar 2014, 00:53

Re: Basic Guide to Workflow ?

Post by univurshul »

recaptcha wrote:...I'm on a Mac with a bootcamp partition, so I guess I could go either way (Windows or Mac). I'm willing to use whatever software makes things faster and easier e.g. Abbyy FineReader, OmniPage, or any of the open source ones mentioned around here. (I read somewhere that you can scan the page image right into AbbyyFineReader, thus eliminating at least one importing step).
When you have a book scanner, and if it's a dual camera rig, I have a Mac instructional that will help you import & prepare your images for Scan Tailor: http://www.diybookscanner.org/forum/vie ... ?f=3&t=527

It sounds like you'll want to test your own productions of PDF, ePub and djvu to see what works best for your target reading device.

Scan Tailor is superb for processing camera images. But if you're simply working with a flat bed scanner for now, then dedicated scan software like Abbyy should do the trick.
univurshul
Posts: 496
Joined: 04 Mar 2014, 00:53

Re: Basic Guide to Workflow ?

Post by univurshul »

strider1551 wrote:...To preserve the page layout, the fonts, etc., most people here cleanup their images with scantailor, compress the image, and put it directly into the pdf - think of the pdf as a container format for a bunch of pictures. While that sounds like a whole book would be a huge file, black and white images compress very well - I get about 15.5 kB per page at ~365 dpi....OCR text is icing on the cake. If you also provide the pdf with OCR information you can search the pdf and it will bring you to the page the word is on, and possibly even highlight where the word is in the image. Typically it is preferred to do the OCR step before compressing the image and putting it in the pdf, since the most efficient compression methods tend to be "lossy", but I don't know if anyone has done a study on whether this has any significant impact on OCR quality.
Strider1551,

Coming from a developer like yourself, I did wonder about compression now that you mention it: what apps are you using to compress your TIFFs post-Scan Tailor? What would you recommend for iPad displays? (I'm keeping my original TIFFs so I can test freely, but a ballpark or any guidelines would be a time saver.)

And I do like the idea of OCR pre-compression as well. ...Looking forward to djvubind once ported to the Mac OS.
Post Reply