Basic Guide to Workflow ?
Moderator: peterZ
-
- Posts: 64
- Joined: 03 Sep 2010, 13:23
- Number of books owned: 0
- Location: Calgary, Alberta, Canada
Basic Guide to Workflow ?
Sorry for the noob question, and I apologize if this has been asked before but I am new to the book scanning world and wanted to ask about the order and steps involved in the basic scanning workflow. I've done a search on this forum and I've browsed for a basic guideline on the web, yet haven't really found anything that does a good job of explaining it to a neophyte. Is there a basic FAQ on the internet, or some sort of Book Scanning for Dummies guide ?
I have books with plain text and others with lots of photos, graphics and charts, and want to end up with a high quality (archival or close), reasonably small-sized, fully searchable file to be read on an e-reader.
I don't actually have a scanner yet, but I just want to get a better idea of what's involved, what I'm getting myself into. From what I've gleaned so far the process goes something like:
1). Scan book pages using a scanner or camera. This is to get the images onto a computer.
2).The book page images are in image files, and need to be converted to text files
3) Text files can only (?) be created from image files through an OCR process
4). These OCR conversions need to then be proofed, and formated on a page(?) and compressed (?)
5) Then the proofed and compressed files can be converted to PDF (?)
6). The PDFs can be converted to an e-reader format like e-pub or .azw, or just left as PDFs.
Anyway, as you can see, except for the beginning and end, I'm not 100% sure about the order or what exactly is involved in each step.
Thanks in advance.
I have books with plain text and others with lots of photos, graphics and charts, and want to end up with a high quality (archival or close), reasonably small-sized, fully searchable file to be read on an e-reader.
I don't actually have a scanner yet, but I just want to get a better idea of what's involved, what I'm getting myself into. From what I've gleaned so far the process goes something like:
1). Scan book pages using a scanner or camera. This is to get the images onto a computer.
2).The book page images are in image files, and need to be converted to text files
3) Text files can only (?) be created from image files through an OCR process
4). These OCR conversions need to then be proofed, and formated on a page(?) and compressed (?)
5) Then the proofed and compressed files can be converted to PDF (?)
6). The PDFs can be converted to an e-reader format like e-pub or .azw, or just left as PDFs.
Anyway, as you can see, except for the beginning and end, I'm not 100% sure about the order or what exactly is involved in each step.
Thanks in advance.
-
- Posts: 496
- Joined: 04 Mar 2014, 00:53
Re: Basic Guide to Workflow ?
Goes like this:
Capture. Import. Scan Tailor. PDF (+/- OCR)
That is basic yet tons of room for quality. Do you use Mac or PC?
Capture. Import. Scan Tailor. PDF (+/- OCR)
That is basic yet tons of room for quality. Do you use Mac or PC?
-
- Posts: 64
- Joined: 03 Sep 2010, 13:23
- Number of books owned: 0
- Location: Calgary, Alberta, Canada
Re: Basic Guide to Workflow ?
So you do the OCR after making a PDF? I thought a PDF was a kind of text file.
-
- Posts: 7
- Joined: 04 Mar 2014, 00:53
Re: Basic Guide to Workflow ?
You can OCR before or after PDF.
Omnipage allows you to OCR any tpye of image, or a PDF itself. I guess for best results do it before PDF so you dont lose any image quality? Just a guess tho, i'm new at this too.
Omnipage allows you to OCR any tpye of image, or a PDF itself. I guess for best results do it before PDF so you dont lose any image quality? Just a guess tho, i'm new at this too.
- strider1551
- Posts: 126
- Joined: 01 Mar 2010, 11:39
- Number of books owned: 0
- Location: Ohio, USA
Re: Basic Guide to Workflow ?
It can be, but normally it is not for book scanning, especially if you want "archival or close quality". To preserve the page layout, the fonts, etc., most people here cleanup their images with scantailor, compress the image, and put it directly into the pdf - think of the pdf as a container format for a bunch of pictures. While that sounds like a whole book would be a huge file, black and white images compress very well - I get about 15.5 kB per page at ~365 dpi.recaptcha wrote:I thought a PDF was a kind of text file.
OCR text is icing on the cake. If you also provide the pdf with OCR information you can search the pdf and it will bring you to the page the word is on, and possibly even highlight where the word is in the image. Typically it is preferred to do the OCR step before compressing the image and putting it in the pdf, since the most efficient compression methods tend to be "lossy", but I don't know if anyone has done a study on whether this has any significant impact on OCR quality.
-
- Posts: 64
- Joined: 03 Sep 2010, 13:23
- Number of books owned: 0
- Location: Calgary, Alberta, Canada
Re: Basic Guide to Workflow ?
Thanks. Does it make any difference to OCR accuracy whether I do the PDF before or after? Is it any easier to correct OCR mistakes before or after PDF ? If I want to end up with an e-pub or .azw file should this be converted from a PDF, or could I make an e-pub file straight from ScanTailor?
Yes, I'd like a fully searchable document.
I forgot to answer univurshul's question: I'm on a Mac with a bootcamp partition, so I guess I could go either way (Windows or Mac). I'm willing to use whatever software makes things faster and easier e.g. Abbyy FineReader, OmniPage, or any of the open source ones mentioned around here. (I read somewhere that you can scan the page image right into AbbyyFineReader, thus eliminating at least one importing step).
Yes, I'd like a fully searchable document.
I forgot to answer univurshul's question: I'm on a Mac with a bootcamp partition, so I guess I could go either way (Windows or Mac). I'm willing to use whatever software makes things faster and easier e.g. Abbyy FineReader, OmniPage, or any of the open source ones mentioned around here. (I read somewhere that you can scan the page image right into AbbyyFineReader, thus eliminating at least one importing step).
- daniel_reetz
- Posts: 2812
- Joined: 03 Jun 2009, 13:56
- E-book readers owned: Used to have a PRS-500
- Number of books owned: 600
- Country: United States
- Contact:
Re: Basic Guide to Workflow ?
Scan Tailor takes the camera images and post-processes them into nice "OCR-ready" images. You can apply OCR to those images, yielding searchable text. Depending on your target format, the software you use after Scan Tailor can be anything. Others here can advise better than I can.could I make an e-pub file straight from ScanTailor
Re: Basic Guide to Workflow ?
In general, OCR will be most accurate on the uncompressed image, so that means before the PDF stage. Though anything is possible I suppose, so you could try testing a few pages before and after making a pdf to see what works best.recaptcha wrote:Thanks. Does it make any difference to OCR accuracy whether I do the PDF before or after? Is it any easier to correct OCR mistakes before or after PDF ?
That's accomplished by having a document that is both text and image in the pdf file. Basically the pdf is an image of the book page and then it has an unseen layer of text embedded in it that works with the search functions and can be output through most pdf software's save to text feature.Yes, I'd like a fully searchable document.
Yes, all commercial OCR software will work that way with flatbed scanners, since they can work with their drivers. They can't do the same with cameras however. But the DIY scanner designs using cameras are so much faster than flatbed scanners that it makes the separate step worth it.I read somewhere that you can scan the page image right into AbbyyFineReader, thus eliminating at least one importing step.
To answer your broad original question, we don't really have a perfectly turnkey solution yet that can take scanned images and put them into a searchable pdf, because there are lots of different ways that people on the forum accomplish it with software for various platforms. You may need to pick which parts do each step best for you.
-
- Posts: 496
- Joined: 04 Mar 2014, 00:53
Re: Basic Guide to Workflow ?
When you have a book scanner, and if it's a dual camera rig, I have a Mac instructional that will help you import & prepare your images for Scan Tailor: http://www.diybookscanner.org/forum/vie ... ?f=3&t=527recaptcha wrote:...I'm on a Mac with a bootcamp partition, so I guess I could go either way (Windows or Mac). I'm willing to use whatever software makes things faster and easier e.g. Abbyy FineReader, OmniPage, or any of the open source ones mentioned around here. (I read somewhere that you can scan the page image right into AbbyyFineReader, thus eliminating at least one importing step).
It sounds like you'll want to test your own productions of PDF, ePub and djvu to see what works best for your target reading device.
Scan Tailor is superb for processing camera images. But if you're simply working with a flat bed scanner for now, then dedicated scan software like Abbyy should do the trick.
-
- Posts: 496
- Joined: 04 Mar 2014, 00:53
Re: Basic Guide to Workflow ?
Strider1551,strider1551 wrote:...To preserve the page layout, the fonts, etc., most people here cleanup their images with scantailor, compress the image, and put it directly into the pdf - think of the pdf as a container format for a bunch of pictures. While that sounds like a whole book would be a huge file, black and white images compress very well - I get about 15.5 kB per page at ~365 dpi....OCR text is icing on the cake. If you also provide the pdf with OCR information you can search the pdf and it will bring you to the page the word is on, and possibly even highlight where the word is in the image. Typically it is preferred to do the OCR step before compressing the image and putting it in the pdf, since the most efficient compression methods tend to be "lossy", but I don't know if anyone has done a study on whether this has any significant impact on OCR quality.
Coming from a developer like yourself, I did wonder about compression now that you mention it: what apps are you using to compress your TIFFs post-Scan Tailor? What would you recommend for iPad displays? (I'm keeping my original TIFFs so I can test freely, but a ballpark or any guidelines would be a time saver.)
And I do like the idea of OCR pre-compression as well. ...Looking forward to djvubind once ported to the Mac OS.