Noob Questions on Scanning Process and E-Reader Formats

Don't know where to start, or stuck on a certain problem? Drop by and tell us about it. Feel like helping others? Start here.

Moderator: peterZ

recaptcha
Posts: 64
Joined: 03 Sep 2010, 13:23
Number of books owned: 0
Location: Calgary, Alberta, Canada

Noob Questions on Scanning Process and E-Reader Formats

Post by recaptcha »

Hi, new to bookscanning, but a long term lurker on this board. Apologies for the noob questions but I would like to know a couple of things before I plunk down money for a DIY scanner kit:

1. My main purpose for scanning books is to read my own book collection on a portable e-reader. I don't actually have an e-reader yet but it's mostly down to a choice between an iPad and a Kobo. It will depend on what to expect from the scanning process. My books are mostly academic books with a lot of photos, tables, diagrams and illustrations. If I scan these type of books with a digital camera and then process them using the methods recommended on this forum, will they be compatible with the EPUB format, and can they be reflowed on an e-reader? Or would I need to leave them in a .pdf format?

2. Is it possible to preserve the original layout of a book that has photos, tables, illustrations, etc. including page numbers and text descriptions underneath photos? Ideally what I'm after I guess is a faithful electronic version of the look/layout of the original paper book.

Thanks heaps for any feedback.
xorpt
Posts: 42
Joined: 24 Feb 2012, 01:37
E-book readers owned: Sony PRS-T1
Number of books owned: 2000

Re: Noob Questions on Scanning Process and E-Reader Formats

Post by xorpt »

recaptcha wrote:1. My main purpose for scanning books is to read my own book collection on a portable e-reader. I don't actually have an e-reader yet but it's mostly down to a choice between an iPad and a Kobo. It will depend on what to expect from the scanning process. My books are mostly academic books with a lot of photos, tables, diagrams and illustrations. If I scan these type of books with a digital camera and then process them using the methods recommended on this forum, will they be compatible with the EPUB format, and can they be reflowed on an e-reader? Or would I need to leave them in a .pdf format?

2. Is it possible to preserve the original layout of a book that has photos, tables, illustrations, etc. including page numbers and text descriptions underneath photos? Ideally what I'm after I guess is a faithful electronic version of the look/layout of the original paper book.
EPUB format is basically HTML files encapsulated in a container. HTML file = text format, so you have to convert images to text. That means OCR, Proof-reading, Correcting, Formatting to HTML, Creating the EPUB. Very time consuming, especially if the book structure is complex, because even commercial OCR software will not give you a perfect result (far from it...) out of the box. Maintaining the book layout is posssible in EPUB, but will require you a lot of time (especially if your book has footnotes...).

In general, I'd say that EPUB conversion is adapted for novels. For the rest, PDF or DJVU will keep everything in place, and you can have an OCR layer for seachability. Also add a table of contents.

So for your academic books, I strongly recommend to buy a 9.7 inches reader such as a Pocketbook 9xx or a Onyx Boox m92, or an ipad. Both have advantages and inconveniences: ereaders are much more comfortable for the eye to read but the navigation sucks. Ipads are much better in terms of usability but they are not comfortable to read a for long time.
recaptcha
Posts: 64
Joined: 03 Sep 2010, 13:23
Number of books owned: 0
Location: Calgary, Alberta, Canada

Re: Noob Questions on Scanning Process and E-Reader Formats

Post by recaptcha »

xorpt wrote: EPUB format is basically HTML files encapsulated in a container. HTML file = text format, so you have to convert images to text. That means OCR, Proof-reading, Correcting, Formatting to HTML, Creating the EPUB. Very time consuming, especially if the book structure is complex, because even commercial OCR software will not give you a perfect result (far from it...) out of the box. Maintaining the book layout is posssible in EPUB, but will require you a lot of time (especially if your book has footnotes...).
Thanks, so you're saying it's possible to have a mix of graphics (i.e. photos, illustrations, charts) alongside text in an EPUB format, but it's just more work?
xorpt wrote: So for your academic books, I strongly recommend to buy a 9.7 inches reader such as a Pocketbook 9xx or a Onyx Boox m92, or an ipad. Both have advantages and inconveniences: ereaders are much more comfortable for the eye to read but the navigation sucks. Ipads are much better in terms of usability but they are not comfortable to read a for long time.
Thanks, I assume the biggest problem for .pdf and DJVU would be keeping the file size down? I wonder how big of a file a 350 page book with about 20 half page graphics/photos would be?
xorpt
Posts: 42
Joined: 24 Feb 2012, 01:37
E-book readers owned: Sony PRS-T1
Number of books owned: 2000

Re: Noob Questions on Scanning Process and E-Reader Formats

Post by xorpt »

recaptcha wrote:Thanks, so you're saying it's possible to have a mix of graphics (i.e. photos, illustrations, charts) alongside text in an EPUB format, but it's just more work?
Yes
recaptcha wrote:Thanks, I assume the biggest problem for .pdf and DJVU would be keeping the file size down? I wonder how big of a file a 350 page book with about 20 half page graphics/photos would be?
With DJVU between 5Mb-30Mb for a 400 pages book with about 20-30 B&W illustrations/photos, if you use the correct parameters. With PDF it will be more, sometimes double or more, depending on the image format. The interest of PDF is that it is better supported on a lot of devices, but DJVU is really better in terms of size (with the same image quality), especially for bitonal documents. I usually end up creating both formats because my Onyx Boox is very slow to turn pages on scanned PDF documents and much faster on DJVU.

The last example I have done is a 660 pages book with only text (only bitonal pages). In DJVU it's 19Mb and 22Mb in PDF.
dtic
Posts: 464
Joined: 06 Mar 2010, 18:03

Re: Noob Questions on Scanning Process and E-Reader Formats

Post by dtic »

Hi recaptcha,
in addition to what xorpt wrote:

See for example this public domain epub text http://archive.org/details/Wizardofozed ... epubreader . You can download the file and unzip it (change the file extension to .zip) to see how the parts fit together. There are of course some software tools to help with such assembly. But I think it would still involve a lot of manual work to get the result close to the original paper text.

For a smaller reading device like the Kindle you can quickly adapt a large page size pdf or djvu for the device screen using http://www.willus.com/k2pdfopt/ . It works great.
recaptcha
Posts: 64
Joined: 03 Sep 2010, 13:23
Number of books owned: 0
Location: Calgary, Alberta, Canada

Re: Noob Questions on Scanning Process and E-Reader Formats

Post by recaptcha »

Thanks. I'm kind of leaning towards a tablet (like the iPad) since tablets seem a bit more flexible, have better .pdf rendering, come in larger screen sizes, have colour screens and larger storage capacities. But yeah, the e-readers offer a nicer pure reading experience. Still making up my mind though.

So if I understand the workflow correctly, it would be something like:

1. Take pictures of book pages with camera
2. Dump images (.jpegs?) from camera onto computer
3. Clean up and crop images using Scan Tailor or Bookscan Wizard (?)
4. Convert finished images to .pdf or DJVU files (?)
5. Transfer files to tablet or e-reader
(Please correct me if this is wrong).

Forgot to ask: are DJVU files searchable? Searching for keywords is crucial to me.
dtic
Posts: 464
Joined: 06 Mar 2010, 18:03

Re: Noob Questions on Scanning Process and E-Reader Formats

Post by dtic »

Yeah, those are the basic steps.

But some use intermediate software, like preprocessing tools to crop and dewarp images before feeding them to ScanTailor. The motivation for that may be quicker processing, less manual work or better quality/less errors. For example the automatic page split and content boxing steps in ScanTailor are good but not perfect. Noise in the areas of the photo outside the desired text area can sometimes be picked up by ScanTailor. Finding and manually correcting such errors takes time. I wrote BookCrop for quick pre Scan Tailor batch cropping to lower the risk for such errors. There are several other such tools in the software section - browse and see!

yes, djvu files can be made searchable. Check out djvubind
recaptcha
Posts: 64
Joined: 03 Sep 2010, 13:23
Number of books owned: 0
Location: Calgary, Alberta, Canada

Re: Noob Questions on Scanning Process and E-Reader Formats

Post by recaptcha »

And by leaving the final file in .pdf or djvu format, I wouldn't have to OCR, is this correct?

I didn't read all the way through the djvubind thread link, but in the OP it says it's currently Linux only.
dtic
Posts: 464
Joined: 06 Mar 2010, 18:03

Re: Noob Questions on Scanning Process and E-Reader Formats

Post by dtic »

Well, you can output both .pdf and .djvu without OCR but what you basically get then is a packaged set of images i.e. cleaned up versions of the input images. For text search you must do OCR which creates a layer of plaintext that you can search, select, copy and so on.

Djvubind should work for Windows and Mac too. See here http://code.google.com/p/djvubind/ .

There are some alternatives for djvu creation. I made TiffDjvuOcr some time ago a frontend for djvulibre (djvu creation) and tesseract (OCR) some time ago. I haven't tried it with the latest tesseract version and lacks some option.
recaptcha
Posts: 64
Joined: 03 Sep 2010, 13:23
Number of books owned: 0
Location: Calgary, Alberta, Canada

Re: Noob Questions on Scanning Process and E-Reader Formats

Post by recaptcha »

Ok, I always thought you could text search .pdf's within say Adobe. I do this at work for documents I scan.

- So where would performing OCR fit within the order of my above listed workflow?
- How does OCR work when there are graphics and images on the page as well? In other words, by doing OCR does this change, or would you need to re-assemble, the layout of the page?
Post Reply