Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

Actually reading ebooks on an ebook reader

Anything about eReaders. If you want really deep discussion, though, go to mobileread.com.
Post Reply
User avatar
daniel_reetz
Posts: 2776
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Actually reading ebooks on an ebook reader

Post by daniel_reetz » 02 Oct 2009, 13:04

StevePoling started talking about what is necessary to get books onto an ebook reader for reading. I thought it deserved its own thread, so here it is:
StevePoling wrote:
nalfonso wrote:First, you need to very clearly define your needs. To say that you need a book scanner is not enough. If you have browsed this forum enough, you will notice that the range of designs is wide and deep. Some people just need to have the images for viewing or reading, others need to have PDFs, and others need to convert to text using OCR, to name just a few. Also, by defining precisely what your needs and specific objectives are, the forum members will be in a position to contribute appropriate ideas and suggestions. But you must communicate your needs.
MY specific objective is to put my library onto my Kindle DX and SONY Reader (PRS-505). I also want to scan my aunt's Poling geneology book that's not in print anyplace.

I'd like to know whether anyone has done any serious analysis of the requirements of electronic readers. Extrapolating from the scanned PDFs I've found on the web, I think that PDF files of scanned images (without OCR) will be:
1) fine if viewed on a laptop,
2) less than fine on the Kindle DX,
3) marginal (if usable at all) on the SONY Reader or the Kindle 1 or 2.

I think that readers with small screens (not the Kindle DX) will have to be OCRed to allow text to reflow on the smaller screen. Less a problem with larger displays. I've found my Sony and my Kindle do pretty good at ersatz large-print editions. But only on non-PDF formats.

I believe (and I'm looking for someone to confirm or deny this) that if you're looking to create a PDF of a book that's a trade paperback or smaller, it'll be usable on the Amazon DX. But a SONY Reader will require epub format. I don't know whether this is possible, but I suppose someone needing epub would do this:
1) scan the book
2) OCR the images
3) clean up text and put it into XHTML format
4) put XHTML into epub file.
With epub in hand, you could then use a program like Calibre to shift it into mobi format for the Kindle.

I haven't built a scanner. I did find this scanned book (http://ia311524.us.archive.org/1/items/ ... ucmf_6.pdf) and OCRed it using http://www.cometdocs.com, to create a MS Word document. Then it was a few evenings cleaning it up. And some more time turning the Word document into epub. A lot of labor in the post-scanning if you need more than a scanned-image PDF.

User avatar
daniel_reetz
Posts: 2776
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: Actually reading ebooks on an ebook reader

Post by daniel_reetz » 02 Oct 2009, 13:17

The real question here, IMHO, is one of screen resolution VS image size.

The Kindle and Kindle 2 devices have 6" diagonal screens with 600x800 pixels (167ppi). That is plenty to have an image of a low-density paperback book, no OCR.

The Kindle DX has 1,200 x 824 pixels at 150 ppi. That's a bit less density but a lot more to work with. It's my opinion that most PDFs from academic journals can be read on a DX, though there will be issues with graphs, figures, small fonts.

A lot of this has been hashed out on the MobileRead forums, though I haven't been there in a long time.

Basically, all e-readers have shit resolution in terms of both pixel density and grayscale resolution, though both are improving. As I see it, a properly OCR'd book pretty much guarantees a good reading experience because the reader can use native fonts and reflow as necessary to compensate for crappy resolution.

I sold my Sony PRS-500 because I couldn't stand the low-res screen. I just don't read enough easily-reflowable fiction to make that worthwhile.

jradi

Re: Actually reading ebooks on an ebook reader

Post by jradi » 02 Oct 2009, 19:45

Unless you're scanning non-fiction (things with footnotes, math/sci books, computer texts), I think OCR'ing is the way to go. Part of the problem with ereaders is that the processor isn't very fast, since they're optimized for displaying text. Processing graphics heavy pdf's would probably tax the processor so much that flipping through pages becomes unbearable.

StevePoling
Posts: 290
Joined: 20 Jun 2009, 12:19
E-book readers owned: SONY PRS-505, Kindle DX
Number of books owned: 9999
Location: Grand Rapids, MI
Contact:

Re: Actually reading ebooks on an ebook reader

Post by StevePoling » 03 Oct 2009, 00:43

jradi wrote:Unless you're scanning non-fiction (things with footnotes, math/sci books, computer texts), I think OCR'ing is the way to go. Part of the problem with ereaders is that the processor isn't very fast, since they're optimized for displaying text. Processing graphics heavy pdf's would probably tax the processor so much that flipping through pages becomes unbearable.
Not probably tax the processor. Surely do so! Processing speed is a consideration I forgot about. I've tried to read PDFs consisting of scanned pages w/o OCR and each page flip was slowwwwww

Downside of OCRing is cleanup afterwards. or maybe I was just working with a low quality input. Ferinstance, do OCR packages know enough to treat the page body separately from the header (with book or chapter title) and footer (with page number)? In my experiment the headers and footers came mixed in with body text and had to be edited out of each page. Moreover paragraph recognition is another impediment to the nirvana of correct text reflowing.

OCRing a book with mixed graphics and text (e.g. a Calc text) seems like a complete nightmare. Classical works and fiction seem much more tractable. But I'd better shut up before my ignorance gets becomes REALLY apparent.

StevePoling
Posts: 290
Joined: 20 Jun 2009, 12:19
E-book readers owned: SONY PRS-505, Kindle DX
Number of books owned: 9999
Location: Grand Rapids, MI
Contact:

Re: Actually reading ebooks on an ebook reader

Post by StevePoling » 03 Oct 2009, 00:59

daniel_reetz wrote:The real question here, IMHO, is one of screen resolution VS image size.
One thing you'll learn when your personal odometer clicks over the 40 year mark is that the eyesight starts to fail. You need brighter reading light and you can't make your eyes focus unless you push the book further away.

The Kindle DX can handle an 8.5x11 technical journal PDF, but if you're like me, you'll need to get out the reading glasses. That makes it good enough for almost everything. However, even reading glasses won't help you read letter sized PDFs when your screen is smaller than the DX. Conversely, when I was an undergrad I could read amazingly small text. In that case, the smaller screens' resolution would become significant.

I always chuckled at my grandmother's Large Print Edition books. But it's not funny anymore. It's sorta nice to click a button and make whatever you're reading Large Print.

Sorry, I digress.

jradi

Re: Actually reading ebooks on an ebook reader

Post by jradi » 03 Oct 2009, 12:33

I've had pretty good luck with the newest version of ABBYY (the cheapest version). Headers and footers are excluded, as are page numbers. It does a pretty good job of separating pictures from text and includes the picture in the ocr'd text. It even does a pretty decent job of recognizing paragraphs.

The problems I can think of so far are:
1. If you're OCR'ing a book that has a special font for the first letter of each chapter (usually a very large character), it treats it like a picture instead of text.
2. Paragraph that spill between pages are treated as two paragraphs.
3. 0casiona11y there is confusion between 1's and l's and O's and 0's.

That being said, the output is readable and I don't put any effort into editing the text after it's been OCR'd. I read pretty fast and it the small mistakes rarely distract me. I've shared some of my ocr'd text with friends and they've said the same thing. If it's a good book, you disappear into the book and small distractions go unnoticed.

I especially love OCR'd text because I can read it on so many different platforms - I keep a copy in my gmail account, my iphone, laptop, work computer, etc... A good book is always nearby...

Post Reply