some general beginner questions

Don't know where to start, or stuck on a certain problem? Drop by and tell us about it. Feel like helping others? Start here.

Moderator: peterZ

Post Reply
Veracia

some general beginner questions

Post by Veracia »

Hi Forum!

Very cool site and a very useful idea too. I'm swedish so forgive my english: :)

I'm interested in building a scanner to scan books in general, but also more specifically, university course litterature. I'm wondering what type of format it all comes out in. Do you scan it and then run it through software that extracts the information to put in a _text_ document or do you create it like a pdf with "pictures" in it? I'm thinking that this whole process works best with with white pages and black text. Alot of university litterature (like economics for example) have alot of design features in the books. Colours, colums, graphs etc - is this usable too? As you can see I don't know anything about these fileformats.

There is alot of info about the scanners here naturally. I'm also a bit curious about what type of e-readers that can be used? There has been lots of critique against most e-book readers because of the limited amount of formats they support. This is of course connected to the idea of buying the books from Amazon and similar stores and limiting the users to those particular closed formats. Is it any special type of e-reader that I need to do the whole DIY-scanner thing?

Thank alot!
spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: some general beginner questions

Post by spamsickle »

Most folks here use Scan Tailor or Book Scan Wizard to process the JPEG images into something that can be turned into an ebook. I still prefer PDF for that, but I'm told DJVU is also popular.

Scan Tailor is good for separating a book's text from its images, which will result in better compression. In my experience, using it for university textbooks can be a very time-consuming manual process, because of all the tweaks required to the dividing line between text and image. For Scan Tailor, text is strictly a black and white affair, so if text in a chart (for example) is processed as text, it may be converted from a range of colors to all black.

Book Scan Wizard will preserve colored text, but many people find it more difficult to use. You'll probably get less compression with BSW too, because I believe it maintains the whole page as an image rather than trying to partition text.

If your objective is simply to have a way to read your textbook on a computer, you could do what I've started doing with these sorts of books, and simply rotate and collate the JPEG images, then read them with an image viewer. You don't have the advantage of searchable text that comes with OCR, but I've rarely missed it. When you're reading the hard copy, you don't have searchable text either (though to be fair, it's easier to flip to the index and flip back to an indexed page in a book than it is in an image viewer).

If you purchase electronic books from the publisher, charts and graphs are usually vector images. That's the best solution, because then you can zoom in to read tiny annotations layered over a picture, for instance. It would take a rare obsessive-compulsive personality to put in the work to vectorize thousands of scanned illustrations in a typical 1400-page textbook, and you'd have to be working with pretty high-resolution images to start with. These are the most difficult books to scan and process well.
Turtle91

Re: some general beginner questions

Post by Turtle91 »

The type of reader is strictly your preference - you just need to make sure that it will read the format that YOU create/prefer.

PDF format is pretty common for putting documents on the internet. A lot of people on this forum talk about converting everything to PDF with searchable text layered underneath. That way you can see what the actual book looked like (from the picture) and if you want to search for specific text it can do that too. If there are any OCR errors, I don't think you would actually see them, they just wouldn't respond correctly to the search. Some people - mostly students with textbooks or people doing research, or people digitizing historical documents - want to keep the exact layout as the book/document they are scanning - it is important to have page numbers for references. PDF's do not take advantage of one primary function of ebooks - the ability to "reflow" the text to fit the size or orientation of your screen. I don't know how often I've been frustrated trying to read technical manuals that have multiple columns in pdf format. I would have to scroll horizontally and vertically multiple times per page. On a desktop display it's not as big a deal, but if you get into smaller displays like a tablet or smart phone that would be crazy. To see the whole page at one time - the letters would have to be so small that my old eyes wouldn't see them!

I personally prefer to do the full OCR and store the file as text/html. This usually results in a much smaller file size - that way I can fit as many books on my reader as possible. Also, most readers can handle those formats. I say MOST but you need to make sure that the reader isn't limited to a specific proprietary format. Rocket books, original kindle, etc wouldn't read anything but their own format. If the reader can't handle text/html it is pretty easy (there is usually free software readily available) to convert from text/html to whatever format you want. The down side is that it takes a bit longer to do the full OCR and then go back and check for errors. The good thing is that if you set up your cameras properly and get a good scan to begin with, the software we use here, scan tailor or book scan wizard, fixes most common scan errors making the OCR much more accurate.

A particular format that is pretty much becoming the "standard" format for ebooks is .ePub. It is basically just .html, image, font, table of contents, and style files (plus the new ePub 3.0 will allow audio and read-out-loud files) all packed into a .zip file. Then they rename the extension to .ePub. If anyone cares to rumage around just rename the extension to .zip and you can open it up. The benefits of this format are primarily that "most" newer readers will open them. But even better, there are a few FREE software programs out there that will automatically create, or reformat, the ePub for you - http://calibre-ebook.com comes to mind, but there are others. If you want to see how it can be done manually - or just want to spend a few minutes because you have nothing else to do - check out a tutorial that Aaron DeMott put together over on http://www.jedisaber.com/ebooks/formatsource.asp.

One reader software (works on windows, iThings, windows 7 mobile, and android, with OSX still in the works) that seems to do a good job with display of pictures, formatting, readability, etc. is "Blio" http://www.blio.com/. It reads the ePub and XPS(??) format. I have only looked at it briefly so please don't take it as an endorsement.

I know that is a lot of info - but the point was that any "reader" you get needs to read the format that YOU create/have. After that it's just your preference on what bells and whistles the reader supports to make your job - reading the book - more enjoyable.

Cheers!
Post Reply