Re: Creating a Digital Library

Posted: 24 Dec 2020, 11:57
by rkomar
As far as the differences between and go, I believe the former accepts scans of books, and then automatically OCRs those to produce the epubs. The latter uses the various distributed proofreading organizations to correct the OCR mistakes before producing the epubs. So, the epubs coming from project gutenberg are closer to the original text. gutenberg also does checks to make sure that it is legal to distribute the epubs, while it appears that does not.

Personally, I started out creating a few epubs from scans, but it was so much work proofing and correcting the OCR mistakes, that I gave up on that. I spent so much time going over the text that I felt I would never re-read the epub after that, which kind of defeats the purpose. :) Now, like @dpc, I clean up the scans and bundle them into PDF files rather than OCR them to create epubs. It's much less work, and I'm sure that the text is correct.

Re: Creating a Digital Library

Posted: 16 Jan 2021, 02:09
by John_Latta
Response to Questions

epubs are of no interest in my library. I want to read the original text and the image captured in the scanned is what I read. It is a waste of time to try to get the OCR results perfect. For, example the OCR results from the fi-6670 is fair but quick. For most of my needs this is acceptable. Keep in mind that I use OCR so that I can search the text for content. One of the most common OCR errors is to leave off spaces between words but in my searching I am looking for words and this has no impact. There are a number of OCR tools here including the best one – ABBYY Fine Reader – and the OCR built into FoxIt PhantomPDF is quite good. However, the output of these programs can significantly reduce the size of the file, which means the image quality declines. In terms of my use – unacceptable. Keep in mind that by focusing on quality page images this increases the file size. Some of the large books with many photos are over at GB but the PDF readers can handle this. There are no limitations on storage in my equipment bay.

Iam2sam asked these questions:

Did you scan books for which good quality digital copies were freely available? If so, why? Are you satisfied with the results of your investment in time and resources in duplicating those particular digital editions?

I purchased about 100 Kindle books. These are great for reading on phones but in terms of high quality copies they are lacking. Further, my experience is that these are laden with DRM (Digital Rights Management). I will not play this game.

Overall I am very pleased with the system and processes set up to convert books to digital. Keep in mind that my current active digital library is 4,000 books. Thus, when one amortizes the investment over all these books it is small. Further, I buy nothing new. Most items, including the 6 computers connected to the three fi-6670 scanners are Lenovo workstation class notebooks, are purchased on eBay. These are fast but relatively inexpensive. These, and the rest of the equipment, will be sold when the use is finished. Thus, I consider the investment as a lease cost.

BillGill comments

It generally takes me about a 1 to 2 weeks to finish a book. That is working on them several hours a day. I find that the biggest part of the task is editing to correct errors that the OCR software generates. I generally go through the text 3 times in a text editor of some sort, and then once more in the Calibre ( EPUB editor.

In my process book scanning averages about 1 sec/page. To scan a 300 page book takes about 5 min and the OCR batch job another 5 minutes. But on top of this is debinding, front cover scanning and then indexing into the library. This is about 10 minutes. Thus, a reasonable estimate is 20 minutes to go from a bound book to its cataloging into the library. The OCR is the least capable with that provided by the fi-6670 but adequate for my immediate needs. If more accurate OCR is required I use another OCR program as described above.

I do not digitize individual books but only batches of books. For example, all the debinding is done at one time, likewise the scanning. A batch can be 10 to 200 books.

dpc comments on portable book readers.

I find GoodReader excellent as described in an earlier post.

iam2sam comments:

I'm going to experiment to see if either Scantailor….

Recently I have needed to create digital books from books that I can only get on loan or borrow with an interlibrary loan. One book was priced at $1,000 retail and outside of my purchase range. With the scanners here is was easy to scan these books and Scantailor was used to create the digital book. This works very well but much too time consuming. It took about 5 hours to create a book to my quality standards.

If there is a down side to my approach to creating and updating the digital library outlined here is keeping up. Right now, I am about 200 books behind. This will take about a week but I would not have it any other way. The high point comes when the final book folder is deposited into the Digital Library Category folders – my card index.