Page 1 of 2

Creating a Digital Library

Posted: 13 Dec 2019, 14:55
by John_Latta
Creating a Digital Library

Creating a digital library of 3,000 books means scanning about 1m pages. Any DIY scanner and the associated software is much too slow. I designed and implemented an 80/20 DIY scanner but in spite of being “fast,” the scanning of books and software processing took far too long. Thus, the quest to find a better solution based on cutting the books into individual pages. There are 8 scanners of various types here and an ADF scanner was essential for this task. Yet, most consumer ADF scanners do not have high volume capabilities. I found the enterprise level Fujitsu fi-6670 superb. It has a robust scanner driver that can be tailored to specific scanning needs. The PaperStream Capture software is excellent. A version of Abbyy is included with the software which does OCR.

The production set up included 3 fi-6670’s and 5 computers. 2 of the fi-6670 scanners are on a USB switch so that they can be operated by a computer that is not doing OCR.

IMG_5327 - small.jpg

To lower the overall time, the objective is to keep all the scanners running constantly. Once a scanner completes a book, it stops to do an OCR batch. The scanner is then switched to another computer which starts scanning another book. By the time this is complete the OCR on the other computer is usually finished. Thus, operationally the task is to constantly feed the scanners. All scanners run at 600dpi for best OCR. Further, the scanner driver was set up to automatically recognize color and the page size. I set up the scanner to scan as large at A3 but when the pages passed the scanner they were immediately sized.

The fi-6670 has a multifeed sensor that is excellent at detecting multiple pages stuck together. As a result in creating this library I found no missed pages due to two pages that passed as one through the scanner. It is also very common that the glue, especially the glue on the first or last pages, will rub off on the glass plate of the scanner. If not removed this creates a line in the text. As a result I checked the image sensor glass after each book and cleaned it as required.

Book cutting requires special care. I purchased, used, a relatively low cost paper cutter.

IMG_5186 - small.jpg
This worked very well. Only once was it necessary to have the blade sharpened. Cutting books takes more than just put them in the cutter and chop. Thicker books, typically above 1/2” curl during the cut. In the extreme this can cut into the edge of the gutter and cut text off. Further there is a fine line between cutting on the edge of the binding and cutting so that no glue exists on the pages. Glue, during binding, can seep into the pages. This gives rise to multiple stuck pages. When scanning the scanner halts and the pages must be separated – slowing the process. Thus, after the cut the pages are fanned to catch any glued pages. I never cut off any text during the cutting but it was a constant vigil to balance between the cut depth, glued pages and the book gutter. But for any thick book I had to cut the spine to create a 1/2'” thick section of the book and scan all such sections. Making sure that the sections were in order was essential so that the final book scan exactly matched the original book. Scanning books of 1,000 to 1,500 pages was routine.

In general, soft cover books were easier to cut, in that the separate step of cutting the hard cover off was not required. In the end, when cutting the pages, soft and hard were the same.

In the process of creating the digital library I set up a number of steps, one of which, scanned the front and back covers of each book before it was cut. There were two reasons for this: the full extent of the cover was scanned (the cover on soft cover books was slightly smaller after the cut) and it was an independent check on the books scanned. That is, sometimes a scanned book was “lost.” I would not know this in the final stage of the process unless there was an independent check – the book cover scanning provided this.

Another process was to name the file for the PDF of the book its title, author. Since all the books had OCR they were searchable. Further, scanning a hardbound book would not have dust jacket, thus, the cover scanning captured the dust jacket and this was added to the final version of the PDF book. Having the cover as the first page of the PDF was excellent. That is, in Windows the large format file display was selected. Even for PDF files the first page is seen, which is the cover based on the procedures outlined here. When I open the folder with the PDF books it looks like a bookshelf with the covers of the books visible.

The process steps included the following: place a group of books into a plastic bin, typically 25 – 30. The bin was numbered. Scan the front and back cover of each book. Cut the books. Scan each book. It would typically take 4 – 6 hours per bin from the books to OCRed PDF files. The books, when done, were discarded. Off and on, the 1,000 books were turned into the digital library in 3 months.

The fi-6670’s and the paper cutter were purchased used. With diligence and care the quality of all the units were excellent and the price a fraction of new. I will eventually sell everything and the net result all this hardware was basically on “rent.”

For reading I use Acrobat reader on the PC and GoodReader on iOS devices. This latter app is superb.

Re: Creating a Digital Library

Posted: 31 Dec 2019, 18:59
by daniel_reetz
Thanks for sharing all this experience, John_Latta.

Re: Creating a Digital Library

Posted: 31 Jan 2020, 11:20
by recaptcha
Hi John. Thanks for posting.

Just curious, what kind of cutter is that? And how much did it cost? I think I might need something like that.

Re: Creating a Digital Library

Posted: 03 Feb 2020, 18:25
by John_Latta
This is a low cost Chinese paper cutter. It was mostly found with its model number - 450VS+ and 480VS+, mine is a 450VS+. Cost about $900. I bought it used.

This is widely available on eBay and other sources. I recall even Amazon has it listed.

But this is not the full story. There are several issues to address with a paper cutter.
(1) Shipping.
(2) Weight and placement
(3) Source
(4) Blade sharpening.

I have had to contend with all four.

Virtually all paper cutters are shipped via truck. I have not had good experience with freight. In spite of excellent packaging the freight company dropped my unit even when it was on a pallet. It took over a month to finally get it addressed. I was an eBay buyer and the seller was excellent.

Once on location finding a place and moving the unit can be difficult. It takes 2 to lift the VS450. Do not place the unit outside as this will rust the blade. Mine is inside. Most commercial paper cutters weigh a lot more.

There are many paper cutters available. I was on a look out on Craigslist and eBay for months. The are a few brand names of the cutters that service the print industry. Frequently when print shops close these come up at reasonable prices. Such units are heavier still. These are industrial grade but I did not need this quality level and they were larger still.

The effectiveness of the unit is based on how sharp the blade is. I have 3 spares and can replace blades easily. Finding a company to sharpen the blades was a challenge - these are specialty companies. That said, I found that a sharp blade lasted me about 2,000 books. Most printers with paper cutters do not think of "debinding."

Overall I am very pleased with the unit. Cutting books is a process and I have this down. The paper cutter is just a tool in that process.

Re: Creating a Digital Library

Posted: 20 Jun 2020, 04:06
by John_Latta
Creating a Digital Library – Update

Having created a 5,500 book digital library an update to the post might be useful.

The scanning configuration has been expanded to 6 computers so that there are 2 computers per Fujitsu fi-6670. As a result when production is underway the scanners are in near constant operation. Operating 3 scanners and 6 computers is about all a single person can do to keep up with the flow. With an inflow of 10 – 40 books/week this configuration makes scanning easy and rapid. Inventory, debinding and scanning is a batch operation.

Books are stored in two ways: Catalog and Bookshelf.

The catalog is a file structure organized by topic. Here is a portion.
Book Folders.JPG
Book Folders.JPG (17.23 KiB) Viewed 1605 times

Opening the Abraham Lincoln folder nets these books.

Books #2.JPG
Books #2.JPG (42.01 KiB) Viewed 1605 times

Books are identified by Title and Author. Here is one book as seen in Windows Extra Large Icons. Note the PDF icon. Windows shows the first page of the file as the file icon.

Lincoln Book.JPG
Lincoln Book.JPG (20.03 KiB) Viewed 1605 times

The name of the file for the book is the same as the actual book PDF file.

The bookshelf is a single folder with all the books in it. The visualize the contents of the folder the display in Windows is with Extra Large Icons. Here is a portion of the bookshelf.

In both of these storage techniques having a front cover for the book is essential. I remember books frequently by their covers.

Thus, a challenge is to make sure each digital book has cover. This is relatively easy with soft cover books as the cover is usually scanned as part of the book. In hard cover books the best front page is the dust jacket, if it still is on the book. As outlined in the original post, a quality check is that every book has the covers scanned before debinding. Thus, to apply the cover to a scanned book is a post processing operation. This means that the front and back cover image is applied to the PDF file of the scanned book. There are multiple programs that can do this, merging files to create a PDF is the operation, and Nitro Pro has been found to be quick and effective. The result is a complete digital book, front to back, that is searchable.

An objective of creating a PDF book is to make it as close to the original book as possible – front and back covers and all the pages. In some cases even the spine is included.

I use multiple devices to read digital books. There are two iPads here, a mini and Pro. Books can also be put on laptop computers and the newer touch screen devices are excellent. The 12.9” Pro is heavy but with 512GB of storage it is well suited to house the personal library – the large screen makes for easy reading. As outlined before an excellent iOS reader is GoodReader. It will also show the front cover of the books.

GoodReader Icon.png
GoodReader Icon.png (59.55 KiB) Viewed 1605 times

Currently the digital library has 5,500 books of high quality which can be readily searched and read.

As the size of the library increases, the scanning process becomes a smaller part of the overall library operations. Creating and maintaining the library as outlined above is time consuming. That is what libraries are for. How a book was printed is not of interest to the librarian – it is just the book. The same applies here. However, given that the books in my library started as a bound book a process is required to do the conversion, which has been described in these posts.

Why this effort and scope? I am an author of a book and need access to many books and the ability to quote them. Thus, ready access to the text, ease of reading and the ability to make citations is an essential part of the use of the digital library. What is described here is only a precursor to use.

Re: Creating a Digital Library

Posted: 20 Jun 2020, 15:32
by daniel_reetz
John, thank you so much for this update and for sharing this insightful and impressive work.

Might be a slightly sensitive question for some, but I see in your original post that you discard some or all of the books after debinding and scanning. Two questions:
    *Did you proceed this way and are you happy with that approach?
      *Were there any books you chose not to discard, how did you handle those?

      Re: Creating a Digital Library

      Posted: 21 Jun 2020, 02:55
      by John_Latta
      Good questions.

      Before answering I should admit I am a bibliophile. That is, I love books. They represent a storehouse of knowledge. In the age of the internet and 100+ word postings or web pages, books are something that has been thought about and not smashed into the ethersphere. There is another medium for words in science and these are journal articles but that is another topic.

      Some of the books I seek out are difficult to find and in many cases old – some over 100 years. Yet, in the age of the search engine I can find virtually any book. I have pangs of reservations when I cut these books up. Yet, trying to digitize them bound takes too long and can also be destructive. Thus, every book has been cut up for digitization. A key issue here is time – scanning a book with the set up shown here takes about 1 sec a page at 600dpi into a final size correct page. I do not know of any DIY scanner that can do this.

      Yes, I keep some books but very few. These are typically old or physically very large. The purpose is to have a backup in case something was wrong with the digitization. But to date I have had no problems with this. Since the output is a PDF file one area of interest is – can I get photos of the pages? The quality of the digitization is excellent and, yes, page photos have been output and are quite good.

      What is described here is a media transformation, literally from pages to digital bits. In today’s world a physical book is a tangible entity with the power of format, presence and even history. Yet, with 2.2m new titles a year the physical book business is in turmoil as book sellers disappear and a rising presence of self-published volumes.

      Self-digitization, which the DIY book movement represents, is a result of the differences of ownership for a digital and physical book. A tangible book has ownership rights and can be resold, which is at the essence of the supply chain I use to create my digital library. While electronically created books have sever restrictions on ownership and typically cannot be resold. For example, I was writing a review of a book where sections of the book were used in the review. This book was in Kindle form and the message popped up – the publisher has determined that you are at your limit for the amount of copying from this book!

      Thus, what this post describes is the equipment and processes used to transform the book media, without the controls of the publishers, in a way which fits my personal use. If digital book ownership was the same as physical book ownership this transformation would not be required.

      Re: Creating a Digital Library

      Posted: 21 Jun 2020, 07:12
      by TS Zarathustra
      I have no problem with the physical existence of a book being destroyed by converting it into electronic copy. What worries me is storage, reliability, and accessibility of the electronic copy. Imagine a scenario where you've passed away and your great grandchildren (hopefully) are going through your stuff. "Mom, what is this?" asks the brightest while holding up a USB drive with the last copy of certain book. "It's your Grand-dads hobby. You can throw it away. There is no equipment that can access the information on this anymore." The kid persists, somehow manages to access the data and is trying to decode the format of the files when Grand Take Without Permission Self Moving Vehicle LXXII comes out and he/she throws it all in the recycling bin.
      I hope I'm being too pessimistic. :lol:

      Re: Creating a Digital Library

      Posted: 21 Jun 2020, 10:16
      by daniel_reetz
      John_Latta wrote:
      21 Jun 2020, 02:55
      Thus, what this post describes is the equipment and processes used to transform the book media, without the controls of the publishers, in a way which fits my personal use. If digital book ownership was the same as physical book ownership this transformation would not be required.
      Perfectly said. Quoted for emphasis. It is a sad thing that (commercial) digital books do not enjoy the full benefits of their medium.

      Thank you John.

      Re: Creating a Digital Library

      Posted: 21 Jun 2020, 10:18
      by daniel_reetz
      TS Zarathustra wrote:
      21 Jun 2020, 07:12
      What worries me is storage, reliability, and accessibility of the electronic copy.
      There's a notion in the digitizing community: LOCKSS ("Lots of copies keeps stuff safe"). Libraries like John's are precious and should definitely be backed up in several places. I know mine will be backed up on the Internet Archive one day.