Creating a Digital Library
Posted: 13 Dec 2019, 14:55
Creating a Digital Library
Creating a digital library of 3,000 books means scanning about 1m pages. Any DIY scanner and the associated software is much too slow. I designed and implemented an 80/20 DIY scanner but in spite of being “fast,” the scanning of books and software processing took far too long. Thus, the quest to find a better solution based on cutting the books into individual pages. There are 8 scanners of various types here and an ADF scanner was essential for this task. Yet, most consumer ADF scanners do not have high volume capabilities. I found the enterprise level Fujitsu fi-6670 superb. It has a robust scanner driver that can be tailored to specific scanning needs. The PaperStream Capture software is excellent. A version of Abbyy is included with the software which does OCR.
The production set up included 3 fi-6670’s and 5 computers. 2 of the fi-6670 scanners are on a USB switch so that they can be operated by a computer that is not doing OCR.
[attachment=1
To lower the overall time, the objective is to keep all the scanners running constantly. Once a scanner completes a book, it stops to do an OCR batch. The scanner is then switched to another computer which starts scanning another book. By the time this is complete the OCR on the other computer is usually finished. Thus, operationally the task is to constantly feed the scanners. All scanners run at 600dpi for best OCR. Further, the scanner driver was set up to automatically recognize color and the page size. I set up the scanner to scan as large at A3 but when the pages passed the scanner they were immediately sized.
The fi-6670 has a multifeed sensor that is excellent at detecting multiple pages stuck together. As a result in creating this library I found no missed pages due to two pages that passed as one through the scanner. It is also very common that the glue, especially the glue on the first or last pages, will rub off on the glass plate of the scanner. If not removed this creates a line in the text. As a result I checked the image sensor glass after each book and cleaned it as required.
Book cutting requires special care. I purchased, used, a relatively low cost paper cutter.
This worked very well. Only once was it necessary to have the blade sharpened. Cutting books takes more than just put them in the cutter and chop. Thicker books, typically above 1/2” curl during the cut. In the extreme this can cut into the edge of the gutter and cut text off. Further there is a fine line between cutting on the edge of the binding and cutting so that no glue exists on the pages. Glue, during binding, can seep into the pages. This gives rise to multiple stuck pages. When scanning the scanner halts and the pages must be separated – slowing the process. Thus, after the cut the pages are fanned to catch any glued pages. I never cut off any text during the cutting but it was a constant vigil to balance between the cut depth, glued pages and the book gutter. But for any thick book I had to cut the spine to create a 1/2'” thick section of the book and scan all such sections. Making sure that the sections were in order was essential so that the final book scan exactly matched the original book. Scanning books of 1,000 to 1,500 pages was routine.
In general, soft cover books were easier to cut, in that the separate step of cutting the hard cover off was not required. In the end, when cutting the pages, soft and hard were the same.
In the process of creating the digital library I set up a number of steps, one of which, scanned the front and back covers of each book before it was cut. There were two reasons for this: the full extent of the cover was scanned (the cover on soft cover books was slightly smaller after the cut) and it was an independent check on the books scanned. That is, sometimes a scanned book was “lost.” I would not know this in the final stage of the process unless there was an independent check – the book cover scanning provided this.
Another process was to name the file for the PDF of the book its title, author. Since all the books had OCR they were searchable. Further, scanning a hardbound book would not have dust jacket, thus, the cover scanning captured the dust jacket and this was added to the final version of the PDF book. Having the cover as the first page of the PDF was excellent. That is, in Windows the large format file display was selected. Even for PDF files the first page is seen, which is the cover based on the procedures outlined here. When I open the folder with the PDF books it looks like a bookshelf with the covers of the books visible.
The process steps included the following: place a group of books into a plastic bin, typically 25 – 30. The bin was numbered. Scan the front and back cover of each book. Cut the books. Scan each book. It would typically take 4 – 6 hours per bin from the books to OCRed PDF files. The books, when done, were discarded. Off and on, the 1,000 books were turned into the digital library in 3 months.
The fi-6670’s and the paper cutter were purchased used. With diligence and care the quality of all the units were excellent and the price a fraction of new. I will eventually sell everything and the net result all this hardware was basically on “rent.”
For reading I use Acrobat reader on the PC and GoodReader on iOS devices. This latter app is superb.
Creating a digital library of 3,000 books means scanning about 1m pages. Any DIY scanner and the associated software is much too slow. I designed and implemented an 80/20 DIY scanner but in spite of being “fast,” the scanning of books and software processing took far too long. Thus, the quest to find a better solution based on cutting the books into individual pages. There are 8 scanners of various types here and an ADF scanner was essential for this task. Yet, most consumer ADF scanners do not have high volume capabilities. I found the enterprise level Fujitsu fi-6670 superb. It has a robust scanner driver that can be tailored to specific scanning needs. The PaperStream Capture software is excellent. A version of Abbyy is included with the software which does OCR.
The production set up included 3 fi-6670’s and 5 computers. 2 of the fi-6670 scanners are on a USB switch so that they can be operated by a computer that is not doing OCR.
[attachment=1
To lower the overall time, the objective is to keep all the scanners running constantly. Once a scanner completes a book, it stops to do an OCR batch. The scanner is then switched to another computer which starts scanning another book. By the time this is complete the OCR on the other computer is usually finished. Thus, operationally the task is to constantly feed the scanners. All scanners run at 600dpi for best OCR. Further, the scanner driver was set up to automatically recognize color and the page size. I set up the scanner to scan as large at A3 but when the pages passed the scanner they were immediately sized.
The fi-6670 has a multifeed sensor that is excellent at detecting multiple pages stuck together. As a result in creating this library I found no missed pages due to two pages that passed as one through the scanner. It is also very common that the glue, especially the glue on the first or last pages, will rub off on the glass plate of the scanner. If not removed this creates a line in the text. As a result I checked the image sensor glass after each book and cleaned it as required.
Book cutting requires special care. I purchased, used, a relatively low cost paper cutter.
This worked very well. Only once was it necessary to have the blade sharpened. Cutting books takes more than just put them in the cutter and chop. Thicker books, typically above 1/2” curl during the cut. In the extreme this can cut into the edge of the gutter and cut text off. Further there is a fine line between cutting on the edge of the binding and cutting so that no glue exists on the pages. Glue, during binding, can seep into the pages. This gives rise to multiple stuck pages. When scanning the scanner halts and the pages must be separated – slowing the process. Thus, after the cut the pages are fanned to catch any glued pages. I never cut off any text during the cutting but it was a constant vigil to balance between the cut depth, glued pages and the book gutter. But for any thick book I had to cut the spine to create a 1/2'” thick section of the book and scan all such sections. Making sure that the sections were in order was essential so that the final book scan exactly matched the original book. Scanning books of 1,000 to 1,500 pages was routine.
In general, soft cover books were easier to cut, in that the separate step of cutting the hard cover off was not required. In the end, when cutting the pages, soft and hard were the same.
In the process of creating the digital library I set up a number of steps, one of which, scanned the front and back covers of each book before it was cut. There were two reasons for this: the full extent of the cover was scanned (the cover on soft cover books was slightly smaller after the cut) and it was an independent check on the books scanned. That is, sometimes a scanned book was “lost.” I would not know this in the final stage of the process unless there was an independent check – the book cover scanning provided this.
Another process was to name the file for the PDF of the book its title, author. Since all the books had OCR they were searchable. Further, scanning a hardbound book would not have dust jacket, thus, the cover scanning captured the dust jacket and this was added to the final version of the PDF book. Having the cover as the first page of the PDF was excellent. That is, in Windows the large format file display was selected. Even for PDF files the first page is seen, which is the cover based on the procedures outlined here. When I open the folder with the PDF books it looks like a bookshelf with the covers of the books visible.
The process steps included the following: place a group of books into a plastic bin, typically 25 – 30. The bin was numbered. Scan the front and back cover of each book. Cut the books. Scan each book. It would typically take 4 – 6 hours per bin from the books to OCRed PDF files. The books, when done, were discarded. Off and on, the 1,000 books were turned into the digital library in 3 months.
The fi-6670’s and the paper cutter were purchased used. With diligence and care the quality of all the units were excellent and the price a fraction of new. I will eventually sell everything and the net result all this hardware was basically on “rent.”
For reading I use Acrobat reader on the PC and GoodReader on iOS devices. This latter app is superb.