Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

OCR Scanning Large Multi-Volume Document

Share your process here - how to build something, scan something, or use something.
Post Reply
jcobban
Posts: 1
Joined: 07 Nov 2017, 01:21
E-book readers owned: Kobo
Number of books owned: 0
Country: Canada

OCR Scanning Large Multi-Volume Document

Post by jcobban » 08 Nov 2017, 01:46

I am trying to convert a large multi-volume document created in the 1980s. It consists of thousands of pages of typewriter generated text plus a nominal index that, by its appearance, was created using punch cards which were sorted and fed to an IBM line printer. That is all of the text in the index is upper case and has the fuzzy appearance of old high-speed line-printer output created by physical type striking through a fabric ribbon. I do not understand why the author used that technology long after it was obsolete, rather than entering the data into a spreadsheet, since the index is structurally a spreadsheet. One obvious consequence of the choice of this obsolete technology, however, is that the data used to generate the original output is in a format which cannot be read by any modern equipment. So moving forward all of the information in those thousands of pages of the original document must be recreated from scratch if I cannot find a workable OCR solution.

I have tried reading individual pages of this multi-volume document that I took photographs of. Each of the index pages has two columns. So I tried processing the image using OCR Feeder, which provides a GUI for identifying the portion of the page to scan. If I scan the whole page I would have to manually split and reorder every line. However I cannot get OCR Feeder to generate any usable output. All of the files it creates are empty.

I am going to cooperate with the copyright holder, but I want to know if there is a practical technology for digitizing this massive document before I waste any time. I have not paid the hundreds of dollars that it would cost to purchase a copy of the original volumes because they were published in 1985 and the information is consequently 32 years out of date. So I am working from copies held by a library. The original copyright holder does not currently have anyone responsible for updating this document. If I can find a technological solution I will volunteer to convert the obsolete document into a form usable by modern equipment and to organize the project to apply the 32 years of updates.

BruceG
Posts: 61
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: OCR Scanning Large Multi-Volume Document

Post by BruceG » 08 Nov 2017, 05:58

Hi
Is it possible to post a photo of a normal page and the index.

Line printers are still being used today. I guess you do not have the original data. Dbase was often used in the 80's for data bases. I expect there are still old computers around today that can still read the data and convert it to something else.

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest