OCR Scanning Large Multi-Volume Document
Posted: 08 Nov 2017, 01:46
I am trying to convert a large multi-volume document created in the 1980s. It consists of thousands of pages of typewriter generated text plus a nominal index that, by its appearance, was created using punch cards which were sorted and fed to an IBM line printer. That is all of the text in the index is upper case and has the fuzzy appearance of old high-speed line-printer output created by physical type striking through a fabric ribbon. I do not understand why the author used that technology long after it was obsolete, rather than entering the data into a spreadsheet, since the index is structurally a spreadsheet. One obvious consequence of the choice of this obsolete technology, however, is that the data used to generate the original output is in a format which cannot be read by any modern equipment. So moving forward all of the information in those thousands of pages of the original document must be recreated from scratch if I cannot find a workable OCR solution.
I have tried reading individual pages of this multi-volume document that I took photographs of. Each of the index pages has two columns. So I tried processing the image using OCR Feeder, which provides a GUI for identifying the portion of the page to scan. If I scan the whole page I would have to manually split and reorder every line. However I cannot get OCR Feeder to generate any usable output. All of the files it creates are empty.
I am going to cooperate with the copyright holder, but I want to know if there is a practical technology for digitizing this massive document before I waste any time. I have not paid the hundreds of dollars that it would cost to purchase a copy of the original volumes because they were published in 1985 and the information is consequently 32 years out of date. So I am working from copies held by a library. The original copyright holder does not currently have anyone responsible for updating this document. If I can find a technological solution I will volunteer to convert the obsolete document into a form usable by modern equipment and to organize the project to apply the 32 years of updates.
I have tried reading individual pages of this multi-volume document that I took photographs of. Each of the index pages has two columns. So I tried processing the image using OCR Feeder, which provides a GUI for identifying the portion of the page to scan. If I scan the whole page I would have to manually split and reorder every line. However I cannot get OCR Feeder to generate any usable output. All of the files it creates are empty.
I am going to cooperate with the copyright holder, but I want to know if there is a practical technology for digitizing this massive document before I waste any time. I have not paid the hundreds of dollars that it would cost to purchase a copy of the original volumes because they were published in 1985 and the information is consequently 32 years out of date. So I am working from copies held by a library. The original copyright holder does not currently have anyone responsible for updating this document. If I can find a technological solution I will volunteer to convert the obsolete document into a form usable by modern equipment and to organize the project to apply the 32 years of updates.