Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

OCR Scanning Large Multi-Volume Document

Share your process here - how to build something, scan something, or use something.
Post Reply
jcobban
Posts: 1
Joined: 07 Nov 2017, 01:21
E-book readers owned: Kobo
Number of books owned: 0
Country: Canada

OCR Scanning Large Multi-Volume Document

Post by jcobban » 08 Nov 2017, 01:46

I am trying to convert a large multi-volume document created in the 1980s. It consists of thousands of pages of typewriter generated text plus a nominal index that, by its appearance, was created using punch cards which were sorted and fed to an IBM line printer. That is all of the text in the index is upper case and has the fuzzy appearance of old high-speed line-printer output created by physical type striking through a fabric ribbon. I do not understand why the author used that technology long after it was obsolete, rather than entering the data into a spreadsheet, since the index is structurally a spreadsheet. One obvious consequence of the choice of this obsolete technology, however, is that the data used to generate the original output is in a format which cannot be read by any modern equipment. So moving forward all of the information in those thousands of pages of the original document must be recreated from scratch if I cannot find a workable OCR solution.

I have tried reading individual pages of this multi-volume document that I took photographs of. Each of the index pages has two columns. So I tried processing the image using OCR Feeder, which provides a GUI for identifying the portion of the page to scan. If I scan the whole page I would have to manually split and reorder every line. However I cannot get OCR Feeder to generate any usable output. All of the files it creates are empty.

I am going to cooperate with the copyright holder, but I want to know if there is a practical technology for digitizing this massive document before I waste any time. I have not paid the hundreds of dollars that it would cost to purchase a copy of the original volumes because they were published in 1985 and the information is consequently 32 years out of date. So I am working from copies held by a library. The original copyright holder does not currently have anyone responsible for updating this document. If I can find a technological solution I will volunteer to convert the obsolete document into a form usable by modern equipment and to organize the project to apply the 32 years of updates.

BruceG
Posts: 67
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: OCR Scanning Large Multi-Volume Document

Post by BruceG » 08 Nov 2017, 05:58

Hi
Is it possible to post a photo of a normal page and the index.

Line printers are still being used today. I guess you do not have the original data. Dbase was often used in the 80's for data bases. I expect there are still old computers around today that can still read the data and convert it to something else.

L.Willms
Posts: 129
Joined: 21 Sep 2016, 10:51
E-book readers owned: Tolino Shine
Country: Germany
Location: Frankfurt/Main, Germany

Re: OCR Scanning Large Multi-Volume Document

Post by L.Willms » 22 Apr 2018, 09:06

jcobban wrote:
08 Nov 2017, 01:46
thousands of pages of typewriter generated text
Are those lose pages or is that a bound book?

In the first case, I would suggest to use a document scanner
If the latter, is it feasable to remove the binding and then have lose pages?

You could then feed those thousands of pages to a document scanner with an automatic sheet feeder.

Are the pages printed on one side only or on both sides?

In the latter case, use a document scanner which scans both sides of a sheet in one go.
jcobban wrote:
08 Nov 2017, 01:46
plus a nominal index that, by its appearance, was created using punch cards which were sorted and fed to an IBM line printer. That is all of the text in the index is upper case and has the fuzzy appearance of old high-speed line-printer output created by physical type striking through a fabric ribbon.
That may create problems with the OCR, depending on the state of the ribbon at the time of printing...
jcobban wrote:
08 Nov 2017, 01:46
I do not understand why the author used that technology long after it was obsolete, rather than entering the data into a spreadsheet, since the index is structurally a spreadsheet.
First my suggestion for OCR of this tabular data: tell the OCR program to keep line breaks!

While spreadsheet programs have been available for PCs in the 1990ies (Visicalc on Apple since 1979, Lotus 1-2-3 since 1983), your printout was quite certainly from a mainframe data base using a mainframe line printer, and the data in that mainframe file or database might well be kept in upper case; even most printers only had type for upper case.

You will have to pass those tabular data thru some processing e.g. a Word processing program with text correction which would change the all UPPER CASE to proper cases.

Post Reply

Who is online

Users browsing this forum: No registered users and 2 guests