Page 1 of 1

OCR for 78 RPM record labels

Posted: 21 Jan 2019, 15:44
by iam2sam
Hello. I built an Archivist Quill in 2017 and used it very successfully to scan and index (Acrobat) a high school yearbook to distribute for a 50 year reunion. I now want to scan and capture text from the labels for a collection of several hundred 78 RPM records that I inherited a while back. The main goal of the project is to digitize the audio. However, archive.org has quite a large number of 78 RPM recordings uploaded. If there is already a high-quality identical recording available for download, it would be a waste of time to reinvent that wheel. So I want to capture and format text from the labels so that I can quickly search archive.org for an identical recording (this will also be very useful for cataloging.) Fortunately, nearly all of the information of interest is linear across the label, not around the circumference, so I don't need to deal with straightening it. Most of the labels are dark colors with lighter text (most often, white or gold.) I want to do 2-up scanning to conserve time. I suspect that it will be a good idea to provide registration marks on the scanner bed (masking tape perhaps) so that every record label is in the same position in the image. I'm hopeful that may allow me to automate cropping the image to eliminate the "non-label" portions. Has anyone done anything similar to this, and, if so, are there any recommendations on what to do or avoid? I'd also be open to purchase of processing software (reasonably priced, this is a hobby endeavor;) if it would make the job faster/easier. Thoughts? Thanks in advance.

SAM

Re: OCR for 78 RPM record labels

Posted: 21 Jan 2019, 18:11
by cday
Continuing from your previous post scanning & processing 78 RPM record labels :

I suggested that the registration problem might be easily overcome by [non-destructively for later book scanning] temporarily inserting some kind of round pin into each side of the book rest. You would have to find something round of about the correct diameter, ideally a ground dowel pin, or more easily and probably adequate for the immediate need, a machine screw that fits in the center hole of the records. You could probably [without being familiar with the detailed design of the Archivist] insert the pins from the back, if necessary mounted in small timber blocks tacked to the back of the book rests. Alternatively, you could probably temporarily tack some small timber strips onto the front of the book rests to provide registration.

With regard to image processing, given accurate registration it should be straightforward to crop the camera images to a constant size, and perform any image enhancement needed using any of the freeware image utilities and batch processing.

With regard to OCR software, the Abbyy Screenshot Reader should as I suggested be sufficient for your limited needs, and is very inexpensive.

Re: OCR for 78 RPM record labels

Posted: 22 Jan 2019, 17:59
by cday
Some further thoughts a day on...

You don't give any indication of the facilities you have available, but a better solution to the registration issue might be to place a piece of some sheet material such as plywood on top of each book rest.

Those pieces could be fitted with either record center registration pins, or location fittings of some kind at the record edges, possibly even two small nails. And if those sheets were sized smaller than the record diameter in at least one dimension, placing the records and later removing them would be facilitated as they could be easily lifted on and off by gripping the edges, and also easily rotated at the edges to align the text to be scanned.

I'm not familiar in any detail with the Archivist or similar scanners, but if anything protrudes above the record top surface presumably you will need to limit the downward movement of the platten to avoid possible damage to the perspex. Alternatively, if it is practical, you could possibly temporarily remove the perspex which won't be needed to flatten the pages of a book.

With regard to OCR, I see that you already have Adobe Acrobat which supports OCR, but the advantage of the Abbyy screen reader above is that you would simply have to use the mouse to make a selection around the relevant text on the record label, and then paste the recognised text into wherever you want it. I did a quick test earlier out of interest of reading light text on a dark background, and it worked fine.

Re: OCR for 78 RPM record labels

Posted: 22 Jan 2019, 18:01
by iam2sam
Thanks for the reply. I thought of a using a spindle, the catch there is that the Archivist has a "V" shaped tray that gets elevated to the underside of a likewise "V" shaped glass platen (2 camera system,) so I would need to use something of a height and material that would not damage the glass. That said, it should be doable. The most challenging registration issue might prove to be with the rotation of the record around that spindle. I appreciate the software suggestion - I will look into it.

Re: OCR for 78 RPM record labels

Posted: 22 Jan 2019, 18:04
by cday
iam2sam wrote:
22 Jan 2019, 18:01
Thanks for the reply. I thought of a using a spindle, the catch there is that the Archivist has a "V" shaped tray that gets elevated to the underside of a likewise "V" shaped glass platen (2 camera system,) so I would need to use something of a height and material that would not damage the glass. That said, it should be doable. The most challenging registration issue might prove to be with the rotation of the record around that spindle. I appreciate the software suggestion - I will look into it.

We crossed, please take a look at my new post above which may be relevant!

Re: OCR for 78 RPM record labels

Posted: 22 Jan 2019, 18:09
by iam2sam
LOL Looks like replies crossed... I should be able to come up with a stubby spindle that is the same height as the records are thick. I'm considering some sort of line-generating laser to project a "horizontal" line on the label(s) for alignment purposes. I'm assuming the text recoognition would be slower and possibly less accurate if the line of text was at an angle. Acrobat worked for me on the previous project, but it was somewhat of a PITA. Abbyy sounds much better. I could paste directly into a spreadwheet, which would be ideal. Thanks again.

Re: OCR for 78 RPM record labels

Posted: 23 Jan 2019, 11:08
by cday
Small detail: your laser level marker idea is neat, but I doubt if it is needed as in another quick test, text rotated by 2º was read successfully by the Abbyy screen reader, and text can be eyeballed horizontal to a much closer tolerance than that.

Re: OCR for 78 RPM record labels

Posted: 24 Jan 2019, 13:06
by dpc
If I were going to do this I'd get a 12"x12" piece of 1/4" MDF, put a 1/4" dowel coming out the center of it and stick that to the wall with double-stick tape. I'd put a camera on a tripod about 6' back, pointed at that dowel, and have two clip-on lights to the right and left of the tripod to light the MDF. With a remote trigger on your camera you should be able photograph 7000 records per hour, assuming you're shooting both sides.

I would be surprised if one of the dedicated OCR packages couldn't process text that is slightly rotated. I'm not sure that's worth worrying about, but a few test images should be able to tell you what will and what won't work.

Re: OCR for 78 RPM record labels

Posted: 24 Jan 2019, 14:25
by cday
dpc wrote:
24 Jan 2019, 13:06
If I were going to do this I'd get a 12"x12" piece of 1/4" MDF, put a 1/4" dowel coming out the center of it and stick that to the wall with double-stick tape. I'd put a camera on a tripod about 6' back, pointed at that dowel, and have two clip-on lights to the right and left of the tripod to light the MDF...

Simpler that making the copy stand I suggested in the original thread, and likely simpler than making minor adjustments to the Archivist. But wouldn't making the MDF or other board slightly narrower than a record assist in placing and removing each record on the spindle, and also assist in rotating the text horizontal, as the record edges could be gripped easily? And also possibly slightly thicker?

Photographing records one at a time would also ensure exact consistency of the record position in the images created, which might simplify post-processing, probably with little if any increase in the overall time required.
I would be surprised if one of the dedicated OCR packages couldn't process text that is slightly rotated. I'm not sure that's worth worrying about, but a few test images should be able to tell you what will and what won't work.
I would expect accuracy to be similar to the full Abbyy FineReader program although that might not be the case; the screen reader utility is in fact included with the full program. While 2º doesn't sound very much, it is very obviously not horizontal, so rotating a record so as to be read accurately should be easy enough. In my very quick look 3º rotation was also read correctly, at 5º unwanted line breaks were introduced, and somewhere beyond that a character was misread.

Re: OCR for 78 RPM record labels

Posted: 03 Mar 2019, 11:05
by L.Willms
dpc wrote:
24 Jan 2019, 13:06
If I were going to do this I'd get a 12"x12" piece of 1/4" MDF, put a 1/4" dowel coming out the center of it and stick that to the wall with double-stick tape. I'd put a camera on a tripod about 6' back, pointed at that dowel, and have two clip-on lights to the right and left of the tripod to light the MDF. With a remote trigger on your camera you should be able photograph 7000 records per hour, assuming you're shooting both sides.
Hanging the disk via the center hole on a vertical wall riscs that the disk surface would not be vertical, but that it might be dangling slightly skewed from vertically.

I think it is better to get a stand for the disks, where the disk is which resting on an inclined surface so that each disk surface has the same angle to the true vertical and true horizontal. The disk might be positioned by two small bars which are placed sighly slanted in a wide V on the lower end of the flat rest for the disk. It is easier and faster to put a disk this way instead of seaching the dowel with the disks center hole. And since the 78 rmp records are all of the same size, as far as I know them (just the same as a 30 cm LP), one can be sure that the label is always in the same place.

The camera may be placed on a tripod; to make sure that is vertical to the plane of the disk, put a mirror on that record stand's surface and check that the camera's image is in the center of the image one sees in the camera's viewfinder.

Another possibility is to use the existing book scanner, but by removing the glass platen, since there is no book pages which need to be flattened and on the other hand, the glass platen can damage the grooves of the record. I think it is also worth to think about putting a soft cover on the record stand in order to protect the grooves of the back side.

Make the board on which the record (disk) is resting for being photographed two centimeters or so narrower than the diamter of the records, so that you can place the disk easier holding the record with two hands on the outer edges, avoiding to touch the surface. This also allows you to take the disk again at the outer edges with two hands, flip it over for the B side.

Have a batch of records left of the imaging stand, and have a second person standing to the right of the imageing stand for taking the record from the board and store it again safely.

So far my initial thoughts.