Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

crowdsourced transcription instead of OCR

Convert page images into searchable text. Talk about software, techniques, and new developments here.
Post Reply
mhhelle
Posts: 7
Joined: 01 Jan 2014, 12:25
E-book readers owned: kindle
Number of books owned: 200
Country: USA

crowdsourced transcription instead of OCR

Post by mhhelle » 21 Aug 2014, 15:41

Saw an article in the New York Times pointing to the Smithsonian Transcription Center last week:

https://transcription.si.edu/

I got an account and transcribed a few pages myself. It is a really slick system. It was also surprising to me how difficult it is to read these old long-hand pages now that I almost never write in cursive and read it even less.

I know there are lots of old books out there that others (including me) would like to scan and convert to text and this gives a nice interface and workflow for doing so. I'm going to send them an email and see if they can share any details or if they have thought about opening up the source code. It looks to me like it is written in Drupal. I think it would be really awesome if there was a generic site where you could contribute books to be transcribed to a system like this and get results back from volunteers. I've done a little bit with Amazon Mechanical Turk transcribing certain information from scanned pages but that gets relatively expensive depending on how much and the type of content to transcribe.

User avatar
rob
Posts: 773
Joined: 03 Jun 2009, 13:50
E-book readers owned: iRex iLiad, Kindle 2
Number of books owned: 4000
Country: United States
Location: Maryland, United States
Contact:

Re: crowdsourced transcription instead of OCR

Post by rob » 21 Aug 2014, 23:18

The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.

mhhelle
Posts: 7
Joined: 01 Jan 2014, 12:25
E-book readers owned: kindle
Number of books owned: 200
Country: USA

Re: crowdsourced transcription instead of OCR

Post by mhhelle » 22 Aug 2014, 09:19

Oh duh I guess I could have actually done a google search before posting. Thanks for letting me know! I looked at their website and FAQ but it was unclear to me-- can you contribute a handwritten text? The Smithsonian site is primarily handwritten texts.

User avatar
daniel_reetz
Posts: 2776
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: crowdsourced transcription instead of OCR

Post by daniel_reetz » 22 Aug 2014, 23:14

I think Wikipedia has a similar tool, I have seen a few such sites but few as nice as the SI effort. Thanks for bringing it here!

User avatar
rob
Posts: 773
Joined: 03 Jun 2009, 13:50
E-book readers owned: iRex iLiad, Kindle 2
Number of books owned: 4000
Country: United States
Location: Maryland, United States
Contact:

Re: crowdsourced transcription instead of OCR

Post by rob » 24 Aug 2014, 23:37

Ah, actually I think PGDP limits itself to printed texts.
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.

User avatar
scann
Posts: 77
Joined: 31 Jul 2011, 01:23
Number of books owned: 0
Country: Argentina

Re: crowdsourced transcription instead of OCR

Post by scann » 04 Sep 2014, 16:03

Here you can find a really comprehensive list of projects that are using crowdsourced transcription:
http://melissaterras.blogspot.com.ar/20 ... erial.html

But I think that is not a question of "transcription instead of OCR". Normally all these projects use OCR as the first layer and then do crowdsourced transcription in order to (1) correct the text and then (2) validate the final text.

Post Reply

Who is online

Users browsing this forum: No registered users and 2 guests