ocr+google books

Convert page images into searchable text. Talk about software, techniques, and new developments here.

Moderator: peterZ

Post Reply
sixtysix
Posts: 34
Joined: 23 Jun 2009, 13:07

ocr+google books

Post by sixtysix »

As I understand it ocr software reads and attempts to correct the scanned image. All Google Books even 19th century one's appear to have been ocr'd even though in some instances 19th spelling is different to today's spelling. In the early 19th century f was used instead of s, in some instances, leading to interesting reading for today's reader, the word 'such' being the most obvious example of unintended confusion.
It is possible to search 19th century Google books for a word spelt in that manner. does this mean that a human is actually correcting the spelling or is the Google software simply accepting each word as it finds it rather than attempting to correct it?
I use Nuance software which forces me to correct or accept each word it finds fault with.
Anonymous2
Posts: 97
Joined: 18 Oct 2011, 16:05

Re: ocr+google books

Post by Anonymous2 »

Tesseract, Google's OCR engine (it's OSS and free for the public), can be trained to recognize virtually any language.

The training isn't something that I have experience with, but it is definitely doable. Here are a few things you can look at:
sixtysix
Posts: 34
Joined: 23 Jun 2009, 13:07

Re: ocr+google books

Post by sixtysix »

thanks for the reply-much appreciated.
User avatar
Heelgrasper
Posts: 70
Joined: 19 Feb 2012, 21:04
E-book readers owned: None
Number of books owned: 500
Location: Randers, Denmark

Re: ocr+google books

Post by Heelgrasper »

sixtysix wrote: In the early 19th century f was used instead of s, in some instances, leading to interesting reading for today's reader, the word 'such' being the most obvious example of unintended confusion.
Before the hacking caused some deleting I had a post here, even though it's an old thread. I'll just repost in a shorter version:

There seems to be some misunderstanding here. In fraktur typefaces there was often used a so called "long s" (Wikipedia has an article on it). In fraktur typefaces this look very simular to an "f" (as do the "k") but it's not the same. However, you need good scans you want OCR to be able to tell the difference.
---
Jakob Øhlenschlæger
Randers, Denmark

The past is a foreign country: they do things differently there
L. P. Hartley
User avatar
Heelgrasper
Posts: 70
Joined: 19 Feb 2012, 21:04
E-book readers owned: None
Number of books owned: 500
Location: Randers, Denmark

Re: ocr+google books

Post by Heelgrasper »

Before somone corrects me I better do it myself: It's not only used in fraktur typefaces but went out of use in latin typefaces around 1800. Latin typefaces were so rarely used here before 1800 (only for latin texts and I don't read much latin) that I didn't know that before actually reading the Wikipedia article I mentioned. I also learned it can actually be reprensented in Unicode so it could be saved in ocr as something different from "s".

It looks like this: ſ. Sligthly larger: ſ
---
Jakob Øhlenschlæger
Randers, Denmark

The past is a foreign country: they do things differently there
L. P. Hartley
Post Reply