recognition of common misspellings after OCR

Convert page images into searchable text. Talk about software, techniques, and new developments here.

Moderator: peterZ

Post Reply
lotadad52
Posts: 1
Joined: 31 Aug 2015, 21:07
E-book readers owned: kindle dx, calibre
Number of books owned: 300
Country: USA

recognition of common misspellings after OCR

Post by lotadad52 »

I am seeing lots of documents where there are common misspellings present after OCR on a scanned TIFF file.
For Example:
actual word = why, recognition = whv
actual word = mylar, recognition = nylar
actual word = beauty, recognition = beautv
actual words = comes from, recognition = comesfrom
actual word = of, recognition = ol
actual word = fellow, recognition = leilow
Is there a recognition scheme that will realize that for example: there are few words ending in "v" and that combined with the previous letters is probably a "y"?
I guess what I am asking is before spell checking can the character recognition process be weighted to realize the next letter is one of the letters in a valid word ?
cday
Posts: 447
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: recognition of common misspellings after OCR

Post by cday »

Which OCR program are you using?

The commercial software Abbyy FineReader allows the language (or languages) used to be selected, and likely Nuance OmniPage does too, so in principle that should help, although OCR is a demanding application and even the best programs still have some limitations. The quality of the original can also, of course, affect the recognition accuracy.
BruceG
Posts: 99
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: recognition of common misspellings after OCR

Post by BruceG »

I use OmniPage, I would say OmniPage does not use language to change words but uses language to suggest words when the current one is not found in its dictionary. The errors y/v m/n f/l and missing space would be common. Searching on v would help fix those. n for m would be more difficult as n is more common. I did some typed / duplicated newsletters from 1937 -1960, I was very pleased when a new typewriter was used or a new duplicator. Over time certain letters would wear. Letter spacing was also different from books so instead of missing a space, a space was inserted within a word.
Makes life interesting.
Post Reply