recognition of common misspellings after OCR

Post by lotadad52 » 24 Sep 2015, 12:59

I am seeing lots of documents where there are common misspellings present after OCR on a scanned TIFF file.
For Example:
actual word = why, recognition = whv
actual word = mylar, recognition = nylar
actual word = beauty, recognition = beautv
actual words = comes from, recognition = comesfrom
actual word = of, recognition = ol
actual word = fellow, recognition = leilow
Is there a recognition scheme that will realize that for example: there are few words ending in "v" and that combined with the previous letters is probably a "y"?
I guess what I am asking is before spell checking can the character recognition process be weighted to realize the next letter is one of the letters in a valid word ?

Re: recognition of common misspellings after OCR

Post by cday » 24 Sep 2015, 14:16

Which OCR program are you using?

The commercial software Abbyy FineReader allows the language (or languages) used to be selected, and likely Nuance OmniPage does too, so in principle that should help, although OCR is a demanding application and even the best programs still have some limitations. The quality of the original can also, of course, affect the recognition accuracy.

Re: recognition of common misspellings after OCR

Post by BruceG » 28 Sep 2015, 03:12

I use OmniPage, I would say OmniPage does not use language to change words but uses language to suggest words when the current one is not found in its dictionary. The errors y/v m/n f/l and missing space would be common. Searching on v would help fix those. n for m would be more difficult as n is more common. I did some typed / duplicated newsletters from 1937 -1960, I was very pleased when a new typewriter was used or a new duplicator. Over time certain letters would wear. Letter spacing was also different from books so instead of missing a space, a space was inserted within a word.
Makes life interesting.

