Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

recognition of common misspellings after OCR

Convert page images into searchable text. Talk about software, techniques, and new developments here.
Post Reply
lotadad52
Posts: 1
Joined: 31 Aug 2015, 21:07
E-book readers owned: kindle dx, calibre
Number of books owned: 300
Country: USA

recognition of common misspellings after OCR

Post by lotadad52 » 24 Sep 2015, 12:59

I am seeing lots of documents where there are common misspellings present after OCR on a scanned TIFF file.
For Example:
actual word = why, recognition = whv
actual word = mylar, recognition = nylar
actual word = beauty, recognition = beautv
actual words = comes from, recognition = comesfrom
actual word = of, recognition = ol
actual word = fellow, recognition = leilow
Is there a recognition scheme that will realize that for example: there are few words ending in "v" and that combined with the previous letters is probably a "y"?
I guess what I am asking is before spell checking can the character recognition process be weighted to realize the next letter is one of the letters in a valid word ?

cday
Posts: 226
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: recognition of common misspellings after OCR

Post by cday » 24 Sep 2015, 14:16

Which OCR program are you using?

The commercial software Abbyy FineReader allows the language (or languages) used to be selected, and likely Nuance OmniPage does too, so in principle that should help, although OCR is a demanding application and even the best programs still have some limitations. The quality of the original can also, of course, affect the recognition accuracy.

BruceG
Posts: 67
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: recognition of common misspellings after OCR

Post by BruceG » 28 Sep 2015, 03:12

I use OmniPage, I would say OmniPage does not use language to change words but uses language to suggest words when the current one is not found in its dictionary. The errors y/v m/n f/l and missing space would be common. Searching on v would help fix those. n for m would be more difficult as n is more common. I did some typed / duplicated newsletters from 1937 -1960, I was very pleased when a new typewriter was used or a new duplicator. Over time certain letters would wear. Letter spacing was also different from books so instead of missing a space, a space was inserted within a word.
Makes life interesting.

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest