I was wondering what you would recommend for running an OCR scan on some really old books, let's say French editions from the 17th century that are available for download on gallica.bnf.fr. These books have fonts that an ordinary OCR program wouldn't recognize, I guess.
Maybe such things are only available for research groups or national libraries?
OCR for outdated fonts
Moderator: peterZ
Re: OCR for outdated fonts
Book fonts haven't changed much in the past few centuries; most of the ones in use today are direct descendants of much older designs. I grabbed a few sample pages from Gallica from the 1600s, and Abbyy Finereader Pro 9 gave quite decent results despite the low resolution. Most of the errors were in the punctuation (italic "!" becoming "/", etc).knappen wrote:These books have fonts that an ordinary OCR program wouldn't recognize, I guess.
-j
Re: OCR for outdated fonts
You can train Tesseract OCR engine to recognize new or, in your case, old fonts. Here are some pertinent links:
http://code.google.com/p/tesseract-ocr/ ... Tesseract3
http://vietocr.sourceforge.net/training.html
http://code.google.com/p/tesseract-ocr/ ... Tesseract3
http://vietocr.sourceforge.net/training.html