OCR for outdated fonts

Convert page images into searchable text. Talk about software, techniques, and new developments here.

Moderator: peterZ

Post Reply
knappen
Posts: 35
Joined: 29 Jul 2010, 20:21

OCR for outdated fonts

Post by knappen »

I was wondering what you would recommend for running an OCR scan on some really old books, let's say French editions from the 17th century that are available for download on gallica.bnf.fr. These books have fonts that an ordinary OCR program wouldn't recognize, I guess.
Maybe such things are only available for research groups or national libraries?
jgreely

Re: OCR for outdated fonts

Post by jgreely »

knappen wrote:These books have fonts that an ordinary OCR program wouldn't recognize, I guess.
Book fonts haven't changed much in the past few centuries; most of the ones in use today are direct descendants of much older designs. I grabbed a few sample pages from Gallica from the 1600s, and Abbyy Finereader Pro 9 gave quite decent results despite the low resolution. Most of the errors were in the punctuation (italic "!" becoming "/", etc).

-j
quân

Re: OCR for outdated fonts

Post by quân »

You can train Tesseract OCR engine to recognize new or, in your case, old fonts. Here are some pertinent links:

http://code.google.com/p/tesseract-ocr/ ... Tesseract3
http://vietocr.sourceforge.net/training.html
Post Reply