Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

OCR for outdated fonts

Convert page images into searchable text. Talk about software, techniques, and new developments here.
Post Reply
knappen
Posts: 35
Joined: 29 Jul 2010, 20:21

OCR for outdated fonts

Post by knappen » 15 May 2011, 20:37

I was wondering what you would recommend for running an OCR scan on some really old books, let's say French editions from the 17th century that are available for download on gallica.bnf.fr. These books have fonts that an ordinary OCR program wouldn't recognize, I guess.
Maybe such things are only available for research groups or national libraries?

jgreely

Re: OCR for outdated fonts

Post by jgreely » 15 May 2011, 23:41

knappen wrote:These books have fonts that an ordinary OCR program wouldn't recognize, I guess.
Book fonts haven't changed much in the past few centuries; most of the ones in use today are direct descendants of much older designs. I grabbed a few sample pages from Gallica from the 1600s, and Abbyy Finereader Pro 9 gave quite decent results despite the low resolution. Most of the errors were in the punctuation (italic "!" becoming "/", etc).

-j

quân

Re: OCR for outdated fonts

Post by quân » 19 Nov 2011, 22:40

You can train Tesseract OCR engine to recognize new or, in your case, old fonts. Here are some pertinent links:

http://code.google.com/p/tesseract-ocr/ ... Tesseract3
http://vietocr.sourceforge.net/training.html

Post Reply