Page 1 of 1

Generating documents with "extracted" symbols

Posted: 09 May 2014, 04:25
by davetweed
Hi,

I'm interested in printed texts where very important information is conveyed both the precise kind of symbol (eg, if it's a bold, italic convey slightly different information to bolded-italic) and spatial position (eg, tables, indented lists, equations, etc). As such, it's probably not a good fit for running Optical Character Recognition on to get a compressed text. On the other hand, because they are printed each instance of a given symbol is pretty close to identical (modulo small gaps for tiny variations due to ink smearing, etc). So one thing I'm wondering about is whether I want to try to do per-book "Optical Character Extraction" to generate bitmaps for each distinct (modulo aforementioned noise) and then store the text in some storage format of that essentially stored data representing the bitmaps for each extracted symbol and lots of ("at (x,y) symbol z").

But while these are still at the levels of vague thoughts I'm just trying to see firstly if there's actually already software that does this. (I'm primarily on Linux, but could use Windows if something pre-existed.) Secondly, is there a better format for doing this than pdf? (It looks like it's possible to store your own bitmap font and then use it in a pdf file, although I've never tried programmatically writing pdf's).

(The actual extraction of symbols is going to be a bit tricky, but since I work in pattern analysis I've got some reasonable ideas how to handle that problem).

Many thanks for any help

Re: Generating documents with "extracted" symbols

Posted: 09 May 2014, 09:15
by cday
If you're not familiar already with Abbyy FineReader, you might take a look at the section on training user patterns in the attached user guide, to see if it could have any relevance to your project...

I haven't explored that option myself.