Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

Generating documents with "extracted" symbols

Convert page images into searchable text. Talk about software, techniques, and new developments here.
Post Reply
Posts: 1
Joined: 08 May 2014, 04:12
E-book readers owned: kindle
Number of books owned: 70
Country: England

Generating documents with "extracted" symbols

Post by davetweed » 09 May 2014, 04:25


I'm interested in printed texts where very important information is conveyed both the precise kind of symbol (eg, if it's a bold, italic convey slightly different information to bolded-italic) and spatial position (eg, tables, indented lists, equations, etc). As such, it's probably not a good fit for running Optical Character Recognition on to get a compressed text. On the other hand, because they are printed each instance of a given symbol is pretty close to identical (modulo small gaps for tiny variations due to ink smearing, etc). So one thing I'm wondering about is whether I want to try to do per-book "Optical Character Extraction" to generate bitmaps for each distinct (modulo aforementioned noise) and then store the text in some storage format of that essentially stored data representing the bitmaps for each extracted symbol and lots of ("at (x,y) symbol z").

But while these are still at the levels of vague thoughts I'm just trying to see firstly if there's actually already software that does this. (I'm primarily on Linux, but could use Windows if something pre-existed.) Secondly, is there a better format for doing this than pdf? (It looks like it's possible to store your own bitmap font and then use it in a pdf file, although I've never tried programmatically writing pdf's).

(The actual extraction of symbols is going to be a bit tricky, but since I work in pattern analysis I've got some reasonable ideas how to handle that problem).

Many thanks for any help

Posts: 242
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Generating documents with "extracted" symbols

Post by cday » 09 May 2014, 09:15

If you're not familiar already with Abbyy FineReader, you might take a look at the section on training user patterns in the attached user guide, to see if it could have any relevance to your project...

I haven't explored that option myself.
Abbyy FineReader 12 User Guide.pdf
(1.73 MiB) Downloaded 400 times

Post Reply