ocr of textbook but retain structure information?

General discussion about software packages and releases, new software you've found, and threads by programmers and script writers.

Moderator: peterZ

Post Reply
o3h1p
Posts: 71
Joined: 08 Nov 2010, 22:47

ocr of textbook but retain structure information?

Post by o3h1p »

Goal: book -> xml which contains text and structure information (such a section hierarchy, paragraphs, and heading information)

I'm not sure such software even exists. Theoretically the information is there: font-size/special ornaments would designate section information, multiple columns should be parseable, but I'm not aware of any such software.

I'm prepared to develop this software myself. I know the Scan Tailor guy has hung up his hat, but I'm just starting out here, so any information from people more knowledgeable would be helpful.

Challenges:
* page number
* headers/footers
* special content boxes
* multiple columns
* section ornaments

Rough procedure:
1) pdfs or bookscanner -> images
2) Scan tailor for cleanup
3) OCR
4) Identify position information and fontsize information from OCR
5) logic to output to xml

of course step 4 is where the hard part is. I know Tesaract is an open source OCR. I'm not sure it provides font size or position information. If thats the case, then I would need to do Scan Tailor-style image processing (which I'm prepared to do).

Any software recommendation or advice for this would be big help. I have (or starting to have rather) a software development background, so I'm certainly prepared to develop this software myself if necessary (if it doesn't already exist)

Thanks
Post Reply