Goal: book -> xml which contains text and structure information (such a section hierarchy, paragraphs, and heading information)
I'm not sure such software even exists. Theoretically the information is there: font-size/special ornaments would designate section information, multiple columns should be parseable, but I'm not aware of any such software.
I'm prepared to develop this software myself. I know the Scan Tailor guy has hung up his hat, but I'm just starting out here, so any information from people more knowledgeable would be helpful.
Challenges:
* page number
* headers/footers
* special content boxes
* multiple columns
* section ornaments
Rough procedure:
1) pdfs or bookscanner -> images
2) Scan tailor for cleanup
3) OCR
4) Identify position information and fontsize information from OCR
5) logic to output to xml
of course step 4 is where the hard part is. I know Tesaract is an open source OCR. I'm not sure it provides font size or position information. If thats the case, then I would need to do Scan Tailor-style image processing (which I'm prepared to do).
Any software recommendation or advice for this would be big help. I have (or starting to have rather) a software development background, so I'm certainly prepared to develop this software myself if necessary (if it doesn't already exist)
Thanks
ocr of textbook but retain structure information?
Moderator: peterZ