Indentation - is there any OCR that recognizes that?

Convert page images into searchable text. Talk about software, techniques, and new developments here.

Moderator: peterZ

alexanderfranca
Posts: 7
Joined: 04 Mar 2014, 00:53

Indentation - is there any OCR that recognizes that?

Post by alexanderfranca »

Hi.

Does anyone here knows about an OCR that recognizes the first line paragraphs indentation?

I use Linux, so, I don't know much about Windows/Mac OCRs, but if there is any, I'm interested any way.

I'm using Tesseract as an OCR, but I can't figure out any feature that do what I want.

[]s
Alexander
Brazil - Rio de Janeiro
spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: Indentation - is there any OCR that recognizes that?

Post by spamsickle »

What is it that you want to do?

I'm no expert on OCR, but there are often several ways to accomplish anything you want to do on the computer. You seem to have identified one way to accomplish what you want to do, but by framing your question with the one way you've identified you may be excluding other solutions.

Which is the defensive way of saying, I don't know how to identify indentation from optical character recognition. The "character" in the term suggests that it's not really designed to identify or track white space, but it is certainly true that some applications will map the characters they identify back to the image -- that's what allows the search function to highlight the word (or part of a word) I'm seeking. I'm doing my OCR with Adobe Acrobat currently, and it seems to keep track of whole words, or occasionally pieces of words, by location. I expect that if one got access to that location information, one could figure out where a new paragraph began, but it really comes back to what it is you're trying to do.
alexanderfranca
Posts: 7
Joined: 04 Mar 2014, 00:53

Re: Indentation - is there any OCR that recognizes that?

Post by alexanderfranca »

I have a Kindle.

I only want the indentations of the paragraphs, first line spaces, like the original "phisical" book, after run an OCR.

Ok, it's not a major problem, but it would be very nice have an way to reproduce the original indentations. It's part of how the writer said what he intended to say.

I not even know how to search a solution for that...

In fact I can't imagine why some OCRs do layout analysing and doesn't automatic identify the first line indentations...

I don't know, really, if I'm doing the wrong question...

[]s
Alexander
Brazil - Rio de Janeiro
User avatar
daniel_reetz
Posts: 2812
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: Indentation - is there any OCR that recognizes that?

Post by daniel_reetz »

ABBY finereader tries to preserve page formatting... I think those are your magic search words -- "preserve page formatting"... will try to help more later, I know how hard it can be to find the right search terms in your second language...
User avatar
rob
Posts: 773
Joined: 03 Jun 2009, 13:50
E-book readers owned: iRex iLiad, Kindle 2
Number of books owned: 4000
Country: United States
Location: Maryland, United States
Contact:

Re: Indentation - is there any OCR that recognizes that?

Post by rob »

Alexander,

I know exactly what you're talking about. When I run OCR on a book, I like to see the formatting kept the same: italics, boldface, justification, first line indent, and so on. The only program I know, which does it somewhat well, is ABBYY FineReader. There is a setting in there which tells FineReader to keep the formatting. However, I found a serious bug in the formatting, and reported it to ABBYY. The FineReader developers replied by telling me that formatting isn't guaranteed.

Here is what I sent them (for FineReader 9):
I have a jpg file of a single page from a fiction book. FineReader
performs well in recognition, and very well in formatting the document,
but it looks like sometimes the formatting of the recognized document in
the Text window is not exported properly. In other words, although the
Text window shows the correct formatting, the exported document is not
formatted properly. There is no limiting factor in the export format,
which is why I think there could be a bug in FineReader export. Could
you please investigate the issue? I love FineReader, I think it's better
than any other OCR package out there, but this export issue makes me cry
:)

I'm using exact HTML export with Full CSS. Most of the time the export
is correct, but sometimes, annoyingly, the export is incorrect.
There doesn't seem to be anything about the HTML export that would
prevent the format from being correct.

I have included the original jpg image, along with the exported HTML and
PDF output. I draw your attention to these paragraphs:

(indented properly) "Eighty-five is the best I can do."
(indented properly) "Okay, I'll talk him into eighty-five. But just for
you. I wouldn't do it for anybody else."
(NOT indented properly) "You're a sweetheart."

...

(indented properly) "Not earned out yet? Are you sure?"
(indented properly) "Sad but true."
(indented properly) "Hmm. Well, I guess Sheldon can live with a million
until the next royalty checks come in. In his tax bracket, it isn't so
bad."
(NOT indented properly) "The self-discipline will be good for him."
(NOT indented properly) "But how about making it a two-book deal?"
(rest of page NOT indented properly, except for last paragraph)

I checked the other exports, and found the following (each with Exact
Copy selected):

HTML: Not formatted properly
PDF: Formatted properly
RTF: Not formatted properly
DOC: Not formatted properly
XML: Not formatted properly


Here are my settings:

Document:
Document languages: English
Document print type: Autodetect
(all other options not selected)

Scan/Open:
Automatically read acquired page images
Image Processing
X Correct image skew
X Detect page orientation
(all other options not selected)

Read:
Thorough reading
Table processing
(all options not selected)
Training
Do not use user patterns

Save:
HTML:
Retain Layout: Exact copy
Save mode: Full (use CSS)
Text Settings:
X Use solid line as page break
X Keep headings and footers
(all other options not selected)
Picture Settings: Medium (for screen)
Character encoding:
Code page: (Automatic)
Code page type: Windows
And here was their reply:
Thank you for contacting ABBYY USA Software House Inc.

After further review and testing of the images provided, the software unfortunately is not able to properly indent the certain sentences.
The results you get with FineReader 9 are not going to be a mirror image of the original document. The software only converts the text image to editable text, in regards to the indents you will have to edit them once saved to the format desired.

Please contact me if you have any further questions regarding this issue.
So that means that ABBYY has no plans to fix the issue. I tested FineReader 10, and the issue is still there.

At one point I had ABBYY output HTML with line breaks, or maybe PDF, it was a long time ago, and I wrote a custom postprocessor program -- just for that one book -- to find the paragraphs and to indent the first lines, as well as joining lines in the same paragraph. The hardest part was what to do for page breaks.

Here's the sample that I sent them. The jpg image is the input, and the PDF and HTML images are screencaps of the PDF and HTML outputs.
Original image
Original image
Image of PDF output
Image of PDF output
pdf-image.jpg (83.42 KiB) Viewed 29656 times
Image of HTML output
Image of HTML output
html-image.jpg (60.77 KiB) Viewed 29656 times
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.
alexanderfranca
Posts: 7
Joined: 04 Mar 2014, 00:53

Re: Indentation - is there any OCR that recognizes that?

Post by alexanderfranca »

That ABBYY Finereader seems to be perfect for what I need!!!

Thank you!!!

I'll try to pay US 200 and... maybe... run that over Linux (probably it won't).

By the way, a guy told me 15 minutes ago that he will put an hack in Tesseract OCR (our Linux bet) to... I think not preserve the whole page formatting, but at least the paragraphs indentation.

Yes, it is a coincidence. :D

I'm pretty surprised ABBYY keeps so many formatting "itens"...

Thank you guys! I thought no OCR could do that!

[]s
Alexander
Brazil - Rio de Janeiro
lexicographer

Re: Indentation - is there any OCR that recognizes that?

Post by lexicographer »

Rob,
have you tried to save the pdf (which as I see is correctly indented) as html? Would be interesting to see if that has better results. Or store the output of Finereader as doc (which is actually an rtf masquerading as doc), and export that as html? Depending on the version of Winword or OpenOffice the html could be better or worse.
User avatar
rob
Posts: 773
Joined: 03 Jun 2009, 13:50
E-book readers owned: iRex iLiad, Kindle 2
Number of books owned: 4000
Country: United States
Location: Maryland, United States
Contact:

Re: Indentation - is there any OCR that recognizes that?

Post by rob »

Well, the problem with the PDF output is that although it seems exact, it is not reflowable, so there is still work to be done after you convert the PDF to HTML.

Alexander -- if you output to PDF, be aware that Kindle's conversion program will convert each page to an image which might not display well on the Kindle screen.
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.
rwreed
Posts: 21
Joined: 23 Jan 2011, 16:15

Re: Indentation - is there any OCR that recognizes that?

Post by rwreed »

Alexander,

I know this is an old thread, but I wondered if the tesseract hack you were provided actually did recognize and reproduce indentation and whether you might share it.

thanks
randy
Anonymous1

Re: Indentation - is there any OCR that recognizes that?

Post by Anonymous1 »

You can do it with Ocropus and hocr2pdf. I haven't gotten it to work with Ocropus yet, but Cuneiform works:

Code: Select all

cuneiform -f hocr -o test.hocr test.tif && hocr2pdf -i test.tif -n -o test.pdf < test.hocr
The output is a bit rusty, but it's pure text and formatted.
Post Reply