Indentation - is there any OCR that recognizes that?

Convert page images into searchable text. Talk about software, techniques, and new developments here.

Moderator: peterZ

gvfarns

Re: Indentation - is there any OCR that recognizes that?

Post by gvfarns »

For what it's worth to anyone, I use tesseract V2 (which does not detect paragraphs) to scan books and then I have a vi script (just a sequence of global search and replace commands) that goes through the identifies paragraphs from the output. Since tesseract output has the same content per line as the original, basically I take advantage of the fact that most paragraphs begin with a capital letter or double quotation marks. Most lines that do not begin a paragraph do not begin with capitals or quotes, especially in a novel.

Basically my script goes through and does the following:
  • Change curly double and single quotes to straight. Change long dashes to two short dashes.
  • Remove all blank lines (which are artifacts of concatenating pages)
  • When a line ends in a hyphen and the next line begins with a lower-case letter, remove the hyphen and join that line with the next with no space between (for words that were cut at the end of the line). If the line ended with a long dash, there are two dashes there (hopefully) so one is still there at the end. Alternately I fix errors caused by this in spellcheck after I'm done.
  • If a line begins with a capital letter or a double parenthesis, insert a blank line above it.
Then I search for the proper names in the book that begin a line (there aren't usually many) and glance to see if they are creating new paragraphs when they shouldn't.

Blank lines are then detected by calibre as the beginning of paragraphs, and you are done.

Of course, cuneiform tries to detect paragraphs, but I'd rather have the errors induced by my method than the ones I find using cuneiform. That is, my script detects all paragraphs, but also splits some paragraphs that should not be split. Cuneiform tends to err on the other side, joining paragraphs that don't belong (assuming you remove blank lines where the subsequent line begins with a lower case). I'd rather have paragraphs split than have two paragraphs joined that shouldn't be.

I understand ABBYY does a good job, but I only use free software.

Has anyone experimented with tesseract V3? I thought that version had page layout analysis but I haven't played with it. If it works well it could potentially be quite labor-saving.
Anonymous1

Re: Indentation - is there any OCR that recognizes that?

Post by Anonymous1 »

OCRopus allegedly has format-recognition, but I haven't seen it work.
text-freak
Posts: 1
Joined: 24 Aug 2014, 15:27
Number of books owned: 0
Country: Germany

Re: Indentation - is there any OCR that recognizes that?

Post by text-freak »

Hi,

I have had that problem myself and looking for a solution I discovered that thread. For the record: For me https://www.ocrgeek.com/ worked perfectly - I choose to produce searchable PDF or also DjVu files... In principle this should be the same as the hocr stuff, but I prefer the online interface.

Good Luck!
Post Reply