Basically my script goes through and does the following:
- Change curly double and single quotes to straight. Change long dashes to two short dashes.
- Remove all blank lines (which are artifacts of concatenating pages)
- When a line ends in a hyphen and the next line begins with a lower-case letter, remove the hyphen and join that line with the next with no space between (for words that were cut at the end of the line). If the line ended with a long dash, there are two dashes there (hopefully) so one is still there at the end. Alternately I fix errors caused by this in spellcheck after I'm done.
- If a line begins with a capital letter or a double parenthesis, insert a blank line above it.
Blank lines are then detected by calibre as the beginning of paragraphs, and you are done.
Of course, cuneiform tries to detect paragraphs, but I'd rather have the errors induced by my method than the ones I find using cuneiform. That is, my script detects all paragraphs, but also splits some paragraphs that should not be split. Cuneiform tends to err on the other side, joining paragraphs that don't belong (assuming you remove blank lines where the subsequent line begins with a lower case). I'd rather have paragraphs split than have two paragraphs joined that shouldn't be.
I understand ABBYY does a good job, but I only use free software.
Has anyone experimented with tesseract V3? I thought that version had page layout analysis but I haven't played with it. If it works well it could potentially be quite labor-saving.