ABBYY finds too many different styles

Convert page images into searchable text. Talk about software, techniques, and new developments here.

Moderator: peterZ

Post Reply
L.Willms
Posts: 134
Joined: 21 Sep 2016, 10:51
E-book readers owned: Tolino Shine
Country: Germany
Location: Frankfurt/Main, Germany

ABBYY finds too many different styles

Post by L.Willms »

While I am satisfied with the text recognition performance of ABBYY Fine Reader (Version 11 here, latest is 12), I have problems with its text output.

The first of the problems is that ABBYY finds too many different paragraph and text styles in the text although they are all the same over all pages; but ABBYY thinks to find slight differences and marks those paragraphs with a different style. So I have (named in German) "Fließtext", then "Fließtext (1)", "Fließtext(2)" and so on up to "Fließtext (23)" or more, with slight differences in font size or recognized font. ("Fließtext" being German for "running text" or body text). BTW, since the text is for publishing on the Web, and all presentation level formatting is done with CSS, I would prefer if the OCR program would not care about the fonts being used...

Isn't there a way to tell ABBYY to recognize only one type of running text paragraph and then maybe an additional footnote paragraph (which is normally set in a smaller type)?
BruceG
Posts: 99
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: ABBYY finds too many different styles

Post by BruceG »

Although not a Abbyy user, I would look out for, OCR Options - Font Matching. I am sure you can choose a single font for output. Default may be all fonts in your system. Which is where your problem has come from.
I doubt you can force ABBYY re font size, bold etc.
L.Willms
Posts: 134
Joined: 21 Sep 2016, 10:51
E-book readers owned: Tolino Shine
Country: Germany
Location: Frankfurt/Main, Germany

Re: ABBYY finds too many different styles

Post by L.Willms »

BruceG wrote:Although not a Abbyy user, I would look out for, OCR Options - Font Matching. I am sure you can choose a single font for output. Default may be all fonts in your system. Which is where your problem has come from.
I guess you are right. There is an option "fonts" ("Schriftarten" in my German language version, "die im erkannten Text verwendenden Schriftarten") "the fonts to be used in the recognized text" (in my translation).

Problem is, ABBYY Fine Reader 11 crashes when I click that button.

Or rather did crash in the original released version 11.0.275 as installed threee years ago from CD.

After finally installing the update to 11.11.194 which I had downloaded already end of June 2014, the error message pointing to a problem in ".\Src\FreeType\TTOs2Table.cpp, 206" appeared as before, but after some clicks finally a table with all installed fonts showed up, in which I could deselect a number of fonts which should not be used in the recognized text. I'll see in how far the problem disappears or at least diminishes.

Thanks for pushing me to install the update! Some external events had turned away my attention in the summer of 2014.

Another annoyance is the varying margins and paddings of the paragraphs which result from the differences in the scanned images of the printed book pages.

I resolved to a software solution in the next step: I let ABBYY Fine Reader send the recognized text to MS Word, and program my own VBA script to export plain HTML from the document, without any of the formatting which ABBYY believes to find and puts in each and every <p> HTLM tag as CSS style. I do not want those styles, as being interested only in the content of the scanned book, not its appearance. I control the latter with my own CSS style sheet.
BruceG
Posts: 99
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: ABBYY finds too many different styles

Post by BruceG »

Another area to look at is how to save. In Omnipage there are 30+ file types to save in and then each one has options, I am sure ABBYY is the same.
Perhaps looking at file types and options you will find what you are after.
L.Willms
Posts: 134
Joined: 21 Sep 2016, 10:51
E-book readers owned: Tolino Shine
Country: Germany
Location: Frankfurt/Main, Germany

Re: ABBYY finds too many different styles

Post by L.Willms »

BruceG wrote:Another area to look at is how to save. In Omnipage there are 30+ file types to save in and then each one has options, I am sure ABBYY is the same.
Yes, and there are lot of options, but it is not really clear what is meant by each of them.

As said before, for my primary form of output, I intend to let ABBYY send the recognized text directly to MS Word, where I can edit the text, add annotations etc, and then create an HTML file according to my scheme using a VBA script, i.e. without any STYLE commands in the text, but only by invoking my CSS file as used in my Web sites.

In another area, where I have scanned in old technical magazines and a book, whose authors are not yet dead long enough to be free of copywrong protection, I think about producing PDFs with searcheable text.
L.Willms
Posts: 134
Joined: 21 Sep 2016, 10:51
E-book readers owned: Tolino Shine
Country: Germany
Location: Frankfurt/Main, Germany

Re: ABBYY finds too many different styles

Post by L.Willms »

L.Willms wrote: 10 Oct 2016, 05:29 There is an option "fonts" ("Schriftarten" in my German language version, "die im erkannten Text verwendenden Schriftarten") "the fonts to be used in the recognized text" (in my translation).

Problem is, ABBYY Fine Reader 11 crashes when I click that button.

Or rather did crash in the original released version 11.0.275 as installed threee years ago from CD.
Now using version 12, I found that FineReader shows an exception dialogue at that moment, but does not crash, and after acknowledging that error serveral times, it continues. Maybe I have too many fonts installed for Finereader to cope with.
L.Willms wrote: 10 Oct 2016, 05:29 I resolved to a software solution in the next step: I let ABBYY Fine Reader send the recognized text to MS Word, and program my own VBA script to export plain HTML from the document, without any of the formatting which ABBYY believes to find and puts in each and every <p> HTLM tag as CSS style. I do not want those styles, as being interested only in the content of the scanned book, not its appearance. I control the latter with my own CSS style sheet.
To clarify: With saving the recognized text as plain text, one can preserve the basic character formatting, i.e. italics and bold format. And paragraphs are recognized as such, but the formatting of paragraphs or pages is not saved.

That is as I want it.
Post Reply