ABBYY FineReader

Convert page images into searchable text. Talk about software, techniques, and new developments here.

Moderator: peterZ

User avatar
rob
Posts: 773
Joined: 03 Jun 2009, 13:50
E-book readers owned: iRex iLiad, Kindle 2
Number of books owned: 4000
Country: United States
Location: Maryland, United States
Contact:

ABBYY FineReader

Post by rob »

Hey all,

After I'm done scanning and postprocessing, I run my scans through FineReader. I have found that FineReader gives very good results, except for dropcaps/raisedcaps, and occasionally doublequotes. I think FineReader outperforms Omnipage. And besides, every time I tried Omnipage, it crashed.

After FineReader completes its page segmentation and reading, it goes through the text and shows you everything that it wasn't certain about, giving you the opportunity to correct the problem. This is probably the most tedious part of OCR, but it is pretty much unavoidable. After you're done, you have the option of sending the text output to txt, doc, pdf, html, and a few other formats. You also have the choice of exact output (retains margins, sizes, etc, in addition to format), formatted output (retains paragraphs and italic, bold, etc), or plain output (line-by-line text only output).

FineReader is fairly smart about hyphenated words at the ends of lines. If it can see the rest of the word in the beginning of the next line, it will usually replace the hyphen with an optional hyphen, which word processors use to optionally break a word if it appears at the end of a line. It is not, however, smart enough to put together hyphenated words across page boundaries.

While FineReader usually gets paragraph justification correct (i.e. right, left, centered, full), it will occasionally fail on centered text (incorrectly classifying it as left or right justified), and almost inevitably fails to correctly classify fully justified text in the last line of the page.

Distributed Proofreaders suggests buying FineReader 5.0 Pro (the current version is 9.0), since apparently 5.0 Pro does what is necessary. But get the latest that you can. Sprint, Express, or any version other than Pro is not recommended.

You can get a trial version of FineReader Pro that will process "a limited number of pages". There is a utility called FRFGrab which will apparently take the intermediate files that even an expired trial version of FineReader generates, and turn it into text (although the text isn't perfect, since it hasn't been run through text checking). So that's also a possibility.
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.
james415
Posts: 13
Joined: 04 Mar 2014, 00:52

Re: ABBYY FineReader

Post by james415 »

Has anyone else tried Adobe Acrobat 9? It is really, really good at converting the scanner output. It is a little pricey, but you may be eligible for a student discount or something.

Cheers,
James
User avatar
rob
Posts: 773
Joined: 03 Jun 2009, 13:50
E-book readers owned: iRex iLiad, Kindle 2
Number of books owned: 4000
Country: United States
Location: Maryland, United States
Contact:

Re: ABBYY FineReader

Post by rob »

I tried Acrobat 8, but it doesn't seem as accurate as FineReader, and doesn't flag many problems. On my test document, it seemed to like to replace an opening double quote with II, and it replaced "cotton tunic" with "coucoruntc" because of a small speck in the image -- and then didn't flag it as a problem word :/
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.
spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: ABBYY FineReader

Post by spamsickle »

I'm curious why people are doing an OCR step.

Is it to make the text searchable? If Google books has scanned your books, the text is already searchable, and you can search across a collection of books rather than just inside the book you're reading.

Is it to have text-to-voice software read the book to you while you're driving?

Is it so you can throw the images away, and store only the text of the books (really small storage footprint)?

Is it so you can use a program like iText to convert the text to a PDF, with only a slightly larger storage footprint than text?

Or is it something I haven't considered?

Just curious why this is important to people. My hardcopy books aren't searchable except by the fan & scan method, so I probably won't bother with the OCR step unless storage becomes a real problem.
Cabe
Posts: 34
Joined: 04 Mar 2014, 00:52

Re: ABBYY FineReader

Post by Cabe »

If you have a text version it will reflow and paginate on to ebook readers properly.
spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: ABBYY FineReader

Post by spamsickle »

Will it re-flow and re-paginate on PDAs too? I thought things like Kindle had their own format and didn't even handle PDFs. As you may be able to tell, I'm not an early adopter...
Cabe
Posts: 34
Joined: 04 Mar 2014, 00:52

Re: ABBYY FineReader

Post by Cabe »

Yep, its "just text"

For PDA's and phones Mobipocket is well supported, if you have an iPhone go for Stanza and Android phone owners should look at FBReader or Aldiko.

I am an early adopter :)
you1
Posts: 92
Joined: 04 Mar 2014, 00:53

Re: ABBYY FineReader

Post by you1 »

spamsickle wrote:I'm curious why people are doing an OCR step.
OCR makes it easier to search and find specific content that you have read .
It is especially useful for technical references, when you can't recall details.

I'm not sure of leagle issues with scanning a copyright book and putting it online to be indexed by web crawlers; however, I can tell you that I would not be happy if I was the author.
spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: ABBYY FineReader

Post by spamsickle »

Cabe wrote:Yep, its "just text"

For PDA's and phones Mobipocket is well supported, if you have an iPhone go for Stanza and Android phone owners should look at FBReader or Aldiko.

I am an early adopter :)
So when you say it's "just text" do you mean it will re-flow and paginate a text-searchable PDF (with illustrations), or that it will re-flow and paginate any document that's literally "just text"?
spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: ABBYY FineReader

Post by spamsickle »

you1 wrote:
spamsickle wrote:I'm curious why people are doing an OCR step.
OCR makes it easier to search and find specific content that you have read .
It is especially useful for technical references, when you can't recall details.

I'm not sure of leagle issues with scanning a copyright book and putting it online to be indexed by web crawlers; however, I can tell you that I would not be happy if I was the author.
I think Google books (and Amazon's "look inside" feature) are sensitive to such authors' concerns. I know Amazon chops pages out of what they'll provide, and I've seen instances in which Google books would only return the two-line snippet that contained the phrase for which I'd searched, rather than provide even a page's worth of context. Amazon is trying to sell books themselves, so their interests and the author's interests should be pretty similar.
Post Reply