Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

AABBYY Fine Reader and I

Convert page images into searchable text. Talk about software, techniques, and new developments here.
Post Reply
BillGill
Posts: 74
Joined: 18 Dec 2016, 17:13
E-book readers owned: Calibre, FBReader
Number of books owned: 7000
Country: USA

AABBYY Fine Reader and I

Post by BillGill » 18 Sep 2017, 09:53

I switched to Fine Reader when I got my new Win10 computer a while back and since them I have been having problems with formatting. I use Fine Reader just for the OCR part, then do my final editing in Open Office Writer. Fine Reader has the capability of saving the document as an OO file (*.odt). But when I do that the file winds up being in terrible shape in regards to the format. I started by letting it save it and then launch OO. That winds up being the worst way as far as I can see. Saving it and reopening it manually in OO doesn't work much better. Then I tried saving as an RTF file. That is finally working. However, I can't have Fine Reader launch OO when I save the file, because if I do the Fine Reader formatting overrides the OO default formatting. The biggest thing is that it winds up being in what appears to be the Fine Reader default font, Dejuvu sans. The thing about that is that I have set Fine Reader to use Times New Roman. It also uses the OO default paragraph format, which I have overridden with my own default paragraph format. However, I have finally worked out a procedure that works for me.

1 Load and recognize the document.
2 Save the document in the RTF format, but don't automatically launch Open Office.
3 Launch Open Office and create a New text document.
4. Using the Insert-file menu item Insert the RTF document into the new document.
5. Save the document as an .odt file.

After that I have a good clean document with my default formatting and no weird formats showing up in it.

There may be some tips that would simplify all of this. If any body has any ideas I would be glad to hear them.

Bill

L.Willms
Posts: 85
Joined: 21 Sep 2016, 10:51
E-book readers owned: Tolino Shine
Country: Germany
Location: Frankfurt/Main, Germany

Re: AABBYY Fine Reader and I

Post by L.Willms » 02 Jan 2018, 03:24

Which version of FineReader are you using? I am currently using version 12 (I missed the Black Friday/Cyber Monday offer of 100 Euro for version 14).

In earlier versions of FineReader I also had problems with export to OpenOffice, where FineReader set page margins and other formatting which I could not change in Open Office Writer.

Then I found out that FineReader preserves italics and bold when writing as "unformatted text".

So I either output to PDF as "exact copy" (I'll try Dejavu) or as "unformatted text" to MS Word. Since most OCRed scans are intended to be published as HTML on the Web, I am developing in VBA my own procedure to export those texts as simple HTML tuned for my CSS formatting.

BillGill
Posts: 74
Joined: 18 Dec 2016, 17:13
E-book readers owned: Calibre, FBReader
Number of books owned: 7000
Country: USA

Re: AABBYY Fine Reader and I

Post by BillGill » 02 Jan 2018, 10:37

I am using version 14.

I am saving the files as unformatted text in RTF format. That seems to give me the best results.

I do my scans to make EPUB books out of books in my library. That is essentially the same as your creating HTML files, since EPUB is essentially XHTML with all the files in the document compressed into a single file.

After I get the files into OpenOffice I still have to do quite a lot of editing. There are many errors. I'm not sure where they take place but a lot of text winds up as bold or italic when it shouldn't be. There are also a lot of standard errors. Modem for Modern for example. Any place where there is a "rn" sequence of letters may wind up as "m". And of course '1' for 'I'. And watch for lower case h and lower case b. They frequently get mixed up.

I use Calibre to convert from OpenOffice to EPUB. It can also be used to convert it to HTML. For that matter OpenOffice can output its files directly as HTML, or EPUB, or PDF. You might want to try that, it might do a good enough job for what you are doing.

MS Word does provide capabilities to save your work as HTML. I haven't tried it, because in the past Word was notorious for producing HTML files that were mangled for anything other than Microsofts reader. And of course I don't have Word now that I have Win10. I don't want to rent my software I want to own it.

I use Calibre because it has facilities for managing all my books, as well as editing the EPUB files. I find that even after I have done all the editing in OpenOffice I still have to do some more in Calibre. And the formatting can wind up kind of strange.

Bill

dpc
Posts: 264
Joined: 01 Apr 2011, 18:05
Number of books owned: 0
Location: Issaquah, WA

Re: AABBYY Fine Reader and I

Post by dpc » 11 Jan 2018, 14:09

I too began my quest to scan all the books from my personal library as ePub. I soon discovered the same issue with AABBYY that you did in that this takes a LOT of hand tweaking to correct all of the OCR and formatting mistakes. I finally gave up and decided to save the books as PDF using Adobe Acrobat. Sure, the storage requirements are more than ePub, but storage is cheap. Every device that I know of has a PDF reader and the format is well understood as there are many open source tools available to read/write the files' contents.

You can still purchase Office 2016 for your PC. That's what I have on my Win10 machines and they're still receiving updates from MS.

BillGill
Posts: 74
Joined: 18 Dec 2016, 17:13
E-book readers owned: Calibre, FBReader
Number of books owned: 7000
Country: USA

Re: AABBYY Fine Reader and I

Post by BillGill » 12 Jan 2018, 10:29

I prefer epub because I can easily buy books in that format, or at least formats that can easily be converted to epub. I am not scanning all of my books, just the ones that I can't get otherwise. It is pretty slow going doing the ones I am doing, and it isn't really a lot of fun. It does keep me off the streets. I sure don't want to start on all the others, that would be a lifetime job.

I may have to look at getting a one time purchase of MS Office. Open Office is a nice program, and is of course free, but it does have some quirks.
But right now I am working with an author to digitize some of her works, and she uses Open Office, so I can't just throw it out.

Bill

dpc
Posts: 264
Joined: 01 Apr 2011, 18:05
Number of books owned: 0
Location: Issaquah, WA

Re: AABBYY Fine Reader and I

Post by dpc » 19 Jan 2018, 11:45

If you're not aware of mobileread.com, you might want to check out the forums there. Lots of tips and info on the various software tools for creating epub files. More here.

L.Willms
Posts: 85
Joined: 21 Sep 2016, 10:51
E-book readers owned: Tolino Shine
Country: Germany
Location: Frankfurt/Main, Germany

Re: AABBYY Fine Reader and I

Post by L.Willms » 05 Mar 2018, 03:09

BillGill wrote:
02 Jan 2018, 10:37
I use Calibre to convert from OpenOffice to EPUB.
Do you know "Writer2ePub" by Luca Calcinai? http://writer2epub.it/en/

This installs in the OpenOffice or Libreoffice Writer and is called from icons in the symbol bar.

I consider this to be a very good solution.

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest