Page 2 of 2

Re: How to convert a book to serchable pdf using open source software

Posted: 14 Nov 2018, 20:35
by zbgns
I performed small comparison of the free method with Adobe Acrobat (Pro DC 15) output. It was much more tricky that I had expected. The comparison is based on the same book I previously scanned and processed. This is what I obtained:

1. Speed
_____________ Adobe Acrobat ____ Open software
Combining ____ 00:01:50 ________ 00:00:30
Optimization __ 00:00:30 ________ -
OCR _________ 00:06:59 ________ 00:19:35
ToC __________ 00:06:27 ________ 00:07:18
Metadata _____ 00:02:27 ________ 00:02:27
_____________ 00:18:13 _______ 00:29:50

Acrobat presented faster OCR and this made the difference, however it partly results from more powerful hardware (i5-6300U, RAM 16 GB vs. i5-4210U, RAM 8 GB). As regards creating of ToC (15 chapters), there were two different methods used. In Acrobat I had to go to each page with chapter headings select them and add one by one. The free software method required to prepare an input text file, so I needed to copy contents of the page containing ToC and reprocess it to obtain required format. Acrobat method occurred faster, however in case of bigger ToC I would expect opposite result.

2. Filesizes

I compared filesizes in two categories: efficiency of jbig2 lossy and lossless modes.
a) jbig2 lossy

Open source method: 5.0 MB (3.2 MB without OCR layer);
Acrobat: 5.9 MB (4.0 MB without OCR layer).

It is worth to mention that Acrobat also compressed covers using MRC method (front cover and back covers were combined from 3 images of various DPI to increase level of compression). In result the front cover compressed by Acrobat is only 24,7 kB in size vs. 62,3 kB in case of free software.

b) jbig2 lossless

Open source method: 13.6 MB
Acrobat: 14.4 MB

Acrobat is capable recreate fonts on the basis of scanned documents and replace raster images with synthesized vector fonts.
aa.png
ab.png
In past this method was called ClearScan, in Acrobat DC Pro 15 it was renamed to "Editable text and images". I applied it also to the book for comparison. The document I obtained is 13.7 MB big and this is surprising to me, as I expected much smaller size. It is due to big number of various fonts embedded in the file.

3. OCR quality

In order to compare number of OCR errors I copied 10 pages of OCR-ed text created by Acrobat and Tesseract to a word processor and did automatic comparison: all differences are directly visible:
b.png
The sample was randomly chosen (pages 164 – 174, 3937 words, 26074 characters). I do not know whether it is fully representative of the whole book, It seems quite typical part. Afterwards I counted errors created by Acrobat and Tesseract. I classified them to specific categories as some are more and some less inconvenient and annoying.

a) wrong letters – A: 12; T: 5
b) added or omitted diacritics – A: 2; T: 6
c) misrecognized upper indexes – A: 4; T: 9
d) joined words – A: 0; T: 8
I ignored punctuation errors.

My conclusion is that Tesseract made bigger number of mistakes (28 in total vs. 18) but they are less “critical” in comparison with the result obtained by Acrobat.

As I use books in pdf format as a source for TTS, I compared also, how they are read by Moon Reader Pro + Ivona TTS engine on my Android phone. The main problem was that there are erroneous additional paragraph breaks in random places what makes the listening less fluent and comfortable. In the indicated part free software produced 21 such false breaks whereas Acrobat inserted 48 of them. Significant difference. I checked also how it looks in case of ClearScan document. The false breaks disappeared, but additional spaces in the middle of words showed up. I counted 58 such unwanted spaces in the sample part and they are definitely more annoying that mentioned paragraph breaks.

Re: How to convert a book to serchable pdf using open source software

Posted: 18 Mar 2019, 05:11
by babul
Hey, great guide!

I have a question though: is it possible to somehow correct OCR output from Tesseract other than editing the pdf in Acrobat (I don't feel like paying monthly for that)? I've been trying to do OCR with gImageReader on Linux. It uses Tesseract and you can preview txt file, but to actually correct it, you need to edit hOCR file directly in the application or in some text editor, which is kinda troublesome due to navigating between html tags, but it can be made easier with Vim or something.

If errors can happen in neatly prepared pdfs, then much more correcting is needed when you try to convert some typewriter stuff. I usually remake those in LaTeX, so copying plain text and correcting it is fine, but I wonder if you can correct it in other way than editing hOCR file.

Re: How to convert a book to serchable pdf using open source software

Posted: 18 Mar 2019, 22:29
by zbgns
babul wrote:
18 Mar 2019, 05:11
I have a question though: is it possible to somehow correct OCR output from Tesseract other than editing the pdf in Acrobat (I don't feel like paying monthly for that)? I've been trying to do OCR with gImageReader on Linux. It uses Tesseract and you can preview txt file, but to actually correct it, you need to edit hOCR file directly in the application or in some text editor, which is kinda troublesome due to navigating between html tags, but it can be made easier with Vim or something.
Please note that the method described does not rely on hOCR created by Tesseract (or gImageReader) and incorporated into a pdf as the 'text' layer. It is kind of 'generic' text-only pdf that Tesseract is capable of creating. I'm using this mode instead of hOCR because it provides better (almost perfect) positioning of text. After joining altogether the text is in right place under the graphics layer. This precision is not available when hOCR based method is used. But the most important issue for me is that the method described in this thread gives pdf files which work good as sources for TTS. hOCR based pdfs are not useful for this at all (correct me if I'm wrong).
babul wrote:
18 Mar 2019, 05:11
If errors can happen in neatly prepared pdfs, then much more correcting is needed when you try to convert some typewriter stuff. I usually remake those in LaTeX, so copying plain text and correcting it is fine, but I wonder if you can correct it in other way than editing hOCR file.
When it comes to proofreading and typo corrections, I would say it is hardly possible to perform this without tools like Acrobat. I'm not 100% sure but it seems that e.g. Master PDF Editor or Qoppa PDF Studio also can do this. They are cheaper than Acrobat and have Linux versions. The hardcore option is to recompress the pdf text-only Tesseract output and edit the sourcecode of such pdf. I tried it only once (with good result by the way as it was batch change of one sequence of letters that was incorrectly recognized) since it was too extreme for me. I'm able to accept some amount of OCR errors in pdfs as full proofreading is very time consuming and isn't worth efforts. Tesseract is very good OCR tool, so if everything is done correctly (some experience is necessary) recognition errors are relatively rare and not very annoying.

However if you need a perfectly looking document you may try to take a plain text and recreate the layout in e.g. MS Word or LO Writer, correcting also typos on this occasion. Afterwards you may save this as pdf or any other kind of file (e.g. epub). I also tried this but find it too laborious in normal conditions.