Re: How to convert a book to serchable pdf using open source software
Posted: 14 Nov 2018, 20:35
I performed small comparison of the free method with Adobe Acrobat (Pro DC 15) output. It was much more tricky that I had expected. The comparison is based on the same book I previously scanned and processed. This is what I obtained:
1. Speed
_____________ Adobe Acrobat ____ Open software
Combining ____ 00:01:50 ________ 00:00:30
Optimization __ 00:00:30 ________ -
OCR _________ 00:06:59 ________ 00:19:35
ToC __________ 00:06:27 ________ 00:07:18
Metadata _____ 00:02:27 ________ 00:02:27
_____________ 00:18:13 _______ 00:29:50
Acrobat presented faster OCR and this made the difference, however it partly results from more powerful hardware (i5-6300U, RAM 16 GB vs. i5-4210U, RAM 8 GB). As regards creating of ToC (15 chapters), there were two different methods used. In Acrobat I had to go to each page with chapter headings select them and add one by one. The free software method required to prepare an input text file, so I needed to copy contents of the page containing ToC and reprocess it to obtain required format. Acrobat method occurred faster, however in case of bigger ToC I would expect opposite result.
2. Filesizes
I compared filesizes in two categories: efficiency of jbig2 lossy and lossless modes.
a) jbig2 lossy
Open source method: 5.0 MB (3.2 MB without OCR layer);
Acrobat: 5.9 MB (4.0 MB without OCR layer).
It is worth to mention that Acrobat also compressed covers using MRC method (front cover and back covers were combined from 3 images of various DPI to increase level of compression). In result the front cover compressed by Acrobat is only 24,7 kB in size vs. 62,3 kB in case of free software.
b) jbig2 lossless
Open source method: 13.6 MB
Acrobat: 14.4 MB
Acrobat is capable recreate fonts on the basis of scanned documents and replace raster images with synthesized vector fonts. In past this method was called ClearScan, in Acrobat DC Pro 15 it was renamed to "Editable text and images". I applied it also to the book for comparison. The document I obtained is 13.7 MB big and this is surprising to me, as I expected much smaller size. It is due to big number of various fonts embedded in the file.
3. OCR quality
In order to compare number of OCR errors I copied 10 pages of OCR-ed text created by Acrobat and Tesseract to a word processor and did automatic comparison: all differences are directly visible: The sample was randomly chosen (pages 164 – 174, 3937 words, 26074 characters). I do not know whether it is fully representative of the whole book, It seems quite typical part. Afterwards I counted errors created by Acrobat and Tesseract. I classified them to specific categories as some are more and some less inconvenient and annoying.
a) wrong letters – A: 12; T: 5
b) added or omitted diacritics – A: 2; T: 6
c) misrecognized upper indexes – A: 4; T: 9
d) joined words – A: 0; T: 8
I ignored punctuation errors.
My conclusion is that Tesseract made bigger number of mistakes (28 in total vs. 18) but they are less “critical” in comparison with the result obtained by Acrobat.
As I use books in pdf format as a source for TTS, I compared also, how they are read by Moon Reader Pro + Ivona TTS engine on my Android phone. The main problem was that there are erroneous additional paragraph breaks in random places what makes the listening less fluent and comfortable. In the indicated part free software produced 21 such false breaks whereas Acrobat inserted 48 of them. Significant difference. I checked also how it looks in case of ClearScan document. The false breaks disappeared, but additional spaces in the middle of words showed up. I counted 58 such unwanted spaces in the sample part and they are definitely more annoying that mentioned paragraph breaks.
1. Speed
_____________ Adobe Acrobat ____ Open software
Combining ____ 00:01:50 ________ 00:00:30
Optimization __ 00:00:30 ________ -
OCR _________ 00:06:59 ________ 00:19:35
ToC __________ 00:06:27 ________ 00:07:18
Metadata _____ 00:02:27 ________ 00:02:27
_____________ 00:18:13 _______ 00:29:50
Acrobat presented faster OCR and this made the difference, however it partly results from more powerful hardware (i5-6300U, RAM 16 GB vs. i5-4210U, RAM 8 GB). As regards creating of ToC (15 chapters), there were two different methods used. In Acrobat I had to go to each page with chapter headings select them and add one by one. The free software method required to prepare an input text file, so I needed to copy contents of the page containing ToC and reprocess it to obtain required format. Acrobat method occurred faster, however in case of bigger ToC I would expect opposite result.
2. Filesizes
I compared filesizes in two categories: efficiency of jbig2 lossy and lossless modes.
a) jbig2 lossy
Open source method: 5.0 MB (3.2 MB without OCR layer);
Acrobat: 5.9 MB (4.0 MB without OCR layer).
It is worth to mention that Acrobat also compressed covers using MRC method (front cover and back covers were combined from 3 images of various DPI to increase level of compression). In result the front cover compressed by Acrobat is only 24,7 kB in size vs. 62,3 kB in case of free software.
b) jbig2 lossless
Open source method: 13.6 MB
Acrobat: 14.4 MB
Acrobat is capable recreate fonts on the basis of scanned documents and replace raster images with synthesized vector fonts. In past this method was called ClearScan, in Acrobat DC Pro 15 it was renamed to "Editable text and images". I applied it also to the book for comparison. The document I obtained is 13.7 MB big and this is surprising to me, as I expected much smaller size. It is due to big number of various fonts embedded in the file.
3. OCR quality
In order to compare number of OCR errors I copied 10 pages of OCR-ed text created by Acrobat and Tesseract to a word processor and did automatic comparison: all differences are directly visible: The sample was randomly chosen (pages 164 – 174, 3937 words, 26074 characters). I do not know whether it is fully representative of the whole book, It seems quite typical part. Afterwards I counted errors created by Acrobat and Tesseract. I classified them to specific categories as some are more and some less inconvenient and annoying.
a) wrong letters – A: 12; T: 5
b) added or omitted diacritics – A: 2; T: 6
c) misrecognized upper indexes – A: 4; T: 9
d) joined words – A: 0; T: 8
I ignored punctuation errors.
My conclusion is that Tesseract made bigger number of mistakes (28 in total vs. 18) but they are less “critical” in comparison with the result obtained by Acrobat.
As I use books in pdf format as a source for TTS, I compared also, how they are read by Moon Reader Pro + Ivona TTS engine on my Android phone. The main problem was that there are erroneous additional paragraph breaks in random places what makes the listening less fluent and comfortable. In the indicated part free software produced 21 such false breaks whereas Acrobat inserted 48 of them. Significant difference. I checked also how it looks in case of ClearScan document. The false breaks disappeared, but additional spaces in the middle of words showed up. I counted 58 such unwanted spaces in the sample part and they are definitely more annoying that mentioned paragraph breaks.