Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

How to convert a book to serchable pdf using open source software

Share your software workflow. Write up your tips and tricks on how to scan, digitize, OCR, and bind ebooks.
zbgns
Posts: 47
Joined: 22 Dec 2016, 06:07
E-book readers owned: Tolino, Kindle
Number of books owned: 600
Country: Poland

Re: How to convert a book to serchable pdf using open source software

Post by zbgns » 14 Nov 2018, 20:35

I performed small comparison of the free method with Adobe Acrobat (Pro DC 15) output. It was much more tricky that I had expected. The comparison is based on the same book I previously scanned and processed. This is what I obtained:

1. Speed
_____________ Adobe Acrobat ____ Open software
Combining ____ 00:01:50 ________ 00:00:30
Optimization __ 00:00:30 ________ -
OCR _________ 00:06:59 ________ 00:19:35
ToC __________ 00:06:27 ________ 00:07:18
Metadata _____ 00:02:27 ________ 00:02:27
_____________ 00:18:13 _______ 00:29:50

Acrobat presented faster OCR and this made the difference, however it partly results from more powerful hardware (i5-6300U, RAM 16 GB vs. i5-4210U, RAM 8 GB). As regards creating of ToC (15 chapters), there were two different methods used. In Acrobat I had to go to each page with chapter headings select them and add one by one. The free software method required to prepare an input text file, so I needed to copy contents of the page containing ToC and reprocess it to obtain required format. Acrobat method occurred faster, however in case of bigger ToC I would expect opposite result.

2. Filesizes

I compared filesizes in two categories: efficiency of jbig2 lossy and lossless modes.
a) jbig2 lossy

Open source method: 5.0 MB (3.2 MB without OCR layer);
Acrobat: 5.9 MB (4.0 MB without OCR layer).

It is worth to mention that Acrobat also compressed covers using MRC method (front cover and back covers were combined from 3 images of various DPI to increase level of compression). In result the front cover compressed by Acrobat is only 24,7 kB in size vs. 62,3 kB in case of free software.

b) jbig2 lossless

Open source method: 13.6 MB
Acrobat: 14.4 MB

Acrobat is capable recreate fonts on the basis of scanned documents and replace raster images with synthesized vector fonts.
aa.png
ab.png
In past this method was called ClearScan, in Acrobat DC Pro 15 it was renamed to "Editable text and images". I applied it also to the book for comparison. The document I obtained is 13.7 MB big and this is surprising to me, as I expected much smaller size. It is due to big number of various fonts embedded in the file.

3. OCR quality

In order to compare number of OCR errors I copied 10 pages of OCR-ed text created by Acrobat and Tesseract to a word processor and did automatic comparison: all differences are directly visible:
b.png
The sample was randomly chosen (pages 164 – 174, 3937 words, 26074 characters). I do not know whether it is fully representative of the whole book, It seems quite typical part. Afterwards I counted errors created by Acrobat and Tesseract. I classified them to specific categories as some are more and some less inconvenient and annoying.

a) wrong letters – A: 12; T: 5
b) added or omitted diacritics – A: 2; T: 6
c) misrecognized upper indexes – A: 4; T: 9
d) joined words – A: 0; T: 8
I ignored punctuation errors.

My conclusion is that Tesseract made bigger number of mistakes (28 in total vs. 18) but they are less “critical” in comparison with the result obtained by Acrobat.

As I use books in pdf format as a source for TTS, I compared also, how they are read by Moon Reader Pro + Ivona TTS engine on my Android phone. The main problem was that there are erroneous additional paragraph breaks in random places what makes the listening less fluent and comfortable. In the indicated part free software produced 21 such false breaks whereas Acrobat inserted 48 of them. Significant difference. I checked also how it looks in case of ClearScan document. The false breaks disappeared, but additional spaces in the middle of words showed up. I counted 58 such unwanted spaces in the sample part and they are definitely more annoying that mentioned paragraph breaks.

babul
Posts: 3
Joined: 14 Jul 2018, 20:17
Number of books owned: 0
Country: Poland

Re: How to convert a book to serchable pdf using open source software

Post by babul » 18 Mar 2019, 05:11

Hey, great guide!

I have a question though: is it possible to somehow correct OCR output from Tesseract other than editing the pdf in Acrobat (I don't feel like paying monthly for that)? I've been trying to do OCR with gImageReader on Linux. It uses Tesseract and you can preview txt file, but to actually correct it, you need to edit hOCR file directly in the application or in some text editor, which is kinda troublesome due to navigating between html tags, but it can be made easier with Vim or something.

If errors can happen in neatly prepared pdfs, then much more correcting is needed when you try to convert some typewriter stuff. I usually remake those in LaTeX, so copying plain text and correcting it is fine, but I wonder if you can correct it in other way than editing hOCR file.

zbgns
Posts: 47
Joined: 22 Dec 2016, 06:07
E-book readers owned: Tolino, Kindle
Number of books owned: 600
Country: Poland

Re: How to convert a book to serchable pdf using open source software

Post by zbgns » 18 Mar 2019, 22:29

babul wrote:
18 Mar 2019, 05:11
I have a question though: is it possible to somehow correct OCR output from Tesseract other than editing the pdf in Acrobat (I don't feel like paying monthly for that)? I've been trying to do OCR with gImageReader on Linux. It uses Tesseract and you can preview txt file, but to actually correct it, you need to edit hOCR file directly in the application or in some text editor, which is kinda troublesome due to navigating between html tags, but it can be made easier with Vim or something.
Please note that the method described does not rely on hOCR created by Tesseract (or gImageReader) and incorporated into a pdf as the 'text' layer. It is kind of 'generic' text-only pdf that Tesseract is capable of creating. I'm using this mode instead of hOCR because it provides better (almost perfect) positioning of text. After joining altogether the text is in right place under the graphics layer. This precision is not available when hOCR based method is used. But the most important issue for me is that the method described in this thread gives pdf files which work good as sources for TTS. hOCR based pdfs are not useful for this at all (correct me if I'm wrong).
babul wrote:
18 Mar 2019, 05:11
If errors can happen in neatly prepared pdfs, then much more correcting is needed when you try to convert some typewriter stuff. I usually remake those in LaTeX, so copying plain text and correcting it is fine, but I wonder if you can correct it in other way than editing hOCR file.
When it comes to proofreading and typo corrections, I would say it is hardly possible to perform this without tools like Acrobat. I'm not 100% sure but it seems that e.g. Master PDF Editor or Qoppa PDF Studio also can do this. They are cheaper than Acrobat and have Linux versions. The hardcore option is to recompress the pdf text-only Tesseract output and edit the sourcecode of such pdf. I tried it only once (with good result by the way as it was batch change of one sequence of letters that was incorrectly recognized) since it was too extreme for me. I'm able to accept some amount of OCR errors in pdfs as full proofreading is very time consuming and isn't worth efforts. Tesseract is very good OCR tool, so if everything is done correctly (some experience is necessary) recognition errors are relatively rare and not very annoying.

However if you need a perfectly looking document you may try to take a plain text and recreate the layout in e.g. MS Word or LO Writer, correcting also typos on this occasion. Afterwards you may save this as pdf or any other kind of file (e.g. epub). I also tried this but find it too laborious in normal conditions.

i8dcI32QWRfiwVj
Posts: 5
Joined: 26 Jul 2018, 09:28
Number of books owned: 0
Country: Germany

Re: How to convert a book to serchable pdf using open source software

Post by i8dcI32QWRfiwVj » 05 Aug 2019, 11:57

dtic wrote:
26 Oct 2018, 12:34
zbgns wrote:
25 Oct 2018, 11:24
dtic wrote:
25 Oct 2018, 04:28
First crop all images to include only the book page. My https://github.com/nod5/BookCrop is a quick but rough Windows tool for that but there are other methods out there.
Also Scan Tailor Advanced offers this crop function - 'Page Box' at the 'Select Content' stage. Unfortunately this is not able to fully solve the problem. First, position of pages changes as I try to keep a book in the middle (lens in phone cameras produce less distortions in the middle that on edges, it is also due to construction of my scanner). The second issue is that there are usually very small mistakes, like: omitted page number, selected content area bigger than necessary only a bit. It is difficult to eliminate this problem by cropping page.
Yes, BookCrop only works well if all pages are at roughly the same position in the photos. One solution is to have some part of the scanner setup that holds the book cover/spine in a fixed position when shooting. Otherwise some other, smarter tool that reliably detects the page edges in the photo is needed. I use a python/OpenCV script for that situation which produces around one error per 200 pages, but it is very sensitive to the specific lighting, background and scanner setup so I haven't released it. I hope someone much better at OpenCV than me will one day release a general tool that can very reliably crop whole pages from a wide range of book page photos.
zbgns wrote:
25 Oct 2018, 11:24
I'm aware of the Scan Tailor CLI mode, however I didn't even tested it. The main reason is that I like to have control over the process and check whether there no errors, even if it takes some time.
If you do find a way to first successfully crop all whole pages then in my experience there won't every be any errors at all when you in the next step use Scan Tailor Enhanced CLI processing simply to binarize images with only text or black and white graphs. Besides, you can just wait for all pages to finish and then review the thumbnails and redo only those images that have errors, if any.
Does anyone know of useful software for cropping scans that runs on Mac Os?

zbgns
Posts: 47
Joined: 22 Dec 2016, 06:07
E-book readers owned: Tolino, Kindle
Number of books owned: 600
Country: Poland

Re: How to convert a book to serchable pdf using open source software

Post by zbgns » 05 Aug 2019, 15:42

i8dcI32QWRfiwVj wrote:
05 Aug 2019, 11:57
Does anyone know of useful software for cropping scans that runs on Mac Os?
I think this thread answers more or less to your question: viewtopic.php?f=24&p=21785&sid=0ab34fa7 ... f21#p21785

In short, I would recommend Scan Tailor Advanced. It is able to crop scans and runs on Mac OS.

i8dcI32QWRfiwVj
Posts: 5
Joined: 26 Jul 2018, 09:28
Number of books owned: 0
Country: Germany

Re: How to convert a book to serchable pdf using open source software

Post by i8dcI32QWRfiwVj » 06 Aug 2019, 07:41

zbgns wrote:
05 Aug 2019, 15:42
i8dcI32QWRfiwVj wrote:
05 Aug 2019, 11:57
Does anyone know of useful software for cropping scans that runs on Mac Os?
I think this thread answers more or less to your question: viewtopic.php?f=24&p=21785&sid=0ab34fa7 ... f21#p21785

In short, I would recommend Scan Tailor Advanced. It is able to crop scans and runs on Mac OS.
Thanks! I am already using Scantailor for cropping, but am looking for software that would get rid of excess white space around text, and fingers, to edit my scans with before I start working with Scantailor.

zbgns
Posts: 47
Joined: 22 Dec 2016, 06:07
E-book readers owned: Tolino, Kindle
Number of books owned: 600
Country: Poland

Re: How to convert a book to serchable pdf using open source software

Post by zbgns » 06 Aug 2019, 12:38

i8dcI32QWRfiwVj wrote:
06 Aug 2019, 07:41
Thanks! I am already using Scantailor for cropping, but am looking for software that would get rid of excess white space around text, and fingers, to edit my scans with before I start working with Scantailor.
I use Scan Tailor Advanced for this, i.e. in order manually select area, where STA looks automatically desired contents. It works very good provided that pages are in the same position. I find it also more efficient to use one tool for this instead various ones. But it depends on your specific needs and your workflow. If you need to crop pages before proceeding with Scan Tailor you may give a try to ImageMagick, especially if pages are in stable position on all pictures. Otherwise it may be necessary to look for something more advanced, like for example this: viewtopic.php?f=24&p=21785#p21785

i8dcI32QWRfiwVj
Posts: 5
Joined: 26 Jul 2018, 09:28
Number of books owned: 0
Country: Germany

Re: How to convert a book to serchable pdf using open source software

Post by i8dcI32QWRfiwVj » 07 Aug 2019, 03:59

zbgns wrote:
06 Aug 2019, 12:38
i8dcI32QWRfiwVj wrote:
06 Aug 2019, 07:41
Thanks! I am already using Scantailor for cropping, but am looking for software that would get rid of excess white space around text, and fingers, to edit my scans with before I start working with Scantailor.
I use Scan Tailor Advanced for this, i.e. in order manually select area, where STA looks automatically desired contents. It works very good provided that pages are in the same position. I find it also more efficient to use one tool for this instead various ones. But it depends on your specific needs and your workflow. If you need to crop pages before proceeding with Scan Tailor you may give a try to ImageMagick, especially if pages are in stable position on all pictures. Otherwise it may be necessary to look for something more advanced, like for example this: viewtopic.php?f=24&p=21785#p21785
True, STA is not too bad, and I wasn't aware of the 'sort pages by ...' option you mentioned in your first post. This really makes finding pages where text hasn't been properly recognised easier, thanks for that! Shall also give the AI option mentioned in the post you refer to a try should I find myself with some spare time on my hands.

Post Reply