Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

How to convert a book to serchable pdf using open source software

Share your software workflow. Write up your tips and tricks on how to scan, digitize, OCR, and bind ebooks.
zbgns
Posts: 54
Joined: 22 Dec 2016, 06:07
E-book readers owned: Tolino, Kindle
Number of books owned: 600
Country: Poland

Re: How to convert a book to serchable pdf using open source software

Post by zbgns » 09 Jan 2020, 21:17

As you already wrote it was the sorting problem due to inconsistent naming of files. BTW this is not the Tesseract issue as it cannot process batch of separate files directly and the workaround is necessary by creating a list of files in right order which Tesseract may follow. This list was created by 'ls' command, namely by listing all 'tif' files in the working directory in name order and saving them to a text file ('output.txt' in this case). Potentially, you might avoid renaming files by change of sorting order, e.g. using sort by creation date (assuming there were created in correct sequence). Tesseract fully relies on this list and processes file by file in the given order.
Please also note that the scripts in this thread are very basic and prone to problems like this. They definitely do not follow good practices of writing shell scripts and I have not presented them as a complex working solution but rather as examples of my individual approach. More sophisticated implementation would be required before this is offered to others as a kind of 'universal' software.

cosinus
Posts: 2
Joined: 15 Apr 2020, 12:12
E-book readers owned: kindle pw4
Number of books owned: 0
Country: Norway

Re: How to convert a book to serchable pdf using open source software

Post by cosinus » 05 May 2020, 09:34

Thank you so much for posting this workflow.
I was really helpful.

I have a few modifications. :-)
First I think the cover should be at the same size when scrolling the pdf file. I had some problem since I scanned the covers at higher resolutions.

After creating the jbig2lossyocr.pdf file, I checked the resolutions of the text file and created the cover the same widh and ppi,

Code: Select all

$pdfimages -list -l 2 jbig2lossyocr.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    1908  3110  gray    1   1  jbig2  no      1316  0   400   401 3899B 0.5%
   2     1 image    1908  3110  gray    1   1  jbig2  no      1319  0   400   401 4926B 0.7%

$mogrify -units PixelsPerInch  -density 400   -resize 1908x cover.tif 
 
I have only created two books so far and I was lucky with the chapter titles. The are named "Kapittel 1" and counting up. I was able to create the toc these way, with help of http://manpages.ubuntu.com/manpages/tru ... ine.1.html

Code: Select all

$echo "0 1 Book Title" > book.toc
$pdftotext book.pdf -|  awk -vRS=$'\f' -vNAME="Kapittel"      'index($0,NAME){printf "1 %d %s\n", NR, NAME;}' |grep -n '^' |awk -F':' '{print $2" "$1}' >> book.toc

$ head book.toc 
0 1 Book Title
1 2 Kapittel 1
1 15 Kapittel 2
1 23 Kapittel 3
1 32 Kapittel 4
1 40 Kapittel 5

pdfoutline book.pdf book.toc  book-toc.pdf 
In also think the page numbers should match between the scanned pages and the pdf file.
I solved this with http://jpdftweak.sourceforge.net/
I did this modification. This will set the first textpage to 7 and the pdf will open full page.

Code: Select all

Pager number tab
1; i,ii; cover 1
2; 1,2,3;;7
Interaction tab
x Set viewer preferences
x Fit window to pdf

Metadata. I found a source for bib files here in Norway. More internationally I think this is a good starting point. https://davetang.org/muse/2014/06/30/co ... to-bibtex/ and especial the OttoBib link.
I download the bib file. In nautilus open with Jabref. In jabref right click the bib entry and attach the pdf file. The in jabref
Tools - Write xmp metadata to pdf’s
I don't use jabref for anything else.

I also found a way to create PDF/A-2B from the created file with toc and metadata. It's online.
https://www.pdftron.com/pdf-tools/pdfa-converter/
It worked well and jbig and jpx images are kept, so the file size are about the same. It don't worked if the jpdftweak was the last step, so there may be some bugs with jpdftweak. I also think it was necessary to have the xmp metadata from jabref attached to the pdf file.

zbgns
Posts: 54
Joined: 22 Dec 2016, 06:07
E-book readers owned: Tolino, Kindle
Number of books owned: 600
Country: Poland

Re: How to convert a book to serchable pdf using open source software

Post by zbgns » 06 May 2020, 09:34

Thank you for your comments and sharing details of your workflow. Nice to see, that someone found useful the thread I wrote.
cosinus wrote:
05 May 2020, 09:34
First I think the cover should be at the same size when scrolling the pdf file. I had some problem since I scanned the covers at higher resolutions.
You are right. The scripts are adopted for work with 300 DPI images. In case of higher resolution it would be necessary to adjust them respectively. It is also possible to have higher resolution for covers than for remaining contents and have the same sizes in a pdf file. It depends on combination of DPI and number of pixels to have correct “physical” size (measured in centimeters or inches).
cosinus wrote:
05 May 2020, 09:34
I was able to create the toc these way, with help of http://manpages.ubuntu.com/manpages/tru ... ine.1.html
Thanks for indicating this tool. I was not aware that this exists.
cosinus wrote:
05 May 2020, 09:34
In also think the page numbers should match between the scanned pages and the pdf file.
I solved this with http://jpdftweak.sourceforge.net/
Apart jpdftweak, I used for this also pagelabels-pyhttps://github.com/lovasoa/pagelabels-py. But usually the simpler the better. It may be sufficient to remove some blank pages at the beginning to have the numbers on pages fitting to sequential numbers in the pdf file.
cosinus wrote:
05 May 2020, 09:34
Metadata. I found a source for bib files here in Norway. More internationally I think this is a good starting point. https://davetang.org/muse/2014/06/30/co ... to-bibtex/ and especial the OttoBib link.
I didn't even know this possibility. I will try to use it, as there are some bibtex files fitting to my books.
cosinus wrote:
05 May 2020, 09:34
I also found a way to create PDF/A-2B from the created file with toc and metadata. It's online.
OCRmyPDF https://github.com/jbarlow83/OCRmyPDFseems to be able to convert to PDF/A format if you want to avoid online tools.

cosinus
Posts: 2
Joined: 15 Apr 2020, 12:12
E-book readers owned: kindle pw4
Number of books owned: 0
Country: Norway

Re: How to convert a book to serchable pdf using open source software

Post by cosinus » 13 May 2020, 03:51

zbgns wrote:
06 May 2020, 09:34
OCRmyPDF https://github.com/jbarlow83/OCRmyPDFseems to be able to convert to PDF/A format if you want to avoid online tools.
Thanks.
Yes I have looked at OCRmyPDF but it don't support JBIG or JPEG2000 images in PDF/A.
I think it's ghostcript that lacks that functionality.

Quote from ocrmypdf webbpage.

Code: Select all

PDFs containing JBIG2-encoded content will be converted to CCITT Group4 encoding, which has lower compression ratios, if Ghostscript PDF/A is enabled.
PDFs containing JPEG 2000-encoded content will be converted to JPEG encoding, which may introduce compression artifacts, if Ghostscript PDF/A is enabled.

zbgns
Posts: 54
Joined: 22 Dec 2016, 06:07
E-book readers owned: Tolino, Kindle
Number of books owned: 600
Country: Poland

Re: How to convert a book to serchable pdf using open source software

Post by zbgns » 14 May 2020, 14:21

My bad. I was convinced that OCRmyPDF supports jbig2 but apparently this applies only to regular pdfs.

Post Reply