How to convert a book to serchable pdf using open source software
Moderator: peterZ
-
- Posts: 61
- Joined: 22 Dec 2016, 06:07
- E-book readers owned: Tolino, Kindle
- Number of books owned: 600
- Country: Poland
Re: How to convert a book to serchable pdf using open source software
As you already wrote it was the sorting problem due to inconsistent naming of files. BTW this is not the Tesseract issue as it cannot process batch of separate files directly and the workaround is necessary by creating a list of files in right order which Tesseract may follow. This list was created by 'ls' command, namely by listing all 'tif' files in the working directory in name order and saving them to a text file ('output.txt' in this case). Potentially, you might avoid renaming files by change of sorting order, e.g. using sort by creation date (assuming there were created in correct sequence). Tesseract fully relies on this list and processes file by file in the given order.
Please also note that the scripts in this thread are very basic and prone to problems like this. They definitely do not follow good practices of writing shell scripts and I have not presented them as a complex working solution but rather as examples of my individual approach. More sophisticated implementation would be required before this is offered to others as a kind of 'universal' software.
Please also note that the scripts in this thread are very basic and prone to problems like this. They definitely do not follow good practices of writing shell scripts and I have not presented them as a complex working solution but rather as examples of my individual approach. More sophisticated implementation would be required before this is offered to others as a kind of 'universal' software.
-
- Posts: 2
- Joined: 15 Apr 2020, 12:12
- E-book readers owned: kindle pw4
- Number of books owned: 0
- Country: Norway
Re: How to convert a book to serchable pdf using open source software
Thank you so much for posting this workflow.
I was really helpful.
I have a few modifications.
First I think the cover should be at the same size when scrolling the pdf file. I had some problem since I scanned the covers at higher resolutions.
After creating the jbig2lossyocr.pdf file, I checked the resolutions of the text file and created the cover the same widh and ppi,
I have only created two books so far and I was lucky with the chapter titles. The are named "Kapittel 1" and counting up. I was able to create the toc these way, with help of http://manpages.ubuntu.com/manpages/tru ... ine.1.html
In also think the page numbers should match between the scanned pages and the pdf file.
I solved this with http://jpdftweak.sourceforge.net/
I did this modification. This will set the first textpage to 7 and the pdf will open full page.
Metadata. I found a source for bib files here in Norway. More internationally I think this is a good starting point. https://davetang.org/muse/2014/06/30/co ... to-bibtex/ and especial the OttoBib link.
I download the bib file. In nautilus open with Jabref. In jabref right click the bib entry and attach the pdf file. The in jabref
Tools - Write xmp metadata to pdf’s
I don't use jabref for anything else.
I also found a way to create PDF/A-2B from the created file with toc and metadata. It's online.
https://www.pdftron.com/pdf-tools/pdfa-converter/
It worked well and jbig and jpx images are kept, so the file size are about the same. It don't worked if the jpdftweak was the last step, so there may be some bugs with jpdftweak. I also think it was necessary to have the xmp metadata from jabref attached to the pdf file.
I was really helpful.
I have a few modifications.
First I think the cover should be at the same size when scrolling the pdf file. I had some problem since I scanned the covers at higher resolutions.
After creating the jbig2lossyocr.pdf file, I checked the resolutions of the text file and created the cover the same widh and ppi,
Code: Select all
$pdfimages -list -l 2 jbig2lossyocr.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 1908 3110 gray 1 1 jbig2 no 1316 0 400 401 3899B 0.5%
2 1 image 1908 3110 gray 1 1 jbig2 no 1319 0 400 401 4926B 0.7%
$mogrify -units PixelsPerInch -density 400 -resize 1908x cover.tif
Code: Select all
$echo "0 1 Book Title" > book.toc
$pdftotext book.pdf -| awk -vRS=$'\f' -vNAME="Kapittel" 'index($0,NAME){printf "1 %d %s\n", NR, NAME;}' |grep -n '^' |awk -F':' '{print $2" "$1}' >> book.toc
$ head book.toc
0 1 Book Title
1 2 Kapittel 1
1 15 Kapittel 2
1 23 Kapittel 3
1 32 Kapittel 4
1 40 Kapittel 5
pdfoutline book.pdf book.toc book-toc.pdf
I solved this with http://jpdftweak.sourceforge.net/
I did this modification. This will set the first textpage to 7 and the pdf will open full page.
Code: Select all
Pager number tab
1; i,ii; cover 1
2; 1,2,3;;7
Interaction tab
x Set viewer preferences
x Fit window to pdf
Metadata. I found a source for bib files here in Norway. More internationally I think this is a good starting point. https://davetang.org/muse/2014/06/30/co ... to-bibtex/ and especial the OttoBib link.
I download the bib file. In nautilus open with Jabref. In jabref right click the bib entry and attach the pdf file. The in jabref
Tools - Write xmp metadata to pdf’s
I don't use jabref for anything else.
I also found a way to create PDF/A-2B from the created file with toc and metadata. It's online.
https://www.pdftron.com/pdf-tools/pdfa-converter/
It worked well and jbig and jpx images are kept, so the file size are about the same. It don't worked if the jpdftweak was the last step, so there may be some bugs with jpdftweak. I also think it was necessary to have the xmp metadata from jabref attached to the pdf file.
-
- Posts: 61
- Joined: 22 Dec 2016, 06:07
- E-book readers owned: Tolino, Kindle
- Number of books owned: 600
- Country: Poland
Re: How to convert a book to serchable pdf using open source software
Thank you for your comments and sharing details of your workflow. Nice to see, that someone found useful the thread I wrote.
You are right. The scripts are adopted for work with 300 DPI images. In case of higher resolution it would be necessary to adjust them respectively. It is also possible to have higher resolution for covers than for remaining contents and have the same sizes in a pdf file. It depends on combination of DPI and number of pixels to have correct “physical” size (measured in centimeters or inches).
Thanks for indicating this tool. I was not aware that this exists.cosinus wrote: ↑05 May 2020, 09:34 I was able to create the toc these way, with help of http://manpages.ubuntu.com/manpages/tru ... ine.1.html
Apart jpdftweak, I used for this also pagelabels-pyhttps://github.com/lovasoa/pagelabels-py. But usually the simpler the better. It may be sufficient to remove some blank pages at the beginning to have the numbers on pages fitting to sequential numbers in the pdf file.cosinus wrote: ↑05 May 2020, 09:34 In also think the page numbers should match between the scanned pages and the pdf file.
I solved this with http://jpdftweak.sourceforge.net/
I didn't even know this possibility. I will try to use it, as there are some bibtex files fitting to my books.cosinus wrote: ↑05 May 2020, 09:34 Metadata. I found a source for bib files here in Norway. More internationally I think this is a good starting point. https://davetang.org/muse/2014/06/30/co ... to-bibtex/ and especial the OttoBib link.
OCRmyPDF https://github.com/jbarlow83/OCRmyPDFseems to be able to convert to PDF/A format if you want to avoid online tools.
-
- Posts: 2
- Joined: 15 Apr 2020, 12:12
- E-book readers owned: kindle pw4
- Number of books owned: 0
- Country: Norway
Re: How to convert a book to serchable pdf using open source software
Thanks.zbgns wrote: ↑06 May 2020, 09:34 OCRmyPDF https://github.com/jbarlow83/OCRmyPDFseems to be able to convert to PDF/A format if you want to avoid online tools.
Yes I have looked at OCRmyPDF but it don't support JBIG or JPEG2000 images in PDF/A.
I think it's ghostcript that lacks that functionality.
Quote from ocrmypdf webbpage.
Code: Select all
PDFs containing JBIG2-encoded content will be converted to CCITT Group4 encoding, which has lower compression ratios, if Ghostscript PDF/A is enabled.
PDFs containing JPEG 2000-encoded content will be converted to JPEG encoding, which may introduce compression artifacts, if Ghostscript PDF/A is enabled.
-
- Posts: 61
- Joined: 22 Dec 2016, 06:07
- E-book readers owned: Tolino, Kindle
- Number of books owned: 600
- Country: Poland
Re: How to convert a book to serchable pdf using open source software
My bad. I was convinced that OCRmyPDF supports jbig2 but apparently this applies only to regular pdfs.
-
- Posts: 5
- Joined: 02 Jun 2020, 13:29
- Number of books owned: 0
- Country: Rather
Re: How to convert a book to serchable pdf using open source software
I'm not sure about JPEG2000, but OCRmyPDF seems to support JBIG2 if you first install jbig2enc: Installing the JBIG2 encoder.
-
- Posts: 12
- Joined: 26 Jul 2018, 09:28
- Number of books owned: 0
- Country: Germany
Re: How to convert a book to serchable pdf using open source software
To my knowledge, when you feed the script posted by zbgns (thanks again for sharing, by the way) with coloured *.tiff files, e.g. colour scans of book covers, these files will turn black in the conversion process. Is there any handy solution for adding coloured book covers of the same size as the text pages to the final book *.pdf?
-
- Posts: 61
- Joined: 22 Dec 2016, 06:07
- E-book readers owned: Tolino, Kindle
- Number of books owned: 600
- Country: Poland
Re: How to convert a book to serchable pdf using open source software
Actually, each book created by me using the described method has a colored front cover and back cover. Contents between covers are binarized (B&W). There may be added pictures in color, but it would be necessary to manually convert them to appropriate format and turn into pdf, and afterwards insert to the final pdf file.
My workflow is following:
1. Save all images to a directory. The first and the last image are in color. Remaining may be already B&W or not, but will be binarized anyway.
2. Create a subfolder:
3. The first and the last image is moved to the subfolder, and the names of them are changed appropriately:
4. If areas of OCR recognition should be indicated (in order to omit headers and footers) I multiply uzn file (it must be prepared earlier):
5. Tesseract OCR recognition (invisible text layer):
6. Binarization and jbig2 compression of images (visible layer of the pdf file):
7. Joining layers altogether in order to have one pdf file with images and text layer underneath:
qpdf may be used for that instead of pdftk (of course syntax of the bash command is then completely different).
Now we have B&W contents of the book and may go to cover(s) where color needs to be preserved.
8. Go to the subfolder where we covers were moved (step 3):
9. Apply jpeg2000 compression:
Jpeg compression may be applied instead of jpeg2000 (then Imagemagick may be useful).
10. Wrap color images into pdf container (no matter jpeg2000 or jpeg, the tool and the method may be the same):
Density, i.e. DPI value must be indicated ('--imgsize 300dpix300dpi' in the example above). If your DPI is different, e.g. a cover has 600 DPI, that must be adjusted respectively. There may be e.g. 300 DPI for contents of a book and e.g. 600 DPI for cover(s) at the same time, but combination of dimensions in pixels times DPI must give the same 'physical' size. Otherwise, the pdf will contain pages of different sizes, what looks terribly.
11. Copy contents of the book (cerated in step 7) to the subfolder where covers are located:
12. Join: front cover + contents of the book + back cover into one pdf file
qpdf can do that as well if one wants replacement for pdftk.
In result, the output pdf file is almost complete book with color front cover and back cover and contents binarized (black letters on white background) and OCRed. The next step would be to add indexes (toc) and metadata.
I gave examples how to deal with CLI tools, as it is possible to put all commands into a script and run all the steps in one pass. But of course other tools may be used instead, including GUI ones.
My workflow is following:
1. Save all images to a directory. The first and the last image are in color. Remaining may be already B&W or not, but will be binarized anyway.
2. Create a subfolder:
Code: Select all
mkdir -p pdf
Code: Select all
mv "`ls *.tif | head -1`" pdf/fcover.tif
mv "`ls *.tif | tail -1`" pdf/bcover.tif
Code: Select all
ls *.tif | cut -d "." -f 1 > list && while read line; do cp 1.uzn "$line".uzn; done < list
Code: Select all
ls *.tif > output.txt && tesseract -l pol+fra --psm 4 -c textonly_pdf=1 output.txt text pdf
Code: Select all
jbig2 -s -p -v *.tif && pdf.py output > lossy.pdf
Code: Select all
pdftk lossy.pdf multibackground text.pdf output jbig2lossyocr.pdf
Now we have B&W contents of the book and may go to cover(s) where color needs to be preserved.
8. Go to the subfolder where we covers were moved (step 3):
Code: Select all
cd pdf
Code: Select all
opj_compress -r 200 -i fcover.tif -o fcover.jp2
opj_compress -r 200 -i bcover.tif -o bcover.jp2
10. Wrap color images into pdf container (no matter jpeg2000 or jpeg, the tool and the method may be the same):
Code: Select all
img2pdf -o fcover.pdf --imgsize 300dpix300dpi fcover.jp2
img2pdf -o bcover.pdf --imgsize 300dpix300dpi bcover.jp2
11. Copy contents of the book (cerated in step 7) to the subfolder where covers are located:
Code: Select all
mv ../jbig2lossyocr.pdf ./
Code: Select all
pdftk fcover.pdf jbig2lossyocr.pdf bcover.pdf cat output book.pdf
In result, the output pdf file is almost complete book with color front cover and back cover and contents binarized (black letters on white background) and OCRed. The next step would be to add indexes (toc) and metadata.
I gave examples how to deal with CLI tools, as it is possible to put all commands into a script and run all the steps in one pass. But of course other tools may be used instead, including GUI ones.
-
- Posts: 12
- Joined: 26 Jul 2018, 09:28
- Number of books owned: 0
- Country: Germany
Re: How to convert a book to serchable pdf using open source software
This is really impressive. Thank you very much!
Re: How to convert a book to serchable pdf using open source software
Hope this post is not too little too late, but I wanted to remark that I've in the past year written tooling that does exactly this. It takes as input a stack of images and a hOCR file for the OCR (generated by tesseract), and produces a PDF, compressed with JPEG2000 images (with separate foreground and background images) and JBIG2 (or CCITT) compression for the foreground mask. It can easily lead to a 10x reduction if the input files are also JPEG2000 files, more otherwise. You can tweak the quality params if the quality is not acceptable.
There are some examples on how it does MRC, here: https://archive.org/~merlijn/projects/a ... c-examples
It's AGPLv3 and you can find it here: https://git.archive.org/merlijn/archive-pdf-tools (to create the combined hocr file, use `hocr-combine-stream` from https://git.archive.org/merlijn/archive-hocr-tools / https://archive.org/~merlijn/archive-ho ... ine-stream)
Cheers,
Merlijn
E: The PDFs should also pass PDF/A 3b and most of PDF/UA (checked with VeraPDF)
There are some examples on how it does MRC, here: https://archive.org/~merlijn/projects/a ... c-examples
It's AGPLv3 and you can find it here: https://git.archive.org/merlijn/archive-pdf-tools (to create the combined hocr file, use `hocr-combine-stream` from https://git.archive.org/merlijn/archive-hocr-tools / https://archive.org/~merlijn/archive-ho ... ine-stream)
Cheers,
Merlijn
E: The PDFs should also pass PDF/A 3b and most of PDF/UA (checked with VeraPDF)