Actually, each book created by me using the described method has a colored front cover and back cover. Contents between covers are binarized (B&W). There may be added pictures in color, but it would be necessary to manually convert them to appropriate format and turn into pdf, and afterwards insert to the final pdf file.
My workflow is following:
1. Save all images to a directory. The first and the last image are in color. Remaining may be already B&W or not, but will be binarized anyway.
2. Create a subfolder:
3. The first and the last image is moved to the subfolder, and the names of them are changed appropriately:
Code: Select all
mv "`ls *.tif | head -1`" pdf/fcover.tif
mv "`ls *.tif | tail -1`" pdf/bcover.tif
4. If areas of OCR recognition should be indicated (in order to omit headers and footers) I multiply uzn file (it must be prepared earlier):
Code: Select all
ls *.tif | cut -d "." -f 1 > list && while read line; do cp 1.uzn "$line".uzn; done < list
5. Tesseract OCR recognition (invisible text layer):
Code: Select all
ls *.tif > output.txt && tesseract -l pol+fra --psm 4 -c textonly_pdf=1 output.txt text pdf
6. Binarization and jbig2 compression of images (visible layer of the pdf file):
Code: Select all
jbig2 -s -p -v *.tif && pdf.py output > lossy.pdf
7. Joining layers altogether in order to have one pdf file with images and text layer underneath:
Code: Select all
pdftk lossy.pdf multibackground text.pdf output jbig2lossyocr.pdf
qpdf may be used for that instead of pdftk (of course syntax of the bash command is then completely different).
Now we have B&W contents of the book and may go to cover(s) where color needs to be preserved.
8. Go to the subfolder where we covers were moved (step 3):
9. Apply jpeg2000 compression:
Code: Select all
opj_compress -r 200 -i fcover.tif -o fcover.jp2
opj_compress -r 200 -i bcover.tif -o bcover.jp2
Jpeg compression may be applied instead of jpeg2000 (then Imagemagick may be useful).
10. Wrap color images into pdf container (no matter jpeg2000 or jpeg, the tool and the method may be the same):
Code: Select all
img2pdf -o fcover.pdf --imgsize 300dpix300dpi fcover.jp2
img2pdf -o bcover.pdf --imgsize 300dpix300dpi bcover.jp2
Density, i.e. DPI value must be indicated ('--imgsize 300dpix300dpi' in the example above). If your DPI is different, e.g. a cover has 600 DPI, that must be adjusted respectively. There may be e.g. 300 DPI for contents of a book and e.g. 600 DPI for cover(s) at the same time, but combination of dimensions in pixels times DPI must give the same 'physical' size. Otherwise, the pdf will contain pages of different sizes, what looks terribly.
11. Copy contents of the book (cerated in step 7) to the subfolder where covers are located:
12. Join: front cover + contents of the book + back cover into one pdf file
Code: Select all
pdftk fcover.pdf jbig2lossyocr.pdf bcover.pdf cat output book.pdf
qpdf can do that as well if one wants replacement for pdftk.
In result, the output pdf file is almost complete book with color front cover and back cover and contents binarized (black letters on white background) and OCRed. The next step would be to add indexes (toc) and metadata.
I gave examples how to deal with CLI tools, as it is possible to put all commands into a script and run all the steps in one pass. But of course other tools may be used instead, including GUI ones.