Combining Split output into PDF

Scan Tailor specific announcements, releases, workflows, tips, etc. NO FEATURE REQUESTS IN THIS FORUM, please.

Moderator: peterZ

Post Reply
jaffamuffin
Posts: 22
Joined: 21 Oct 2011, 09:51
Number of books owned: 0

Combining Split output into PDF

Post by jaffamuffin »

Hi all

I have a requirement to create an optimised PDF file, and I wonder what software can take the split output (binary + colour) and make a PDF that is a smaller filesize than compressiing the whole page including the binary as a JPG ?

My standard approach to PDF creation is a python script that uses img2pdf
jaffamuffin
Posts: 22
Joined: 21 Oct 2011, 09:51
Number of books owned: 0

Re: Combining Split output into PDF

Post by jaffamuffin »

Hmm after some tests It appears FineReader if the (undocumented) 'Mixed Raster' Option is ticked, will produce a PDF with binary and colour components on the same page, reducing the filesize significantly.

However it takes the processed single image from Scantailor as it's input.

Is there any known software that will take the split streams from scan tailor and combine them into a PDF (for example, if I didn't want to or couldn't use Abbyy Finereader, or didn't actually need OCR performed ) ?

Many thanks
zbgns
Posts: 61
Joined: 22 Dec 2016, 06:07
E-book readers owned: Tolino, Kindle
Number of books owned: 600
Country: Poland

Re: Combining Split output into PDF

Post by zbgns »

Adobe Acrobat in more recent versions does a good job with Scan Tailor output and creates such 'optimized' pdf files, where color and b&w content is segmented and compressed separately. The result seems to me to be similar to Abbyy FineReader MRC compression model.
dpc
Posts: 379
Joined: 01 Apr 2011, 18:05
Number of books owned: 0
Location: Issaquah, WA

Re: Combining Split output into PDF

Post by dpc »

@jeffamuffin

If you don't mind, will you post the file size savings of your work so that we can get a rough idea of what one could expect by going this route? Thanks!
jaffamuffin
Posts: 22
Joined: 21 Oct 2011, 09:51
Number of books owned: 0

Re: Combining Split output into PDF

Post by jaffamuffin »

I havent found anything that can take the multistream of scantailor, but if you process the files properly i.e. binary text and picture zones I found this on 5 sample files : Sorry I can't post the files but here is thumbnails https://imgur.com/a/DROftlI

I scan the files at 300dpi uncompressed TIF in 24bpp colour then do 600dpi output in scantailor for both colour/mixed and B&W...) Scantailor's Binary output is really, really, really nice.

(note, every book has a colour cover in the output)

(Note also that ST does LZW on the colour output so the filesize re: upscale doesnt change much - I do wish we could have ST output JPEG at a preferred compression level, it would save a lot of time as 90% of the time I convert back to JPEG anyway)

(scanned size / scantailor OUT size / PDF size IIRC, this is just default NO scaling, but 50% JPEG and G4 binary/ PDF at 150dpi scaling / with MRC in finereader) I think MRC uses deflate or JBIG ? as the compression on the binary files is improved see note below...

(scanned size / scantailor OUT size / PDF size 50% JPEG -G4 / PDF at 150dpi scaling / MRC in finereader (full res!)

1. 29 pages black and white only (571MB/ 53.8MB / 3MB / 1.3MB / 640KB )
2. 26 pages black and white (474 MB / 31.9MB / 1MB / 900KB / 535KB )
3. 25 pages colour illustrations, 12"x12" book (1.1 GB / 988MB / 40MB / 5.5MB / 4.4MB )
4. 38 pages, colour another big book 10x10" (1.34 GB / 1.1 GB / 52 MB / 5MB / 4.8MB )
5. 23 pages, 2 illustrations (489 MB / 69MB / 3.5MB / 1MB / 528KB)

Note that I get a filesize SMALLER than any other methods while still retaining full resolution (600dpi) (150 is too low IMO)
Note also that If i take the files and run them in acrobat 9 (it's all i have) reduce file size -- it only processes the binary files properly and I get a filesize almost the same as the abbey mixed raster output. but it fails on the larger files.

The other thing is that the actual PDF when looked at closely has binary encoded data and colour on the same page so text is nice and crispy alongside the colour illustrations. Not using MRC produces 'colour' black and white which makes the visual quality look less good because the binary text in encoded with JPEG so you get the fuzzies. Extract from a page that has a colour picture on it: See here : https://imgur.com/a/KF7uG4N

hope that helps
emmerkar
Posts: 1
Joined: 30 Jan 2021, 05:10
Number of books owned: 0
Country: Belgium

Re: Combining Split output into PDF

Post by emmerkar »

Hi, javamuffin

sorry if I bother you but i am new on scantailor and try to get the most effective pdf using this very good software. So my problem is: I have a scanned 1000 pages dictionary of aobut 1.7 GB (pdfimages says it is 300DPI scan). I splitted it in 1000 png images at 600x600 DPI. Then I used scantailor obtaining the tif images and the resultant pdf is about 350MB. I did the same procedure trying with 300 DPI TIFF images 24nc and the result after scantailor is about the same. So I guess I obtained the best result with scantailor. What I don't understand, better, I don't know is how to achieve the 2 passages after scantailor: PDF size 50% JPEG -G4 and PDF at 150dpi scaling (I am on linux). Can you explain me what have I to do?? Also is it possible MRC in linux?
Thanks for your attention

emmerkar
zbgns
Posts: 61
Joined: 22 Dec 2016, 06:07
E-book readers owned: Tolino, Kindle
Number of books owned: 600
Country: Poland

Re: Combining Split output into PDF

Post by zbgns »

emmerkar wrote: 30 Jan 2021, 16:06 What I don't understand, better, I don't know is how to achieve the 2 passages after scantailor: PDF size 50% JPEG -G4 and PDF at 150dpi scaling (I am on linux). Can you explain me what have I to do?
What do you mean by 'PDF size 50% JPEG -G4'? Do you want to have JPEG compression applied to color elements and CCITT Group 4 compression ('G4') compression used to bitonal content (text) at the same page (in one picture)? This is not possible in case of graphics formats of general use (jpeg, png etc), and special tricks like MRC, where a picture is segmented into a foreground (letters) and background (images), or even bigger number of layers, then each layer is compressed separately and afterwards all layers are combined altogether in order to obtain one picture (in containers like pdf or djvu). Scan Tailor Advanced can provide such split output (background and foreground in separate files) but compression and combining them back is another story.

When it comes to 50% reduction of size it is an easy task. Plenty of programs offer bulk resizing. E.g. Imagemagick does that by

Code: Select all

mogrify -resize 50% *.tif
But it seems that 500 DPI is too low for a text. It may be enough for color pictures only if you do not care about details and general quality. I would rather recommend 300 DPI instead.
emmerkar wrote: 30 Jan 2021, 16:06 Also is it possible MRC in linux?
I am not aware of any free software under Linux, that is able to create PDF files with MRC compression. Apparently, it is possible in case DJVU. Maybe you should give DJVU a try?
jaffamuffin
Posts: 22
Joined: 21 Oct 2011, 09:51
Number of books owned: 0

Re: Combining Split output into PDF

Post by jaffamuffin »

emmerkar wrote: 30 Jan 2021, 16:06 Hi, javamuffin

sorry if I bother you but i am new on scantailor and try to get the most effective pdf using this very good software. So my problem is: I have a scanned 1000 pages dictionary of aobut 1.7 GB (pdfimages says it is 300DPI scan). I splitted it in 1000 png images at 600x600 DPI. Then I used scantailor obtaining the tif images and the resultant pdf is about 350MB. I did the same procedure trying with 300 DPI TIFF images 24nc and the result after scantailor is about the same. So I guess I obtained the best result with scantailor. What I don't understand, better, I don't know is how to achieve the 2 passages after scantailor: PDF size 50% JPEG -G4 and PDF at 150dpi scaling (I am on linux). Can you explain me what have I to do?? Also is it possible MRC in linux?
Thanks for your attention

emmerkar
Scan tailor will always output either G4 compressed TIF (for binary files) or LZW tiff for colour (and gray as well, i think)

So you need to compress them outside of scantailor. i was using ABBYY finereader to compress them (with a JPEG compression on the colour pages/content at 50%) and MRC which does a thing where it compresses only the colour content where the document is colour and not the whole page, so I can gain efficient by binarising ALL text to G4 compression, even on pages that have colour content, instead of having to have JPEG on the whole page which depending on the make up of your books can be significant file size difference.

In your case you are dealing with a dictionary so I doubt there's any colour so you should be able to get a filesize of approx 50-100KB or so per page (depend on content and size etc) so 1000 pages should be about 50-100MB ? . Note as well in dealing with these things, a 150dpi colour image has about as much detail as a 300dpi binary image. since the colour can make up for in some way the lack of resolution.

You should import your 300dpi scans, binarise them to 600 and output them from scantailor. Then you have a couple of options on linux, i will list them below.

1. use imagemagick. to compress and then pdftk to concat
convert 0001.tif 0001.pdf
magick will take the source format and put it in the output so if your tif is already compressed as G4 it will put save the pdf as G4
also for jpeg:
convert 0001.jpg -quality 50% 0001.pdf will make it smaller
it's trivial to write a script/loop for batch processing. that's an excercise for the reader. On windows it's something like:

FOR /F %%A IN ('dir c:\myfiles\*.tif /b /s') DO(
convert %%A c:\myfiles\out\%%~nA.pdf
)

If you have a lot of images magicks inbuilt globbing can fill your memory as it reads in all the files and then outputs hence doing it one by one as a loop.

then you have to use pdftk to concat them https://www.pdflabs.com
pdftk c:\myfiles\out\*.pdf cat output combined.pdf

but i don't favour this approach any more since A) imagemagick always recompresses your files and because of A, it's B) slow i prefer to split the compress and the PDF stage.

2. The way I have been doing it for some time now is to prepare the files using image magick if required so creating compressed jpegs, resized jpegs (300->150 etc) and then running a small python script to join them as here :

This means i can use any process to create my images and then PDF them separately.

pdf script here
https://pypi.org/project/img2pdf/

heres's the meat

# convert all files ending in .jpg inside a directory
dirname = "/path/to/images"
with open("name.pdf","wb") as f:
imgs = []
for fname in os.listdir(dirname):
if not fname.endswith(".jpg"):
continue
path = os.path.join(dirname, fname)
if os.path.isdir(path):
continue
imgs.append(path)
f.write(img2pdf.convert(imgs))

this just containerises the files in the PDF so no conversion and it's super fast. if you want tif an jpeg change the line to
if not fname.endswith((".jpg", ".tif")):


3. This is another thing worth mentioning, install and use ghostscript. Which is defacto pdf software and actually imagemagick uses it behind the scene. There is documentation and options galore with this software however, I have had good success converting PDFS to smaller usable files using this command or variants thereof: this resizes to 150dpi

"C:\Program Files\gs\gs9.52\bin\gswin64c.exe" -q -dNOPAUSE -dBATCH -dSAFER -dSimulateOverprint=true -sDEVICE=pdfwrite -dPDFSETTINGS=/ebook -dEmbedAllFonts=true -dSubsetFonts=true -dAutoRotatePages=/None -dColorImageDownsampleType=/Bicubic -dColorImageResolution=150 -dGrayImageDownsampleType=/Bicubic -dGrayImageResolution=150 -sOutputFile=outfile.pdf inputfile.pdf


here's a window script for making pdf not using python

Code: Select all

@echo off
SETLOCAL
REM drop a folder on me
SET outdir="%~dpn1\pdf"
IF NOT EXIST "%outdir%" MKDIR "%outdir%"
FOR /F %%A IN ('dir "%~1\*.jpg" /b /a-d ') DO (
	echo "%outdir%\%%~nA.pdf"
	REM magick is the command for IM ver 7+, otherwise use convert
	REM convert %~1\%%A %outdir%\%%~nA.pdf
	magick "%~1\%%A" "%outdir%\%%~nA.pdf"
)

rem next line puts pdf above source folder
pdftk %outdir%\*.pdf cat output %~dp1\%~n1.pdf

python script (close enough)

Code: Select all

import img2pdf
import os
import sys
from sys import argv


    
    ## MAIN PROG TURN ON

##           script        path to images               output subdir
## call e.g. pdf.py u:\path\to\BOX_1234\2018


if len(argv) < 2: exit(1)
#supplied_filename = sys.argv[-1]
supplied_filepath = sys.argv[1]

print (supplied_filepath)


#expecting a path to a an docid within a box, like this 
# U:\myfiles\SCANS\BOX_2273\1002

filePath, imgid = os.path.split(supplied_filepath)
filebase, docid = os.path.split(filePath)
filebase2, box = os.path.split(filebase)



print (filePath, gbox, docid, imgid)
targetdir = "U:" + "\\" + "myfiles" + "\\" + "pdf" + "\\" + box + "\\" + docid

print (targetdir)
if not os.path.exists(targetdir):
    os.makedirs(targetdir)


# convert all files ending in .jpg inside a directory
dirname = supplied_filepath
with open(targetdir + "\\" + docid + ".pdf","wb") as f:
    imgs = []
    for fname in os.listdir(dirname):
        if not fname.endswith((".jpg", ".tif")):
            continue
        path = os.path.join(dirname, fname)
        if os.path.isdir(path):
            continue
        imgs.append(path)
    f.write(img2pdf.convert(imgs))
    
    
jaffamuffin
Posts: 22
Joined: 21 Oct 2011, 09:51
Number of books owned: 0

Re: Combining Split output into PDF

Post by jaffamuffin »

What do you mean by 'PDF size 50% JPEG -G4'?

by this i mean i store the JPEG at 50% compression which is fine/good enough quality for colour and G4 is group ccitt for binary image. You can see in my previous post example of 50% jpeg on text up close, the fuzzies.... higher quality doesn't have this (as much) but file size increases lots. Note that as i said a higher jpeg quality , but at a lower resolution can provide in many cases better looking subjectively higher quality colour text pages, expecially if the pdf will be read on screen only.

see post below :

150@50 = 4KB, 150@90=9KB
but
400@50 = 13KB and 400@90 = 33KB ...
all readable, so bigger saving with resizing than with higher compression. And as a bonus if you want better quality you can get it for a given filesize if you resample down to e.g. 150. ( 150@90=9KB vs 400@50 = 13KB )
Last edited by jaffamuffin on 31 Jan 2021, 15:25, edited 1 time in total.
jaffamuffin
Posts: 22
Joined: 21 Oct 2011, 09:51
Number of books owned: 0

Re: Combining Split output into PDF

Post by jaffamuffin »

2021-01-31 19_06_33-U__examples_ - XnView MP.png
I have attached some example 1inch x 1inch scanned at 400 dpi of files at different compressions and scaled to 150/600.
Attachments
examples.zip
(502.19 KiB) Downloaded 269 times
Post Reply