HELP - Scan Tailor Project --> .pdf

Scan Tailor specific announcements, releases, workflows, tips, etc. NO FEATURE REQUESTS IN THIS FORUM, please.

Moderator: peterZ

User avatar
dingodog
Posts: 110
Joined: 22 Jul 2010, 18:19
Number of books owned: 1000
Country: on the net
Location: on the net
Contact:

Re: HELP - Scan Tailor Project --> .pdf

Post by dingodog »

So it is confirmed that even latest pdf.py does not produces pdfs with MEDIABOX equal to REAL sizes of image calculated by its DPI

I seen in jbig2enc.cc
http://github.com/agl/jbig2enc/blob/c70 ... big2enc.cc
http://github.com/agl/jbig2enc

was included patch (before it was included in main trunk, I patched myself taking fix dpi patch), giving ability to jbig2enc to code rightly DPI, so it it a problem of program /way to produce pdfs by resulting jbig2 files

it is a matter worth of investigating

meanwhile, knowing physical size of book scanned, you can apply this workaround

with Impose tool inside multivalent
- http://www.ziddu.com/download/1794145/M ... ar.gz.html (old version with tools, newer has only the viewer)

Code: Select all

java -cp /path...to/multivalent.jar tool.pdf.Impose -dim 1x1 -paper widthxheightin file.pdf
in our case, since my scan has sizes 8.5x11 inches:

Code: Select all

java -cp /path...to/multivalent.jar tool.pdf.Impose -dim 1x1 -paper 8.5x11in file.pdf
this sets PDF MEDIABOX to REAL sizes

in my experience, OCR ability is not influenced by this little problem, since even DPI is not settled, what it is important for good OCR is a CLEAR, BIG (in size) image
User avatar
Misty
Posts: 481
Joined: 06 Nov 2009, 12:20
Number of books owned: 0
Location: Frozen Wasteland

Re: HELP - Scan Tailor Project --> .pdf

Post by Misty »

That's good to know, thanks! The incorrect value isn't just in /MediaBox. It's also in the q ... Q array. Both of those need to be right to display the page correctly.

Another workaround is to hardcode DPI value inside the pdf.py script. In lines 127 and 131, change the width and height values to (width*72)/600 and (height*72)/600

Since most people will always have 600dpi Scan Tailor files, this bit of hardcoding will work despite being an ugly hack. I'd prefer a real solution in the near future though. ;)
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.
La_Tristesse
Posts: 11
Joined: 18 Jun 2011, 21:47

Re: HELP - Scan Tailor Project --> .pdf

Post by La_Tristesse »

Like this?

Code: Select all

contents = Obj({}, 'q %f 0 0 %f 0 0 cm /Im1 Do Q' % (float(width * 72) / 600, float(height * 72) / 600))
    resources = Obj({'ProcSet': '[/PDF /ImageB]',
        'XObject': '<< /Im1 %d 0 R >>' % xobj.id})
    page = Obj({'Type': '/Page', 'Parent': '3 0 R',
        'MediaBox': '[ 0 0 %f %f ]' % (float(width * 72) / 600), float(height * 72) / 600)),
        'Contents': ref(contents.id),
        'Resources': ref(resources.id)})
User avatar
Misty
Posts: 481
Joined: 06 Nov 2009, 12:20
Number of books owned: 0
Location: Frozen Wasteland

Re: HELP - Scan Tailor Project --> .pdf

Post by Misty »

Yup, pretty much.

Actually, my post was back in October. Since then, the author of jbig2enc added a hack into pdf.py that causes it to always assume 600 DPI. If you use the version of pdf.py from his Github page, it's done for you.
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.
Post Reply