Toolchain for good PDF output

Discussions, questions, comments, ideas, and your projects having to do with DIY Book Scanner software. This includes the Stereo Data Maker software for the cameras, post-processing software, utilities, OCR packages, and so on.

Moderator: peterZ

Post Reply
User avatar
openbuddha
Posts: 13
Joined: 23 May 2011, 12:39
E-book readers owned: kindle 3, kindle 2 DX, Nook Color, iPad2
Number of books owned: 14000
Location: Oakland, CA
Contact:

Toolchain for good PDF output

Post by openbuddha »

Hi,

I'm new here but have been working with Myles and Daniel Reetz on laser cut kit versions of the DIY Book Scanner recently.

On a not quite DIY Book Scanner note, I've been using a new Fujitsu Snapscan S1500M on my Mac for books that I don't care about destroying but am having a problem getting good quality PDFs that aren't huge for final output. I've read through a bunch of threads here but Daniel suggested that I ask questions since I'm not seeing a good answer.

My toolchain right now is:
1) Scan book in the Snapscan
a) This outputs 300 DPI grayscale jpg files. (The only options are JPG or PDF with the scanning wizard.) - I turn off compression on these before scanning and could do 600 DPI if I wanted to but the process slows down dramatically.
2) Take directory of JPGs and run them through Scantailor, deskewing, centering, cropping, etc.
a) Output is 300 DPI grayscale tiffs
3) Create PDF in current Adobe Acrobat Pro
a) Initial PDF size without optimization is something like 90 MB for a 220 page book, which is HUGE.
b) I don't want to OCR as a lot of my books have non-roman characters in them (Chinese or Japanese)

I've tried saving the PDFs using various "Optimized Scanned Document" options. I often wind up with either a 10 MB files with visible speckling and lack of quality if I do any image resampling there. At 300% zoom, my original PDF at 90 MB or so is beautiful, crisp, etc.

My goal is a 10 - 20 MB sized image-based PDF that looks crisp. This is all text with the occasional chart or line drawing but no photos or color.

I've tried playing with the various resampling options in Adobe but I seem to either get 50+ MB fairly good books or 4 - 20 MB really crap books. I cannot seem to find the sweet spot for working with the files I'm pulling in.

Daniel said that a few people here have some good voodoo for working with Adobe Acrobat and processing these images into fairly good quality books so I'm hoping for suggestions on settings or sequences or processing to output some better quality image based PDFs.

Any suggestions?
User avatar
Misty
Posts: 481
Joined: 06 Nov 2009, 12:20
Number of books owned: 0
Location: Frozen Wasteland

Re: Toolchain for good PDF output

Post by Misty »

I've got a couple questions. When you say you output greyscale images, do you mean you're outputting 8-bit images that have the page texture and all? Or are you processing the images into pure black/white scans in Scan Tailor?

The problem with full greyscale images is that they take up a lot of space compressed. 90MB is not unreasonable for a PDF of compressed greyscale images at a reasonable resolution, unfortunately. You might be able to get a smaller size by reducing the resolution of your pages, but that also limits the amount you can zoom into them. Unfortunately, I'm not sure how much better you can get while retaining an image that keeps the page texture and the like.

Processing them into pure black/white in Scan Tailor will dramatically decrease the filesize you get. Make sure Scan Tailor is operating in "black and white" mode; otherwise it will output in 8-bit mode and Acrobat will end up getting a worse compression ratio. If you're using a version of Acrobat pre-X, then make sure you set the "convert to PDF --> TIFF" option for monochrome compression to JBIG2 (lossless) or JBIG2 (lossy).

If the charts have more than just linedrawings, and don't look good in black and white mode, it gets a little more complicated, but you can still get good filesizes. Process the pages with images in Scan Tailor as "mixed mode" and ensure that the images are properly selected, then process your book using PDFBeads. It can separate out the images from the black/white text to make sure that they get compressed separately, and you can still get a smaller output PDF.
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.
User avatar
openbuddha
Posts: 13
Joined: 23 May 2011, 12:39
E-book readers owned: kindle 3, kindle 2 DX, Nook Color, iPad2
Number of books owned: 14000
Location: Oakland, CA
Contact:

Re: Toolchain for good PDF output

Post by openbuddha »

Misty wrote:I've got a couple questions. When you say you output greyscale images, do you mean you're outputting 8-bit images that have the page texture and all? Or are you processing the images into pure black/white scans in Scan Tailor?
I'm saying that the Fujitsu snapscan has three options: black and white, grayscale, color. Out of those, black and white outputs to PDF directly and the latter two can save the individual pages as jpegs with no (or minimal) compression.

I've been outputting the images from Scan Tailor as greyscale as well because when I've switched over to black and white images in the "output" section at the end of ST processing at 300 DPI, they didn't look very good inside ST during previewing.

Should I be doing all of the initial scanning in my scanner as 600 DPI or 1200 DPI B&W instead of doing 300 or 600 DPI greyscale there? Since I do have the occasional images, I've been wanting to input with more finesse than black and white seemed to offer.
The problem with full greyscale images is that they take up a lot of space compressed. 90MB is not unreasonable for a PDF of compressed greyscale images at a reasonable resolution, unfortunately. You might be able to get a smaller size by reducing the resolution of your pages, but that also limits the amount you can zoom into them. Unfortunately, I'm not sure how much better you can get while retaining an image that keeps the page texture and the like.

Processing them into pure black/white in Scan Tailor will dramatically decrease the filesize you get. Make sure Scan Tailor is operating in "black and white" mode; otherwise it will output in 8-bit mode and Acrobat will end up getting a worse compression ratio. If you're using a version of Acrobat pre-X, then make sure you set the "convert to PDF --> TIFF" option for monochrome compression to JBIG2 (lossless) or JBIG2 (lossy).
Is operating Scan Tailor in "black and white" mode different than choosing "black and white" along with DPI (though I chose 300 there since that was what I scanned at) during the "output" stage at the end?
User avatar
Misty
Posts: 481
Joined: 06 Nov 2009, 12:20
Number of books owned: 0
Location: Frozen Wasteland

Re: Toolchain for good PDF output

Post by Misty »

openbuddha wrote:I'm saying that the Fujitsu snapscan has three options: black and white, grayscale, color. Out of those, black and white outputs to PDF directly and the latter two can save the individual pages as jpegs with no (or minimal) compression.
You should definitely be scanning greyscale or colour in the scanner itself. Black and white processing happens within Scan Tailor.
I've been outputting the images from Scan Tailor as greyscale as well because when I've switched over to black and white images in the "output" section at the end of ST processing at 300 DPI, they didn't look very good inside ST during previewing.
What didn't you like about the black and white output from Scan Tailor? There may be a way to fix that.

By the way, though it may sound strange, you can output higher resolution from Scan Tailor than your input scan and get good results. 600 dpi black and white scans are the norm for Scan Tailor output, even though the input is typically well below 600 dpi.
Is operating Scan Tailor in "black and white" mode different than choosing "black and white" along with DPI (though I chose 300 there since that was what I scanned at) during the "output" stage at the end?
You would be using the "black and white" option in the output stage in Scan Tailor.
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.
User avatar
openbuddha
Posts: 13
Joined: 23 May 2011, 12:39
E-book readers owned: kindle 3, kindle 2 DX, Nook Color, iPad2
Number of books owned: 14000
Location: Oakland, CA
Contact:

Re: Toolchain for good PDF output

Post by openbuddha »

Misty wrote: What didn't you like about the black and white output from Scan Tailor? There may be a way to fix that.

By the way, though it may sound strange, you can output higher resolution from Scan Tailor than your input scan and get good results. 600 dpi black and white scans are the norm for Scan Tailor output, even though the input is typically well below 600 dpi.
This is probably the problem. I was selecting "300 dpi" there as I knew that's what the input was.

I'm also upping my input into 600 dpi input because why not? That's what the scanner can do greyscale and it is just disk space.

Since I do have Acrobat X, I am playing with Clearscan for final output. As I mentioned, I generally dislike OCR because I have so many not-standard characters (non-Roman) in my works and that plays holy hell with OCR.
User avatar
Misty
Posts: 481
Joined: 06 Nov 2009, 12:20
Number of books owned: 0
Location: Frozen Wasteland

Re: Toolchain for good PDF output

Post by Misty »

Let me know if that helps. The extra resolution may fix things up. Varying line width might help too.

I'm curious to know how Clearscan compares for filesize. I've typically used JBIG2 lossy, which produces appropriately tiny filesizes.
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.
Post Reply