Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

My workflow and tip for reducing pdf file size

Share your software workflow. Write up your tips and tricks on how to scan, digitize, OCR, and bind ebooks.
Post Reply
Ricardo
Posts: 8
Joined: 07 Feb 2014, 20:55
E-book readers owned: Samsung tablet
Number of books owned: 23
Country: Australia

My workflow and tip for reducing pdf file size

Post by Ricardo » 09 Feb 2014, 19:52

I will admit at the moment my only book scanner since my camera passed away is a commercial unit, flat bed scanner.

However it scans to tiff at up to 600dpi resolution..

My workflow consists of scanning in the book to 400dpi greyscale tiff, then scantailor to get the output. For scantailor as I only have black/white books I choose the mixed output to keep the pictures as greyscale.

I then combine the tiff's into pdf through acrobat, OCR clear type next....

Then save....

To reduce file size even more I then print it out through adobe print, this seems to yield far better results with the pictures then leaving it as is and reduces the file size to boot, as well, all pages will be made the same size..

Here is an example of am out of copyright book I did, the book is only a small 4X6 book, the pdf is A4 sized I believe...

https://dl.dropboxusercontent.com/u/462 ... inding.pdf

Only thing I would like to do it go away from having to use acrobat entirely and use all open source... But that clear type OCR is very good...

User avatar
daniel_reetz
Posts: 2776
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: My workflow and tip for reducing pdf file size

Post by daniel_reetz » 09 Feb 2014, 19:59

Very impressive scan! What's the overall size of the book, roughly speaking?

Ricardo
Posts: 8
Joined: 07 Feb 2014, 20:55
E-book readers owned: Samsung tablet
Number of books owned: 23
Country: Australia

Re: My workflow and tip for reducing pdf file size

Post by Ricardo » 09 Feb 2014, 20:14

The book pages are 4.5" X 7"..

One thing I found was the printing to pdf file process also makes the pages all the same size if after processing you end up with some pages different sizes as I have encountered when I have to go back and rescan a missing or not clear page and try to insert it into the finished pdf...

I could have improved the end result by going through and manually selecting all the pictures in scantailor, as it is, it is an amazing piece of software, but does have trouble selecting all of a black and white picture or drawn diagram...

cday
Posts: 226
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: My workflow and tip for reducing pdf file size

Post by cday » 12 Feb 2014, 14:45

Really excellent text, really excellent photos and a very acceptable file size...

But do you know that the final output PDF is not searchable? That may not be important with this book, but it would be an important consideration for many people. Perhaps I can make some comments.

First a question: the photos reproduced very well, with no sign of any moiré patterns which can be a problem when scanning halftone images. Did you use a descreening setting in your scanner software, and possibly experiment to find the optimum setting, or were artefacts simply not present when the photos were scanned at 400 DPI ?

And out of interest, what model of scanner did you use?

Looking at the workflow:
Ricardo wrote:I ... combine the tiff's into pdf through acrobat, OCR clear type next.... Then save....
The ClearType stage has clearly worked well producing very acceptable text at 400 DPI, and without producing an excessive number of fonts, as has sometimes been reported. At a quick count, 22 fonts are used in total (File > Properties...) and looking at the pages of the book there are rather more fonts used than one might expect, so there is probably little duplication.
Ricardo wrote:... I then print it out through adobe print, this seems to yield far better results with the pictures then leaving it as is...
That looks like a question of the settings used for each form of output -- if it is possible to obtain high quality images using 'print' then it should be possible to obtain the same quality images earlier immediately after the ClearType stage. It seems likely that the images at the earlier stage are either being downsampled, or compressed differently, or both.

The Adobe Acrobat interface is fairly complex and seems to change regularly but if you can find the relevant settings you should be able to obtain the same high quality after the ClearType stage. I see from the file properties that you are using Acrobat 9.
Ricardo wrote:... I then print it out through adobe print ... all pages will be made the same size...
That seems a novel and interesting way of equalising page sizes, a necessary step when producing a high quality scan of a book. The A4 page size that resulted is probably of no practical consequence when viewing the book on the screen or even when printing when using a 'fit to page option'. But again, there could be a setting somewhere to change that, although possibly not to set a custom page size to exactly match the original.

But your scans have already been processed by ScanTailor, which has a facility to equalise page sizes in the 'Margins' step... I'm not very familiar with ScanTailor, but did you have a reason for not using that option?
Ricardo wrote:I then print it out through adobe print... .... and reduces the file size to boot
It seems likely that running the output through 'print' is the cause of the loss of searchability, and might also explain the reduction in file size, even though the images are in fact higher quality. When ClearType is used it is probably necessary to store the position of each word, as the PDF file format works differently from the way a word processor works, where text flows automatically between the margins.

So, really excellent quality output and a small file size but not searchable, and the workflow to obtain the same high quality result could possibly be simplified. If the 'print' stage can be eliminated searchability should be maintained. Alternatively, if a 'print' stage is the optimum way to equalise page sizes, performing the ClearType stage after equalising the page sizes would also produce searchable text.
Ricardo wrote:Only thing I would like to do it go away from having to use acrobat entirely and use all open source...
Yes!!

dpc
Posts: 272
Joined: 01 Apr 2011, 18:05
Number of books owned: 0
Location: Issaquah, WA

Re: My workflow and tip for reducing pdf file size

Post by dpc » 12 Feb 2014, 19:24

Curious about what happened with the artifact on page 15? Was that actually on the book's page or something that was picked up along the way?

Image

Ricardo
Posts: 8
Joined: 07 Feb 2014, 20:55
E-book readers owned: Samsung tablet
Number of books owned: 23
Country: Australia

Re: My workflow and tip for reducing pdf file size

Post by Ricardo » 13 Feb 2014, 07:25

cday wrote:Really excellent text, really excellent photos and a very acceptable file size...

But do you know that the final output PDF is not searchable? That may not be important with this book, but it would be an important consideration for many people. Perhaps I can make some comments.

First a question: the photos reproduced very well, with no sign of any moiré patterns which can be a problem when scanning halftone images. Did you use a descreening setting in your scanner software, and possibly experiment to find the optimum setting, or were artefacts simply not present when the photos were scanned at 400 DPI ?

And out of interest, what model of scanner did you use?

Looking at the workflow:
Thanks, the scanner is a Plustek 3800... I have not played with any of the settings, other then select the dpi and greyscale.. There are options there in the software but I have not gone into them very well at this point in time, I have only had the scanner a couple of weeks..

In the original scans the pictures look like they do in the finished PDF except they you can really notice they are just made up of a large number of dots.... The post processing has smoothed out the dots to resemble a better image, maybe not quite as sharp an image, but the difference is negligible..

The print to file I used to resize the PDF pages, I did because I had to rescan some pages that were not to my liking... After running through scan tailor the same as the others, and inserting into the pdf, the inserted pages ended up different sized... Print to pdf seemed like a way to get them all the same size (A4)..

I know the software I have can yield better results but it is a matter of time available and wanting to learn the software as well.... I still keep all the raw scans as tiffs, so down the track more improved results may be able to be achieved..

I was not aware the searchable text had been removed.... I agree it was probably the printing to file that did this...

I need to play with the scanner software and the other software some more... But I am pretty happy with the results I am currently getting and it is fairly automated...

dpc wrote:Curious about what happened with the artifact on page 15? Was that actually on the book's page or something that was picked up along the way?

That is an ink smudge on the book..

Ricardo
Posts: 8
Joined: 07 Feb 2014, 20:55
E-book readers owned: Samsung tablet
Number of books owned: 23
Country: Australia

Re: My workflow and tip for reducing pdf file size

Post by Ricardo » 22 Feb 2014, 22:17

I have done some playing around with the scanner, and it seems to have software based settings for scanning in that it must have set automatically as I have not touched the settings, but they are not at the default setting..

The settings are just for the normal basic image manipulation that most software has... Contrast, brightness, saturation, descreen... That sort of thing...

There was no descreening done in the example scanned book above...

It is interesting playing around trying to balance quality with pdf final size... There does not seem to be a one rule to rule them all sort of thing... Every book has to be done differently depending on what you want your final result to be....

For my own use, a huge but top quality image file size is not an issue, and you can probably stick with a greyscale rather then more compact but sometimes visually inferior black and white..

For internet usage, you need smaller size and will have to have some trade offs...

Although I imagine the newest expensive software probably gives you get best results..

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest