Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

Scanned 670 pages --> 73 MB, how can I downsize?

Don't know where to start, or stuck on a certain problem? Drop by and tell us about it. Feel like helping others? Start here.
airwolfke
Posts: 6
Joined: 31 Jul 2013, 13:08
E-book readers owned: iPad
Number of books owned: 45
Country: Belgium

Scanned 670 pages --> 73 MB, how can I downsize?

Post by airwolfke » 31 Jul 2013, 14:01

Hello,

I'm new in scanning books and I want to get more into it.
I have scanned a bible (about 670pages), in resolution 600DPI, the file size was about 130MB, I did a ClearScan on it to clean the file up and make it OCR readable.
The size was brought back to about 73MB.

The reason why I took 600DPI is that I want a good resolution and that I probably always can downsize the DPI after it's scanned.

The book is just in TEXT format, black/white, no pictures in it, sometimes there is a small map in the book but no pictures.
I have tried to SAVE the file as a REDUCED PDF FILE (Adobe Acrobat Pro), then I got a filesize of 63MB.

What I'm I doing wrong, I would like to get a smaller file with good quality! Does somebody have suggestion to downsize the file even more, like something up to 30MB? or even less...? The best would be if there would be no loss in quality of course.

I did a Audit Space Usage (in Adobe Acrobat Pro) and that pointed that 74% of the file (55MB) was used for fonts, on the Adobe Forum they have told me this:
The problem is your choice of using Clearscan as your OCR method. The fonts created are taking up more than 74% of your documents storage. You can re-try using a different OCR method.


Anyway I don't know exactly what the person means and I'm still seeking for a better way or the solution to my problem,

Help is much appreciated!
thank you so much,

Ruben

dtic
Posts: 438
Joined: 06 Mar 2010, 18:03

Re: Scanned 670 pages --> 73 MB, how can I downsize?

Post by dtic » 31 Jul 2013, 16:23

Do you import the scanned images directly to Acrobat? If so, try preprocessing the images in Scan Tailor http://scantailor.sourceforge.net/ and then use Acrobat Clearscan on the tif images that Scan Tailor produces.

cday
Posts: 226
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Scanned 670 pages --> 73 MB, how can I downsize?

Post by cday » 01 Aug 2013, 16:58

It might be worth trying saving the original images with CCITT Group 4 (fax) compression which at 600dpi should produce good quality black and white text and line illustrations with a reasonable file size. JBIG2 compression, if available, might also be worth trying and could potentially be considerably smaller. That is a lot of pages, though, and if the file size isn't suitably reduced you could try downsampling the images to 300dpi.

If the text is required to be searchable, it would be necessary to OCR the images after conversion to one of the above formats, in Acrobat or using a dedicated OCR program. An advantage of viewing the original images rather than ClearScan text is that the original text would be viewed, without any recognition errors that might be introduced by the ClearScan process, which would only affect the searchability of the text. That might be important for the bible!

This does look like a case of ClearScan, for all its advantages or at least potential advantages, generating a large number of fonts unnecessarily, which has been mentioned before in posts. It's interesting that it is an issue even at 600dpi.

airwolfke
Posts: 6
Joined: 31 Jul 2013, 13:08
E-book readers owned: iPad
Number of books owned: 45
Country: Belgium

Re: Scanned 670 pages --> 73 MB, how can I downsize?

Post by airwolfke » 02 Aug 2013, 07:11

dtic wrote:Do you import the scanned images directly to Acrobat? If so, try preprocessing the images in Scan Tailor http://scantailor.sourceforge.net/ and then use Acrobat Clearscan on the tif images that Scan Tailor produces.
thank you for your advice, I have scanned the images with PaperPort 14 in PDF format, I see that Scan Tailor is for TIFF format importing.
What is the best method to scan a book? Scan in TIFF?

Thanks!

airwolfke
Posts: 6
Joined: 31 Jul 2013, 13:08
E-book readers owned: iPad
Number of books owned: 45
Country: Belgium

Re: Scanned 670 pages --> 73 MB, how can I downsize?

Post by airwolfke » 02 Aug 2013, 07:18

cday wrote:It might be worth trying saving the original images with CCITT Group 4 (fax) compression which at 600dpi should produce good quality black and white text and line illustrations with a reasonable file size. JBIG2 compression, if available, might also be worth trying and could potentially be considerably smaller. That is a lot of pages, though, and if the file size isn't suitably reduced you could try downsampling the images to 300dpi.

If the text is required to be searchable, it would be necessary to OCR the images after conversion to one of the above formats, in Acrobat or using a dedicated OCR program. An advantage of viewing the original images rather than ClearScan text is that the original text would be viewed, without any recognition errors that might be introduced by the ClearScan process, which would only affect the searchability of the text. That might be important for the bible!

This does look like a case of ClearScan, for all its advantages or at least potential advantages, generating a large number of fonts unnecessarily, which has been mentioned before in posts. It's interesting that it is an issue even at 600dpi.
Thanks for your hint cday! I have to tell you that I earlier already tried to do a compression to JBIG2 (Lossy), I then got a filesize of about 30mb what is acceptable for me but the quality went to done. ClearScan provides a much better quality.
I'll gave it a try with CCITT and the size was stil at 115 MB! so that doesn't help...
any other suggestions?

or should I just stick with the ClearScan file?
thanks,

airwolfke
Posts: 6
Joined: 31 Jul 2013, 13:08
E-book readers owned: iPad
Number of books owned: 45
Country: Belgium

Re: Scanned 670 pages --> 73 MB, how can I downsize?

Post by airwolfke » 02 Aug 2013, 08:22

Optimizing the file as JBIG2 (lossless), handle in about the middle, brings me to 92mb, without OCR

cday
Posts: 226
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Scanned 670 pages --> 73 MB, how can I downsize?

Post by cday » 02 Aug 2013, 09:02

Sorry that hasn't helped but you are processing a large number of pages and 600dpi is quite a high resolution.

Have you tried downsampling the images to 300dpi (or maybe 400dpi) and then compressing with CCITT Group 4 or JBIG2 lossless?

If you only require the text, and don't need to preserve the page layout, another possibility would be to extract the text in its multitude of fonts from the ClearScan file and then paste it into a word processor. You could then select the whole text and set it to any font you wish. That shouod produce the smallest possible file.

You can copy the text from a PDF file by highlighting it and copying, or using Ctrl + A then Ctrl + C for the whole document. Given the number of fonts used it might be necessary to do it in sections, though, to avoid crashing Acrobat. You might also lose the illustrations, although I think you could select, copy and paste them individually into Word if necessary.

Note for others: When an image is black and white with only line illustrations, CCITT Group 4 or JBIG2 do produce much smaller files than the same images as colour or grayscale, but that doesn't help here.

airwolfke
Posts: 6
Joined: 31 Jul 2013, 13:08
E-book readers owned: iPad
Number of books owned: 45
Country: Belgium

Re: Scanned 670 pages --> 73 MB, how can I downsize?

Post by airwolfke » 02 Aug 2013, 09:18

cday wrote:Sorry that hasn't helped but you are processing a large number of pages and 600dpi is quite a high resolution.

Have you tried downsampling the images to 300dpi (or maybe 400dpi) and then compressing with CCITT Group 4 or JBIG2 lossless?

If you only require the text, and don't need to preserve the page layout, another possibility would be to extract the text in its multitude of fonts from the ClearScan file and then paste it into a word processor. You could then select the whole text and set it to any font you wish. That shouod produce the smallest possible file.

You can copy the text from a PDF file by highlighting it and copying, or using Ctrl + A then Ctrl + C for the whole document. Given the number of fonts used it might be necessary to do it in sections, though, to avoid crashing Acrobat. You might also lose the illustrations, although I think you could select, copy and paste them individually into Word if necessary.

Note for others: When an image is black and white with only line illustrations, CCITT Group 4 or JBIG2 do produce much smaller files than the same images as colour or grayscale, but that doesn't help here.
Hey cday, thanks,
I would like to keep the page layout... otherwise it would have been a good solution.
I guess 65MB is not that bad for a 670pages book at good quality? :)
It brought me to 65mb after doing save as Optimized pdf

thanks,

cday
Posts: 226
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Scanned 670 pages --> 73 MB, how can I downsize?

Post by cday » 02 Aug 2013, 09:23

airwolfke wrote:

Optimizing the file as JBIG2 (lossless), handle in about the middle, brings me to 92mb, without OCR
You might try again using another handle setting at whichever end gives the smallest file size. As you are using JBIG2 (lossless) it can't affect the quality of the images, and I suspect that the effect will be to increase compression at the expense of it taking slightly longer for each image to be displayed. That seems to be the case with PNG, which is also lossless and also has a selectable setting.

But as you say, maybe 65MB isn't bad for a 670 page book with excellent quality text, and it will soon seem smaller in size as memory technology advances.

dtic
Posts: 438
Joined: 06 Mar 2010, 18:03

Re: Scanned 670 pages --> 73 MB, how can I downsize?

Post by dtic » 02 Aug 2013, 12:04

airwolfke wrote:
dtic wrote:Do you import the scanned images directly to Acrobat? If so, try preprocessing the images in Scan Tailor http://scantailor.sourceforge.net/ and then use Acrobat Clearscan on the tif images that Scan Tailor produces.
thank you for your advice, I have scanned the images with PaperPort 14 in PDF format, I see that Scan Tailor is for TIFF format importing.
What is the best method to scan a book? Scan in TIFF?
Scan Tailor works well with grayscale input images, for example jpg images. Scan Tailor can output black and white tif images that you then run through Acrobat or some other pdf tool.

If your scanned images are saved directly to a pdf format then I think you can extract them as jpg files using Acrobat. And then run those jpg files through Scan Tailor.

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest