Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

Reduce pdf file size

Don't know where to start, or stuck on a certain problem? Drop by and tell us about it. Feel like helping others? Start here.
korben1
Posts: 7
Joined: 04 Mar 2015, 17:28
Number of books owned: 0
Country: Switzerland

Reduce pdf file size

Post by korben1 » 04 Jun 2015, 10:16

Hello,

I would like to reduce the pdf file size of my documents... like a lot of people here ;) . First I'm on linux (but people on windows
can certainly also help me) and my
problem is because I want good quality scan. I have noticed the quality of the scan is better on the screen when I scan at 600dpi instead
of 300 dpi. So I want to scan at 600 dpi. Then a lot of scans are color scans. If it would be black and write, it would be no problem.

With these constraints it's not easy to reduce the file size. At the moment I have between 2-3 M for a page. I need sometimes to scan 30-40 pages,
so the file size is too much. After searching, I found this link

http://en.flossmanuals.net/e-book-enlightenment/index/

They explain it's possible to reduce the file size by using a multi-layer technique. The text layer is stored
in high quality, the images are downsampled and the background is highly compressed. But unfortunately I'm unable
to find how I can make it, which software I have to use. Anybody has an idea ?

For the moment I discovered an alternative, but I still don't understand why it works. I can scan at 600dpi. Then I downsample to 300dpi
and then I resize the image of the file to 50%. If I print my scan, the size remain the same. And on my computer the quality I get is the same for
me as the scan at 600 dpi. So I can reduce the file size a lot.

But I would like to compare with the first technique. If you know how to do on other system than Linux, you can also
describe what steps are needed and I will try to find the necessary softwares.

Thank you!

dtic
Posts: 431
Joined: 06 Mar 2010, 18:03

Re: Reduce pdf file size

Post by dtic » 04 Jun 2015, 12:45

Since you don't mention it in your post I better ask: have you tried Scan Tailor, http://scantailor.org/ ? With it you can process color images into mixed mode tiff images: black and white in areas where there is only text and color only in areas of the photo where there are images or illustrations? Those tiff images can then be converted into pdf files. There are many posts about Scan Tailor in these forums.

korben1
Posts: 7
Joined: 04 Mar 2015, 17:28
Number of books owned: 0
Country: Switzerland

Re: Reduce pdf file size

Post by korben1 » 06 Jun 2015, 08:36

Thank you for your answer, now I understand better the mixed mode of scantailor!

I have tested it, but I the advantage of using it is not so big. So I think I'll keep my solution. I have to correct one
thing. If I scan at 600 dpi and after that I downsample to 300 dpi, the quality is the same as scanning directly to
scanning at 600 dpi.

If someone is interested, after scanning my pictures I adjust the levels with imagemagick. I can do in
on the command line, so it's possible to integrate it in a script. I do something like:

Code: Select all

 mogrify -normalize -level 0%,91% -gamma 0.6 scan.tif
Then I resize the picture to 50% and convert it in to jpg. On my screen, the picture is
still big enough to be read. And if I need to print it, I can resize it to 200% and the quality is still good.
With this method I can get 1M for a page, a little bit less if I resize to less than 50%.

cday
Posts: 216
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Reduce pdf file size

Post by cday » 06 Jun 2015, 13:08

korben1 wrote:I would like to reduce the pdf file size of my documents... a lot of scans are color scans... at the moment I have between 2-3 M for a page.

I found this link:

http://en.flossmanuals.net/e-book-enlightenment/index/

They explain it's possible to reduce the file size by using a multi-layer technique. The text layer is stored in high quality, the images are downsampled and the background is highly compressed. But unfortunately I'm unable to find how I can make it, which software I have to use. Anybody has an idea ?
The multi-layer technique is commonly referred to as MRC for Mixed Raster Content. An image of a colour or grayscale page normally has a much larger file size than that of a black and white page, both because it requires a 24-bit or 8-bit depth image rather than a 1-bit depth image, and because black and white images can be stored particularly efficiently if an optimum compression method is used.

When a page contains both colour or grayscale content and black and white content, it is normally necessary to store the whole page as a colour (or grayscale) image even when much of the page area is actually black and white, with a resulting large file size. With MRC, the different types of content are separated and stored as separate layers, with the result that only areas of the page that need to be colour or grayscale are stored as such. There is therefore a double benefit, in that the area of the page that needs to be stored in colour or grayscale is reduced, while the remainder of the page can be stored very efficiently in black and white.

The common file types that support multi-layer content are PDF and DjVu, the latter much less widely used but sometimes considered to produce smaller file sizes.

Note that Scantailor 'mixed-mode' (from memory...) converts text to pure black while leaving other image content unchanged, but the output file produced is a still a single-layer image at a colour or grayscale bit depth, so there is no direct file size reduction. It is in effect an image enhancement process rather than a compression process.
For the moment I discovered an alternative, but I still don't understand why it works. I can scan at 600dpi. Then I downsample to 300dpi and then I resize the image of the file to 50%. If I print my scan, the size remain the same. And on my computer the quality I get is the same for me as the scan at 600 dpi. So I can reduce the file size a lot.
It isn't clear why you are obtaining that result. When file size is important colour and grayscale images are normally best stored as JPEG files, although in principle the format isn't well suited to images containing sharp edges, such as text. The compression level or 'Quality' setting used each time the file is saved can have a considerable effect on the resulting file size, and it seems likely that the reduction you are seeing is either the result of the file being resaved at a lower setting, or some other unidentified effect.

When file size is important JPEG images even of text can often be quite heavily compressed with little visible change in appearance; the strategy would normally
be to try increasing the compression until visible deterioration is seen, and to then back off a bit. Over-compression typically produces compression artefacts in the form of feint gray smudges between the characters. It is as well to leave something in reserve, though, remembering that the resolution of screens is increasing and may increase further in the future.

korben1
Posts: 7
Joined: 04 Mar 2015, 17:28
Number of books owned: 0
Country: Switzerland

Re: Reduce pdf file size

Post by korben1 » 09 Jun 2015, 10:33

cday wrote:
korben1 wrote:I would like to reduce the pdf file size of my documents... a lot of scans are color scans... at the moment I have between 2-3 M for a page.

I found this link:

http://en.flossmanuals.net/e-book-enlightenment/index/

They explain it's possible to reduce the file size by using a multi-layer technique. The text layer is stored in high quality, the images are downsampled and the background is highly compressed. But unfortunately I'm unable to find how I can make it, which software I have to use. Anybody has an idea ?
The multi-layer technique is commonly referred to as MRC for Mixed Raster Content.
Thank you for the information, now I know I have to search for MRC. I didn't find a software with MRC on linux excepted ABBYY. But as I remember the quality was poor. So I don't know if I forgot some configurations. Do you know some software on Linux which can produce pdf with MRC ?
cday wrote:
For the moment I discovered an alternative, but I still don't understand why it works. I can scan at 600dpi. Then I downsample to 300dpi and then I resize the image of the file to 50%. If I print my scan, the size remain the same. And on my computer the quality I get is the same for me as the scan at 600 dpi. So I can reduce the file size a lot.
It isn't clear why you are obtaining that result.


Just have a look at my second post. There is a problem. When I convert to 300 dpi, the quality is the same as scanning to 300 dpi. So it doesn't work.

cday
Posts: 216
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Reduce pdf file size

Post by cday » 09 Jun 2015, 12:30

korben1 wrote:Thank you for the information, now I know I have to search for MRC. I didn't find a software with MRC on linux excepted ABBYY. But as I remember the quality was poor. So I don't know if I forgot some configurations. Do you know some software on Linux which can produce pdf with MRC ?
I'm a Windows user so someone who uses Linux should be able to give better advice; you may also find DjVu better supported and more widely used on Linux. A search on the forum should produce relevant information.

ABBYY FineReader and Nuance OmniPage are both quite complex programs in terms of the processing and saving options available, so there is quite a lot to understand in order to get the best out of them. And from my experience a while back some of the options aren't too well described in the documentation, so some experimentation is required. If you can obtain satisfactory results in terms of quality and file size saving to image formats (JPG for colour or grayscale, or TIFF with CCITT G4 'Fax' compression or JBIG2 for black and white) it would probably be the simplest option.
... Just have a look at my second post. There is a problem. When I convert to 300 dpi, the quality is the same as scanning to 300 dpi. So it doesn't work.
I'm sorry but I'm not sure that I follow what you are saying regarding image quality at 300 and 600dpi, and I'm wondering whether is a typo in your second post:
... If I scan at 600 dpi and after that I downsample to 300 dpi, the quality is the same as scanning directly ... at 600 dpi.
What I am suggesting is that if you experiment with increasing the JPG compression level (saving at a lower 'Quality' setting) you may well find that you can obtain 600dpi quality at a significantly reduced file size, which was you original aim.

korben1
Posts: 7
Joined: 04 Mar 2015, 17:28
Number of books owned: 0
Country: Switzerland

Re: Reduce pdf file size

Post by korben1 » 09 Jun 2015, 14:11

cday wrote: I'm a Windows user so someone who uses Linux should be able to give better advice; you may also find DjVu better supported and more widely used on Linux. A search on the forum should produce relevant information.

I think I'll keep pdf. I did a lot of researches, so I think I'll don't search again. 2 or 3 month ago I only found abbyy. This time I found commercial solutions, but they were too expensive. Or it isn't possible to test.
cday wrote: ABBYY FineReader and Nuance OmniPage are both quite complex programs in terms of the processing and saving options available, so there is quite a lot to understand in order to get the best out of them. And from my experience a while back some of the options aren't too well described in the documentation, so some experimentation is required. If you can obtain satisfactory results in terms of quality and file size saving to image formats (JPG for colour or grayscale, or TIFF with CCITT G4 'Fax' compression or JBIG2 for black and white) it would probably be the simplest option.
I think it's the best option for the moment. So I'll do again some tests and I'll ask on the mailing list if necessary. Perhaps I missed something the first time.

cday wrote: I'm sorry but I'm not sure that I follow what you are saying regarding image quality at 300 and 600dpi, and I'm wondering whether is a typo in your second post:
... If I scan at 600 dpi and after that I downsample to 300 dpi, the quality is the same as scanning directly ... at 600 dpi.
Interesting, I discovered it's not possible to edit a post. You are right, sorry..... It should be read: "If I scan at 600 dpi and after that I downsample to 300 dpi, the quality is the same as scanning directly ... at 300 dpi".
cday wrote: What I am suggesting is that if you experiment with increasing the JPG compression level (saving at a lower 'Quality' setting) you may well find that you can obtain 600dpi quality at a significantly reduced file size, which was you original aim.
Hhhmm, you are right. I don't know why, I tried once again and it works. I had in mind, depending on the compression, it was possible to reduce the file size, but after a level which I don't remember, the pdf file size was increasing again. I have to test it again. Perhaps it was because I was using jpeg2000 instead of jpeg or because I was adding a text layer to my pdf. I will check every possibility again.

cday
Posts: 216
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Reduce pdf file size

Post by cday » 09 Jun 2015, 14:34

korben1 wrote:I discovered it's not possible to edit a post.
Posts can only be edited for a few hours after posting (possibly up to 12 hours, although it might be less...) but not after there has been a reply to the post, I think.

If you intend to continue with ABBYY FineReader (I didn't know it was available for Linux, that's interesting... ) I would suggest concentrating on producing PDFs with JPEG or JPEG2000 compression (which should produce slightly smaller file sizes) and understand the settings for that first. Then if you wish look at MRC, but it's not too clearly documented, if indeed it's documented at all. The problem even with ABBYY is the number of possible combinations of settings that can be selected, with limited guidance on how to use them.

BruceG
Posts: 65
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: Reduce pdf file size

Post by BruceG » 10 Jun 2015, 06:04

Hi
I had a go at reducing file size scanning at 300 & 600 dpi saving both as pdf and jpeg. Then with a OCR program the all were OCR'ed and saved as a pdf file as text and also a image The page was from a book with 2 colour photos with title and credit for each. All were saved with file options as attached. Image mode and Maximum Resolution - as is.
options image pdf.gif
options image pdf.gif (8.04 KiB) Viewed 4272 times
Difference between 'text' and 'image' is the text in 'text' is a actual font not just part of the image. It can be copied and pasted.

Result were

300 dpi jpeg 2,898 kb
300 dpi jpeg OCR text 109 kb
300 dpi jpeg OCR image 39 kb

300 dpi pdf 1,019 kb
300 dpi pdf OCR text 492 kb
300 dpi pdf OCR image 67 kb

600 dpi jpeg 13,820 kb
600 dpi jpeg OCR text 1,463 kb
600 dpi jpeg OCR image 223 kb

600 dpi pdf 4,635 kb
600 dpi pdf OCR text 670 kb
600 dpi pdf OCR image 73 kb

Saving as a image produced the smallest file. But cannot be searched. At 300 dpi scanning as jpeg did better but at 600 dpi scanning as pdf produced the smallest files.

As others have mentioned there are many options in saving all file types not just those in the attachment.

I scanned a 10 Vol book set with around 500 to 700 pages each including colour pages. I played around with different options, the smallest pdf file was 7,354 kb and the largest 436,286 kb.
Scanning text/B&W photo pages as greyscale first then colour pages certainly helps to keep files smaller.

0kelvin
Posts: 20
Joined: 10 Nov 2012, 17:14
Number of books owned: 0
Country: Brazil

Re: Reduce pdf file size

Post by 0kelvin » 03 Jul 2015, 22:17

How much quality do I lose by using maximum compression of black and white pages? I saved one book in Finereader using 10% quality for background, I saw virtually no quality loss. No artifacts, nothing, all graphs, equations and text remained intact.

Now I see how can many books that I've found in ebook sharing sites can be a PDF with black and white pages, 600 dpi, yet the filesize is like 10mb or less.

Post Reply

Who is online

Users browsing this forum: No registered users and 2 guests