Newbie Questions about Reducing File Size...

Don't know where to start, or stuck on a certain problem? Drop by and tell us about it. Feel like helping others? Start here.

Moderator: peterZ

Post Reply
teaguetod
Posts: 2
Joined: 07 Oct 2016, 12:08
E-book readers owned: Sony Reader, Aluratek Libre, Kindle
Number of books owned: 1000
Country: USA

Newbie Questions about Reducing File Size...

Post by teaguetod »

Hello All:

I have been using the Cardboard Box DIY Scanner Method, and I've 'scanned' about 5 or 6 books so far. My question is: How best to proceed with post-processing to create a reasonably small (say 25 MB or less) DJVU or PDF? And should I care if the thing gets OCR or not? Will performing OCR help to reduce the file size in the end?

Let me give you some background, if it matters. I happened to own a Canon A590 IS, which I hacked using CHDK to take time-lapse photos as fast as I can turn the pages. I can 'scan' about 7 or 8 pages per minute this way (one side at a time). I'm using ISO-80, which works well with 2 bright desk lamps on each side of the page I'm capturing. I leave the size of the JPEG image at the largest setting (3264 x 2448) to take advantage of the full 8 megapixels on the camera. But even this way, when I check the image properties, it still says only 180 DPI. (?)

So I've tried some different post-processing programs (YASW, Scan Tailor), but the TIF output images I get are gigantic in terms of their dimensions, like 6000 x 9000 or 10,000, because I leave the output setting set to 600 DPI...

Anyway, that's where I'm at right now... Trying to figure out the best way to compress / shrink / whatever the images into a manageable format without compromising too much on the quality of the words or characters that will appear on the page...

Any advice would be greatly appreciated!

(I hope I'm posting this in the right place. Please let me know if I'm not.)
L.Willms
Posts: 134
Joined: 21 Sep 2016, 10:51
E-book readers owned: Tolino Shine
Country: Germany
Location: Frankfurt/Main, Germany

Re: Newbie Questions about Reducing File Size...

Post by L.Willms »

teaguetod wrote: So I've tried some different post-processing programs (YASW, Scan Tailor), but the TIF output images I get are gigantic in terms of their dimensions, like 6000 x 9000 or 10,000, because I leave the output setting set to 600 DPI...

Anyway, that's where I'm at right now... Trying to figure out the best way to compress / shrink / whatever the images into a manageable format without compromising too much on the quality of the words or characters that will appear on the page...
One possibility is saving the images as PNG, i.e. compressed. With black-white images, this should drastically reduce the file sizes.

I use PMView, but you might prefer the free alternative Irfanview. Both support batch modifications.
duerig
Posts: 388
Joined: 01 Jun 2014, 17:04
Number of books owned: 1000
Country: United States of America

Re: Newbie Questions about Reducing File Size...

Post by duerig »

Usually the compression happens in the last step when creating a PDF or DJVU or EPUB file from the ScanTailor images. Don't worry too much that the ScanTailor images are large. Here are the three basic paths that you can follow:

(1) Create an epub file by OCRing the text. You typically lose formatting information and there will be a small number of mistakes in the book, but this is the smallest you can get. Storing a single letter from the book will always be smaller than storing an image of that letter.

(2) Create a DJVU file. To get the most out of DJVU files, you want to binarize with ScanTailor or similar. Set every page (or every page without pictures) to black and white. Then you can create color zones for any picture you want to preserve in color or grayscale. Then you use djvubind (or another djvu file creator) to make the DJVU file. It does a crazy amount of compression, especially on black and white files. One of the tricks it does is to compress many pages as a single unit and so it gains even more compression than just treating each page individually. It also creates a multi-layer structure. The 'main' layer is a black and white bitmap which is why it is so small. For any color regions, it creates a secondary layer to overlay that onto the black and white one. And finally, it can use OCR to create a hidden search layer. In this case, the OCR doesn't reduce the size. Instead, it lets you search for keywords and will highlight on the page where those keywords are. It slightly increases the size of the file to give this functionality.

(3) Create a PDF file. PDF creators all have compression algorithms attached. IrfvanView, Adobe Acrobat, or PDFbeads are all good options. Adobe Acrobat and PDFbeads can offer OCR options to add a search layer as before. Adobe Acrobat has a 'Clearscan' feature which tries to combine OCR with custom fonts to give you an almost identical looking document but searchable and with custom fonts based on the typefaces it detects. It does a good job, but if there is an error in the OCR, this will still appear while you are reading it. PDFbeads tries to make a PDF document with multiple layers similar to DJVU files above. In any of these cases, binarizing with Scan Tailor or other program above will help it generate even smaller documents.

Overall, the smallest books will be pure text epubs. Of the other two, DJVU files tend to be smaller than equivalent PDFs from scans.

-Jonathon Duerig
cday
Posts: 451
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Newbie Questions about Reducing File Size...

Post by cday »

May I make two comments on your comprehensive introduction Jonathon:

1. While binarization (conversion to a black and white, 1-bit depth image) generally produces very substantial reductions in file size when suitable compression is used, it can also result in a significant loss of text quality on screen unless the DPI is reasonably high in relation to the size of the text characters on the page. Acceptable text quality is subjective, but at the resolutions typically produced by current cameras there may be a noticeable loss of quality. In that case saving as grayscale would normally result in much better quality, but inevitably result in a large increase in file size, and when creating a PDF file JPEG compression may be the best option to obtain an acceptable file size.

2. The statement that with Adobe Clearscan OCR recognition errors will appear in the text displayed on screen didn't seem to be the case in the tests I did using Adobe Acrobat XI, which I reported on a while back. Clearscan in modern versions of Acrobat seemed to be very good at determining when it had confidently identified a character and vectorised only those characters, but when it had any doubt about a character it seemed to represent the character as a bitmap image, so visual fidelity on screen was maintained. Searchability in that case would be expected to be lost, although in principle even then it might still use its best guess of the character in the searchable text, not tested.

The statement that OCR recognition errors would appear in the text on screen was one I made in a number of posts before my tests, and was based on the assumption that when Clearscan was used all text on the page was displayed as vector font characters, so that recognition errors were inevitably displayed. That was an assumption based on the little information that Adobe provided about the technology, and may have been the case with the original implementation. The technology was introduced in Acrobat 9 if my memory is correct, and has likely been refined in successive Acrobat versions into what is now a sophisticated tool that works very well on high quality images, although with lower DPI images there can be a significant increase in file size and some loss of visual quality of the displayed text.
duerig
Posts: 388
Joined: 01 Jun 2014, 17:04
Number of books owned: 1000
Country: United States of America

Re: Newbie Questions about Reducing File Size...

Post by duerig »

Thanks for those clarifications, cday. There are a lot of trade-offs to be made between quality and compression. And binarization can work both ways. On the one hand, it can 'sharpen' an image by making the text blacker and the background whiter. But it can also make things much harder to read if it doesn't go well. At one point, I was very interested in the idea of binarizing every scanned book. But lately I've decided that I am ok with the larger greyscale filesizes to avoid the potential quality issues with binarizing.

For Adobe Clearscan, I think my understanding must have been based on your original statement/assumptions rather than the later tests you did. I am glad that the theoretical difficulty here doesn't cause a practical problem.

-Jonathon Duerig
teaguetod
Posts: 2
Joined: 07 Oct 2016, 12:08
E-book readers owned: Sony Reader, Aluratek Libre, Kindle
Number of books owned: 1000
Country: USA

Re: Newbie Questions about Reducing File Size...

Post by teaguetod »

Wow! Thank you all so much for these prompt replies! This is great.

Before posting here I had done some initial experiments creating PDF or DJVU files using sample (black & white) output pages from Scan Tailor. But I don't have Adobe Acrobat Pro, and I wasn't sure if I wanted to invest in it. For creating the DJVU files, I was using DjVu Solo 3.1, only because that's the first program I came across.

I wasn't that worried about OCR because I had expected the final output files to too large to work on my dedicated e-readers. So I planned to read these scanned books using a tablet. But ideally, of course, it would be great if I could also add OCR to these files and make them searchable...

I'm leaning towards DJVU because in my (limited) experience it can create smaller files. I will check out djvubind.

I will do some experiments bundling all the images together into DJVU and PDF and report the results.
cday
Posts: 451
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Newbie Questions about Reducing File Size...

Post by cday »

In relation to the possibility of incorrect text being displayed on the screen, and to avoid any possible confusion in the future:

JBIG2 compression in its lossy form, which can produce much smaller files sizes for black and white images than even highly efficient CCITT G4 'Fax' compression, may indeed display incorrect characters on the screen. Compression relies on recognising similar bitmap patterns in the image and then storing only a single copy, so that given that there can be slight variation in the bitmaps of a particular character, misidentification of a character may result in another character being displayed.

A company scanning engineering drawings on an office multifunction device that unknowingly used the lossy form of compression, discovered this when it was found that some dimensions in the scans had been changed: potentially very expensive!
Post Reply