Re: Learning to Create Tiny DJVU files
Posted: 13 Mar 2014, 21:58
I continue to mass-convert my scanned books to djvu. These are not the ugly low-quality scans that I have kind of made a hobby out of dressing up. These are all 600dpi sheet-fed scans of books which have had their spines chopped off. Recall that they were Adobe ClearScan files before, and I was generally pleased with their small size. Converting these scans to djvu has proven DjVu's superiority to me once and for all, though.
Many of the books are half the size once converted with the process outlined here (minidjvu for the black-and-white portions, c44 for any background images at 1/4th size, manually mashed together with djvumake and scripts). I have even seen some that are 1/8th the size of the PDF. Then some are about the same size, and two or three books have been slightly larger. I can't predict which outcome I'll get before the conversion, as there doesn't appear to be any rhyme of reason to it. I have to imagine that sometimes ClearScan builds up lots of redundant font images, which djvu/Jb2 manages to share between pages.
Example: Plato's republic, 397 pages, all black and white except the cover image.
I've generally been dropping the OCR data in the conversion, because when I look at it... it's not that great. ClearScan OCR has tons of mistakes in it, and I don't search in my PDFs that often in the first place. For large books it will save 2 or 3 megabytes to leave it out, so I've been doing that. In the numbers above I gave the with-OCR sizes to make the comparison to the OCR'ed PDFs fair.
[EDIT: I should say all the ClearScan files were created with Acrobat X, and saved as "Reduced Size PDF" afterward. Don't know if Acrobat XI does a better job or not.]
Many of the books are half the size once converted with the process outlined here (minidjvu for the black-and-white portions, c44 for any background images at 1/4th size, manually mashed together with djvumake and scripts). I have even seen some that are 1/8th the size of the PDF. Then some are about the same size, and two or three books have been slightly larger. I can't predict which outcome I'll get before the conversion, as there doesn't appear to be any rhyme of reason to it. I have to imagine that sometimes ClearScan builds up lots of redundant font images, which djvu/Jb2 manages to share between pages.
Example: Plato's republic, 397 pages, all black and white except the cover image.
- PDF: 14MB ClearScan
- DJVU with OCR: 6MB
- DJVU without OCR: 4MB
- PDF: 19MB ClearScan
- DJVU with OCR: 11MB
- DJVU without OCR: 8MB
I've generally been dropping the OCR data in the conversion, because when I look at it... it's not that great. ClearScan OCR has tons of mistakes in it, and I don't search in my PDFs that often in the first place. For large books it will save 2 or 3 megabytes to leave it out, so I've been doing that. In the numbers above I gave the with-OCR sizes to make the comparison to the OCR'ed PDFs fair.
[EDIT: I should say all the ClearScan files were created with Acrobat X, and saved as "Reduced Size PDF" afterward. Don't know if Acrobat XI does a better job or not.]