Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

Scan Tailor

Scan Tailor specific announcements, releases, workflows, tips, etc. NO FEATURE REQUESTS IN THIS FORUM, please.
Tulon
Posts: 687
Joined: 03 Oct 2009, 06:13
Number of books owned: 0
Location: London, UK
Contact:

Re: Scan Tailor

Post by Tulon » 07 Apr 2010, 14:18

spamsickle wrote:Could someone post some "upstream" numbers for these wonders?
Here you go:
DJVU 319 pages, 600 DPI - 2.95 MB
PDF 318 pages, DPI unknown - 6.28 MB

I haven't actually downloaded those, but I was one click away.
Scan Tailor experimental doesn't output 96 DPI images. It's just what your software shows when DPI information is missing. Usually what you get is input DPI times the resolution enhancement factor.

spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: Scan Tailor

Post by spamsickle » 07 Apr 2010, 22:28

Sorry, I wasn't clear. By "upstream" I mean the sizes of the JPGs and TIFFs that went into building the PDFs or DJVUs.

I can accept that other people are getting these incredibly small files, I'm just trying to figure out what they're doing differently. I understood your description of DJVU encoding, and how it can make smaller files. The examples you just gave had the PDF file about twice as large as the DJVU file for a similar page count, but as I say, my own PDF for that page count would likely be 15-20 times the 6 MB you report that somebody got. I'll try making a DJVU one of these days, I promise, but other people are getting PDF sizes that I don't even get from ScanSnap.

Apparently, I'm doing something terribly, terribly wrong. While I don't mind the huge files per se (I'm not doing this for download, I'm doing it to cut down on storage space), I'd prefer to understand what everybody else is doing that I should probably be doing myself.

phaedrus
Posts: 56
Joined: 04 Mar 2014, 00:52

Re: Scan Tailor

Post by phaedrus » 08 Apr 2010, 07:38

Spamsickle: In my text-only example the TIFF output file sizes are typically 80-100k, the camera in use was a 5mp Canon, ST images in set to 300dpi, out set to 600dpi and approx page size of the book was A5 with reasonable line spacing (ie. it wasn't tight).

In another example my text-only pages are around 110-130k with the photo pages 4-6mb. Similar dimensions etc to the above but there's a reasonable amount more text on the pages which probably accounts for the increased file sizes.

Incidentaly the image size directly from the camera varied from around 700k to 1.2mb for all pages.

Tulon's observation that he used mixed mode may well be why he was striking issues and producing larger (PDF) output files - possibly you're doing the same? It's been my practice to output in B&W and specifically select colour mode for the pages with photos I need to have in good quality. My recollection is that mixed mode would make a bit of a hash of some text-pages (ie. they'd be partly B&W and partly GS) as well as a few messes with photos so it's suited my purpose to do it this way.

Cheers, P.

spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: Scan Tailor

Post by spamsickle » 08 Apr 2010, 08:10

Thanks. It seems like the TIFFs are part of the difference -- I'm seeing TIFFs that are more like 120K than 80K in one of my mostly-text books. Even if they were 80K, though, 300 such files is still 24MB, and converting to PDF makes them bigger rather than smaller. The mixed-mode is also very significant -- those pages, for me, are cranking out 3MB TIFFs. I haven't noticed that it makes a hash of the text pages -- more often, I still get binary images or bits of images, and have to manually add image boundaries. I don't think I'd ever want to choose color output for a page that contained text -- getting rid of the "greywash" is the main reason I'm using Scan Tailor in the first place. Previously, I was just going straight from JPG to PDF.

I appreciate your providing these numbers. If anyone else would like to chime in with their results, I'd appreciate that too.

User avatar
strider1551
Posts: 126
Joined: 01 Mar 2010, 11:39
Number of books owned: 0
Location: Ohio, USA

Re: Scan Tailor

Post by strider1551 » 08 Apr 2010, 08:43

I appreciate your providing these numbers. If anyone else would like to chime in with their results, I'd appreciate that too.
Statistics from my last book (545 pages)
original jpg's: 1.4GB (estimated 300dpi)
scantailor tiff's: 58.2MB (B&W, 300dpi)
DJVU file (without ocr): 11.1MB
DJVU file (with ocr): 12.8MB
converting to PDF makes them bigger rather than smaller
What kind of compression is being used to make the pdf? I use an application called gscan2pdf, and when exporting the tiff images as a pdf it gives me a list of several compression methods to choose from. In my experience G4 (or Group4) produces the smallest pdf.

Edit:
I just made a PDF with G4 compression from the book above. Without ocr it came to 37.5MB. I also had to remove the cover - it was the only color page, so perhaps G4 doesn't do color?

jhitchcock
Posts: 11
Joined: 04 Mar 2014, 00:52

Re: Scan Tailor

Post by jhitchcock » 09 Apr 2010, 01:43

OK, I'm going to go ahead and post my stupid question. I'm trying to use ST, but when I go to create a new project I can't seem to select any images. When I browse to a directory containing JPEG files, aren't they supposed to show up in the boxes labeled "Files in Project" or "Files Not in Project"? Could this have anything to do with the fact that I am using a Mac running Parallels? I used my virtual machine to import the photos and all, but would there be any reason why ST can't seem to find the images?

Tulon
Posts: 687
Joined: 03 Oct 2009, 06:13
Number of books owned: 0
Location: London, UK
Contact:

Re: Scan Tailor

Post by Tulon » 09 Apr 2010, 04:02

jhitchcock wrote:OK, I'm going to go ahead and post my stupid question. I'm trying to use ST, but when I go to create a new project I can't seem to select any images. When I browse to a directory containing JPEG files, aren't they supposed to show up in the boxes labeled "Files in Project" or "Files Not in Project"? Could this have anything to do with the fact that I am using a Mac running Parallels? I used my virtual machine to import the photos and all, but would there be any reason why ST can't seem to find the images?
Do those files appear greyed out or not appear at all?
What is the directory containing the images? Is it a local directory or a network share in the form of \\host\share\path?
Scan Tailor experimental doesn't output 96 DPI images. It's just what your software shows when DPI information is missing. Usually what you get is input DPI times the resolution enhancement factor.

jhitchcock
Posts: 11
Joined: 04 Mar 2014, 00:52

Re: Scan Tailor

Post by jhitchcock » 09 Apr 2010, 09:42

Tulon, many thanks for taking the time to answer what must have been a very simple question for you. Putting the files in a local directory fixed it. When they were on a network share directory they would not show up.

spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: Scan Tailor

Post by spamsickle » 09 Apr 2010, 11:53

Strider1551, thank you for posting your results. It sounds like gscan2pdf is something I should try.

I've just done some reading about PDF compression. I think I'm getting JPEG compression, because the PDF file contains the string "DCTDecode". I have just used ImageMagick and its default compression; I don't know if G4 can be specified or not. You are correct that G3 and G4 are for black-and-white images, and cannot encode greyscale or color.

I assume that PDF does not have the option of using different compression options to encode binary text and greyscale/color images on the same page, as DJVU can, so if a page has both text and image, G4 is probably not an option.

I'm feeling a little more comfortable about using DJVU, after seeing the open-source and commercial tools that are available, so I'll probably give it a try in the near future.

User avatar
strider1551
Posts: 126
Joined: 01 Mar 2010, 11:39
Number of books owned: 0
Location: Ohio, USA

Re: Scan Tailor

Post by strider1551 » 09 Apr 2010, 13:33

I have just used ImageMagick and its default compression; I don't know if G4 can be specified or not.
Yep, just use the compress option. e.g. "convert -compress Group4 in.tiff out.tiff"

Locked