Page 30 of 36

Re: Scan Tailor

Posted: 20 Mar 2010, 18:00
by spamsickle
Hmmm, stupidity on my part too, since Tulon intuited that the URL needed to be lopped, while I just noted it looked odd and didn't even try to fix it.

I've run your images through my program, and it failed on virtually all of them, so don't waste your time. I think the images are confusing YAPP as they confused ST, though for different reasons, but even if you were feeding it standard text pages that bright area under the spine would probably cause my simplistic algorithm to fail. I need darkness! I see a pale cradle and I want to paint it black...

In fact, looking at your images, I realize that YAPP relies on a quirk of my setup that probably doesn't apply to most: the anti-reflective glass I've used means the entire trapezoid in the "facing page" on my images is bright, whereas the reflection of the target page is overwhelming on your image.
0004.jpg (13.34 KiB) Viewed 4221 times
Probably explains why there have been 93 downloads of YAPP and only one "works like a charm". Don't be afraid to criticize, people. You won't hurt my feelings.

Steve, I think the reason your suggestion is currently on Tulon's back burner is related to mine. Our DIY-style images don't really have a problem with warping, per se, as the page images are pretty flat. Images coming from a flatbed scanner, on the other hand, need full de-warping; simply de-skewing them as you describe will not be adequate.

There may be shortcuts we can take on the DIY side (like de-skewing) that will not make ST more useful to the flatbed crowd for which it was designed. If we can bootstrap ourselves enough non-Tulon coding talent, would it make sense to have a DIY-optimized version of the application?

Re: Scan Tailor

Posted: 20 Mar 2010, 18:56
by Tulon
spamsickle wrote: There may be shortcuts we can take on the DIY side (like de-skewing) that will not make ST more useful to the flatbed crowd for which it was designed. If we can bootstrap ourselves enough non-Tulon coding talent, would it make sense to have a DIY-optimized version of the application?
There is more to it than that:
1. The proposed extension to deskewing would be completely manual. I don't like that.
2. ST's transformation pipeline can't currently handle perspective transforms, which suggests such functionality should be at the Output stage.
3. I kind of like working on challenging stuff. I also think a proper dewarping is something doable.

As for a DIY-specific page splitter, I think it's quite possible to make the generic one handle DIY-specific cases. I actually think my latest idea with the Fourier transform should work really well both for flatbed scanners and DIY setups. If Rob or someone else could come up with a proof of concept code, it would only take me a few days to turn it into a production solution.

Re: Scan Tailor

Posted: 20 Mar 2010, 19:06
by spamsickle
I don't like completely manual either, but I think the de-skewing Steve describes could implement the usual "apply to this image and the following images" or "every other image" automation, since the same camera/book geometry probably applies to an entire volume.

I'm not interested in the de-skewing because keystoning still isn't something that bothers me in the images I've generated, so since my page-split "solution" doesn't look like it will be helpful, I think I'll go back to concentrating on adding the aforementioned propagation to content selection.

Re: Scan Tailor

Posted: 30 Mar 2010, 10:16
by Misty
Tulon wrote:
SuperNibbler wrote:However, is there a way to turn off the LZW compression?
There is no such an option. What would be the point? It's a lossless compression and patents on it have expired.
Correct me if I'm wrong, SuperNibbler, but my guess would be that it's for long-term preservation reasons. Compressed data is a little more liable to bitrot than uncompressed, because the loss of a few bytes will ruin larger portions of the image. Despite the larger size, a lot of institutions keeping files for long-term purposes will prefer to keep uncompressed TIFFs.

Re: Scan Tailor

Posted: 30 Mar 2010, 11:32
by spamsickle
Good God, do you mean the TIFFs ST is generating now are compressed?

Last night I scanned a textbook that's roughly 900 pages. I'm processing it today with Scan Tailor. The raw input JPEGs consist of 920 files consuming 1,700,000,000 bytes (call it 1.7 GB). It's a molecular biology text, and in addition to lots of text, there are several color illustrations on every page -- metabolic cycles, pictures of molecules, etc.

I just output a sample page to see what I'll be looking at, size-wise.

The input JPEG is 1,800,000 bytes (call it 1.8 MB). If I output a full-color TIFF, it's 35 megabytes. A "mixed" TIFF is 6.5 MB.

This page looks fairly representative, so if I go with "mixed" throughout, the total size of my TIFFs could come in at about 6 GIGABYTES. Since conversion to PDF adds a bit, there's no way I'll be able to store the whole book on a single 4.7-gig capacity DVD.

I assume Scan Tailor goes with output TIFFs to make it easier to get accurate OCR, though in my (admittedly limited) experience with ABBYY OCR, it's possible to get good OCR recognition from JPEG files. In fact, if I were going to do OCR (which, at the moment, I'm not), I'd use the original JPEGs rather than ST's output anyway.

I think instead of simply converting from TIFFs to PDFs, I'm going to add an intermediate step, to convert ST's TIFF output to JPEGs before converting those to PDFs. ImageMagick makes the 6.5 MB TIFF into a 2.3 MB JPEG, which makes a PDF only slightly larger. That should get me a final product just a little over 2 GB -- larger than the raw originals, but still small enough to fit on a single DVD (along with the originals, the Scan Tailor project file, and if I'm lucky, Scan Tailor's cache info too).

The quality looks comparable to me, though I'm sure there's been some loss in going from JPEG to TIFF to JPEG to PDF rather than JPEG to TIFF to PDF.

Re: Scan Tailor

Posted: 30 Mar 2010, 15:12
by Tulon

Let me introduce you to the workflow that is common in the Russian book-scanning community.

First of all, for technical literature, and also for non-technical but with lots of pictures, DJVU rather that PDF is the way to go. DJVU compresses the background (pictures) and foreground (text) using different algorithms. The text compression algorithm is called JB2 and works by finding (nearly) identical symbols across the whole book and reusing a single picture of that symbol over and over. Pictures are downscaled and encoded by a wavelet-based IW44 algorithm that is better than JPEG and almost as good as JPEG2000. Most of this technology is open source, although a critical piece is missing. That piece is an automatic background/foreground segmenter.

Scan Tailor to the rescue! Its Mixed output is already effectively segmented. In recent builds, Scan Tailor makes sure that pure black and pure white colors are reserved for text - they won't be used in pictures. This allowed people to write scripts that would separate Scan Tailor's Mixed TIFFs into pairs of "text only" and "pictures only" images and feed them to command-line JB2 and IW44 encoders. A DJVU file is then composed out of them.

With this workflow, the final DJVU file of your 900-pages-book-with-pictures should be under 100MB. For typical books, say 300 pages with occasional pictures, people manage to bring the final DJVU file under 10MB. They might be using a commercial DJVU encoder though.

Re: Scan Tailor

Posted: 30 Mar 2010, 16:04
by spamsickle
Tulon, thank you for that introduction. I'll confess, I've never looked seriously at DJVU, because the only eBook reader I have is a personal computer. While I trust that PDF will be a format that computers will be able to read for the handful of decades I expect to live (though there are no guarantees), DJVU doesn't seem to have the same base of support. As a result, I don't trust it so much.

I''ll take a closer look at it, though. My main concern is that I can get the books on external media without splitting them up. Even if I can only store a single book on a DVD (and this is the first time I've had to consider extraordinary measures to do so), a stack of 100 DVDs is less than 1/8 of a cubic foot and weighs less than 10 pounds; 100 books occupies several shelves and weighs hundreds of pounds.

Re: Scan Tailor

Posted: 30 Mar 2010, 16:29
by daphnis
Hi all. I'm looking to use Scan Tailor to selectively process 1-bit images of music scores. One of the problems I'm having to which I can't seem to find a solution, is how to deactivate certain "modules" within the software. For example, most often I (let me say "we" in speaking for those also who scan music) only need to run a few transforms on the image, commonly de-speckle and most importantly, deskew. I'd like to use ScanTailor only for these two operations thereby deactivating all other features. Is there any way this can be accomplished?

Re: Scan Tailor

Posted: 30 Mar 2010, 17:01
by spamsickle
There isn't really a way to turn off filters, though most of them will permit you to set them one time and apply the settings to all the images in the batch you are processing. What problems are you encountering?

Re: Scan Tailor

Posted: 30 Mar 2010, 17:09
by daphnis
For example, in the "Select Content" filter. I really don't need to select any content except the whole page I feed it. And I don't see how to apply the whole selection box (ie the whole image) to all images. Same with page layout in that I may not want it to change anything in the layout. I'm also finding that with some images (again, music scanned at 600dpi in 1-bit), the deskew algorithm is either not detecting any skew or doing so to the opposite direction. This may be a problem since my work flow entails around 1,000 pages per week, and having to adjust skew and other factors on top of the manual post-processing on a per-image basis could become extremely time consuming.