Scan Tailor

spamsickle · Post by **spamsickle** » 11 Apr 2010, 08:58

There was a bug in the version of ImageMagick I was using (6.5.x) which caused it to fail with the -compress option ("No space to read TIFF" message); I'm running version 6.6.1-2 now, and for many of the images, -compress Group4 results in PDFs that are about 1/3 the size of those defaulting to JPEG compression. Now, I am noticing the pages on which text is erroneously tagged as image, since they are unable to use Group4 compression, and ImageMagick reverts to using the default. instead of specifying everything as mixed mode output in Scan Tailor, I may need to take the extra time to pick and choose pages containing images, to avoid this problem and get the benefit of G4 compression in my PDFs.

Thanks to everyone who took the time to help me over this little hump.

kitashi · Post by **kitashi** » 18 Apr 2010, 11:48

Hello everyone. This is my first post in this forum.

spamsickle wrote: I assume that PDF does not have the option of using different compression options to encode binary text and greyscale/color images on the same page, as DJVU can, so if a page has both text and image, G4 is probably not an option.

In fact, PDF can contains multiple images using different compression method in one page. It is also possible to overlay images as layers with transparency.
I made a simple sample file. The attached file consists of two layers; while the bitonal text layer is G4-compressed, the figure of Hatter is DCT (JPEG).

However, since ST currently creates a single-layered tiff file from a mixed-mode page, there is no easy way to convert the tiff file to a PDF page of multiple images/layers. Possible code modifications might be: (1) to output picture zone and contents separetely, then combine them; or (2) to create single PDF directly, instead of tiffs.

Tulon · Post by **Tulon** » 19 Apr 2010, 03:00

kitashi wrote:However, since ST currently creates a single-layered tiff file from a mixed-mode page, there is no easy way to convert the tiff file to a PDF page of multiple images/layers.

There is an easy way to separate ST's output tiffs into Text / Pictures pairs. ST makes sure pure black and pure white are reserved for text areas, which makes such separation an easy thing to do. There exists a utility for this purpose called "ST Separator". It's currently in Russian only, though I think its author would be willing to make an English version.

The problem with generating PDFs is that no good opensource encoder exists. For example PDF supports the JBIG2 compression method for B/W content, which is a close relative to DJVU's JB2. Unfortunately, as far as I am aware, no opensource encoder can merge nearly identical as opposed to pixel-by-pixel identical characters. For DJVU, there exists an opensource encoder that can do that. It's called minidjvu I and I am trying to get it relicensed from "GPL2 only" to "GPL2 or above" to be able to incorporate parts of it in ST.

dtic · Post by **dtic** » 20 Apr 2010, 17:37

I've recently used Scan Tailor on multiple documents with only a few (~10) pages each. Each pack of tiff's outputted is then used to make an OCR'ed djvu file.

I speed up the work process by using two instances of Scan Tailor simultaneously, each processing a separate document. Here are some changes that could increase the speed further by reducing the number of manual actions. Note: I get that Tulon isn't focusing on these kinds of small GUI tweaks ATM. I'm posting anyway in case anyone else has the capacity and interest to contribute code for these things.

1. allow drag and drop of files into the "files in project" box
2. option to immediately go to the next step after such drag&drop (i.e. no manual press of OK needed. Compare to the five manual steps currently needed: copy folder path, paste path in ST, click select all, click arrow, click ok)
3. hotkeys for batch operations. Suggestion: F1 to F6 to start batch operation on steps 1 to 6.
4. option to begin batch operation (at some user set step) immediately when the main window opens, without any click/keypress. Example: on window do steps 1-4 on all files.
5. option to disable warning on "new project" (ctrl+N or ctrl+w)
6. option to disable warning on "remove from project..." (in thumbnail right click menu)
7. let "delete" key execute "remove from project..." on selected thumbnails
8. allow changing default value for thinner/thicker setting in output step (maybe through ctrl+drag of the slider?)
9. let ctrl+click on "apply to..." do the apply to all action immediately.

edit:
10. option to autorun a command line after processing on the last step ends (plus a parameter for the "out" folder for the currently processed files)

kitashi · Post by **kitashi** » 21 Apr 2010, 10:12

Tulon wrote:ST makes sure pure black and pure white are reserved for text areas, which makes such separation an easy thing to do. There exists a utility for this purpose called "ST Separator". It's currently in Russian only, though I think its author would be willing to make an English version.

Tulon, thanks for the information. Reading this forum throughout, I noticed that you had already mentioned that (31 Mar 2010, 05:12). Sorry to have overlooked.

ST Separator seems to be interesting. I'm looking forward to the English version. (or Japanese one!

)
Well, ST Separator works only on Windows, but unfortunately I mainly use *nix these days, so I tried with ImageMagick and pdftk. It works fine for me.

Code: Select all

convert -threshold 1 -compress Group4  MIXED.tiff TEXT.pdf                          #bitonal
convert -transparent white -transparent black -compress Zip  MIXED.tiff PICT.pdf    #picture zone with transparent background.
pdftk TEXT.pdf stamp PICT.tiff output RESULT.pdf

In most case, Zip compression makes the files smaller than LZW does. I suppose that is because the files tend to contain vast continuous zones of transparent pixels. I read somewhere that Zip works better than LZW in such situations. IIRC.
In another way, we can "-trim" the margin of the images, so if we can put the trimmed image on the correct position in the PDF, it must be a smarter way. I'm trying this way now.

Tulon, may I ask you a question about DJVU's JB2 compression? The "merge nearly identical characters" method -- Is that only for lossy compression, right? If not, I wonder how does it possible to compress losslessly with that...

# I've tried command line tools of jbig2enc. Although resulting pdf was quite small in size, its page order was messed up. I have no idea why.

Tulon · Post by **Tulon** » 23 Apr 2010, 02:39

kitashi wrote:Tulon, may I ask you a question about DJVU's JB2 compression? The "merge nearly identical characters" method -- Is that only for lossy compression, right? If not, I wonder how does it possible to compress losslessly with that...

That's lossy by definition. In lossless mode it would only merge pixel-by-pixel identical characters.

Misty · Post by **Misty** » 23 Apr 2010, 11:44

I'm pretty late to the party here, but I got the chance to try out the version of Scan Tailor with Rob's dewarping algorithm. It seems to have somewhat... wacky results, at least with the book I tested. Could this be the result of something I did wrong? It's set to output DPI 600, black and white, default thickness. I know the algorithm isn't complete yet, so I may just be running into its current limitations.

Original:

: 2010DW001.131 normal.png (94.87 KiB) Viewed 11012 times

Dewarped:

: 2010DW001.131 dewarped.png (95.69 KiB) Viewed 11012 times

Tulon · Post by **Tulon** » 24 Apr 2010, 04:30

Misty,

You are not doing anything wrong, it's just that dewarping code needs more work. Why do you think I disabled it in the release?

intermediatic · Post by **intermediatic** » 24 Apr 2010, 10:35

Hi, I love Scan Tailor!

I tried the Mac OS version but couldn't get it to run. I'm running Scan Tailor in Crossover on the Mac, which uses WINE to run Windows apps.

Anyway, I've set to take some books out of DJVU and into PDF and, while doing so, I like to clean up the pages. I didn't scan the former, so they are often spreads, with nasty edges and so on.

I ran two books with no problems. Now a third, which was particularly nasty, has proven difficult. It has a high degree of gray in the background and really shouldn't be used, but it's what I have.

I've run this twice and both times hit the same dead-end. I go through, set up splits (many manual), tweak select content, ok. all is well so far. Then I get to page layout and about half the images are proper, as such:

but in the rest, the pages get shrunken to mini versions in a massive field of white.

The second time through, I ran "fix dpis." All the manual work means this took an hour while watching TV.

Would love to have your input on how to make this work next time.

BTW, regardless of why these images look the like they are different sizes, they actually are the same size.

Tulon · Post by **Tulon** » 24 Apr 2010, 11:30

Tulon wrote:The second time through, I ran "fix dpis." All the manual work means this took an hour while watching TV.

Would love to have your input on how to make this work next time.

Wrong DPIs would be my guess as well. Did fixing them help or did it not?
If not, then chances are you didn't fix all of them. Don't count on all wrong DPIs appearing in the "Needs Fixing" tab. In most cases all pages have the same DPI, so go to the "All Pages" tab, select the "All Pages" node and apply the correct DPI there. Don't try to guess the correct DPI - estimate it instead.

DIY Book Scanner

Scan Tailor

Re: Scan Tailor

Re: Scan Tailor

Re: Scan Tailor

Re: Scan Tailor

Re: Scan Tailor

Re: Scan Tailor

Re: Scan Tailor

Re: Scan Tailor

Re: Scan Tailor

Re: Scan Tailor