HELP - Scan Tailor Project --> .pdf

dingodog · Post by **dingodog** » 14 Oct 2010, 21:13

spamsickle wrote: I see that sam2p has Windows binaries as well as Linux. The author claims that it's better than ImageMagick for creating PDFs, and the reasons he gives seem reasonable.

my workflow (images to pdf) is:

- scan pages
- passing to Scantailor to blackandwhitize
- compressing furtherly these B/W tiffs with Jbig2enc needing python and pdf.py from
http://github.com/agl/jbig2enc

(I use version I compiled myself for Puppy Linux

http://dokupuppylinux.co.cc/programs:encoders

jbig2enc -s -p -v *.tiff ; pdf.py output >out.pdf

jbig2 is the same compression used by googlebooks for its scans. it has an enormous power of lossless compression

for colorful scans, I encode in djvu format (scanned option) with freeware version of
*djvusolo*

Misty · Post by **Misty** » 15 Oct 2010, 09:14

I was going to suggest JBIG2enc too. It's much, much more efficient than Group4 or similar compression. My text-only pages are typically in the region of 6-20KB each, not 400KB. It's not quite as efficient as DjVu, which uses a very similar compression scheme, but it's definitely what I'd call "good enough."

JBIG2 supports a lossy profile as well as lossless compression, and I've found that the results are usually quite good. It creates an index of the characters on a page and uses average representations for a given letter for every occurrence. So, if you have the letter "e" which differs by a few pixels in many places, it gets replaced with a single version of the letter "e" everywhere. I had some problems with the open-source cjb2 encoder from DjVu Tools being overzealous, but JBIG2enc has worked well for me.

Unfortunately, the current release version of JBIG2enc discards all DPI information from the input TIFF. That's my only complaint right now. That's been fixed in the Git version, but I've been having trouble getting it to compile in Windows.

Edit: Spamsickle, since you're using Acrobat, it compresses to JBIG2 without needing external software. In the "convert to PDF" settings for TIFF, you can choose from a few options for monochrome compression including both lossless and lossy profiles of JBIG2. Acrobat's file sizes seem fine to me. I think it's in the same region as JBIG2enc, but I can't remember if it's smaller or larger. Just open a bitonal TIFF in Acrobat and it will convert it to the compression format you selected automatically.

dingodog · Post by **dingodog** » 15 Oct 2010, 14:06

Misty wrote:I was going to suggest JBIG2enc too. It's much, much more efficient than Group4 or similar compression. My text-only pages are typically in the region of 6-20KB each, not 400KB. It's not quite as efficient as DjVu, which uses a very similar compression scheme, but it's definitely what I'd call "good enough."

However, despite benchmarks publically available, I personally found certain cases where jbig2 compression overpass the djvu compression

Misty wrote:Unfortunately, the current release version of JBIG2enc discards all DPI information from the input TIFF. That's my only complaint right now. That's been fixed in the Git version, but I've been having trouble getting it to compile in Windows.

I compiled linux version (including patch for resolution fix) and I prefer using jbig2enc in linux

I burned an ISO of PUPPY LINUX LIVE CD
- http://dokupuppylinux.co.cc/

then installed python 2.5 (pet package)
- http://dokupuppylinux.co.cc/programs:python

and jbig2enc (if I remember, the version available here I compiled already with patch)
- http://dokupuppylinux.co.cc/programs:encoders

anyway, soon I will add, on same page, the jbig2enc version furtherly patched; this new patch (by akryukov, adds new switch -P, an ability to set number of pages for dictionary (because for long books having an unique dictionary can made very slow browsing pages), a modified version of pdf.py is also needed (I commented line 27 in order to make working without PIL)

-d --duplicate-line-removal: use TPGD in generic region coder
-p --pdf: produce PDF ready data
-P <number> --pages-per-dict <number>: pages per dictionary (default 15)
-s --symbol-mode: use text region, not generic coder
-t <threshold>: set classification threshold for symbol coder (def: 0.85)
-T <bw threshold>: set 1 bpp threshold (def: 188)
-r --refine: use refinement (requires -s: lossless)
-O <outfile>: dump thresholded image as PNG
-2: upsample 2x before thresholding
-4: upsample 4x before thresholding
-S: remove images from mixed input and save separately
-j --jpeg-output: write images from mixed input as JPEG
-v: be verbose

Misty · Post by **Misty** » 18 Oct 2010, 12:46

While I wouldn't mind using Linux for this (actually, my *nix of choice is OS X), I'm using a Windows machine at work so that's what I need the software for. I'm also hoping to include it in my PDF making script, which will be for Windows to start out. Thanks for the suggestion, though!

clemd973 · Post by **clemd973** » 20 Oct 2010, 22:43

I'm trying to keep this part the least labor intensive as possible. Using my Mac, I've followed Univershul's "pre-processing" suggestion of using iPhoto to combine the images from the Left and Right cameras in sequential page order and export them ready for importing to Scan Tailor. If you're using a Mac, Univershul's process is all you need to prepare the images for Scan Tailor.

Currently I'm trying out Image to PDF Converter 3.1 for the final PDF.

Any suggestions for OCR and when/how to do it and what program is best to use for that would be appreciated. Yes, I'm a Newbie; so far my workflow is as follows:

Scan > Import to iPhoto > Put in sequential page order (thanks Univershul) > Export in sequential page order > Scan Tailor post processing > Save TIFFs > Convert to PDF.

Misty · Post by **Misty** » 22 Oct 2010, 14:41

Back on the topic of jbig2enc. Does anyone have experience with it? I'm having an extraordinary amount of trouble getting it to produce readable output. What I've been doing is

jbig2 -b basename -s -p page.tif
python pdf.py basename > basename.pdf

The result is a PDF that can't be opened in any software I've tried. I've tried updating to the latest pdf.py from the jbig2enc website, incorporating the binary stdout fixes for Windows, but no dice. Anyone had any luck in this?

Edit: Never mind, that was premature. I have a PDFMaker alpha successfully producing PDFs now! jbig2enc's output is indeed smaller than cjb2 for the same pages, though a little bigger than the proprietary DjVu Solo output.

dingodog · Post by **dingodog** » 22 Oct 2010, 15:07

my syntax is

jbig2 -s -p -v *.tif ; pdf.py output > out.pdf

; in linux bash language means: run after previous task, so i combine in one line two commands

have you used binaries provided by rubypdf for windows?

http://blog.rubypdf.com/2009/11/03/how- ... r-windows/

any way, using jbig2enc in Linux is more easier than in windows

http://dokupuppylinux.co.cc/programs:encoders

Misty · Post by **Misty** » 22 Oct 2010, 15:09

Yes, it turns out RubyPDF had an updated version of the binaries that fixed the bug.

Like I mentioned before, while I like using Linux, Windows is what I have on my work computer and I need to work with what I have.

clemd973 · Post by **clemd973** » 22 Oct 2010, 20:39

OK, I'm stuck in the middle. I've prepared the images in my project, and I'm ready to "export" them from Scan Tailor to make a PDF. My question is: How do I export from Scan Tailor? Under "File", I see "New Project", "Open Project", "Save Project", "Save Project As", and "Close Project"; Under "Tools", I see "Debug"; and there is only "Help" remaining. Thanks.

Tulon · Post by **Tulon** » 23 Oct 2010, 03:24

Running batch processing on the Output stage would generate the output files and put them to the "out" subdirectory under your input directory (unless another output directory was explicitly specified).

DIY Book Scanner

HELP - Scan Tailor Project --> .pdf

Re: HELP - Scan Tailor Project --> .pdf

Re: HELP - Scan Tailor Project --> .pdf

Re: HELP - Scan Tailor Project --> .pdf

Re: HELP - Scan Tailor Project --> .pdf

Re: HELP - Scan Tailor Project --> .pdf

Re: HELP - Scan Tailor Project --> .pdf

Re: HELP - Scan Tailor Project --> .pdf

Re: HELP - Scan Tailor Project --> .pdf

Re: HELP - Scan Tailor Project --> .pdf

Re: HELP - Scan Tailor Project --> .pdf