Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

Scan Tailor

Scan Tailor specific announcements, releases, workflows, tips, etc. NO FEATURE REQUESTS IN THIS FORUM, please.
Amjad
Posts: 8
Joined: 31 Mar 2010, 01:49

Re: Scan Tailor

Post by Amjad » 04 Apr 2010, 12:55

daniel_reetz wrote:Amjad, I'm sure you saw the changelog?
I wasn't aware of that, actually. Thanks, that's exactly what I was looking for.
I used to read this, which seemed incomplete, but Tulon's link is the full version of that.

phaedrus
Posts: 56
Joined: 04 Mar 2014, 00:52

Re: Scan Tailor

Post by phaedrus » 05 Apr 2010, 05:01

Tulon wrote: You would end up with huge PDFs then, no? What would be the size of a 300 page book with no pictures?
Tulon: Fair comment, FYI using the method I mentioned a 225-page book I recently did, complete with colour endpapers and a number of photos throughout the book itself came out with a 23mb PDF. I would expect a text-only book to be considerably less, possibly under the 10mb magic figure. If I can find something to try it on I'll let you know.

Update: 215page text only PDF from 600DPI IIRC comes to 7.9mb.

Cheers, P.

dtic
Posts: 463
Joined: 06 Mar 2010, 18:03

Re: Scan Tailor

Post by dtic » 05 Apr 2010, 07:33

Tulon wrote:
dtic wrote:For djvu and OCR there is already fine FOSS available it seems to me. See http://www.diybookscanner.org/forum/vie ... ?f=3&t=319. I'm no programmer but I made a GUI frontend that transforms Scan Tailor output into OCR'ed djvu in one manual step. I will post it very shortly.
I'd certainly like to see that. I expect serious shortcomings though, like inability to handle halftone pictures.
I was thinking of use on plain text book scans and I haven't tried it on books with photos or color. I'm sure there's much room for improvement and I'm thrilled by the idea of adding support for djvu or pdf output to Scan Tailor or some sibling tool. I've now posted my small scripted frontend, info here:
http://www.diybookscanner.org/forum/vie ... 9&start=10 . The outputted files are small. One example: a 500 page book turned into a 5MB .djvu file (including textlayer from OCR).

spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: Scan Tailor

Post by spamsickle » 05 Apr 2010, 10:33

I am frankly amazed at these numbers. I have a 400-page book with lots of color illustrations that comes in at 2 GB for the PDF. An 800-page book, mostly text, with a handful of black and white illustrations, resulted in a 150 MB PDF. I'll try running some of these through the DJVU script that dtic posted and see what the results are, as long as it will run on ActiveSiate perl and not require me to install Strawberry. I don't think I've ever had a book that started as JPEGs and went through Scan Tailor that came in at less than 100 MB for the final PDF.

Tulon
Posts: 687
Joined: 03 Oct 2009, 06:13
Number of books owned: 0
Location: London, UK
Contact:

Re: Scan Tailor

Post by Tulon » 06 Apr 2010, 18:06

phaedrus,

I was trying to reproduce your results with IrfanView + CutePDF, and unfortunately I wasn't able to do it. It looks like I hit some kind of limitation of the Windows printing system. After a certain number of pages (in my case 129) it just stops printing and the print job silently disappears. If you watch the print queue, it displays the raw data size sent to the "printer". It seems to stop priting when this value approaches 2GB. As a result, I ended up with a 129 page PDF file 33MB in size, which is not a great result by itself, not to mention I couldn't finish the job.

5MB DJVU file for a 500 page, 600 DPI book produced by dtic is easier to believe, although I would expect a bit more, maybe 7 or 8 MB. Unfortunately his solution is quite complex and can only be recommended to geeky crowd. I believe it also wouldn't be able to handle halftone illustrations.

So, I am probably going forward with my solution. Not that I want to - it's kind of a boring task, but it needs to be done.
Scan Tailor experimental doesn't output 96 DPI images. It's just what your software shows when DPI information is missing. Usually what you get is input DPI times the resolution enhancement factor.

dtic
Posts: 463
Joined: 06 Mar 2010, 18:03

Re: Scan Tailor

Post by dtic » 07 Apr 2010, 05:48

Tulon wrote: 5MB DJVU file for a 500 page, 600 DPI book produced by dtic is easier to believe, although I would expect a bit more, maybe 7 or 8 MB. Unfortunately his solution is quite complex and can only be recommended to geeky crowd. I believe it also wouldn't be able to handle halftone illustrations.
Geeky setup, agreed. It could be simplified though. The perl step just manipulates text which any language can do (strider1551 did a python analog). I gave it a shot for ten minutes but was too tired and so opted for a shortcut with strawberry perl.

So format.pl and (strawberry) perl can be made redundant.
ImageMagick only uncompress tiffs for tesseract. Can probably also be made redundant.
That only leaves djvulibre (and tesseract if you aim for OCR).

phaedrus
Posts: 56
Joined: 04 Mar 2014, 00:52

Re: Scan Tailor

Post by phaedrus » 07 Apr 2010, 06:34

Tulon wrote:I was trying to reproduce your results with IrfanView + CutePDF, and unfortunately I wasn't able to do it.
Hmmm, interesting, I'm reasonably sure there's nothing special with my system! FWIW I'm using an elderly P4 with Win2k & 512mb RAM and it does take a moment or three to produce a PDF (initially you'd think it's locked up) but I don't recall it ever flatly refusing to produce the file in its entirety.

If I get a moment I'll copy some of the ST output files from one of the books and try it on a laptop that's running XP, just in case that makes a difference for some reason. I'll also confirm the versions of Irfanview and CutePDF. In this instance Irfanview is simply a means to an end in allowing me to print the files to CutePDF - if ST were able to print its output then Irfanview wouldn't be needed for the task. In other words I'm not asking it to do any image manipulation or anything else that would affect the resultant PDF.

I could post the text-only PDF I mentioned (written in the 1800's) but I'm not sure what use that would really be, it most certainly is 7.9mb as quoted :-)

Cheers, P.

phaedrus
Posts: 56
Joined: 04 Mar 2014, 00:52

Re: Scan Tailor

Post by phaedrus » 07 Apr 2010, 07:05

phaedrus wrote:If I get a moment I'll copy some of the ST output files from one of the books and try it on a laptop that's running XP, just in case that makes a difference for some reason. I'll also confirm the versions of Irfanview and CutePDF.
Irfanview 4.25, Ghostscript converter 8.15, CutePDF 2.8 as downloaded just now, older XP Laptop also with 512mb RAM. Fortunately I was able to log in and copy the ST images for the 215-page book (original was A5 format or thereabouts) to this machine and have just done the print to PDF giving me a file size of 7.942mb. FYI in case you want to duplicate exactly what I do using Irfanview I hit 'T'to give me thumbnail view, Ctrl-A to select all images then right-click and tell it to batch-print (print selected files as single images). It's been a while since I did the first PDF on the Win2k machine but I'd say it was somewhat quicker on this Laptop (which I'd expect) and it didn't really appear to baulk at all. Incidentaly the images were all 600dpi & I printed 'best fit to page' to an A4 page size.

I also remembered I did a 323-page book and have checked that, it was a 19mb PDF with some full-size photographs scattered throughout the pages.

HTH, P.

spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: Scan Tailor

Post by spamsickle » 07 Apr 2010, 12:56

Could someone post some "upstream" numbers for these wonders? For example, I have a book with nothing but black-and-white text plus color cover, which generated 312 3264x2448 JPEGs totaling 512 MB. Since I discard the first and last image, Scan Tailor converted that into 310 TIFF files totaling 120 MB. ImageMagick turned that into 310 PDFs totaling 135 MB, which pdftk concatenated into a final PDF book that was also 135 MB.

Maybe I'm just starting with bigger pictures or something.

I could see getting a 20 MB file out of it if I was doing OCR and creating a new PDF by printing a text file, but I'm keeping an image of the original page with its original formating.

Tulon
Posts: 687
Joined: 03 Oct 2009, 06:13
Number of books owned: 0
Location: London, UK
Contact:

Re: Scan Tailor

Post by Tulon » 07 Apr 2010, 14:01

phaedrus,

I think the reason it works for you but not for me is because I tried it with Mixed output. I've just checked, and it turns out in this mode Scan Tailor only chooses between 8 bit grayscale and 24 bit RGB output, even for completely B/W pages. Even though this hardly affects the size of LZW compressed TIFFs, it does increase the logical data size 8 times. That would be the reason I hit the apparent 2GB limit and you don't.

In addition, this method doesn't allow me to get exactly the page size I want. It would instead fit the pages into the standard size (like A4) of your choice.
Scan Tailor experimental doesn't output 96 DPI images. It's just what your software shows when DPI information is missing. Usually what you get is input DPI times the resolution enhancement factor.

Locked