Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

invalid tif output

Scan Tailor specific announcements, releases, workflows, tips, etc. NO FEATURE REQUESTS IN THIS FORUM, please.
Post Reply
lexicographer

invalid tif output

Post by lexicographer » 10 Nov 2010, 16:21

I used to write the output of st as black and white and have recently switched to Color /Greyscale with White margins and equalized illumination. The output tifs I use for OCR with ABBY finereader 8.0. The problem is, there are always a couple of files randomly ditributed (maybe one in fifty), which finereader claims are corrupted and cannot be read. The absurd thing is that then I open them with IRFAN (which is perfectly possible!), convert them to png, and ocr the png-files. This happens on all my machines, so it is not machine-specific, although they all have the same system (Win XP SP2).
This is of course not a big problem, since I have a perfectly good workaround; still one wishes it would not happen (and has never happened with the black/white files).
I have put two files from my last project online ftp://ftp.lrz-muenchen.de/transfer/dir- ... rupted.tif and ftp://ftp.lrz-muenchen.de/transfer/dir-nlw/252_ok.tif
Thanks for any suggestions.

Tulon
Posts: 687
Joined: 03 Oct 2009, 06:13
Number of books owned: 0
Location: London, UK
Contact:

Re: invalid tif output

Post by Tulon » 10 Nov 2010, 16:38

There is nothing special about your files. Gimp opens them just fine, without a single warning. Looks like it's a bug in FineReader.
Scan Tailor experimental doesn't output 96 DPI images. It's just what your software shows when DPI information is missing. Usually what you get is input DPI times the resolution enhancement factor.

lexicographer

Re: invalid tif output

Post by lexicographer » 10 Nov 2010, 16:59

thanks for taking the trouble to look at them.

User avatar
daniel_reetz
Posts: 2797
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: invalid tif output

Post by daniel_reetz » 10 Nov 2010, 17:53

Are they the same files every time? Windows Explorer can sometimes lock TIFF files in a really unpleasant way. I ran into this recently with MATLAB -- see the post here: http://blogs.mathworks.com/steve/2010/1 ... pend-loop/

lexicographer

Re: invalid tif output

Post by lexicographer » 11 Nov 2010, 05:10

They are the same files, insofar as a file which is considered corrupt by FineReader once, will be considered corrupt by FineReader everytime and on every machine I have. So this seems to be a characteristic of the files. At the same time I can open the same files, whether corrupt or not, with another program, in my case IRFAN view, and they appear perfectly fine (this is consistent with what tulon tested) - so this is a pretty esoteric phenomenon (I wont even call it a bug any more). I have not tested whether ST would produce identical 'corrupted' files if I processed the original jpg's again.
After having read the entry on MATLAB I suspect that in my case it may simply be that my computers don't have enough memory (since the color/greyscale output files are about ten times bigger then the b/w-files I produced earlier), another program interferes with the ST output, or ... .
In any case, the greyscale-output of ST is simply superb, so I gladly trade this slight hitch for more useful files, and, as I said in my first post, converting the tifs to png solves the problem.

lexicographer

Re: invalid tif output

Post by lexicographer » 23 Nov 2010, 10:47

I have now run a futher test. I have greyscale jpg's to start with, the ST output is set to Color/Greyscale with equalized illumination. This produces smaller output than the color images I used formerly (which are pointless anyway, since I photograph books without pictures), and greyscale images seem to produce better OCR results than bitonal ones. Same result: out of ca. 800 pages ca. 15 tifs cannot be read by Finereader (these are, as said earlier, perfectly fine when opened with other graphics programs). If I now repeat the ST run with the originals of some of the 'defect' tifs, the second series of tifs is perfectly legible by Finereader. I suppose that means that this is neither ST's nor Finereader's fault, but my computers are simply not powerful enough.

Tulon
Posts: 687
Joined: 03 Oct 2009, 06:13
Number of books owned: 0
Location: London, UK
Contact:

Re: invalid tif output

Post by Tulon » 23 Nov 2010, 11:12

lexicographer wrote:If I now repeat the ST run with the originals of some of the 'defect' tifs, the second series of tifs is perfectly legible by Finereader. I suppose that means that this is neither ST's nor Finereader's fault, but my computers are simply not powerful enough.
Wrong conclusion. You obviously produced different files on the two runs. Some files do trigger a bug in FineReader while others don't.
Now why were the files different? Two possibilities:
1. You didn't re-use the same project file.
2. You did, but floating point values get rounded when saved to the project file, so you still can get minor differences.
Scan Tailor experimental doesn't output 96 DPI images. It's just what your software shows when DPI information is missing. Usually what you get is input DPI times the resolution enhancement factor.

lexicographer

Re: invalid tif output

Post by lexicographer » 23 Nov 2010, 11:53

You are right, of course. I used the same project-files, but naturally the output would have had different sizes the second time.

Post Reply