I used to write the output of st as black and white and have recently switched to Color /Greyscale with White margins and equalized illumination. The output tifs I use for OCR with ABBY finereader 8.0. The problem is, there are always a couple of files randomly ditributed (maybe one in fifty), which finereader claims are corrupted and cannot be read. The absurd thing is that then I open them with IRFAN (which is perfectly possible!), convert them to png, and ocr the png-files. This happens on all my machines, so it is not machine-specific, although they all have the same system (Win XP SP2).
This is of course not a big problem, since I have a perfectly good workaround; still one wishes it would not happen (and has never happened with the black/white files).
I have put two files from my last project online ftp://ftp.lrz-muenchen.de/transfer/dir- ... rupted.tif and ftp://ftp.lrz-muenchen.de/transfer/dir-nlw/252_ok.tif
Thanks for any suggestions.
invalid tif output
Moderator: peterZ
Re: invalid tif output
There is nothing special about your files. Gimp opens them just fine, without a single warning. Looks like it's a bug in FineReader.
Scan Tailor experimental doesn't output 96 DPI images. It's just what your software shows when DPI information is missing. Usually what you get is input DPI times the resolution enhancement factor.
- daniel_reetz
- Posts: 2812
- Joined: 03 Jun 2009, 13:56
- E-book readers owned: Used to have a PRS-500
- Number of books owned: 600
- Country: United States
- Contact:
Re: invalid tif output
Are they the same files every time? Windows Explorer can sometimes lock TIFF files in a really unpleasant way. I ran into this recently with MATLAB -- see the post here: http://blogs.mathworks.com/steve/2010/1 ... pend-loop/
Re: invalid tif output
They are the same files, insofar as a file which is considered corrupt by FineReader once, will be considered corrupt by FineReader everytime and on every machine I have. So this seems to be a characteristic of the files. At the same time I can open the same files, whether corrupt or not, with another program, in my case IRFAN view, and they appear perfectly fine (this is consistent with what tulon tested) - so this is a pretty esoteric phenomenon (I wont even call it a bug any more). I have not tested whether ST would produce identical 'corrupted' files if I processed the original jpg's again.
After having read the entry on MATLAB I suspect that in my case it may simply be that my computers don't have enough memory (since the color/greyscale output files are about ten times bigger then the b/w-files I produced earlier), another program interferes with the ST output, or ... .
In any case, the greyscale-output of ST is simply superb, so I gladly trade this slight hitch for more useful files, and, as I said in my first post, converting the tifs to png solves the problem.
After having read the entry on MATLAB I suspect that in my case it may simply be that my computers don't have enough memory (since the color/greyscale output files are about ten times bigger then the b/w-files I produced earlier), another program interferes with the ST output, or ... .
In any case, the greyscale-output of ST is simply superb, so I gladly trade this slight hitch for more useful files, and, as I said in my first post, converting the tifs to png solves the problem.
Re: invalid tif output
I have now run a futher test. I have greyscale jpg's to start with, the ST output is set to Color/Greyscale with equalized illumination. This produces smaller output than the color images I used formerly (which are pointless anyway, since I photograph books without pictures), and greyscale images seem to produce better OCR results than bitonal ones. Same result: out of ca. 800 pages ca. 15 tifs cannot be read by Finereader (these are, as said earlier, perfectly fine when opened with other graphics programs). If I now repeat the ST run with the originals of some of the 'defect' tifs, the second series of tifs is perfectly legible by Finereader. I suppose that means that this is neither ST's nor Finereader's fault, but my computers are simply not powerful enough.
Re: invalid tif output
Wrong conclusion. You obviously produced different files on the two runs. Some files do trigger a bug in FineReader while others don't.lexicographer wrote:If I now repeat the ST run with the originals of some of the 'defect' tifs, the second series of tifs is perfectly legible by Finereader. I suppose that means that this is neither ST's nor Finereader's fault, but my computers are simply not powerful enough.
Now why were the files different? Two possibilities:
1. You didn't re-use the same project file.
2. You did, but floating point values get rounded when saved to the project file, so you still can get minor differences.
Scan Tailor experimental doesn't output 96 DPI images. It's just what your software shows when DPI information is missing. Usually what you get is input DPI times the resolution enhancement factor.
Re: invalid tif output
You are right, of course. I used the same project-files, but naturally the output would have had different sizes the second time.