Tim wrote:I think one of the reasons scantailor does help sometimes is the amount of "smoothing" it does do, though less than clearscan it seems. That can take a letter which is a bit hard to recognize and make it more clear, both to the eye and to the OCR process. I can't recall the technical term for the smoothing process, there's some standard Photoshop/GIMP name for it.
This clearscan option sounds really interesting. Acrobat is fairly expensive, no?
If it's something that's common to Photoshop and GIMP, you may be thinking of anti-aliasing. I don't think that's technically what Scan Tailor is doing; I think it's just being smart about what it chooses as foreground (font) and background. With binary output such as Scan Tailor is generating for text, the only anti-aliasing available is dithering, and it's clearly not doing that.
I think Acrobat goes for about $250 on the Adobe site, and more like $150 in the real world. I see that my version 5.0 doesn't qualify for upgrade pricing, (missed it by 1), and my next choice would be to shamelessly exploit my school-age kids to get the academic pricing. I could shamefully go the keygen / registry edit route, but I have a low shame threshold. Actually, I see that I have a Version 8.0 Acrobat that came with my ScanSnap, which I never installed, so maybe I can get the upgrade pricing after all...
Tim wrote:JJJM wrote:The interesting thing about clearscan is that you get a lighter pdf which is not made of images but vectors, but it is not an OCR till you export as text. For example, I could see the word "life" on the screen when it is pdf, but if I copy and paste that word into a wordprocessor it becomes "lije" which was an OCR error. Funny!
This just means there is an OCR layer and an image layer. The image layer is shown to you and the OCR layer is only displayed/output if the text is selected to be copied/pasted or exported. That means the OCR errors are always there, it just doesn't matter if your needs involve only/primarily viewing the document (the image layer). The OCR errors can still be a problem if the text is needed for other purposes, say accessibility. I'm curious if a document processed with clearscan can be reprocessed with another OCR package to improve the OCR results if needed.
After playing around with this a bit this morning, I agree with Tim: You already have the OCR in the Clearscan version, you just maybe didn't realize it. The tip-off is that the Clearscan version is text searchable.
And you CAN reprocess a Clearscan PDF, at least sometimes. Actually, probably most of the time, if all you're doing is vanilla text. I doubt that I'll be doing any of that, because the Clearscan output is more than I was settling for before, I don't need accessibility (yet, anyway), and 99% of what I'd search for is already clean.
The exceptions can go from easy enough to fix to crashing your OCR (at least, it can crash my ABBYY Finereader 9.0). The problem is that Clearscan creates a custom font. An example will illustrate:
On the left is the Clearscan output; on the right is the same text after it was reprocessed by Abbyy. It may look like Abbyy choked, but in fact most of those errors were in the Clearscan version, they were just masked because the custom font made the wrong text look right. If you look at the actual text underlying that 5A delta that Abbyy identifies, it's in the Clearscan text as .:l delta.
Abbyy processes this bit of the file okay, though it will be a pain in the butt to clean up, and since I'm not the kind of uber-geek that knows the Unicode for getting an umlaut in my uber, much less the code for delta, I'm not going to be doing a search on those Greek symbols. It's not really worth my time to clean it up, even though the native font (Roman) chosen by Abbyy looks a lot better than the custom font generated by Clearscan. If you can't live with that level of quality, or you do need to make the results accessible, maybe it will be worth it to you to get things cleaned up.
Now, in this particular book, a few pages later there are some hand-drawn characters, for which ClearScan generated custom fonts that look pretty much like the originals. I think it appropriated some upper-range Unicode characters to do it, though, because if you look at the text in that area it's complete heiroglyphics. When I try to read that page with Abbyy, the program just crashes. Maybe if I segregated this section as graphics before I asked Abbyy to read it, I could get it to work, but once again, I'm not going to be searching on this stuff, and it's not worth it to me to spend the time tweaking my way toward perfection when good enough is more than good enough for me.