Basic Guide to Workflow ?

JJJM · Post by **JJJM** » 19 Sep 2010, 02:36

spamsickle wrote:I didn't know about the Clearscan option in Acrobat either, but that's not surprising: the last version of Acrobat I bought was 5.0, and this option was apparently introduced in version 9. Finally, maybe I have a reason to upgrade...

I'm curious -- a couple of people in this thread have mentioned that they think pre-processing with Scan Tailor gives them better OCR results. I haven't tended to do OCR on my scans, so I don't have any results of my own, but I'd think if one was using the Clearscan option, one would want the original images rather than the Scan Tailor binarizations, as I would expect those to produce better vector fonts. Is it possible to pipe the post-Scan Tailor text into a Clearscan Acrobat step, and get the best of both worlds (post Scan Tailor OCR, plus pre Scan Tailor vector fonts)?

I don' catch you. Which way do you ask?

1. Scan Tailor first and Clearscan then. If so, this is the one I have tried and It works perfect as when you apply scantailor you get a cleaner and more homogeneus scan which fits fine with clearscan.
2. First Clearscan and Scan Tailor then.

The interesting thing about clearscan is that you get a lighter pdf which is not made of images but vectors, but it is not an OCR till you export as text. For example, I could see the word "life" on the screen when it is pdf, but if I copy and paste that word into a wordprocessor it becomes "lije" which was an OCR error. Funny!

The most interesting thing for me is this pdf works very well with ereader (kindle and sony), turning pages is very fast and adaptation-cropping of text to ereader screen is very good. And what is more, you are seeing the original picture of the book, with no errors due to OCR conversion.

I am very happy with this clearscan option as it saves me hours of OCR postprocessing.

Now, my workflow wil be just based on taking pictures, a good scan tailor preprocessing (have a look at video tutorial posted previously) and final clearscan.

spamsickle · Post by **spamsickle** » 19 Sep 2010, 04:18

JJJM wrote: The interesting thing about clearscan is that you get a lighter pdf which is not made of images but vectors, but it is not an OCR till you export as text. For example, I could see the word "life" on the screen when it is pdf, but if I copy and paste that word into a wordprocessor it becomes "lije" which was an OCR error.

I may be misunderstanding how Clearscan works. From the description here http://acrobatusers.com/blogs/duffjohns ... -clearscan
I got the impression that OCR must be done before Clearscan's fonts were created. If that were the case, though, I'd expect you'd either see "lije" in your PDF, or a substantial number of the "F"s in your document would be misidentified as "J"s. As you describe it, I'd sort of expect the latter to the be the case -- that the OCR process had identified large numbers of "F"s as "J"s (maybe only italic lowercase Fs, for example), but then the process of vectorizing them had still made them look like what is in the original image.

The reason I expect that you'd want to vectorize the raw image rather than the Scan Tailor image is that the individual letters would be more true to the original. The example PDF from the blog I just linked is here http://www.scientificamerican.com/media ... -story.pdf. The output font is remarkably clean:

: shot2.jpg (61.76 KiB) Viewed 11897 times

A similar shot from my own Scan Tailor output for a different document has a much coarser appearance:

: shot1.jpg (55.59 KiB) Viewed 11897 times

It's possible that Scan Tailor is capable of producing better quality than this, and I have done something wrong (like choosing the wrong DPI, for instance, or having a lower-resolution scan to begin with) which has caused my results to be less than they should be. Could you post an example of the results you have obtained from Clearscanning Scan Tailor output?

JJJM · Post by **JJJM** » 19 Sep 2010, 07:23

spamsickle wrote:
JJJM wrote:

It's possible that Scan Tailor is capable of producing better quality than this, and I have done something wrong (like choosing the wrong DPI, for instance, or having a lower-resolution scan to begin with) which has caused my results to be less than they should be. Could you post an example of the results you have obtained from Clearscanning Scan Tailor output?

Here you have. For me is more than enough for an ereader.

: ScreenShot001.jpg (173.47 KiB) Viewed 11892 times

spamsickle · Post by **spamsickle** » 19 Sep 2010, 07:38

I discovered that Adobe makes Acrobat 9 available in a free 30-day trial version, and downloaded it. It looks like the Clearscan option does some smoothing of the "coarse" output; after processing, the sample I posted before looks like this:

: Shot3.jpg (60.68 KiB) Viewed 11891 times

Going back and processing the original image actually resulted in worse quality; I got the graywashed background, and the foreground fonts weren't appreciably better.

The great news is that processing with Clearscan not only improved the appearance, but took my filesize from 255 MB down to 16 MB. I keep color front and back covers; I expect if I was only saving the text inside, the reduction would have been even more dramatic.

I'll play around with it for the next couple of weeks to see if there are any unexpected gotchas, but it looks like this Clearscan capability is going to be worth a purchase.

Tim · Post by **Tim** » 19 Sep 2010, 09:08

spamsickle wrote:I'm curious -- a couple of people in this thread have mentioned that they think pre-processing with Scan Tailor gives them better OCR results. I haven't tended to do OCR on my scans, so I don't have any results of my own, but I'd think if one was using the Clearscan option, one would want the original images rather than the Scan Tailor binarizations, as I would expect those to produce better vector fonts. Is it possible to pipe the post-Scan Tailor text into a Clearscan Acrobat step, and get the best of both worlds (post Scan Tailor OCR, plus pre Scan Tailor vector fonts)?

Theoretically, OCR will be able to provide better results if it is working on the original images. This is because the original has the most information to work with. But in order to take advantage of that, the binarization, despeckling, and other types of pre-processing that are done automatically in various types of OCR packages must be done optimally, or at least better than from another process. What happens is that for some image workflows and scanner setups, the processing that Scantailor does, provides a better input to the OCR program than the OCR program's automatic processing does by itself. It doesn't always happen, and that makes sense too. The most ideal OCR comes when the processing algortithms are best suited to the input images and the parameters are tuned the best, which may mean by hand. I think one of the reasons scantailor does help sometimes is the amount of "smoothing" it does do, though less than clearscan it seems. That can take a letter which is a bit hard to recognize and make it more clear, both to the eye and to the OCR process. I can't recall the technical term for the smoothing process, there's some standard Photoshop/GIMP name for it.

This clearscan option sounds really interesting. Acrobat is fairly expensive, no?

Tim · Post by **Tim** » 19 Sep 2010, 09:14

JJJM wrote:The interesting thing about clearscan is that you get a lighter pdf which is not made of images but vectors, but it is not an OCR till you export as text. For example, I could see the word "life" on the screen when it is pdf, but if I copy and paste that word into a wordprocessor it becomes "lije" which was an OCR error. Funny!

This just means there is an OCR layer and an image layer. The image layer is shown to you and the OCR layer is only displayed/output if the text is selected to be copied/pasted or exported. That means the OCR errors are always there, it just doesn't matter if your needs involve only/primarily viewing the document (the image layer). The OCR errors can still be a problem if the text is needed for other purposes, say accessibility. I'm curious if a document processed with clearscan can be reprocessed with another OCR package to improve the OCR results if needed.

spamsickle · Post by **spamsickle** » 19 Sep 2010, 16:05

Tim wrote:I think one of the reasons scantailor does help sometimes is the amount of "smoothing" it does do, though less than clearscan it seems. That can take a letter which is a bit hard to recognize and make it more clear, both to the eye and to the OCR process. I can't recall the technical term for the smoothing process, there's some standard Photoshop/GIMP name for it.

This clearscan option sounds really interesting. Acrobat is fairly expensive, no?

If it's something that's common to Photoshop and GIMP, you may be thinking of anti-aliasing. I don't think that's technically what Scan Tailor is doing; I think it's just being smart about what it chooses as foreground (font) and background. With binary output such as Scan Tailor is generating for text, the only anti-aliasing available is dithering, and it's clearly not doing that.

I think Acrobat goes for about $250 on the Adobe site, and more like $150 in the real world. I see that my version 5.0 doesn't qualify for upgrade pricing, (missed it by 1), and my next choice would be to shamelessly exploit my school-age kids to get the academic pricing. I could shamefully go the keygen / registry edit route, but I have a low shame threshold. Actually, I see that I have a Version 8.0 Acrobat that came with my ScanSnap, which I never installed, so maybe I can get the upgrade pricing after all...

Tim wrote:
JJJM wrote:The interesting thing about clearscan is that you get a lighter pdf which is not made of images but vectors, but it is not an OCR till you export as text. For example, I could see the word "life" on the screen when it is pdf, but if I copy and paste that word into a wordprocessor it becomes "lije" which was an OCR error. Funny!
This just means there is an OCR layer and an image layer. The image layer is shown to you and the OCR layer is only displayed/output if the text is selected to be copied/pasted or exported. That means the OCR errors are always there, it just doesn't matter if your needs involve only/primarily viewing the document (the image layer). The OCR errors can still be a problem if the text is needed for other purposes, say accessibility. I'm curious if a document processed with clearscan can be reprocessed with another OCR package to improve the OCR results if needed.

After playing around with this a bit this morning, I agree with Tim: You already have the OCR in the Clearscan version, you just maybe didn't realize it. The tip-off is that the Clearscan version is text searchable.

And you CAN reprocess a Clearscan PDF, at least sometimes. Actually, probably most of the time, if all you're doing is vanilla text. I doubt that I'll be doing any of that, because the Clearscan output is more than I was settling for before, I don't need accessibility (yet, anyway), and 99% of what I'd search for is already clean.

The exceptions can go from easy enough to fix to crashing your OCR (at least, it can crash my ABBYY Finereader 9.0). The problem is that Clearscan creates a custom font. An example will illustrate:

Greek.jpg: (88.37 KiB) Downloaded 97 times

On the left is the Clearscan output; on the right is the same text after it was reprocessed by Abbyy. It may look like Abbyy choked, but in fact most of those errors were in the Clearscan version, they were just masked because the custom font made the wrong text look right. If you look at the actual text underlying that 5A delta that Abbyy identifies, it's in the Clearscan text as .:l delta.

Abbyy processes this bit of the file okay, though it will be a pain in the butt to clean up, and since I'm not the kind of uber-geek that knows the Unicode for getting an umlaut in my uber, much less the code for delta, I'm not going to be doing a search on those Greek symbols. It's not really worth my time to clean it up, even though the native font (Roman) chosen by Abbyy looks a lot better than the custom font generated by Clearscan. If you can't live with that level of quality, or you do need to make the results accessible, maybe it will be worth it to you to get things cleaned up.

Now, in this particular book, a few pages later there are some hand-drawn characters, for which ClearScan generated custom fonts that look pretty much like the originals. I think it appropriated some upper-range Unicode characters to do it, though, because if you look at the text in that area it's complete heiroglyphics. When I try to read that page with Abbyy, the program just crashes. Maybe if I segregated this section as graphics before I asked Abbyy to read it, I could get it to work, but once again, I'm not going to be searching on this stuff, and it's not worth it to me to spend the time tweaking my way toward perfection when good enough is more than good enough for me.

JJJM · Post by **JJJM** » 19 Sep 2010, 16:29

After your feedback, I am more convinced solution made bay Scan Tailor plus Clearscan is good enough for my purpose. In this way, getting good quality pictures and good postprocessing with scan tailor is the key to have better final documents.

OCR is very time consuming and at the end of the day, what I want is to be reading, more than OCRing.

I am susprised nobody came across this option till it was suggested in this post.

Thanks for sharing your opinions.

spamsickle · Post by **spamsickle** » 19 Sep 2010, 17:19

I'm surprised too. Thanks, umpausewhat, for that little hidden gem, and thanks JJJM for noticing it. When I read the original post, I just rolled past the mention of Clearscan, and things would probably have stayed that way.

Now I think the Clearscan option is the greatest thing since Scan Tailor, and they complement each other very well.

I'll still try djvu one of these days just to see, but I'm really happy about being able to take all the bloated PDFs I've generated over the past year, drop them into Acrobat, and cut them down to 1/10th of the size while improving the quality (and getting easy OCR). The only way I think I could be happier is to learn that Acrobat also has a batch option...

JJJM · Post by **JJJM** » 19 Sep 2010, 17:28

spamsickle wrote:I'm surprised too. Thanks, umpausewhat, for that little hidden gem, and thanks JJJM for noticing it. When I read the original post, I just rolled past the mention of Clearscan, and things would probably have stayed that way.

Now I think the Clearscan option is the greatest thing since Scan Tailor, and they complement each other very well.

I'll still try djvu one of these days just to see, but I'm really happy about being able to take all the bloated PDFs I've generated over the past year, drop them into Acrobat, and cut them down to 1/10th of the size while improving the quality (and getting easy OCR). The only way I think I could be happier is to learn that Acrobat also has a batch option...

I have read somewhere Clearscan does funny things like erasing parts of text. I do not know about this but maybe it is for complex layouts, or maybe it is a bug already fixed. I will google deeper and try to see the weak points clearscan cuold have, but as you say, it is a very interesting discovery.

DIY Book Scanner

Basic Guide to Workflow ?

Re: Basic Guide to Workflow ?

Re: Basic Guide to Workflow ?

Re: Basic Guide to Workflow ?

Re: Basic Guide to Workflow ?

Re: Basic Guide to Workflow ?

Re: Basic Guide to Workflow ?

Re: Basic Guide to Workflow ?

Re: Basic Guide to Workflow ?

Re: Basic Guide to Workflow ?

Re: Basic Guide to Workflow ?