I’ve now seen Adobe ClearScan in operation for the first time and I have certainly been impressed with what I have seen. I ran the Fisher... PDF file used in my Abbyy FineReader tests in the previous post through Acrobat XI and the file size was reduced from 13.87MB to 3.76MB. The output quality looks very good in the sample pages I’ve examined, although it hasn’t of course been practical to check for all possible errors or defects in the book.
- CS_1.png (82.63 KiB) Viewed 16728 times
In the above image the upper text is from the original scan file and the lower text is from the resulting ClearScan file: notice that the curved baseline at the left side of the original text has been corrected in the ClearScan text. That seems an unexpected and rather impressive enhancement.
- CS_2.png (96.29 KiB) Viewed 16728 times
In the above text the baseline curvature is slightly greater but look carefully at the resulting ClearScan text: the upper three lines have been straighten but the left-most side of the lower two lines is still curved. That’s interesting...
- CS_3.png (20.76 KiB) Viewed 16728 times
In the upper image note that the stray pixels on the outline of the two a’s and two r’s are in different places, as is normal in a scanned image at moderate resolution. The lower image shows the resulting ClearScan text and illustrates the dramatic improvement in visual quality that is one of ClearScan’s advantages. ClearScan synthesises scalable vector fonts to match the original text, so that quality will be maintained when zooming in, just as in text a Word processor document.
In the lower image note that the two a’s and the two r’s look identical, so it is reasonable to assume that ClearType has identified them both as being the same character. Fine, but that inevitably means that ClearType, like JBIG2 Lossy in the previous post, could in principle display an incorrect character if a character in the scan is misidentified, for example if it is poorly formed.
When Adobe introduced ClearScan in Acrobat 9 it seemed a massive step forward, offering the possibility of replacing a large file consisting of scanned images that would not scale well with pages of smooth, scalable vector text, and at the same time greatly reducing file size.
In its early days Adobe was a prominent font company, and ClearScan is a proprietary technology that exploits that background. Adobe has said very little about the technology, basically only revealing that it replaces the characters in a scan with synthesised Type 3 fonts that closely approximate fonts in the original scan, and preserves the page background [when there is a non-white background] using a low-resolution copy. Adobe recommends scanning at a high resolution, ideally 600 DPI, for the best results.
When ClearScan was introduced I assumed, possibly incorrectly, that the text displayed so elegantly on the page was the output from the normal OCR process. As Acrobat’s OCR results at the time on any except good quality scans were often less than perfect, I assumed that it could easily introduce errors into the displayed text when used on lower-quality images, for example from a camera.
Acrobat X reportedly introduced a greatly improved version of ClearType. Looking at the above sample images, it looks as if it uses a combination of methods, displaying the OCR output text when it has confidence in the output, and a synthesised vector image of the text when it is less confident, as in some of the curved text above. On close inspection the reproduction of the page image even extends to the reproduction of black specs from dust in the scanned image, so it looks as if the displayed page can normally be taken as an accurate reproduction of the original.
ClearScan has been criticised in some earlier posts for producing a very large number of fonts which inevitably increase the file size, but this is probably mainly a problem with lower-resolution page images. Rather than the fonts being duplicates because it has failed to recognise portions of text as being in the same font, I suspect that every time it finds a bit pattern it needs to vectorise in order to produce a good quality output, it simply adds it to the current font until it is full, and then starts a new font.
The OCR results seem to be generally very good on these high-quality scan pages, with even words that are displayed as vector images often recognised correctly, and therefore searchable. The OCR results are not, however, entirely free of errors, with some difficult characters missed and some Spanish accents not reproduced, although I did inadvertently run ClearScan with the English language selected, rather than Spanish, or English and Spanish if that is selectable in Acrobat.
So ClearType does have a lot to offer and I’m duly impressed. With good-quality images it should suit most users, but it as well to keep in mind that it does have at least a theoretical possibility of displaying an incorrect character in adverse conditions. That would probably usually not be an issue, but in an academic work in the worst case, a name or date in the displayed text could theoretically be changed.
At the end of the day it is necessary to balance the benefits of greatly improved text quality and a greatly reduced file size against a possible need to spend more time checking the output produced. A simple page image scan compressed using a lossless method can be assumed to be accurate, and can always be referred to in the event of a query later.
Edit:
An incidental advantage of the reduced file size of a ClearScan file is that pages are displayed more quickly when stepping through the pages of a long document on a slower computer, but like the file size reduction, that should rapidly become less significant if computing technology continues to progress at the present pace.