rkomar wrote:I reread the preceding post and saw that OCR is used to make the documents searchable. However, it was said that there were errors in the results. I'm curious to know if the words that are not recognized correctly using OCR still look correct after the ClearScan process?
All OCR must necessarily be based on the original scanned bitmap image, I think, but the information derived can then be used in different ways.
In ClearScan’s case it seems that the page image displayed in the output file is likely composed of scalable vector font characters when it is confident it has identified the characters correctly, and arbitrary scalable vector shapes when it is less sure, or unable to identify a bitmap area as a character.
BruceG wrote:I down loaded 'Bristol' and OCR with OmniPage which produced a file of 2080 KB.(the original was 38,040 KB) As with all OCR there are mistakes. Getting the same font as the book and training it to select the right character helps. Editing within OmniPage produces the smallest file size. ... For me I am after searchable pdf that are 100% correct (well best I can do) and look like the original. Then I index (with Adobe) all the material on the same subject so they be all searched together. A book of this size may take a day to OCR, edit and put back together.
The usual form of OCR as the term is used on the forum results in the addition of a hidden, searchable text layer in the output file, referred to as a 'searchable image' or sometimes as ‘text under the page image’.
You appear to be using another form of output, available in Nuance OmniPage and Abbyy FineReader, in which the page image in the output file consists of scalable vector text that uses standard fonts, rather than the synthesised fonts ClearScan creates to match the text in the scanned document, a form of output very similar to a page created using a word processor. That text can then be searched with 100% accuracy. That form of output is sometimes referred to rather opaquely as 'text over the page image' or something similar, although the page image isn't present in the outout file.
That form of output should normally produce the smallest output file size, as only one copy of each required font is required, and can also produce very attractive output text. The practical problem, as you already know, is often in matching the original font with an available system font, and in maintaining the layout in terms of line breaks, for example, if that is required.
Be aware, though that the standard fonts used may not be automatically embedded in the final output file, meaning that the document may not reproduce correctly on another computer that doesn't have the fonts used and substitutes other fonts, and that font licencing issues may mean that the fonts used can’t be embedded: best to use open source fonts when possible.
I am preparing to leave for a short vacation and so may be unable to respond promptly to any further posts in this thread for a week.