Post-processing tweaks needed!

Share your software workflow. Write up your tips and tricks on how to scan, digitize, OCR, and bind ebooks.

Moderator: peterZ

rkomar
Posts: 98
Joined: 12 May 2013, 16:36
E-book readers owned: PRS-505, PocketBook 902, PRS-T1, PocketBook 623, PocketBook 840
Number of books owned: 3000
Country: Canada

Re: Post-processing tweaks needed!

Post by rkomar »

I'm not sure if ClearScan really does OCR, or if it just deals with shapes (without any meaning attached to those shapes). When you do the conversions, do you have to specify the source languages anywhere?

Edit: Oops! I reread the preceding post and saw that OCR is used to make the documents searchable. However, it was said that there were errors in the results. I'm curious to know if the words that are not recognized correctly using OCR still look correct after the ClearScan process? Maybe the two operations (cleaning the image and OCR) are completely separate.
BruceG
Posts: 99
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: Post-processing tweaks needed!

Post by BruceG »

I down loaded 'Bristol' and OCR with OmniPage which produced a file of 2080 KB.(the original was 38,040 KB) As with all OCR there are mistakes. Getting the same font as the book and training it to select the right character helps. Editing within OmniPage produces the smallest file size. I normally use a pdf editor called 'Infix", which increases file size. So to keep books small I split the file to single pages and only edit those pages that need fixing then join the pages together again. For me I am after searchable pdf that are 100% correct (well best I can do) and look like the original. Then I index (with Adobe) all the material on the same subject so they be all searched together. A book of this size may take a day to OCR, edit and put back together.
cday
Posts: 447
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Post-processing tweaks needed!

Post by cday »

rkomar wrote:I reread the preceding post and saw that OCR is used to make the documents searchable. However, it was said that there were errors in the results. I'm curious to know if the words that are not recognized correctly using OCR still look correct after the ClearScan process?
All OCR must necessarily be based on the original scanned bitmap image, I think, but the information derived can then be used in different ways.

In ClearScan’s case it seems that the page image displayed in the output file is likely composed of scalable vector font characters when it is confident it has identified the characters correctly, and arbitrary scalable vector shapes when it is less sure, or unable to identify a bitmap area as a character.
BruceG wrote:I down loaded 'Bristol' and OCR with OmniPage which produced a file of 2080 KB.(the original was 38,040 KB) As with all OCR there are mistakes. Getting the same font as the book and training it to select the right character helps. Editing within OmniPage produces the smallest file size. ... For me I am after searchable pdf that are 100% correct (well best I can do) and look like the original. Then I index (with Adobe) all the material on the same subject so they be all searched together. A book of this size may take a day to OCR, edit and put back together.
The usual form of OCR as the term is used on the forum results in the addition of a hidden, searchable text layer in the output file, referred to as a 'searchable image' or sometimes as ‘text under the page image’.

You appear to be using another form of output, available in Nuance OmniPage and Abbyy FineReader, in which the page image in the output file consists of scalable vector text that uses standard fonts, rather than the synthesised fonts ClearScan creates to match the text in the scanned document, a form of output very similar to a page created using a word processor. That text can then be searched with 100% accuracy. That form of output is sometimes referred to rather opaquely as 'text over the page image' or something similar, although the page image isn't present in the outout file.

That form of output should normally produce the smallest output file size, as only one copy of each required font is required, and can also produce very attractive output text. The practical problem, as you already know, is often in matching the original font with an available system font, and in maintaining the layout in terms of line breaks, for example, if that is required.

Be aware, though that the standard fonts used may not be automatically embedded in the final output file, meaning that the document may not reproduce correctly on another computer that doesn't have the fonts used and substitutes other fonts, and that font licencing issues may mean that the fonts used can’t be embedded: best to use open source fonts when possible.

Edit:

I am preparing to leave for a short vacation and so may be unable to respond promptly to any further posts in this thread for a week.
dtic
Posts: 464
Joined: 06 Mar 2010, 18:03

Re: Post-processing tweaks needed!

Post by dtic »

Just a tip to rkomar and perhaps others: Acrobat has a free 30 day trial of Acrobat XI that includes the ClearScan method, so if curious go try it. Ncraun who now manages ScanTailor has a project for a FOSS ClearScan alternative - smoothscan - but no new version of it has come out since the first one last year. An alternative that match ClearScan in quality would be terrific especially since Acrobat doesn't support command line processing for ClearScan.
rkomar
Posts: 98
Joined: 12 May 2013, 16:36
E-book readers owned: PRS-505, PocketBook 902, PRS-T1, PocketBook 623, PocketBook 840
Number of books owned: 3000
Country: Canada

Re: Post-processing tweaks needed!

Post by rkomar »

Thanks for the tip, dtic. My only Windows "computer" is a netbook with Windows 7 Starter and 1GB of RAM. It barely runs with no apps, and even the browser brings it to its knees. I'm not sure that I'm curious enough about ClearScan to put up with the aggravation of temporarily installing and running Acrobat on that crappy device.
whitepage
Posts: 16
Joined: 11 Mar 2015, 18:12
E-book readers owned: Onyx Note Pro, KDX
Number of books owned: 0
Country: New Zealand

Re: Post-processing tweaks needed!

Post by whitepage »

After ClearScan, to remove background images, this post explains how you can use a commercial Acrobat plugin. This has limitations (apart from the hefty price or free trial expiry), as you have to select the page ranges to target, so I'm currently looking for a free program that would perform this task in a more intuitive way.
Post Reply