Cleaning up Background after ClearScan

cday · Post by **cday** » 18 Mar 2015, 08:28

whitepage wrote:When you use the layer separation preflight, you get three layers (if all types are present): text, images, and vector objects. I assume that if run this preflight on the sample you mentioned, the vectors you obtained (non-recognized text) would show up on the vector layer. In contrast, the non-recognized text I mentioned went to the Images layer, just as genuine images in the file did.

Adobe ClearScan is a proprietary technology and Adobe have released few details of how it is implemented, other than to say that it synthesises fonts that closely approximate the original, and preserves the page background using a low-resolution copy. It is therefore interesting to learn of your experience of separating pages into layers using the Acrobat Pro ‘Preflight’ option, not available in my Acrobat Standard version.

From your description it seems clear now that text in your scans is sometimes being placed on the background layer, which must be presumed to be a bitmap image given Adobe’s reference to ‘low-resolution’. A simple way to check directly whether characters are on the background layer without separating the layers, would be to zoom well in: if the character outlines remain smooth the characters are vector characters, and if the outlines becomes pixelated the characters are on the bitmap background layer.

If you are planning on scanning more books, you might investigate why some characters in your previous scans have been placed on the background layer. Unless the pages you are scanning are particularly difficult, some enhancement of the images before or after processing them in ScanTailor, or even using different scanner settings, could ensure that all text is recognised as such.

... you are right that a sample would help. Will follow up if I find something suitable (maybe by simply smudging a magazine page to hamper recognition of some of the text).

What would be useful is a representative page image that converts to a ClearScan file in which some text is placed on the background layer, to see if the source image can be enhanced to produce better ClearScan output. You could then whiten the background of your pages, reduce the total file size, and improve the OCR recognition results, all without the need to view and selectively delete background images.

whitepage · Post by **whitepage** » 21 Mar 2015, 02:59

Hi cday,

To show you the sample you requested, instead of scanning multiple pages at random, I thought I'd scan a selection of type. Sure enough, SureScan produced the mixture of text / image we were after.
The scan is from Just my Type by Simon Garfield, and I'm sure uploading a single page in this context constitutes fair use.

1. just-my-type.tif is the scan.
2. just-my-type_out.tif is the ScanTailor output.
3. just-my-type.pdf is the ClearScan. I have separated the Text and Images layer so you can toggle them if your software allows that.

Just in case you can't toggle the layers,
4. just-my-type_text.pdf is just the Text layer.
5. just-my-type_image.pdf is just the Images layer.

These results are as expected: ClearScan succeeds in places and fails in others. Where it fails, the image is not substituted with type, so we are able to extract it to a different layer. These are the kinds of images we want to keep.

In this case there is no dirty background. However, if I drew a squiggle somewhere on a page of 100% recognized type, we would have a dirty background behind the type.

Hope this brings some light to the situation. In my view, there is no guarantee that ClearScan will succeed 100% of the time. For this reason, my main concern is not tweaking settings in ST. Knowing that after ClearScan there will be three kinds of images (actual images, missed type, and dirty backgrounds), my main concern is in quickly selecting images to remove from the final file.

Wishing you a terrific weekend,

wp

cday · Post by **cday** » 22 Mar 2015, 13:06

Thanks for the files you uploaded, which I have been examining carefully and have used for some limited tests of my own.

Your original post stated that you were looking for a software that would display all the image assets inside a PDF in a GUI folder-style view for easy selection and removal. You later explained that some of the text in your scans was being placed on the images layer in your ClearScan PDFs, which meant that you needed to retain those images, whereas the background images on other pages only contained unwanted ‘noise’ and could be deleted, which would result in whiter images for those pages and reduce the total file size.

The software you need for that would as you recognise be highly specialised, and quite possibly so specialised that there might not be any tool available at an affordable price. Please note that the link I included to the Planet PDF Forum, which looks like it could be a useful resource, contained a typo. The correct link is:

http://forums.planetpdf.com/forums

whitepage wrote:ClearScan succeeds in places and fails in others. Where it fails, the image is not substituted with type, so we are able to extract it to a different layer. These are the kinds of images we want to keep.

When ClearScan in its latest version fails it normally seems to provide an accurate representation of the page viewed on the screen, whether as scalable vector characters or sometimes as bitmap text. And that is in fact the case for the example file you uploaded.

Failure would normally be considered to be failure of the OCR process to recognise some text, possibly difficult or poorly formed characters in a scan, limiting searchability. Or, probably very rarely now, the substitution of an incorrect character in the text displayed on the screen, and also necessarily in the resulting OCR result. On good quality scans, ClearScan generally seems to meet users’ needs both in terms of providing excellent appearance on screen and also good OCR accuracy, with the added bonus of much smaller file sizes than searchable image PDF files.

Your need to whiten page backgrounds would normally be best achieved as a preprocessing operation, the practical consideration being to achieve a good result without causing unacceptable collateral damage to any grayscale or colour images on the page. That should normally be achievable, and would in turn result in the file size reduction you seek in the final PDF file, so both your needs would be met.

THE FILES YOU UPLOADED

The files you uploaded are for a page that has many unusually large characters, and in some cases very unusual, stylised fonts. It is therefore not a typical scanned book page, and possibly in the present state of the art beyond what it is reasonable to expect ClearScan to recognise as being intended to be text, given that pages will more typically consist of a mixture of smaller text and illustrations, and that it needs to be able to distinguish between the two.

To test whether the unusually large size of some characters might be a factor in their failure to be recognised as text, I tried downsampling the source image to a lower resolution before using ClearScan but there was no clear benefit, and some of the oversize characters in the original image were in any case recognised correctly as text.

Although the source grayscale image had a near-white background I also tried enhancing it using a levels correction, and then formed the ClearScan output using that image. That did produce some limited improvement in that some large characters that had previously been bitmap images became vector characters with a smooth outline. As it is not of immediate relevance I’ll append the result at the end.

In summary, some of the text in your PDFs is indeed on the image layer, but normally in moderation that wouldn’t be considered a serious problem, although it would marginally degrade the appearance of the page and slightly increase the PDF file size as you say.

If you no longer have your original scans or ScanTailor output, you clearly have no alternative to continuing to selectively extract background images from the resulting ClearScan PDF files by whatever means you can, unless you decide to rescan the pages.

If you are intending to scan more books, it would be worth considering a modification to your workflow to either enhance the page scan images before creating the ClearScan file, or to experiment with alternative scanner settings, to produce a whiter page backgrounds and minimum file size directly.

MY TESTS

My detailed look at Adobe ClearScan is described in this post, and the information in it might need to be refined as more detailed insights into ClearScan become available.

The following two files are pages extracted from the PDF file examined in the above thread where the text seems to have vectorised well despite less than optimum original scans. You might like to examine the layers in those files and report back.

Fisher... p16.pdf: (588.5 KiB) Downloaded 637 times

Fisher... p94.pdf: (486.12 KiB) Downloaded 594 times

The following ClearScan file was created from an enhanced version of the ‘just-my-type’ grayscale image you uploaded, appended at the end of the post. I created the file ClearScan file directly from the grayscale enhanced image, which may have in part assisted the vectorisation, although I’m not sure. You did originally refer to the need to scan pages with images that had to remain grayscale, so in the context the increase in file size compared with the ClearScan file you uploaded shouldn’t be material.

just-my-type_enhanced.pdf: (117.3 KiB) Downloaded 585 times

Perhaps you could split the above file into layers and report back on any improvement.

This is a screenshot showing how some text has been vectorised that was bitmap text in the file you uploaded –- the image posted is actually from an earlier ClearScan version I created.

: The lower test is vectorised unlike the text from the original file: click on the image to zoom in.

I hope that all this is of some use to you!

You might note that if you produced the files you uploaded recently, your Adobe Acrobat XI software is version 11.0.03 (Help > About... or Edit > Properties... for the files you uploaded) whereas the currently version is 11.0.10, which includes a number of recent security updates.

whitepage · Post by **whitepage** » 23 Mar 2015, 16:01

Hi cday,
Thank you for your extremely thorough and detailed message.
You have probably written the most detailed investigation of ClearScan available on the web today. It will probably become a reference post.

For obvious reasons I am still interested in the kind of software I have mentioned — being able to quickly examine and act upon multiple images in a PDF file would be useful to me in many circumstances.

But I will also investigate, as you advise, how to produce better images to feed ClearScan in the first place, both with ScanTailor and at scan time.

With many thanks,
wp

dpc · Post by **dpc** » 24 Mar 2015, 00:33

whitepage,
Try asking for advice in mobileread.com's PDF forum. This post describes something similar to what you're attempting to do. Mutool extract will dump all of the images from a PDF into a subdirectory but I'm not sure how you'd strip them. You might just be able to look at the image dimensions and determine if it's full page size and can be culled.

whitepage · Post by **whitepage** » 24 Mar 2015, 05:00

@dpc, thank you for suggesting MobileRead, I haven't been there in ages.

You mentioned Mutool. I believe its extraction feature is the latest version of the pdfextract.exe utility I mentioned on this post (pdfextract was included in MuPDF until version 0.9). Extraction is not a problem; what I'd like to so is select multiple images in some kind of folder-like view and act on them directly inside the file.
But I'm giving up on this idea (for this year anyway) as it looks like this feature doesn't exist anywhere.

Thank you all for your kind suggestions on this question.

muscleriot · Post by **muscleriot** » 11 Aug 2016, 07:43

I used to use something like your steps with Acrobat Pro and ClearType - trouble is it simply embedded all the nonmapping text into the pdf.
I found the output was simply not good enough and the pdfs I made where just huge 50mbs or over. I tried all sorts to optimise the filesize but it didnt fall much lower than 30mb.

A few days ago I downloaded Omnipage Ultimate (on a 15 day trail).
I had a pile of books in scanned images which I just had given up with processing because the files where 50mb+ and the results where rubbish due to bent pages while scanning and the Scan Taylor Skew is just not good enough.

I fed these in 'raw' and was utterly amazed at Omnipage's output.

Jaw droppingly good OCR, even with curved page edges, and no messing around with Scan Tailor skew needed.

The best thing is - It uses something called 'Truetype' fonts in PDFs and simply maps the OCR to a fair representation of the 5 or 6 built in fonts (Times Roman, Sans serif etc..) which are always in a pdf document (rather than a bunch of verbatum font images embedded later).
This means the output is top quality - and thanks to no need to embed fonts, the filesize of the pdfs is minimal - typically 2-5mbs - even with images.
Amazing. And less of a learning curve than Scan Tailor. It even outputs ePubs. Give it a go and get back to me....!

DIY Book Scanner

Cleaning up Background after ClearScan

Re: Cleaning up Background after ClearScan

Re: Cleaning up Background after ClearScan

Re: Cleaning up Background after ClearScan

Re: Cleaning up Background after ClearScan

Re: Cleaning up Background after ClearScan

Re: Cleaning up Background after ClearScan

Re: Cleaning up Background after ClearScan