Thanks for the files you uploaded, which I have been examining carefully and have used for some limited tests of my own.
Your original post stated that you were looking for a software that would display all the image assets inside a PDF in a GUI folder-style view for easy selection and removal. You later explained that some of the text in your scans was being placed on the images layer in your ClearScan PDFs, which meant that you needed to retain those images, whereas the background images on other pages only contained unwanted ‘noise’ and could be deleted, which would result in whiter images for those pages and reduce the total file size.
The software you need for that would as you recognise be highly specialised, and quite possibly so specialised that there might not be any tool available at an affordable price. Please note that the link I included to the Planet PDF Forum, which looks like it could be a useful resource, contained a typo. The correct link is:
http://forums.planetpdf.com/forums
whitepage wrote:ClearScan succeeds in places and fails in others. Where it fails, the image is not substituted with type, so we are able to extract it to a different layer. These are the kinds of images we want to keep.
When ClearScan in its latest version fails it normally seems to provide an accurate representation of the page viewed on the screen, whether as scalable vector characters or sometimes as bitmap text. And that is in fact the case for the example file you uploaded.
Failure would normally be considered to be failure of the OCR process to recognise some text, possibly difficult or poorly formed characters in a scan, limiting searchability. Or, probably very rarely now, the substitution of an incorrect character in the text displayed on the screen, and also necessarily in the resulting OCR result. On good quality scans, ClearScan generally seems to meet users’ needs both in terms of providing excellent appearance on screen and also good OCR accuracy, with the added bonus of much smaller file sizes than searchable image PDF files.
Your need to whiten page backgrounds would normally be best achieved as a preprocessing operation, the practical consideration being to achieve a good result without causing unacceptable collateral damage to any grayscale or colour images on the page. That should normally be achievable, and would in turn result in the file size reduction you seek in the final PDF file, so both your needs would be met.
THE FILES YOU UPLOADED
The files you uploaded are for a page that has many unusually large characters, and in some cases very unusual, stylised fonts. It is therefore not a typical scanned book page, and possibly in the present state of the art beyond what it is reasonable to expect ClearScan to recognise as being intended to be text, given that pages will more typically consist of a mixture of smaller text and illustrations, and that it needs to be able to distinguish between the two.
To test whether the unusually large size of some characters might be a factor in their failure to be recognised as text, I tried downsampling the source image to a lower resolution before using ClearScan but there was no clear benefit, and some of the oversize characters in the original image were in any case recognised correctly as text.
Although the source grayscale image had a near-white background I also tried enhancing it using a levels correction, and then formed the ClearScan output using that image. That did produce some limited improvement in that some large characters that had previously been bitmap images became vector characters with a smooth outline. As it is not of immediate relevance I’ll append the result at the end.
In summary, some of the text in your PDFs is indeed on the image layer, but normally in moderation that wouldn’t be considered a serious problem, although it would marginally degrade the appearance of the page and slightly increase the PDF file size as you say.
If you no longer have your original scans or ScanTailor output, you clearly have no alternative to continuing to selectively extract background images from the resulting ClearScan PDF files by whatever means you can, unless you decide to rescan the pages.
If you are intending to scan more books, it would be worth considering a modification to your workflow to either enhance the page scan images before creating the ClearScan file, or to experiment with alternative scanner settings, to produce a whiter page backgrounds and minimum file size directly.
MY TESTS
My detailed look at Adobe ClearScan is described in
this post, and the information in it might need to be refined as more detailed insights into ClearScan become available.
The following two files are pages extracted from the PDF file examined in the above thread where the text seems to have vectorised well despite less than optimum original scans. You might like to examine the layers in those files and report back.
The following ClearScan file was created from an enhanced version of the ‘just-my-type’ grayscale image you uploaded, appended at the end of the post. I created the file ClearScan file directly from the grayscale enhanced image, which may have in part assisted the vectorisation, although I’m not sure. You did originally refer to the need to scan pages with images that had to remain grayscale, so in the context the increase in file size compared with the ClearScan file you uploaded shouldn’t be material.
Perhaps you could split the above file into layers and report back on any improvement.
This is a screenshot showing how some text has been vectorised that was bitmap text in the file you uploaded –- the image posted is actually from an earlier ClearScan version I created.
- The lower test is vectorised unlike the text from the original file: click on the image to zoom in.
I hope that all this is of some use to you!
You might note that if you produced the files you uploaded recently, your Adobe Acrobat XI software is version 11.0.03 (Help > About... or Edit > Properties... for the files you uploaded) whereas the currently version is 11.0.10, which includes a number of recent security updates.