Cleaning up Background after ClearScan

Share your process here - how to build something, scan something, or use something.

Moderator: peterZ

whitepage
Posts: 16
Joined: 11 Mar 2015, 18:12
E-book readers owned: Onyx Note Pro, KDX
Number of books owned: 0
Country: New Zealand

Cleaning up Background after ClearScan

Post by whitepage »

Most of the steps in my workflow are absolutely classic, but I'd thought I'd share it as I haven't seen the last step mentioned. The typical kind of book I scan has images, so I run ScanTailor in grayscale.
1. Clean up in Scan Tailor (grayscale because of the images)
2. Assemble
3. ClearScan
4. Separate to Layers
5. PitStop (see below)
6. Optimize

I like how ClearScan leaves the document looking close to the original (acknowledging the risk of mis-recognized characters). One thing I don't like so much is that it leaves images from all the page backgrounds, which makes the file heavier.

I haven't yet found the perfect way to deal with these BG images, but one thing has helped a lot. I'm lucky to have access to an expensive Acrobat plugin at a friend's shop (PitStop Pro).
1. After ClearScan, I separate the PDF to layers (Tools / Print Production / Preflight / Create Separate Layers for vectors, text and images). Then I hide the text layer in order to inspect what images are left.
2. At that stage, I go through the whole document and write down the page ranges (or individual pages) where all images can be zapped (typically, an almost white background).
3. On my friend's machine, I fire PitStop and create an Action List with the following actions:
- Select Layers by Name (equals Images)
- Select Page Range (the ranges identified in step 2)
- AND
- Select Images
- AND
- Remove Selection
Running the Action List zaps all the images from the page ranges.

Although this works to an extent, I am not fully satisfied with this solution because (a) it's not sustainable (I don't own PitStop Pro), (b) pages that have both BG and valid images must be addressed manually, and (c) there has to be a better way of mass-selecting image assets to be zapped. As mentioned on this other post, I am looking for a software that shows all the image assets inside a PDF in a GUI folder-style view for easy selection and removal.
cday
Posts: 447
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Cleaning up Background after ClearScan

Post by cday »

Could you upload a sample grayscale page with an image, as output from ScanTailor, to better understand the problem?

And, if possible but less important, the same page as output from Acrobat ClearScan?

I think I follow what you are saying, and don't right now have any direct solution to suggest or particularly expect to, but sometimes it is useful to stand back from a problem to see if there may be an alternative way of looking at it.
dpc
Posts: 379
Joined: 01 Apr 2011, 18:05
Number of books owned: 0
Location: Issaquah, WA

Re: Cleaning up Background after ClearScan

Post by dpc »

What's the file size difference before and after your PitStop process? You'd think that those "almost all white" backgrounds would compress fairly well.

I also use ScanTailor/Adobe Acrobat Pro and output ClearScan PDFs. I've never worried too much about file sizes because storage is cheap compared to my free time.

The biggest problem that I have is ScanTailor's auto-selection of images that I have to almost always tweak manually. That can become a huge time sink for me.
whitepage
Posts: 16
Joined: 11 Mar 2015, 18:12
E-book readers owned: Onyx Note Pro, KDX
Number of books owned: 0
Country: New Zealand

Re: Cleaning up Background after ClearScan

Post by whitepage »

What's the file size difference before and after your PitStop process? You'd think that those "almost all white" backgrounds would compress fairly well.
@dpc You're quite right that the almost white backgrounds compress well, but with one per page, they add up.
On a recent scan for a 330-page book (something like 9x6"), the file went from 8.3MB to 4.4MB. That's an average of 11kB per page.

Apart from the file size, there's a definite benefit when viewing the book on an e-ink Kindle (old-style KDX): not just the cleaner background, but also the faster opening and processing.
Could you upload a sample grayscale page with an image, as output from ScanTailor, to better understand the problem?
@cday Away from my files, but will try to remember to scan a page when I get back. That step is not so much about a "problem" than about wanting to complete the post-processing by polishing an already great ClearScan output — removing the extra weight of the BG images created by Acrobat, with the benefits of file weight, cleanliness, and easier work for the Kindle.
cday
Posts: 447
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Cleaning up Background after ClearScan

Post by cday »

The requirement to remove ClearScan’s low-resolution background images from an existing file to enhance the displayed image quality and reduce the file size seems, as you realise, quite unusual. And while PDF, Acrobat and ClearScan are frequently referred to in posts on this forum, the deeper technicalities of PDF are rarely mentioned. There might therefore be more relevant forums on which you could ask for support.

The Adobe support forum http://www.forums.adobe.com might be one possibility, although your requirement would be out of the mainstream. The website http://www.PlanetPDF.com is an interesting site for PDF tips and news, but I believe is related to a commercial PDF software product, although the associated forum http://www.forum.planetpdf.com looks like it might be a useful resource. I am sending you by PM contact details for a member of that forum who posts regularly who also develops low-priced PDF tools, who might possibly at least be worth contacting for advice.

It would probably be easy for Adobe to add an option to omit background images from generated ClearScan files if a need were identified, but not surprisingly I don’t see one, or even any reference to ClearScan, in Acrobat’s comprehensive Edit > Preferences... options. ;)
cday wrote:Could you upload a sample grayscale page with an image, as output from ScanTailor, to better understand the problem?
I am thinking that there might be a simpler way to enhance the displayed image quality and shrink the size of your files, but if you have discarded your ScanTailor output files, only for any future scans you produce...
whitepage
Posts: 16
Joined: 11 Mar 2015, 18:12
E-book readers owned: Onyx Note Pro, KDX
Number of books owned: 0
Country: New Zealand

Re: Cleaning up Background after ClearScan

Post by whitepage »

@cday Thank you for your kind message and sharing your thoughts.
It would probably be easy for Adobe to add an option to omit background images from generated ClearScan files if a need were identified
The thing is that ClearScan sometimes fail to recognize text, so text elements may still be part of the backgrounds left after ClearScan. For this reason, in my view it's good that Acrobat leaves the backgrounds; and it would be even better if there was a way to browse the images, and selectively delete them.

Kind regards,

wp
cday
Posts: 447
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Cleaning up Background after ClearScan

Post by cday »

whitepage wrote:The thing is that ClearScan sometimes fail to recognize text, so text elements may still be part of the backgrounds left after ClearScan. For this reason, in my view it's good that Acrobat leaves the backgrounds...
I'm not sure about that, in my [fairly quick] tests reported in another thread, it looked as if even characters that weren't recognised for OCR and stray marks that were 'noise' were vectorised, whereas I take the 'low-resolution background image' that Adobe refers to as being a bit map image...
whitepage wrote:... it would be even better if there was a way to browse the images, and selectively delete them.
Yes...
whitepage
Posts: 16
Joined: 11 Mar 2015, 18:12
E-book readers owned: Onyx Note Pro, KDX
Number of books owned: 0
Country: New Zealand

Re: Cleaning up Background after ClearScan

Post by whitepage »

I'm not sure about that
But I am sure (reporting my experience). For instance if I extract all the images of a ClearScanned document using pdfextract, some images can be text, and that text is not on the Text layer (after separation). Likewise, that text will appear on the Images layer if you separate the layers.

Would upload a sample but uncomfortable about uploading copyrighted work.
cday
Posts: 447
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Cleaning up Background after ClearScan

Post by cday »

whitepage wrote:... if I extract all the images of a ClearScanned document using pdfextract, some images can be text, and that text is not on the Text layer (after separation). Likewise, that text will appear on the Images layer if you separate the layers.
In my tests there was text which was successfully recognised for OCR purposes which was therefore searchable, and some text which was reproduced on the screen as a scaleable vector outline which was not searchable, which I assume would not appear in the text layer you are referring to...

The PDF format can support vector and bitmap images, and also a searchable text layer, so it is difficult to comment further without seeing the output you obtained.

Edit:

To further complicate things, a vector image may be rasterised when extracted from a PDF file in order that it can be displayed in an image viewing program...
whitepage
Posts: 16
Joined: 11 Mar 2015, 18:12
E-book readers owned: Onyx Note Pro, KDX
Number of books owned: 0
Country: New Zealand

Re: Cleaning up Background after ClearScan

Post by whitepage »

text which was reproduced on the screen as a scaleable vector outline which was not searchable, which I assume would not appear in the text layer you are referring to...
When you use the layer separation preflight, you get three layers (if all types are present): text, images, and vector objects. I assume that if run this preflight on the sample you mentioned, the vectors you obtained (non-recognized text) would show up on the vector layer. In contrast, the non-recognized text I mentioned went to the Images layer, just as genuine images in the file did. Note that if you have a picture, it is not vectorized, hence it makes sense that some non-recognized text also would be treated as an image and not get vectorized.

Yes, you are right that a sample would help. Will follow up if I find something suitable (maybe by simply smudging a magazine page to hamper recognition of some of the text).
Post Reply