Adobe Acrobat deleting parts of page during OCR

General discussion about software packages and releases, new software you've found, and threads by programmers and script writers.

Moderator: peterZ

Post Reply
elhyam
Posts: 10
Joined: 28 Mar 2012, 17:51
E-book readers owned: Samsung Galaxy Tab 7, 8.9
Number of books owned: 10000
Country: USA

Adobe Acrobat deleting parts of page during OCR

Post by elhyam »

I was playing around with a few sample pages to get the hang of using Scan Tailor and Acrobat to go from original images to final PDF. When I did OCR of the pages using ClearScan, Acrobat did something pretty scary: Deleted chunks of text from the PDF. Attached are a sample PDF page before and after doing ClearScan OCR. As you can see there are pieces of missing text in the Sample2.pdf. Any idea how this could happen??
Sample1.pdf
(86.87 KiB) Downloaded 407 times
Sample2.pdf
(88.84 KiB) Downloaded 411 times
stearn
Posts: 18
Joined: 22 Dec 2011, 20:00
E-book readers owned: kindle
Number of books owned: 4000
Location: Nr. London, UK

Re: Adobe Acrobat deleting parts of page during OCR

Post by stearn »

I'm not sure what the problem is as I downloaded both files, but reprocessed the first one. Initially I did a straight OCR in Acrobat X Pro and got good recognition, and then I went into the settings and changed to clearscan and got a reasonable recognition (it is all there) but encountered a problem I have had before of weird spacing issues.

This is the first OCR:
Sample1a.pdf
Straight OCR
(92.61 KiB) Downloaded 371 times
This is the second with clearscan turned on:
Sample1b.pdf
ClearScan OCR
(60.18 KiB) Downloaded 377 times
Personally I give clearscan a very wide berth as the output text just isn't up to scratch for what I am doing (I don't think it is up to scratch for anything really, as keyword searching is a joke when you have extra spaces thrown in randomly). What I don't get is your missing text.

What version of Acrobat are you using?
Post Reply