Vectorization

Convert page images into searchable text. Talk about software, techniques, and new developments here.

Moderator: peterZ

univurshul
Posts: 496
Joined: 04 Mar 2014, 00:53

Re: Vectorization

Post by univurshul »

spamsickle wrote: I've run into several books which Adobe's ClearScan chokes on...
Spam, can you load these images or send me a PM so I can test on them?
spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: Vectorization

Post by spamsickle »

Okay, dirty little secret time. Since the first mention I noticed of ClearScan here, I've been running with the free evaluation version of Acrobat Pro, because I discovered that it only asked for the serial number at startup time. Since I let my computers run 24/7, I just kept the application open, and it hasn't nagged me for a serial number though the free trial expired weeks ago.

This has meant that I also didn't "install the updates" which I was periodically informed were available.

My latest book that choked ClearScan is almost 700 pages, and it wasn't getting past page 50 before telling me that some page couldn't be converted. Even though it gave me a check box saying I had the option of ignoring subsequent errors, it didn't seem to go on with converting once that dialogue box was dismissed.

The original book-length PDF is over a gigabyte, and I wasn't about to upload that whole thing, so I thought I'd just start converting pages individually, find one that Acrobat couldn't handle, and upload that.

The first page I tried converted just fine. The second page crashed the program. No popup error message this time, just "encountered an error, can't continue, I'm going to die now, do you want to tell Microsoft?"

Sooo.. since my "keep it running" strategy was now toast, I let it install the upgrades. Only I didn't let it reboot my computer. I hate hate HATE software upgrades that force me to stop not only the application being upgraded but EVERYTHING ELSE I'M DOING, and usually it's not really necessary to reboot the computer anyway...

Not this time. Just asking Acrobat to do OCR crashed it now, even on the page that worked just minutes before. So, reboot.

Now, it asked me for the serial number, and I used the serial number that came with my Design Studio CS5 which I've had for months but never installed. And all the individual pages I tried converting converted okay, so I figured just run that 700-page book through again, and see where it balks. It's up to about page 250 now, and doesn't seem to be having a problem so far.

I like software upgrades that actually fix problems I'm having, and if that's what's happened in this case (still 450 pages to go, so we'll see) I guess it will keep the like hate hate HATE relationship between me and Adobe percolating along as it has been. Keeping my fingers crossed...

Update: Blew up on page 381, but processing each individual page in that area ran to completion. I suspect some kind of memory leak or wild pointer, rather than the source being converted, may explain the problem (and why it was more common in an application that had been running for several weeks). I've broken the book into 100-page pieces (thanks, pdftk), and will see if I can get a complete book out of it that way. Crossing my fingers again...
Last edited by spamsickle on 07 Nov 2010, 19:15, edited 1 time in total.
spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: Vectorization

Post by spamsickle »

Well, taking the book in 100-page chunks through an updated Acrobat that hadn't been running for weeks got every page ClearScanned. My gigabyte file is now 80 MB, which means it will fit on a CD with half a dozen siblings. Thanks for offering to take a look, but I guess it's probably not necessary now.
univurshul
Posts: 496
Joined: 04 Mar 2014, 00:53

Re: Vectorization

Post by univurshul »

...time to buy a Mac. :ugeek:
univurshul
Posts: 496
Joined: 04 Mar 2014, 00:53

Re: Vectorization

Post by univurshul »

Hey, Spam, you know I do notice that Acrobat Pro is a memory hog, and should really run alone on the desktop. No browsers open etc. I run the ClearScan package overnight; and now that you mention the instability of it's batch processing of PDF larger than 50 pages, I can reaffirm your issues with the software.

Acrobat X (v10) should hopefully autosave while it runs a 5 hour OCR package on a document. It's a no-brainer. Because if it fails: square one, hours of CPU time lost. And because Acro doesn't autosave, I specifically break up the OCR package into 200-page chunks; save; continue.

Although it doesn't happen much to me, I hear you loud and clear.

One thing interesting--and this may be machine-related--is that the batch OCR across multiple files appears to run more stable than the opened & assembled document pre-OCR.
spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: Vectorization

Post by spamsickle »

univurshul wrote:Hey, Spam, you know I do notice that Acrobat Pro is a memory hog, and should really run alone on the desktop. No browsers open etc.
When it crapped out on page 381 of 700, it was running alone on the desktop, by virtue of having required me to reboot my computer to install the upgrades and get it running again.
univurshul wrote:Acrobat X (v10) should hopefully autosave while it runs a 5 hour OCR package on a document. It's a no-brainer. Because if it fails: square one, hours of CPU time lost. And because Acro doesn't autosave, I specifically break up the OCR package into 200-page chunks; save; continue.
It often happens that my idea of how things should work is at odds with the ideas of their designers. I dislike car doors that lock themselves, copiers that refuse to copy because they think they've detected a page that's not straight, "screensavers", and DVRs that turn themselves off after dubbing, to name just a few.

I agree that "autosave" is a sensible option for a conversion program like this, but I have no illusion that one will appear in the next release. Software can be simultaneously very smart about some things, and very stupid about others, and this product seems to fit that mold. For example, if I saved the PDF after Acrobat partially converted it, then asked Acrobat to convert the whole thing again, it threw up its hands altogether. Rather than recognizing its own handiwork on the initial pages as something that didn't need to be repeated, it complained "Can't convert this; it contains renderable text" or some such message. So even if autosave was in place, it seems one would have to guess about where it had left off last time, and specify a complementary range on the next try.

For simple books, it seems that 200-300 page chunks (and even more) are safe. For "busier" books, 100-page chunks seem to be safe too. Pdftk makes it easy to assemble PDFs from pieces, so I can work within the limitations of Acrobat to get the functions I want until something better comes along.
Shaknum
Posts: 91
Joined: 16 Aug 2010, 13:10

Re: Vectorization

Post by Shaknum »

spamsickle wrote: Pdftk makes it easy to assemble PDFs from pieces,
You probably already know this, but you can set it to OCR pages 1-300, then 301-600, etc... and then save the whole file in Acrobat. You don't need to ocr a chunk and then save it to an independent PDF files and then merge them later. Frankly I find the Acrobat limitation frustrating to no end, but live with it since the end result is so good. Perhaps Acrobat X will fix this (after all God made hard drives so we don't need infinite amounts of RAM).

I wonder if I would get even better file sizes if I could clearscan all the pages at once. Does it create a new "font" map for each chunk you OCR? Perhaps the waste is negligible. I guess we will all have to wait and hope that a real OCR package like ABBY or Omnipage can implement this type of vectorization in the future. Perhaps we should send in feature requests.
ibr4him
Posts: 102
Joined: 18 Oct 2010, 10:36

Re: Vectorization

Post by ibr4him »

potrace looks great, is something similar available for colored text?
User avatar
dingodog
Posts: 110
Joined: 22 Jul 2010, 18:19
Number of books owned: 1000
Country: on the net
Location: on the net
Contact:

Re: Vectorization

Post by dingodog »

there is an old utility for windows

*cr2v*
- http://goo.gl/rJfZ1 (latest very hard to find freeware version)

able to vectorize (from command-line) also colored images, it was free for personal use, now company has disapperead and I think this program is abandonware

now it has been superseded by commercial vector-eye (GUI, not command-line 59 US $)
- http://www.scale-a-vector.de/svg-test4-e.htm#cr2v

usage

Code: Select all

cr2v  image.ext settings.set
it works also in linux with wine
Last edited by dingodog on 12 Nov 2010, 10:00, edited 1 time in total.
Anonymous1

Re: Vectorization

Post by Anonymous1 »

univurshul wrote:...time to buy a Mac. :ugeek:
Sir, gather yourself; overpriced hardware isn't going to solve your problems; Linux is!

This seems like a great idea, but what is the purpose? Vectorization increases size (if you make a tarball out of the image, it gets REALLY small, but that is really redundant), and it doesn't seem to help with OCR. I do admit that it looks a lot better than the original...
Post Reply