web based access to one of Google's OCR engines

Convert page images into searchable text. Talk about software, techniques, and new developments here.

Moderator: peterZ

User avatar
daniel_reetz
Posts: 2812
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

web based access to one of Google's OCR engines

Post by daniel_reetz »

User avatar
rob
Posts: 773
Joined: 03 Jun 2009, 13:50
E-book readers owned: iRex iLiad, Kindle 2
Number of books owned: 4000
Country: United States
Location: Maryland, United States
Contact:

Re: web based access to one of Google's OCR engines

Post by rob »

Interesting, but I can't recommend it! Doesn't know the difference between em-dashes and hyphens, fails on detecting small images, naively treats each line as separate from every other line, doesn't get line spacing correct, doesn't detect text style. Correctly identifies the characters without any help, though, which is more than I can say about FineReader 10 ;)
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.
User avatar
clemd973
Posts: 121
Joined: 22 Aug 2010, 21:20

Re: web based access to one of Google's OCR engines

Post by clemd973 »

rob wrote:Interesting, but I can't recommend it! Doesn't know the difference between em-dashes and hyphens, fails on detecting small images, naively treats each line as separate from every other line, doesn't get line spacing correct, doesn't detect text style. Correctly identifies the characters without any help, though, which is more than I can say about FineReader 10 ;)
Rob, don't you use Adobe Acrobat for your OCR tasks (think I remember that from another post)??? If so, how does it work for you? I've got an opportunity to buy Acrobat Professional 9 for Mac at a discounted price. What does it do for file size before and after OCR?
User avatar
rob
Posts: 773
Joined: 03 Jun 2009, 13:50
E-book readers owned: iRex iLiad, Kindle 2
Number of books owned: 4000
Country: United States
Location: Maryland, United States
Contact:

Re: web based access to one of Google's OCR engines

Post by rob »

Oh, I pretty much tried everything. I've settled on using ABBYY FineReader for my OCR work. Again, it's not perfect, and the developers are shielded by a crack management team, but you won't really get anything better.
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.
spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: web based access to one of Google's OCR engines

Post by spamsickle »

clemd973 wrote:I've got an opportunity to buy Acrobat Professional 9 for Mac at a discounted price. What does it do for file size before and after OCR?
If you're using the "Clearscan" option (a new feature in Acrobat 9), my experience has been that files sizes after OCR are often 1/10th of what they were before. "Clearscan" vectorizes the image, and creates custom fonts, which accounts for the space savings over raster. So far, I haven't encountered any problems with other software using the custom font, but I think it will limit you in some ways. I'm told you can't convert a "Clearscan" PDF back to raster, for instance. Since I save all my originals anyway, that isn't a big deal for me, but it may be for you.
User avatar
dingodog
Posts: 110
Joined: 22 Jul 2010, 18:19
Number of books owned: 1000
Country: on the net
Location: on the net
Contact:

Re: web based access to one of Google's OCR engines

Post by dingodog »

all vector layers can be rasterized

in Linux (and in windows both) you can use

pdftoppm (included in

xpdf utils
- http://www.foolabs.com/xpdf/

pdftoppm -mono (for black/white text) -r 300 input.pdf ppm-root
User avatar
reggilbert
Posts: 49
Joined: 28 Sep 2010, 19:57
Number of books owned: 3000
Location: Buffalo, New York

Re: web based access to one of Google's OCR engines

Post by reggilbert »

clemd973 wrote: Rob, don't you use Adobe Acrobat for your OCR tasks (think I remember that from another post)??? If so, how does it work for you? I've got an opportunity to buy Acrobat Professional 9 for Mac at a discounted price. What does it do for file size before and after OCR?
I've used Acrobat OCR exclusively for several years and it does a very decent job, within a couple percent of the dedicated OCR packages and just as fast or faster (this assessment is based on the standalone OCR packages of two years ago, however).

But whether one OCR process "works" or not depends on what its purpose is. For finding most instances of a word or phrase in a file, Acrobat's 97 percent (or whatever it is) OCR accuracy is good enough, for my personal purposes, at least. But for outputting to a text file that the user plans to correct (say, to create an ebook that can be repurposed in an e-reader), an accuracy rate a couple percent higher would make a big difference in the post-processing time, so a standalone package might be a better idea for that.

Spamsickle's 1/10 compression after Acrobat OCR may have something to do with his/her source files. I notice that when I combine JPGs in Acrobat, the resulting file is much larger than when I use BMP (in both cases generated by flatbed scanners). I never OCRd a resulting file; perhaps OCR of JPG-based Acrobat files yields a much smaller size. There is virtually no change in the size of my BMP-based files, but it wouldn't really matter if there were -- somehow Acrobat takes gig-plus folders of source scans (say, 150 9-meg greyscale BMPs) and mashes them down into 20-meg PDFs that look good even at 400% magnification. Amazing technology.

Still, this Acrobat / JPG issue might make a difference if that is your preferred (or only allowed) DIY scanner / camera output format. I don't know if Acrobat can handle RAW or any other formats that different cameras may support. I looked at a manual for a Canon A480 and JPG seemed to be the only option.
User avatar
clemd973
Posts: 121
Joined: 22 Aug 2010, 21:20

Re: web based access to one of Google's OCR engines

Post by clemd973 »

spamsickle wrote:If you're using the "Clearscan" option (a new feature in Acrobat 9), my experience has been that files sizes after OCR are often 1/10th of what they were before. "Clearscan" vectorizes the image, and creates custom fonts, which accounts for the space savings over raster. So far, I haven't encountered any problems with other software using the custom font, but I think it will limit you in some ways. I'm told you can't convert a "Clearscan" PDF back to raster, for instance. Since I save all my originals anyway, that isn't a big deal for me, but it may be for you.
Spam, have you found any changes in the info you submitted above??? I eventually purchased Acrobat X, and when I run OCR/ClearScan it seems to inflate the file size incredibly. I'm assuming it's because of all the newly created fonts - sometimes I can't even count the number of newly created fonts...it would take forever. See some of the posts on my thread Acrobat Tips. I'd love to find a way to decrease the file size - for example, the latest book I did this past weekend was 236 pages and came out to be ~15MB. Sounds like a lot to me, but I believe those created fonts blow up the file size. I'd appreciate any of your observations.

Also, we communicated once on a thread regarding book placement/positioning and how it changes through the scanning process, and you mentioned using a PVC pipe, if I remember correctly. What was that about? I'm running into some problems when I scan hard cover books, as they are less flexible. I had to end up at different points putting something under the spine to raise the book a bit, thereby pushing it up more flush against the glass. Was that your intention with the PVC? (I hope I remember this correctly) Thanks
User avatar
Misty
Posts: 481
Joined: 06 Nov 2009, 12:20
Number of books owned: 0
Location: Frozen Wasteland

Re: web based access to one of Google's OCR engines

Post by Misty »

Clemd, an alternate to vectorizing would be to use lossily-compressed JBIG2 compression as offered in Acrobat. It detects similar characters and merges them to substantially reduce the amount of space an image uses. It could potentially be smaller than vectorizing.
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.
spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: web based access to one of Google's OCR engines

Post by spamsickle »

clemd973 wrote: Spam, have you found any changes in the info you submitted above??? I eventually purchased Acrobat X, and when I run OCR/ClearScan it seems to inflate the file size incredibly. I'm assuming it's because of all the newly created fonts - sometimes I can't even count the number of newly created fonts...it would take forever. See some of the posts on my thread Acrobat Tips. I'd love to find a way to decrease the file size - for example, the latest book I did this past weekend was 236 pages and came out to be ~15MB. Sounds like a lot to me, but I believe those created fonts blow up the file size. I'd appreciate any of your observations.
I'm happy to share, but it sounds like what you're doing and what I'm doing are like apples and oranges. I've never seen Clearscan inflate my filesize, but I'm not starting with filesizes anything close to what you're seeing. To take a recent example:

I start with a 430-page book (including front and back covers) which is 1 GB of JPEGs. It's mostly text, with a few greyscale images and a few more b/w images. The front and back covers are in color.

After it's been through Scan Tailor, it's 572 MB of TIFs. Everything is mixed mode except the covers, which are color.

ImageMagick's mogrify converts from TIF to 532 MB of PDFs, which PDFTK stitches into a 532 MB book.

Running that through Clearscan gives me a 19 MB book.

Now, granted that's more than 236 pages, but if 15 MB seems large to you, I don't think we're on the same page. I'd be interested in hearing what your process and data sizes are down the line too; I may be doing something stupid. One thing I'm doing that you may not be is preserving a completely full-color image of the front and back cover. My front cover accounts for 53 MB, and back cover 31 MB of my uncompressed book.

I know PDFs created from the ground up, with a single font and I assume some professional compression tweaking on the images can come in under 5 MB. While that would be nice, I'm not trying to carry a library on my phone, so 20 or even 50 MB is satisfactory for me. All my books are on a hard drive or a DVD, and even at 50 MB that means I can put 80 books on a single disk.
Post Reply