web based access to one of Google's OCR engines
Moderator: peterZ
- daniel_reetz
- Posts: 2812
- Joined: 03 Jun 2009, 13:56
- E-book readers owned: Used to have a PRS-500
- Number of books owned: 600
- Country: United States
- Contact:
- rob
- Posts: 773
- Joined: 03 Jun 2009, 13:50
- E-book readers owned: iRex iLiad, Kindle 2
- Number of books owned: 4000
- Country: United States
- Location: Maryland, United States
- Contact:
Re: web based access to one of Google's OCR engines
Interesting, but I can't recommend it! Doesn't know the difference between em-dashes and hyphens, fails on detecting small images, naively treats each line as separate from every other line, doesn't get line spacing correct, doesn't detect text style. Correctly identifies the characters without any help, though, which is more than I can say about FineReader 10
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.
Re: web based access to one of Google's OCR engines
Rob, don't you use Adobe Acrobat for your OCR tasks (think I remember that from another post)??? If so, how does it work for you? I've got an opportunity to buy Acrobat Professional 9 for Mac at a discounted price. What does it do for file size before and after OCR?rob wrote:Interesting, but I can't recommend it! Doesn't know the difference between em-dashes and hyphens, fails on detecting small images, naively treats each line as separate from every other line, doesn't get line spacing correct, doesn't detect text style. Correctly identifies the characters without any help, though, which is more than I can say about FineReader 10
- rob
- Posts: 773
- Joined: 03 Jun 2009, 13:50
- E-book readers owned: iRex iLiad, Kindle 2
- Number of books owned: 4000
- Country: United States
- Location: Maryland, United States
- Contact:
Re: web based access to one of Google's OCR engines
Oh, I pretty much tried everything. I've settled on using ABBYY FineReader for my OCR work. Again, it's not perfect, and the developers are shielded by a crack management team, but you won't really get anything better.
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.
-
- Posts: 596
- Joined: 06 Jun 2009, 23:57
Re: web based access to one of Google's OCR engines
If you're using the "Clearscan" option (a new feature in Acrobat 9), my experience has been that files sizes after OCR are often 1/10th of what they were before. "Clearscan" vectorizes the image, and creates custom fonts, which accounts for the space savings over raster. So far, I haven't encountered any problems with other software using the custom font, but I think it will limit you in some ways. I'm told you can't convert a "Clearscan" PDF back to raster, for instance. Since I save all my originals anyway, that isn't a big deal for me, but it may be for you.clemd973 wrote:I've got an opportunity to buy Acrobat Professional 9 for Mac at a discounted price. What does it do for file size before and after OCR?
- dingodog
- Posts: 110
- Joined: 22 Jul 2010, 18:19
- Number of books owned: 1000
- Country: on the net
- Location: on the net
- Contact:
Re: web based access to one of Google's OCR engines
all vector layers can be rasterized
in Linux (and in windows both) you can use
pdftoppm (included in
xpdf utils
- http://www.foolabs.com/xpdf/
pdftoppm -mono (for black/white text) -r 300 input.pdf ppm-root
in Linux (and in windows both) you can use
pdftoppm (included in
xpdf utils
- http://www.foolabs.com/xpdf/
pdftoppm -mono (for black/white text) -r 300 input.pdf ppm-root
- reggilbert
- Posts: 49
- Joined: 28 Sep 2010, 19:57
- Number of books owned: 3000
- Location: Buffalo, New York
Re: web based access to one of Google's OCR engines
I've used Acrobat OCR exclusively for several years and it does a very decent job, within a couple percent of the dedicated OCR packages and just as fast or faster (this assessment is based on the standalone OCR packages of two years ago, however).clemd973 wrote: Rob, don't you use Adobe Acrobat for your OCR tasks (think I remember that from another post)??? If so, how does it work for you? I've got an opportunity to buy Acrobat Professional 9 for Mac at a discounted price. What does it do for file size before and after OCR?
But whether one OCR process "works" or not depends on what its purpose is. For finding most instances of a word or phrase in a file, Acrobat's 97 percent (or whatever it is) OCR accuracy is good enough, for my personal purposes, at least. But for outputting to a text file that the user plans to correct (say, to create an ebook that can be repurposed in an e-reader), an accuracy rate a couple percent higher would make a big difference in the post-processing time, so a standalone package might be a better idea for that.
Spamsickle's 1/10 compression after Acrobat OCR may have something to do with his/her source files. I notice that when I combine JPGs in Acrobat, the resulting file is much larger than when I use BMP (in both cases generated by flatbed scanners). I never OCRd a resulting file; perhaps OCR of JPG-based Acrobat files yields a much smaller size. There is virtually no change in the size of my BMP-based files, but it wouldn't really matter if there were -- somehow Acrobat takes gig-plus folders of source scans (say, 150 9-meg greyscale BMPs) and mashes them down into 20-meg PDFs that look good even at 400% magnification. Amazing technology.
Still, this Acrobat / JPG issue might make a difference if that is your preferred (or only allowed) DIY scanner / camera output format. I don't know if Acrobat can handle RAW or any other formats that different cameras may support. I looked at a manual for a Canon A480 and JPG seemed to be the only option.
Re: web based access to one of Google's OCR engines
Spam, have you found any changes in the info you submitted above??? I eventually purchased Acrobat X, and when I run OCR/ClearScan it seems to inflate the file size incredibly. I'm assuming it's because of all the newly created fonts - sometimes I can't even count the number of newly created fonts...it would take forever. See some of the posts on my thread Acrobat Tips. I'd love to find a way to decrease the file size - for example, the latest book I did this past weekend was 236 pages and came out to be ~15MB. Sounds like a lot to me, but I believe those created fonts blow up the file size. I'd appreciate any of your observations.spamsickle wrote:If you're using the "Clearscan" option (a new feature in Acrobat 9), my experience has been that files sizes after OCR are often 1/10th of what they were before. "Clearscan" vectorizes the image, and creates custom fonts, which accounts for the space savings over raster. So far, I haven't encountered any problems with other software using the custom font, but I think it will limit you in some ways. I'm told you can't convert a "Clearscan" PDF back to raster, for instance. Since I save all my originals anyway, that isn't a big deal for me, but it may be for you.
Also, we communicated once on a thread regarding book placement/positioning and how it changes through the scanning process, and you mentioned using a PVC pipe, if I remember correctly. What was that about? I'm running into some problems when I scan hard cover books, as they are less flexible. I had to end up at different points putting something under the spine to raise the book a bit, thereby pushing it up more flush against the glass. Was that your intention with the PVC? (I hope I remember this correctly) Thanks
Re: web based access to one of Google's OCR engines
Clemd, an alternate to vectorizing would be to use lossily-compressed JBIG2 compression as offered in Acrobat. It detects similar characters and merges them to substantially reduce the amount of space an image uses. It could potentially be smaller than vectorizing.
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.
-
- Posts: 596
- Joined: 06 Jun 2009, 23:57
Re: web based access to one of Google's OCR engines
I'm happy to share, but it sounds like what you're doing and what I'm doing are like apples and oranges. I've never seen Clearscan inflate my filesize, but I'm not starting with filesizes anything close to what you're seeing. To take a recent example:clemd973 wrote: Spam, have you found any changes in the info you submitted above??? I eventually purchased Acrobat X, and when I run OCR/ClearScan it seems to inflate the file size incredibly. I'm assuming it's because of all the newly created fonts - sometimes I can't even count the number of newly created fonts...it would take forever. See some of the posts on my thread Acrobat Tips. I'd love to find a way to decrease the file size - for example, the latest book I did this past weekend was 236 pages and came out to be ~15MB. Sounds like a lot to me, but I believe those created fonts blow up the file size. I'd appreciate any of your observations.
I start with a 430-page book (including front and back covers) which is 1 GB of JPEGs. It's mostly text, with a few greyscale images and a few more b/w images. The front and back covers are in color.
After it's been through Scan Tailor, it's 572 MB of TIFs. Everything is mixed mode except the covers, which are color.
ImageMagick's mogrify converts from TIF to 532 MB of PDFs, which PDFTK stitches into a 532 MB book.
Running that through Clearscan gives me a 19 MB book.
Now, granted that's more than 236 pages, but if 15 MB seems large to you, I don't think we're on the same page. I'd be interested in hearing what your process and data sizes are down the line too; I may be doing something stupid. One thing I'm doing that you may not be is preserving a completely full-color image of the front and back cover. My front cover accounts for 53 MB, and back cover 31 MB of my uncompressed book.
I know PDFs created from the ground up, with a single font and I assume some professional compression tweaking on the images can come in under 5 MB. While that would be nice, I'm not trying to carry a library on my phone, so 20 or even 50 MB is satisfactory for me. All my books are on a hard drive or a DVD, and even at 50 MB that means I can put 80 books on a single disk.