web based access to one of Google's OCR engines

Convert page images into searchable text. Talk about software, techniques, and new developments here.

Moderator: peterZ

User avatar
clemd973
Posts: 121
Joined: 22 Aug 2010, 21:20

Re: web based access to one of Google's OCR engines

Post by clemd973 »

spamsickle wrote:I'm happy to share, but it sounds like what you're doing and what I'm doing are like apples and oranges. I've never seen Clearscan inflate my filesize, but I'm not starting with filesizes anything close to what you're seeing.
By inflating the file size, I suppose I'm referring to what the file size would be if all the text were one font, e.g. Times New Roman. If that were the case, there would only be one embedded font, which would mean a substantially low file size. Running the document through Acrobat X's OCR/ClearScan results in the creation of MANY fonts, which results in a much larger file size. After you Clear Scan your document, look in the document properties under Fonts and you'll see what I'm talking about. in the example below, my most recent book, I have the font properties of the 84pg book I did today. This 84pg book put through Acrobat X's OCR/ClearScan resulted in 255 created fonts since, apparently, Clear Scan didn't recognize the font or some anomalies resulting from the post processing.
FontPic.jpg
FontPic.jpg (434.12 KiB) Viewed 9983 times
Now, don't get me wrong, I am very happy with Clear Scan's results, especially since it renders a very accurate account of the text - haven't run across any misspellings yet. However, I believe that if the font was uniform, the 5.6MB file would be even smaller. In fact, I tried editing the text to change the created fonts to something like Times New Roman, and it resulted in lots of misspellings as well as some format changes. If that's my only other option, I'll live with (what I consider) a larger file size. I'm just trying to streamline as much as possible since I'd like to carry many of my books on my 32GB iPad.
spamsickle wrote:To take a recent example: I start with a 430-page book (including front and back covers) which is 1 GB of JPEGs. It's mostly text, with a few greyscale images and a few more b/w images. The front and back covers are in color.

After it's been through Scan Tailor, it's 572 MB of TIFs. Everything is mixed mode except the covers, which are color.

ImageMagick's mogrify converts from TIF to 532 MB of PDFs, which PDFTK stitches into a 532 MB book.

Running that through Clearscan gives me a 19 MB book.
My most recent example: I start with an 84-page book (including a full-color front cover; I usually don't worry about the back cover) which is 225MB of JPEGs. It's mostly text, with two greyscale images, no b/w images. The front cover is in full color.

1) I first take the 225MB of JPGs and pre-process them in Adobe Lightroom 3 and export as TIFs, which yields 831MB of TIFs.
2) I import that 831MB of TIFs into ScanTailor and output 46MB of TIFs
3) I import that 46MB of TIFs from ScanTailor into Acrobat X, Clear Scan it, and add the 2.5MB JPG color cover (48.5MB) to yield a 5.6MB PDF.

Granted...considering those numbers, a 5.6MB PDF is a drop in the bucket, but I know on the other hand that if those fonts were uniform, the file size would be even lower. (No way to get there yet, though.) So I'm happy with what I have for the moment...just trying to make it better and smaller.
spamsickle wrote:Now, granted that's more than 236 pages, but if 15 MB seems large to you, I don't think we're on the same page. I'd be interested in hearing what your process and data sizes are down the line too; I may be doing something stupid. One thing I'm doing that you may not be is preserving a completely full-color image of the front and back cover. My front cover accounts for 53 MB, and back cover 31 MB of my uncompressed book.
That full-color front and back cover sure does seem large. Is it TIF or JPG (surely not JPG, huh?). I tweak my full-color cover in LR3 and export as a high quality JPG, which usually results in an image between 2-5MB. I then insert that as the cover page of the PDF at the very end of my workflow and save everything to PDF once again and I'm done. What makes your images so large? If it is a TIF, you might try JPG. Seems to work fine for me. You can download my latest book here if you'd like to get a look at the final product.

Someone should mention to Daniel that there should be a place where we can upload completed works just to download and get a look at so we can ask each other questions about processing, workflow, etc.
spamsickle wrote:I know PDFs created from the ground up, with a single font and I assume some professional compression tweaking on the images can come in under 5 MB. While that would be nice, I'm not trying to carry a library on my phone, so 20 or even 50 MB is satisfactory for me. All my books are on a hard drive or a DVD, and even at 50 MB that means I can put 80 books on a single disk.
For those purposes, you've got all you need. However, as I continue to convert my actual library into a virtual library on my iPad, space/filesize becomes a concern.
User avatar
clemd973
Posts: 121
Joined: 22 Aug 2010, 21:20

Re: web based access to one of Google's OCR engines

Post by clemd973 »

Yep, don't know if this is the correct thread to continue this discussion...any suggestions, or should we start another?
spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: web based access to one of Google's OCR engines

Post by spamsickle »

To keep our conversation somewhat on topic, I tried the Google OCR engine on a page. I was pretty happy with the OCR -- it had about the same number of errors (<5 for the page) as Clearscan, though they were different errors. Both seemed prone to "spacing" errors, inserting spaces where there were none, or merging two words into one. Both capitalized a word or two which were not capitalized in the original. The Google engine misidentified one "-" as "»", which caused it to fail to join a word that was split across two lines, and misidentified "fl" as "H", converting "flow" to "How," so it was slightly inferior, but I would be satisfied with the OCR results of either for my purposes.

I wouldn't use the Google system because it's unwieldy -- each page must be uploaded and converted individually, as far as I can tell, and then downloaded individually. For my 400-page book, this seems like it would require several hours in front of the computer shepherding each step. With Clearscan, I can set it running on the whole PDF, then go do something else.

I did check the number of fonts; my 400-page book used 89 custom fonts, so I guess I'm getting better results than you are in that regard. I don't think these take up too much room individually, but I don't really know for sure.

I saw an option under Acrobat's Document menu to "Reduce File Size", and invoking that with "Make compatible with Acrobat 9 or later" took my 19 MB file down to 8.5 MB, though still with 89 fonts. The conversion took about a minute, though even I can see a loss of quality. It looks to me like it's probably doing some kind of compression like JPEG does -- fast fourier transform, or discrete cosine transform, or something along those lines -- because I see "ringing" around some of the graphics elements. The text itself seems to have about the same quality, so if you don't have any images you may not see a significant reduction in file size, but if size is your main concern, it's probably worth a try.
User avatar
clemd973
Posts: 121
Joined: 22 Aug 2010, 21:20

Re: web based access to one of Google's OCR engines

Post by clemd973 »

spamsickle wrote:I did check the number of fonts; my 400-page book used 89 custom fonts, so I guess I'm getting better results than you are in that regard. I don't think these take up too much room individually, but I don't really know for sure. I saw an option under Acrobat's Document menu to "Reduce File Size", and invoking that with "Make compatible with Acrobat 9 or later" took my 19 MB file down to 8.5 MB, though still with 89 fonts. The conversion took about a minute, though even I can see a loss of quality. It looks to me like it's probably doing some kind of compression like JPEG does -- fast fourier transform, or discrete cosine transform, or something along those lines -- because I see "ringing" around some of the graphics elements. The text itself seems to have about the same quality, so if you don't have any images you may not see a significant reduction in file size, but if size is your main concern, it's probably worth a try.
I'll check out the Google engine, but as you said, the results aren't good enough to warrant sitting in front of the computer for a few hours doing one page at a time. I am wondering, though, how your 400 page book got away with only 89 custom fonts. Perhaps it's the quality of the page input to ScanTailor. Would you mind posting a sample page? (I'm trying to fine-tune my preprocessing the best I can with the tools I have in order to start with as sharp an image as possible. In my case, I think the lighting is an issue.)
spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: web based access to one of Google's OCR engines

Post by spamsickle »

I kind of doubt that the quality input to Scan Tailor is the explanation in this case, because I managed to shoot the book without the proper white balance setting on one of the cameras. Fortunately, Scan Tailor didn't seem to mind, and the output FROM Scan Tailor seems consistent. I've uploaded three sequential mostly-text pages to give you an idea of what they look like, JPGs, TIFs, and PDFs for all 3.
http://www.4shared.com/dir/sONbITty/_online.html.
For some reason, the site isn't working for me under Firefox, but Internet Explorer still has access. Probably some mess with cookies or something; let me know if you can't get them, and I'll try putting them somewhere else. ETA: I cleared my cookies for the site, and it's accessible for me now under Firefox too.
lexicographer

Re: web based access to one of Google's OCR engines

Post by lexicographer »

clemd973: I would be interested to hear, if you can actually search through the text across the fonts. You mention that changing the fonts to Times New Romans results in misspellings; this suggests to me that you would not be able to search a word in one of those newly created fonts, since they do not seem to have the the same sequence of letters (the 'a' in a new font might actually be an 'f' in Times) - but maybe I have misunderstood you?
User avatar
clemd973
Posts: 121
Joined: 22 Aug 2010, 21:20

Re: web based access to one of Google's OCR engines

Post by clemd973 »

spamsickle wrote:I kind of doubt that the quality input to Scan Tailor is the explanation in this case, because I managed to shoot the book without the proper white balance setting on one of the cameras.
Well, your image straight from the camera looks much sharper than mine usually come out. I'm still tinkering with my lighting and trying to find the best settings on the cameras (PSA480's). I'm also using glare free acrylic instead of glass, and I believe that's detracting from the clarity in the shot...see below.
IMG_0012.jpg
However, after the post processing and OCR/ClearScan, I get this...
Page3b.pdf
(33.51 KiB) Downloaded 548 times
I'd like to hear your workflow.
User avatar
clemd973
Posts: 121
Joined: 22 Aug 2010, 21:20

Re: web based access to one of Google's OCR engines

Post by clemd973 »

lexicographer wrote:clemd973: I would be interested to hear, if you can actually search through the text across the fonts. You mention that changing the fonts to Times New Romans results in misspellings; this suggests to me that you would not be able to search a word in one of those newly created fonts, since they do not seem to have the the same sequence of letters (the 'a' in a new font might actually be an 'f' in Times) - but maybe I have misunderstood you?
My books are fully searchable both on my computer and my iPad. I tested it once again before posting this to make sure, and I searched for various words with success...even the word "humanitarianism," which isn't all that common, and it found it right away. I again tried to change the font, which resulted in format changes as well as a few misspellings. I cant figure that one out yet. You can try it on my file posted above in a reply to spamsickle earlier on this page (first post on this page...linked to the word "here" under the third quote). I'd be interested in hearing your thoughts.
lexicographer

Re: web based access to one of Google's OCR engines

Post by lexicographer »

I looked at your scan just briefly: exported the whole book to WinWord, so I could see the actual text (not the scanned pictures). The number of font-names is astonishing. The actual quality of the OCR is very good, I found only three words (acually letters) which had not been read by the OCR-program and were marked with a question mark. Then I formatted the whole text with Times New Roman. As far as I could tell, there was no increase in misspellings, but I just clicked through the text cursorily, so that may have escaped me. Certainly my theory that different fonts might be mapped differently missed the mark completely. Great result!
User avatar
daniel_reetz
Posts: 2812
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: web based access to one of Google's OCR engines

Post by daniel_reetz »

Yep, that's a focus or glare-free issue. Your book images should be coming out much sharper than that. It's not too underexposed, and the lighting looks reasonably even.
Post Reply