Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

Acrobat Tips

Discussions, questions, comments, ideas, and your projects having to do with DIY Book Scanner software. This includes the Stereo Data Maker software for the cameras, post-processing software, utilities, OCR packages, and so on.
DamnedOwl

Re: Acrobat Tips

Post by DamnedOwl » 17 Mar 2011, 06:51

clemd973 wrote:Yeah, the sheer number of created fonts would take forever to go through. I'll take a look at those programs, though. You never know what can be gleaned from different sources
PitStop Pro especially is a really good plug-in for Acrobat - I recommend it for anyone who works a lot with pdfs.
clemd973 wrote:I do have a dual boot system - running XP - but it seems that by incorporating Clear Scan into the initial process has all but eliminated the original problem for me. Thanks for the suggestion. Still working on the work around for file size.
Still the massive file size? That sucks! I spent ages trying to find a solution for this - in the end I was fortunate in that the solution kind of found me.

But still, I can only hope that the next advance in ClearScan (which I do otherwise think is excellent) will do something about reducing the number of fonts that are created through the process.

In this respect, it would be good if future versions could have some way of controling the ClearScan process.

For example, it seems to me that ClearScan will only tolerate a small amount of difference between characters before it will produce an entirely new font. This is great in some respects because the end result is something which visually looks very similar to the original - but in other respects (file size, for example) it's crazy because a book which may use only seven or eight different fonts could be rendered by maybe 20 times that amount when it goes through the ClearScan process. Maybe if you could change the tolerance level for difference in the font characters they could reduce the number of generated fonts and hence the file size.

User avatar
rob
Posts: 773
Joined: 03 Jun 2009, 13:50
E-book readers owned: iRex iLiad, Kindle 2
Number of books owned: 4000
Country: United States
Location: Maryland, United States
Contact:

Re: Acrobat Tips

Post by rob » 18 Mar 2011, 09:31

I converted a 280-page book using ClearScan, and ended up with 83 fonts, and a 6.9 MB file. I used Adobe Acrobat 9 to generate a preflight inventory, which prints out the fonts. You can see it makes a valiant effort to group characters that sort of look alike, but it does stumble with characters that should be identical but aren't. Also, it completely fails to recognize certain bits, treating them as images for whatever reason. In the end, someone had to make a judgment call as to how much variation to allow, and sadly that someone isn't you!

(Aside: minidjvu allows you to set the aggressiveness of character matching, but even at the highest setting, I've seen it fail to match what should be fairly obvious duplicate letters. Supposedly minidjvu has a threshold for the number of different pixels, and that threshold varies with document dpi -- the higher the dpi, the more pixels may be different -- but it still doesn't work quite well enough for me. I should look at the source code. One wonders if they forgot that the number of different pixels should go up as the square of the dpi...)

What I was impressed with was ClearScan's ability to sometimes group broken and accented characters correctly. Apparently there must be an algorithm inside which tries to group bits of characters together coherently, and it does a pretty good job with it. Unfortunately, sometimes it misses, and treats the accent itself as a separate character.

The OCR part of ClearScan mostly recognizes each character, at least for plain old unaccented alphanumerical characters. It often confuses a long thin character for an exclamation point, for example, an italic f. The OCR is unable to recognize characters composed of more than one letter.

With 83 fonts, and an average of about (guessing) 70 characters in each font, that's on the order of 6,000 separate characters, but I'm guessing that maybe 4,000 of those characters were OCR'd correctly. (Update: this is wrong. There are many more characters than this -- some fonts have upwards of 1,500 characters!)

Here are some samples from the inventory.
preflight1.jpg
preflight1.jpg (143.92 KiB) Viewed 6194 times
preflight2.jpg
preflight2.jpg (41.92 KiB) Viewed 6194 times
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.

User avatar
clemd973
Posts: 121
Joined: 22 Aug 2010, 21:20

Re: Acrobat Tips

Post by clemd973 » 18 Mar 2011, 14:41

Rob, thanks for the analysis. While Clear Scan inflates the file size, I'm happy with the results, and the fine tuning of the font appearance, in my opinion, adds a lot to the finished product. I'm wondering out loud, until I can research it, if there might be a way to "Print to File" that will simply capture the image of the page without needing the embedded fonts? (but I suppose that would render it an unsarchable document and defeat the purpose.)

jem

Re: Acrobat Tips

Post by jem » 18 Mar 2011, 20:55

daniel_reetz wrote:So what's the ultimate solution here - if they won't bugfix - a script to submit a few hundred pages at a time?
clemd973 wrote:Other than living with it, I'm sure an Acrobat "Action" can be compiled to process incrementally. I plan to pursue that as well.
Following this I've found a way to work around the error. In my case at least Acrobat would usually completely scan the document and only fail at the very end while "Finalizing ClearScan file". The "Optimize Scanned PDF" tool in "Document Processing" can be set to use ClearScan OCR and processes a single page at a time, finalizing and generating output before moving to the next. This prevents the error in most cases, but even if an error does happen the intermediate progress is not lost and you can simply resume from the failed page rather than lose 7 hours of scanning a 3800 page document.

Put it in an Action for batch processing and you can almost forget how much you paid Adobe for a half-baked OCR solution. However, if it's run from an Action you won't be able to recover from any "single-page error". Depending on what you're working with this might not be a problem, I've only seen this a couple times on pages with complex line drawings.

Hopefully this is somewhat helpful and please make sure you contact Adobe and let them know you expect better.

User avatar
clemd973
Posts: 121
Joined: 22 Aug 2010, 21:20

Re: Acrobat Tips

Post by clemd973 » 25 Mar 2011, 12:13

jem wrote:
daniel_reetz wrote:So what's the ultimate solution here - if they won't bugfix - a script to submit a few hundred pages at a time?
clemd973 wrote:Other than living with it, I'm sure an Acrobat "Action" can be compiled to process incrementally. I plan to pursue that as well.
Following this I've found a way to work around the error. In my case at least Acrobat would usually completely scan the document and only fail at the very end while "Finalizing ClearScan file". The "Optimize Scanned PDF" tool in "Document Processing" can be set to use ClearScan OCR and processes a single page at a time, finalizing and generating output before moving to the next. This prevents the error in most cases, but even if an error does happen the intermediate progress is not lost and you can simply resume from the failed page rather than lose 7 hours of scanning a 3800 page document.

Put it in an Action for batch processing and you can almost forget how much you paid Adobe for a half-baked OCR solution. However, if it's run from an Action you won't be able to recover from any "single-page error". Depending on what you're working with this might not be a problem, I've only seen this a couple times on pages with complex line drawings.

Hopefully this is somewhat helpful and please make sure you contact Adobe and let them know you expect better.
In considering your workaround and doing a little digging into the "Preferences" section, I believe I've worked out an effective and even more time efficient method. Under "Preferences" there is an option to tailor "Convert to PDF" to your specific needs. In my workflow, I select "Combine Files into PDF" from the opening menu as I prepare to use the exported .tiff's from Scan Tailor. Prior to this, you might do the following IN ACROBAT X PRO:

Acrobat > Preferences > Convert to PDF (Under "Categories") > TIFF (Under "Converting to PDF") > Edit Settings > Settings (next to "Scan Optimization and OCR") > Edit (under OCR options) > Select "Clear Scan" from the PDF Output Style drop down box > Click "OK" at the notification "Compression settings will not be used when Clearscan is selected."

Yes, that's a long trail to follow, but what it does is assign OCR/Clear Scan to the import process at the beginning of the Acrobat workflow. This causes OCR/Clear Scan to be run on each individual .tiff during the import process. Since it is done one page at a time, it doesn't become a memory hog and actually goes by pretty quickly. I think it took about 25-35 minutes to incorporate 1200 pages. After the import is complete, you're ready to save as PDF. Voila, you're done! Now, it may have taken a little more than 25-35 minutes...can't remember, but it certainly didn't take as long as it did when I was running OCR/Clear Scan AFTER import when I was getting the original crashes.

All in all, I still prefer the look of the final product with Clear Scan, even though it explodes the file size. Guess we'll just have to accept that for the time being. Got an Acrobat guru who I'm going to present this problem to to see if she can make any suggestions. Stay tuned...

DamnedOwl

Re: Acrobat Tips

Post by DamnedOwl » 25 Mar 2011, 12:29

What was the pdf file size of those 1200 pages once you'd imported them using this method?

User avatar
clemd973
Posts: 121
Joined: 22 Aug 2010, 21:20

Re: Acrobat Tips

Post by clemd973 » 25 Mar 2011, 20:51

DamnedOwl wrote:What was the pdf file size of those 1200 pages once you'd imported them using this method?
Sadly, 95MB.

Post Reply