Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

Acrobat Tips

Discussions, questions, comments, ideas, and your projects having to do with DIY Book Scanner software. This includes the Stereo Data Maker software for the cameras, post-processing software, utilities, OCR packages, and so on.
User avatar
clemd973
Posts: 121
Joined: 22 Aug 2010, 21:20

Acrobat Tips

Post by clemd973 » 28 Dec 2010, 11:33

I thought I'd start this thread for closed-source junkies like myself who use Adobe Acrobat for part of their post-processing. I'm actually a new user and have pulled out some of my hair over some of the issues I've run into. I wish I would have run across a resource like this...but never found one...so why not start one now. Even if you don't use programs like Acrobat, some of the ideas/problems presented here might help you work out scripts and codes to use in your own post-processing. I've just finished my first multicolored-text scan and am very satisfied with the results; although, there still needs to be some tweaking of the final presentation. PLEASE ADD YOUR OWN TIPS/TRICKS, ETC.
Last edited by clemd973 on 30 Dec 2010, 08:52, edited 1 time in total.

User avatar
clemd973
Posts: 121
Joined: 22 Aug 2010, 21:20

Re: Acrobat Tips

Post by clemd973 » 28 Dec 2010, 11:41

Problem: With pages containing multi-colored text, you can't process in Scan Tailor using the "black and white" setting, for obvious reasons. ;) Therefore, you must use the "color/grayscale" setting, but the background ends up looking very blotchy. I found that even when manipulating the colors and settings in Lightroom 3 (the program I use for pre-processing the images) once output from ScanTailor using the "color/grayscale" mode, the background was still coming out slightly blotchy and not really clean like I wanted it.

Solution: I researched what could possibly be done in Acrobat and I found that I could use Edit>Preflight to separate the pages into different layers: Image layer, Text layer, and Vector Object layer. Then, in the layer command, I could hide the image layer, which effectively removes/hides the image layer. That layer can even be locked out, i.e., made to never be visible either in viewing in Acrobat or in exporting or printing. 8-)

User avatar
clemd973
Posts: 121
Joined: 22 Aug 2010, 21:20

Re: Acrobat Tips

Post by clemd973 » 18 Jan 2011, 10:32

clemd973 wrote: I researched what could possibly be done in Acrobat and I found that I could use Edit>Preflight to separate the pages into different layers: Image layer, Text layer, and Vector Object layer. Then, in the layer command, I could hide the image layer, which effectively removes/hides the image layer.
It's been a while since my last update, and I've made some changes in my workflow, etc. that addresses the issue in my first post. I had to rework my original colored-text book because of a problem I ran into with separating the document into different layers. I found that when separating into text and image/background layers, while I wanted to be able to separate all text from the background image, that wasn't what was always happening. At times, the Acrobat OCR was not recognizing certain words - mostly at random - and therefore they were remaining on the image layer. Because of this when I hid the background/image layer, some of the text would go with it. I've got an idea on how I can resolve this issue and still use the "layer option," but until I can look into that further, I found another work around. I process my images in Lightroom 3 before sending it to Scan Tailor, and in becoming more familiar with this program, I was able to effectively whiten the background and darken both the black and the colored text. In Scan Tailor, then, using the "color/grayscale" mode to maintain the colored text, the background was able to be further processed by selecting "white margins" and "equalize illumination". There are some minor blotches that remained visible in the final product, but I'm OK with that. It comes up clean and easily readable on my iPad. Moreover, I was able to save the settings in Lightroom 3 as a preset so that I can use it again in the future. What also adds to the ease of processing is the two LCD monitors (scroll to the bottom) I mounted on my scanner that allows me to see what the camera is seeing as I'm working through the scanning process...which then allows me to make any needed adjustments along the way.

User avatar
clemd973
Posts: 121
Joined: 22 Aug 2010, 21:20

Re: Acrobat Tips

Post by clemd973 » 23 Jan 2011, 18:41

OCR and RAM: I'm using a Macbook Pro with 4GB RAM. When performing OCR in Acrobat, I've found that it's better to go in increments of about 100 pages at a time since the OCR process seems to process all the pages at once and holds it in memory rather than one page at a time and releasing it from memory; therefore, going over about 100 pages may end up in using all your RAM, which will then result in an error message and will cancel the process. This really sucks if you were trying to OCR 500 pages, which takes a lot of time, only to get the error message half way or more of the way through and have to start from the beginning again.

User avatar
daniel_reetz
Posts: 2797
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: Acrobat Tips

Post by daniel_reetz » 23 Jan 2011, 23:57

I know I've seen similar reports around the forums -- a lot of people doing 100 or 200 pages at a time, and then binding the results. Does the mac platform have some kind of profiler that would confirm this behavior?

Seems a shame that such nice software has problems like this in the year 2011, but since we're really pushing the limits of technology, it's in a way unsurprising.

Thanks for keeping track of this stuff, Clemd973.

User avatar
clemd973
Posts: 121
Joined: 22 Aug 2010, 21:20

Re: Acrobat Tips

Post by clemd973 » 25 Jan 2011, 06:36

daniel_reetz wrote:Does the mac platform have some kind of profiler that would confirm this behavior? Seems a shame that such nice software has problems like this in the year 2011, but since we're really pushing the limits of technology, it's in a way unsurprising.
To be honest, I'm not really sure. I'll look into it and post back.

umpausewhat
Posts: 22
Joined: 04 Mar 2014, 00:55

Re: Acrobat Tips

Post by umpausewhat » 26 Jan 2011, 21:34

I don't know how exactly Acrobat uses Ram, but I've found its ocr performance depends a lot on how clean the text is. This isn't just an image quality issue, but sometimes an issue with the printed material itself. I have mass market paperbacks in which the ink of a printed letter commonly touches the next letter. Acrobat ocr does not like this. I've crashed the ocr a few times on these types of texts. To make matters more frustrating, in these scenarios Acrobat's "clearscan" ocr doesn't end up making the file smaller (sometimes the process increases the file size). I take it that this is the case because every time in runs across multiple touching letters, it has to come up with a new custom vectorized font and as those fonts multiply, the ocr process gets cumbersome and file size bigger. Maybe this is where Ram comes in, if you are using clearscan--too many custom fonts multiplying.

When a printed text is clean and all the letters are separate, Acrobat's ocr seems to be able to handle the big books without much trouble. I don't think the RAM is too much of an issue here--when I look at my memory usage during the ocr process (using Task Manager), it tends to remain pretty constant; I've got the same amount of RAM mentioned above (4 GB). But I'm not a computer expert, so forgive oversights in any of the above.

User avatar
clemd973
Posts: 121
Joined: 22 Aug 2010, 21:20

Re: Acrobat Tips

Post by clemd973 » 27 Jan 2011, 01:14

Dan, I hope this is what you were looking for. I used the Activity Monitor to assess RAM usage during the OCR process. I've got 4GB installed, and it seems as the process went on that number began to fall. I'm in no way a Mac or Adobe expert, but I know that when I include too many pages at one time to OCR with Clear Scan, it tells me I've run out of memory. When I decrease that number to about 100 - 125 pages at a time, it works fine. I think the images below help illustrate what's going on. Pay attention to what's circled in red. I've got a friend writing a script to document more accurately. I'll show those results when I get them. Eating the RAM really sucks, but as long as there's a work around, I'm OK with it. So far, I'm on book #5. I'll post some pics of my work soon...might have to start a new thread for that.
OCRbeginning.jpg
The beginning of the 127 page OCR process with Clear Scan.
OCRp45of127.jpg
Page 45 of 127.
OCRp78of127.jpg
Page 78 of 127.
OCRp114of127.jpg
Page 114 of 127.
OCRp126of127.jpg
Page 126 of 127.

User avatar
daniel_reetz
Posts: 2797
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: Acrobat Tips

Post by daniel_reetz » 27 Jan 2011, 11:06

That's pretty damning in and of itself. You might consider sending it in to Adobe as a bug report.

So what's the ultimate solution here - if they won't bugfix - a script to submit a few hundred pages at a time?

User avatar
rob
Posts: 773
Joined: 03 Jun 2009, 13:50
E-book readers owned: iRex iLiad, Kindle 2
Number of books owned: 4000
Country: United States
Location: Maryland, United States
Contact:

Re: Acrobat Tips

Post by rob » 27 Jan 2011, 12:04

Clearscan is an incredible memory hog. I have 2 GB on my Mac, and converting to Clearscan died after a few hundred pages (I don't remember the exact figure...)
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.

Post Reply