an initial effort...

General discussion about software packages and releases, new software you've found, and threads by programmers and script writers.

Moderator: peterZ

Post Reply
cfmorrill
Posts: 56
Joined: 17 Apr 2011, 21:20
Number of books owned: 0
Location: Charlottesville, Virginia

an initial effort...

Post by cfmorrill »

O.k. so I fired up the hackerspace scanner for the first time and photographed two pages.

Next, I imported the pages to iphoto (this is an imac), then parked them on the desktop, fired up Scantailor, deskewed, margined, etc. to an output file. Next opened up Acrobat 9 and OCR'd my first page. Damn, got just about every word.

Some comments:

1) This is way, way cool.

2) When Acrobat OCR'd the page, the image become fuzzier and I don't know why. Text is searchable, but I'm curious as to why it would look a little worse. Strange. Acrobat (this is full version pro) must have some way of being able to clean up a page. If it's OCR'd, seems to me it would be able to replace the existing document with a whole new typeface. One problem would be all the punctuation marks I guess. Any suggestions?

Charles
User avatar
rob
Posts: 773
Joined: 03 Jun 2009, 13:50
E-book readers owned: iRex iLiad, Kindle 2
Number of books owned: 4000
Country: United States
Location: Maryland, United States
Contact:

Re: an initial effort...

Post by rob »

What happens if you zoom in on a page? Do the letters become clearer with smooth edges? If so, then you've run into an artefact of poor aliasing. A page reduced by factors of two (50%, 25%, and so on) generally looks better than a page reduced by an odd factor (e.g. 33%) because of bad aliasing. I'm not sure that can be fixed, since it's a function of the reader program, not the source file.
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.
cfmorrill
Posts: 56
Joined: 17 Apr 2011, 21:20
Number of books owned: 0
Location: Charlottesville, Virginia

Re: an initial effort...

Post by cfmorrill »

Hi Rob,
Hmmm....that's interesting. Maybe I need an acrobat expert (dare I say an acrobat "acrobat"?)
Clearly I've got to work the whole thing through from beginning to end starting with better camera placement so I can capture a page at better resolution...

Charles

PS...I keep drawing up grids of wires in an effort to help out your logic module...
cfmorrill
Posts: 56
Joined: 17 Apr 2011, 21:20
Number of books owned: 0
Location: Charlottesville, Virginia

Re: an initial effort...

Post by cfmorrill »

Still hammering away at this. It's kind of funny, because I read about different hardware builds and and the problems folks have seem so simple. Then I get into software issues and the simplest things for everyone else can take me several days. I guess it's all what you're used to.

I had thought awhile back that a software program called prizmo might help me dewarp and deskew pages, but after using it awhile I just couldn't get the hang of how to dewarp a slightly warped page. Paradoxically, Prizmo seems to work best with a page that's really skewed. I think it's designed to flatten something like a magazine page or a poster that you've photographed at a weird angle. In such a case, it works quite well. But the images that come off the hackerspace scanner are skewed and keystoned only to a subtile degree and I find myself spending a half hour on each page to correctly place the Bezier page margins so everything comes out just so. Consequently, it doesn't help that much. But, hey, that's not what it's designed to do. It seems to have been designed so you could take your point and shoot digital camera around Paris, snapping the posters on the Metro at all angles. Then back home Prizmo lets you reproduce the image flat. It's cool, but not exactly what I need.

So, I went back to Scantailor and spent some more time with it, working the idea of the assembly line from the video. It took me a number of days to discover that Scantailor does in fact create its own little "out" folder in the same source folder for the jpegs. I know Rob had written this, but it took me some time to figure out exactly what was meant.

Then I discovered I had an early build of Scantailor without dewarping. Then I got the latest, and the dewarping seemed to make things worse, but I'm not sure why, so I ended up being very careful with shots on the scanner, letting Scantailor do it's thing without dewarping, then OCR'ing them in Acrobat and living with only slightly less than perfect results.

Interesting to me why I want them to be so perfect. People seldom look at a book without the pages being skewed or keystoned to some degree as it sits in your lap. Yet, we think nothing of it. When we see skewing and keystoning on a computer screen though, it looks terrible. Why? I dunno...

Charles (Working away at this, but knowing the only way to come up with a decent workflow is to hack away at it.)

Note to people new at this:

Best thing you can do is keep a journal. Get a big sketchbook and write down what you did each time.
User avatar
daniel_reetz
Posts: 2812
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: an initial effort...

Post by daniel_reetz »

charles, can you post a pair of example pages that we can experiment with? the hackerspace scanner *should* be producing pages with no keystoning but some barrel distortion (inherent in the camera optics)... in any case we should be able to figure out why things aren't working yet.
dpc
Posts: 379
Joined: 01 Apr 2011, 18:05
Number of books owned: 0
Location: Issaquah, WA

Re: an initial effort...

Post by dpc »

I was wondering how you were getting on with your post-processing, Charles. Thanks for the updates.

One of these days I'm going to be in the same boat and I'm interested in hearing about your trials and tribulations of developing an efficient workflow using the new "HS" scanner. I was hoping that the new scanner design would produce page images that wouldn't require a lot of post-processing - certainly not the keystone problems you're seeing since the camera mount is in lock-step with the platen surface. That's gotta be a camera position/direction problem?

Dan, regarding the "barrel distortion", would moving the camera farther away from the platen surface and zooming-in to compensate mitigate this effect? What are our options here?
User avatar
daniel_reetz
Posts: 2812
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: an initial effort...

Post by daniel_reetz »

It's possible to minimize lens distortion by setting the camera somewhere in the middle of its total zoom range, but for these models we are counting on getting the software correction right so the cameras can sit wherever they mechanically need to sit.

Scan Tailor or not, fixing lens distortion is a solved problem and just one we need to more carefully and completely integrate into the workflow.
cfmorrill
Posts: 56
Joined: 17 Apr 2011, 21:20
Number of books owned: 0
Location: Charlottesville, Virginia

Re: an initial effort...

Post by cfmorrill »

Sure Daniel,
I might need a couple of days because my two jobs are quite busy right now. Also, I'm not exactly sure how to attach a .pdf file to one of these posts...hmmm. Is there a tutorial about?

Regards, Charles
Alan Shutko

Re: an initial effort...

Post by Alan Shutko »

Acrobat, by default, will resample the image when doing OCR. That can make it less sharp.

To prevent this, change the OCR settings to "Searchable Image (Exact)" which means Acrobat won't touch the image and will just OCR it.
Post Reply