O.k. so I fired up the hackerspace scanner for the first time and photographed two pages.
Next, I imported the pages to iphoto (this is an imac), then parked them on the desktop, fired up Scantailor, deskewed, margined, etc. to an output file. Next opened up Acrobat 9 and OCR'd my first page. Damn, got just about every word.
Some comments:
1) This is way, way cool.
2) When Acrobat OCR'd the page, the image become fuzzier and I don't know why. Text is searchable, but I'm curious as to why it would look a little worse. Strange. Acrobat (this is full version pro) must have some way of being able to clean up a page. If it's OCR'd, seems to me it would be able to replace the existing document with a whole new typeface. One problem would be all the punctuation marks I guess. Any suggestions?
Charles
an initial effort...
Moderator: peterZ
- rob
- Posts: 773
- Joined: 03 Jun 2009, 13:50
- E-book readers owned: iRex iLiad, Kindle 2
- Number of books owned: 4000
- Country: United States
- Location: Maryland, United States
- Contact:
Re: an initial effort...
What happens if you zoom in on a page? Do the letters become clearer with smooth edges? If so, then you've run into an artefact of poor aliasing. A page reduced by factors of two (50%, 25%, and so on) generally looks better than a page reduced by an odd factor (e.g. 33%) because of bad aliasing. I'm not sure that can be fixed, since it's a function of the reader program, not the source file.
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.
-
- Posts: 56
- Joined: 17 Apr 2011, 21:20
- Number of books owned: 0
- Location: Charlottesville, Virginia
Re: an initial effort...
Hi Rob,
Hmmm....that's interesting. Maybe I need an acrobat expert (dare I say an acrobat "acrobat"?)
Clearly I've got to work the whole thing through from beginning to end starting with better camera placement so I can capture a page at better resolution...
Charles
PS...I keep drawing up grids of wires in an effort to help out your logic module...
Hmmm....that's interesting. Maybe I need an acrobat expert (dare I say an acrobat "acrobat"?)
Clearly I've got to work the whole thing through from beginning to end starting with better camera placement so I can capture a page at better resolution...
Charles
PS...I keep drawing up grids of wires in an effort to help out your logic module...
-
- Posts: 56
- Joined: 17 Apr 2011, 21:20
- Number of books owned: 0
- Location: Charlottesville, Virginia
Re: an initial effort...
Still hammering away at this. It's kind of funny, because I read about different hardware builds and and the problems folks have seem so simple. Then I get into software issues and the simplest things for everyone else can take me several days. I guess it's all what you're used to.
I had thought awhile back that a software program called prizmo might help me dewarp and deskew pages, but after using it awhile I just couldn't get the hang of how to dewarp a slightly warped page. Paradoxically, Prizmo seems to work best with a page that's really skewed. I think it's designed to flatten something like a magazine page or a poster that you've photographed at a weird angle. In such a case, it works quite well. But the images that come off the hackerspace scanner are skewed and keystoned only to a subtile degree and I find myself spending a half hour on each page to correctly place the Bezier page margins so everything comes out just so. Consequently, it doesn't help that much. But, hey, that's not what it's designed to do. It seems to have been designed so you could take your point and shoot digital camera around Paris, snapping the posters on the Metro at all angles. Then back home Prizmo lets you reproduce the image flat. It's cool, but not exactly what I need.
So, I went back to Scantailor and spent some more time with it, working the idea of the assembly line from the video. It took me a number of days to discover that Scantailor does in fact create its own little "out" folder in the same source folder for the jpegs. I know Rob had written this, but it took me some time to figure out exactly what was meant.
Then I discovered I had an early build of Scantailor without dewarping. Then I got the latest, and the dewarping seemed to make things worse, but I'm not sure why, so I ended up being very careful with shots on the scanner, letting Scantailor do it's thing without dewarping, then OCR'ing them in Acrobat and living with only slightly less than perfect results.
Interesting to me why I want them to be so perfect. People seldom look at a book without the pages being skewed or keystoned to some degree as it sits in your lap. Yet, we think nothing of it. When we see skewing and keystoning on a computer screen though, it looks terrible. Why? I dunno...
Charles (Working away at this, but knowing the only way to come up with a decent workflow is to hack away at it.)
Note to people new at this:
Best thing you can do is keep a journal. Get a big sketchbook and write down what you did each time.
I had thought awhile back that a software program called prizmo might help me dewarp and deskew pages, but after using it awhile I just couldn't get the hang of how to dewarp a slightly warped page. Paradoxically, Prizmo seems to work best with a page that's really skewed. I think it's designed to flatten something like a magazine page or a poster that you've photographed at a weird angle. In such a case, it works quite well. But the images that come off the hackerspace scanner are skewed and keystoned only to a subtile degree and I find myself spending a half hour on each page to correctly place the Bezier page margins so everything comes out just so. Consequently, it doesn't help that much. But, hey, that's not what it's designed to do. It seems to have been designed so you could take your point and shoot digital camera around Paris, snapping the posters on the Metro at all angles. Then back home Prizmo lets you reproduce the image flat. It's cool, but not exactly what I need.
So, I went back to Scantailor and spent some more time with it, working the idea of the assembly line from the video. It took me a number of days to discover that Scantailor does in fact create its own little "out" folder in the same source folder for the jpegs. I know Rob had written this, but it took me some time to figure out exactly what was meant.
Then I discovered I had an early build of Scantailor without dewarping. Then I got the latest, and the dewarping seemed to make things worse, but I'm not sure why, so I ended up being very careful with shots on the scanner, letting Scantailor do it's thing without dewarping, then OCR'ing them in Acrobat and living with only slightly less than perfect results.
Interesting to me why I want them to be so perfect. People seldom look at a book without the pages being skewed or keystoned to some degree as it sits in your lap. Yet, we think nothing of it. When we see skewing and keystoning on a computer screen though, it looks terrible. Why? I dunno...
Charles (Working away at this, but knowing the only way to come up with a decent workflow is to hack away at it.)
Note to people new at this:
Best thing you can do is keep a journal. Get a big sketchbook and write down what you did each time.
- daniel_reetz
- Posts: 2812
- Joined: 03 Jun 2009, 13:56
- E-book readers owned: Used to have a PRS-500
- Number of books owned: 600
- Country: United States
- Contact:
Re: an initial effort...
charles, can you post a pair of example pages that we can experiment with? the hackerspace scanner *should* be producing pages with no keystoning but some barrel distortion (inherent in the camera optics)... in any case we should be able to figure out why things aren't working yet.
Re: an initial effort...
I was wondering how you were getting on with your post-processing, Charles. Thanks for the updates.
One of these days I'm going to be in the same boat and I'm interested in hearing about your trials and tribulations of developing an efficient workflow using the new "HS" scanner. I was hoping that the new scanner design would produce page images that wouldn't require a lot of post-processing - certainly not the keystone problems you're seeing since the camera mount is in lock-step with the platen surface. That's gotta be a camera position/direction problem?
Dan, regarding the "barrel distortion", would moving the camera farther away from the platen surface and zooming-in to compensate mitigate this effect? What are our options here?
One of these days I'm going to be in the same boat and I'm interested in hearing about your trials and tribulations of developing an efficient workflow using the new "HS" scanner. I was hoping that the new scanner design would produce page images that wouldn't require a lot of post-processing - certainly not the keystone problems you're seeing since the camera mount is in lock-step with the platen surface. That's gotta be a camera position/direction problem?
Dan, regarding the "barrel distortion", would moving the camera farther away from the platen surface and zooming-in to compensate mitigate this effect? What are our options here?
- daniel_reetz
- Posts: 2812
- Joined: 03 Jun 2009, 13:56
- E-book readers owned: Used to have a PRS-500
- Number of books owned: 600
- Country: United States
- Contact:
Re: an initial effort...
It's possible to minimize lens distortion by setting the camera somewhere in the middle of its total zoom range, but for these models we are counting on getting the software correction right so the cameras can sit wherever they mechanically need to sit.
Scan Tailor or not, fixing lens distortion is a solved problem and just one we need to more carefully and completely integrate into the workflow.
Scan Tailor or not, fixing lens distortion is a solved problem and just one we need to more carefully and completely integrate into the workflow.
-
- Posts: 56
- Joined: 17 Apr 2011, 21:20
- Number of books owned: 0
- Location: Charlottesville, Virginia
Re: an initial effort...
Sure Daniel,
I might need a couple of days because my two jobs are quite busy right now. Also, I'm not exactly sure how to attach a .pdf file to one of these posts...hmmm. Is there a tutorial about?
Regards, Charles
I might need a couple of days because my two jobs are quite busy right now. Also, I'm not exactly sure how to attach a .pdf file to one of these posts...hmmm. Is there a tutorial about?
Regards, Charles
Re: an initial effort...
Acrobat, by default, will resample the image when doing OCR. That can make it less sharp.
To prevent this, change the OCR settings to "Searchable Image (Exact)" which means Acrobat won't touch the image and will just OCR it.
To prevent this, change the OCR settings to "Searchable Image (Exact)" which means Acrobat won't touch the image and will just OCR it.