Simple "looking" one camera book scanner

Built a scanner? Started to build a scanner? Record your progress here. Doesn't need to be a whole scanner - triggers and other parts are fine. Commercial scanners are fine too.

Moderator: peterZ

gsloop

Re: Simple "looking" one camera book scanner

Post by gsloop »

I like this idea a lot. It makes building the rig less intense, (time consuming) and perhaps more portable.

The big question is what DPI is reasonable?

I'd guess somewhere between 200-300 DPI for non-archival purposes. (Keeping scans for pictures, but they won't be archival quality [read: super high quality] and having text that OCR's reasonably well...)

So, I'd guess something like 12Mpix and a good lens is a big deal, right?

Would one of the power-shots with high-ish MP sensors do reasonably well?

I'm scanning a wide variety of materials, but mostly books/magazines etc - some illustrations. Figure fiction [pleasure reading], university style textbooks, reference materials etc.

I'm not trying to archive comix or Picasso's... :)

-Greg
bnz

Re: Simple "looking" one camera book scanner

Post by bnz »

For many non-archival cases, 200-300 dpi should be sufficient I guess, but I think I would aim at the 300 rather than the 200. Hm, I guess as always, whether the 12 megapixel camera is enough for both pages depends on the size of the book. The bigger the page, the less DPI you have per page. I would try to measure what the sizes of your biggest books and magazines are that you want to scan. Then see how many pixels the camera that you intend to buy produces per picture. Then make a rough estimation on how many pixels you think one page will occupy (calculate in lots of unused space!) and see whether the resulting dpi is above 300.

From my tests with a Panasonic pocket cam (with even 9 megapixels I think), it worked well enough (at least for my purposes) with a rather small book. I am glad though that I have my repaired Canon 550D back now. There really is a difference.
User avatar
daniel_reetz
Posts: 2812
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: Simple "looking" one camera book scanner

Post by daniel_reetz »

bnz has a great answer, and I just want to comment that if you're
not trying to archive comix or Picasso's... :)
, then shoot for a good, readable image and you'll be alright. The best way to find out is just to try it and see if you need more. Borrow a cam from a friend if you don't have one yet...
gsloop

Re: Simple "looking" one camera book scanner

Post by gsloop »

Yes, I understand that the DPI level is dependent on size.

Given, say 250 DPI, a 12Mpix image would get us 192 in^2, which is about 11x17 in size

This was about where I was going - but you're right, there's going to be white-space - probably quite a lot.

So, if I want to scan a full tabloid spread, it's going to be touch and go...

---
I've got a Canon D350, which is 8Mpix, but the image quality is going to be good, obviously. So, I should try a few things and see how it goes..

---
I just like the simplicity of a single shot for the whole spread.

Any follow-up on how well this works? Is it really worth building a scanner based on this?

---
I'm doing my best to future proof this as much as possible. I don't really relish building a half dozen of these over the next five years. I'd like to do a basic design and perhaps upgrade a camera, or lights, but leave most everything else alone.

So, that's the basis of my queries.
spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: Simple "looking" one camera book scanner

Post by spamsickle »

It's not just MP and DPI. You say you're shooting textbooks, and for me, those are the most challenging. Big pages, small fonts, and more often than not, subscripts, superscripts, italics, etc.

Personally, I would probably find 12 MP to be adequate for two pages, even with these requirements, but I haven't actually tried it. I've run into situations where I had difficulty reading tiny subscripts here and there shooting one page with 8 MP, but so far I've been able to puzzle out what Scan Tailor drops out when doing its binarization without needing to consult the original JPEGs.

If you have the 12 MP camera, pick a difficult page or two and shoot some tests.

As you can see from my test images, you do get some gutter curl in your images this way, and at this point I'm not aware of an easy automatic way to eliminate it. If that's going to bother you, you might consider building a more standard DIY design until a solution becomes available. I don't think you'll be able to eliminate it entirely for all books at the shooting stage with this setup.
User avatar
reggilbert
Posts: 49
Joined: 28 Sep 2010, 19:57
Number of books owned: 3000
Location: Buffalo, New York

Re: Simple "looking" one camera book scanner

Post by reggilbert »

spamsickle wrote:I've run into situations where I had difficulty reading tiny subscripts here and there shooting one page with 8 MP, but so far I've been able to puzzle out what Scan Tailor drops out when doing its binarization without needing to consult the original JPEGs.
I contend that a good scanning process should make even the smallest type fully legible and OCR of said text nearly perfect.

The following assumes that binarization means production of b&w images, vs. leaving images as greyscale or color.

Thus, one question -- must Scan Tailor binarize to function properly?

-- and one, long FYI, if it makes any sense: my experience using Acrobat OCR (perhaps dedicated OCR would be different) is that binarization of originally greyscale or color images yields poorer OCR.

This opinion is based on just a handful of tests I ran on book material I am ongoingly scanning with a Plustek Opticbook flatbed scanner (until I settle on and build a diy machine based on the profusion of material available on this site). Since it was just a handful of tests, in the end this observation about OCR of b&w images may not be a scientific fact, but . . .

I used to scan all pure text pages in b&w, and only book pages with images in greyscale or color, since the former took up just 1.5Mb per image, the latter a great deal more at 8Mb. This system was necessary a few years ago when hard disk space was a factor in things, but it also made intuitive sense -- b&w images just look more OCR-able. The follow-on OCR was most conveniently done in Acrobat although it was not that great, which I accepted as worth the convenience of having OCR embedded in page images. It never occurred to me that the dirtier greyscale images could possibly be easier to OCR.

Anyway, a year or so ago I happened to both accidentally scan some of a book's pure text pages in greyscale and output the resulting Acrobat OCR to a Word file (usually I am content to leave the OCR content embedded in the Acrobat file). Some parts of the export had obviously fewer errors and on investigating the source images I discovered that the greyscale-scanned pages were substantially less prone to errors than the b&w-scanned pages. Two more contrasting scanning efforts confirmed this observation.

A look at the different kinds of images in extreme closeup (I think I blew them up to 400%), both in an imaging program and in Acrobat (the difference is almost undetectable, which is amazing given the reduction in aggregate file size achieved by the Acrobat combining utility) shows a possible reason why -- binarization results in little bits being cut out of or added to the letters, bits of the letter or the background respectively that the binarization software cannot tell is not part of the letter, bits that the human eye adds in or removes -- or never deals with to begin with -- when looking at greyscale versions of those same letters embedded in their dirty greyscale "white" background.

Now, it does not really make sense that binarization elements of the scanner software (or the binarization elements of Scan Tailor) would have difficulty figuring what is and is not part of a letter from a given source when at the same time, working on the same or even slightly degraded material, simple Acrobat OCR software has less difficulty doing so. Both processes must distinguish letter from non-pure-white background. I would assume that the underlying algorithms for determining what is and is not part of a letter against a noisy background are long established and more or less common to all this software -- the great OCR programs like Omnipage haven't significantly increased their recognition rate in many years and only Google seems to be able maintain a software technology edge for the long term. But as far as I can tell, it is exactly the case that binarization software does not do as good a job as even run-of-the-mill OCR software (the OCR engine in Acrobat can't be the best) in deciding what is a letter.

I feel like my observations must be in error since the logic of the binarizing / OCRing software in the two scenarios says there shouldn't be much significant difference. But for now I am going with the observations of divergent OCR accuracy and I both scan all books in greyscale (for DIY scanning I assume color or greyscale outputs are default) and leave the final images in the finished ebook product in that state both for image readability and superior embedded OCR. When I move to a DIY scanner I will want to do the same.

And just to reiterate the starting question -- must Scan Tailor binarize to function properly? It does not seem to have any setting to allow it to function otherwise.
bnz

Re: Simple "looking" one camera book scanner

Post by bnz »

Hi reggilbert,

interesting post! I will definately binarize my scans as the final size does matter for me - I want to put the books onto my iPad. I'm not sure about the OCR engines. I am pretty sure that the recognition engines do binarize anyway internally at some point even if you give them greyscale images. I have read a couple of times that OCR recognition works on b/w. Possibly, their own binarization could be more tailored towards the recognition purpose and not necessarily to look nice in the end (as it is the case with Scan Tailor). Therefore, I can imagine that there could be some truth to your observations.

Regarding your question with Scan Tailor: in step 6, you can select to have greyscale images. There is no necessity to binarize.
User avatar
reggilbert
Posts: 49
Joined: 28 Sep 2010, 19:57
Number of books owned: 3000
Location: Buffalo, New York

Re: Simple "looking" one camera book scanner

Post by reggilbert »

bnz wrote:I will definately binarize my scans as the final size does matter for me - I want to put the books onto my iPad.
The thing that always amazes me is how little difference the size of the source images seems to make to Acrobat. The most recent book that I scanned yielded 134 greyscale images at 8.7Mb each and 2 color images (the covers) at 13.8Mb each -- these are all perfectly cropped in the scan, so this is all nominally important data -- the total of 136 images taking up 1.2GB yet still crushed down to nearly 1 percent at 14.3Mb in the Acrobat file of combined images. And every page is still very good resolution at high magnification.

Thus you may not need to binarize to have a small final file size for your iPad, bnz. However, if you keep the source images on some computer hard drive (which I would recommend - you never know what future software will be able to do with them should you want it to),that will add up eventually, though for the last couple years hard drive prices seem to be falling faster than processing power.
bnz

Re: Simple "looking" one camera book scanner

Post by bnz »

Today, I made again a setup like that of Spamsickle, but now with my Canon 550D/T2i back (more or less exactly the same, therefore I don't post any pictures). I used a cheap canon 50mm lens which I manually focussed on the book and shot everything using the canon eos utility and tethered shooting. A good thing I had this PS3 keypad around that I don't use anyway, I could just pair it with my PC and then use its space bar button as trigger for shooting. Actually, this setup is pretty nice as I always get a quick preview on the screen from the shots that I take. With that, I scanned a complete 360 pages book today. It was pretty quick to use (enough for me anyway), but a few things were really annoying:

- I didn't take care of the lighting as well as Spamsickle did, but just used the lighting I had anyway in my room. As a result, I did get a lot of nasty shadows when processing these pictures in Scan Tailor. I could get rid of this problem to some degree by preprocessing the pictures with Lightroom using a combination of increasing contrast, brightness, sharpening and denoising. The shadows are gone now, but sometimes the characters are a little thin. Not that important for my first complete scan, but definately I want to make better for my next one.

- The glass platen from my fridge could be a little heavier. Especially the pages in the middle of the book are really hard to make flat.

- The glass platen reflects like hell. I have to get something different here.

- With this kind loose glass platen, I noticed that I need move it around for it to flatten the book in an ideal way and distribute the weight somehow evenly, because each open page somehow seems to have its own unique of way to put force towards the platen. Of course, the difference is more noticeable when you make bigger steps, e.g., the beginning or the end of a book, the force directions are obviously switched. Does this make sense? I am wondering how this would be dealt with a helper construction as in the beginning of this thread. I would imagine that the platen either has to be very heavy or that the platen somehow needs to be forced to close with magnets or something like that.

Preprocessing the pictures with Lightroom seems to be very useful in case of non-perfect scans. Also, I feared that Scan Tailor might have problems to find the separation line for the book pages and that this might be a lot of additional work. It turns out that Scan Tailor is actually pretty good at that. I didn't have to readjust a single page middle for any page in my first tests. So this really is a non-issue. If there is interest, I can maybe post one of those ugly original pictures and a processed picture tomorrow.
bnz

Re: Simple "looking" one camera book scanner

Post by bnz »

I think I have to revise my opinion on this type of book scanner. It work reasonably well for a small book with a big inner margin, but now I have a larger book 7x9 inches with an extremely small inner margin and this one is really a mess to scan with just a single glass platen and a single camera for both pages.

1) the warping gets really extreme if the book has a lot of pages.
3) the unevenness of the pages leads to the situation that some parts of the pages are more sharp the others due to the focusing.
2) I get imperfect results even if a adjust the tripod to capture little extra space. I am getting something like 317x341dpi then with my Canon 550D/T2i and I get lots of "incomplete" characters, i.e., parts of the characters are missing after processing with Scan Tailor.

I am lighting with two 40w lamps, one desk lamp + some indirect room lighting. I am wondering to what degree I can fix these incomplete characters with even better lighting. However, I have tried to scan single pages and these look (obviously) considerably better and don't have these problems with something above 400dpi. So my guess is that indeed 400dpi per page is more the resolution to aim at to get good results. I'll probably build a v-shaped scanner now, because I fear that I will be unhappy with the results otherwise...
Post Reply