Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

Scanning 5x7 Index Cards - Help?!

Built a scanner? Started to build a scanner? Record your progress here. Doesn't need to be a whole scanner - triggers and other parts are fine. Commercial scanners are fine too.
Post Reply
PossiblyCanadian
Posts: 3
Joined: 22 Aug 2013, 15:47
E-book readers owned: B&N Nook Simple Touch
Number of books owned: 300
Country: United States

Scanning 5x7 Index Cards - Help?!

Post by PossiblyCanadian » 03 Sep 2013, 14:48

Hi all,
I've lurked here off and on for quite some time. I love watching this community grow, get better, and share skills - which is why I've now decided to come for help!

I work at the Great Falls Public Library, and part of our job is to do research requests, often out of the local paper. Now, if someone asks us to look for a particular event, sans date, we used to be able to call the paper and have their archivist look it up in their index.

There's no longer an archivist.

What we've been working on is digitizing that index so that we can keep it here in the library for those situations. The index, however, is on 5"x7" type-written index cards - about 150-200k of them. Until this week, we'd started running them through the auto-feeder on our existing copier/scanner with decent results, but it's simply not built to feed and scan that many cards. It's failing constantly now, so we're exploring other options.

What we want, ideally, are PDFs of each card. Even more ideally, we'd have them OCR'd and searchable, but that seems like a wholly separate issue, based on the quality/consistency of the type-writing (inconsistent formatting, words struck-out, faded print, etc.). I know that commercial scanners exist that could theoretically handle this project, but the limitation there is cost. It's a bit of a pet project, so it'd be tough to justify a ~$10k investment, especially if we already wrecked a copier.

I've been in contact with the Internet Archive, but the $.10 per scanned page they charge would, again, be unfeasible cost-wise. I've been looking at this - http://www.diybookscanner.org/forum/vie ... 6&start=10 - and it seems fairly analogous to what we want to do, but I'm not certain it's the single best option available.

Thus, I've come to you. My questions to you are:
1.) Are there projects you've seen on these forums that have attempted this sort of thing?
2.) If I were to adapt parts of the book scanner, what sorts of considerations should be made? Steps we can skip?
3.) Do you have any thoughts/ideas on how we can accomplish this project?

My creativity in figuring this out is running low, which, combined with pressure to either figure it out or drop it, has put me in a box. So, any help, thoughts, links, etc. that you could send my way would be a huge help I'm sure.

victoriaaustralia
Posts: 55
Joined: 07 Nov 2011, 16:22
E-book readers owned: newton
Number of books owned: 2
Country: Australia
Location: Castlemaine, Victoria, Australia

Re: Scanning 5x7 Index Cards - Help?!

Post by victoriaaustralia » 03 Sep 2013, 20:50

If there is any budget money for it I would still suggest the sheet feeding scanner as the fastest option, the Fujistu ScanSnap works well at our work and will do odd shaped documents.

Although I love my bookscanner it is best for non-destructive scanning of books. If I was happy to band saw books then a feeding scanner would be my first choice, not a bookscanner. The sheet feeding scanner is ready to go, OCR is de-bugged and anyone in the office can use it.

This is their top of the line, $USD2200 on Amazon, 1/5th of the 10000 you mentioned.
http://www.fujitsu.com/us/services/comp ... 6010n.html

or this is the cheapie that is still listed as scanning as small as business cards, takes 50 cards at a time:
http://www.amazon.com/Fujitsu-ScanSnap- ... B00ATZ9QMO for $USD419, even if you broke a couple you would still be ahead for this one job compared to two cameras and ancillaries for a bookscanner. A built hackerspacescanner recently sold on here for $USD850

However at 50 cards per loading and 25ppm scanning this still equates to 4000 loadings of 50 cards and 133hours of scanning time, going with your higher 200 000 card estimation. Using the bookscanner would be like me scanning 333 600page books, it takes me an 45min to photograph a 600page book ( i have an old simple scanner based on the book liberator) and at least an hour post-processing, done between other jobs. So a very rough estimate would be 666hours for me do scan and process 333 of a 600page book.
Freeware Windows workflow in 2020
viewtopic.php?f=19&t=3620

vitorio
Posts: 138
Joined: 30 Oct 2010, 23:56
Number of books owned: 0
Location: Austin, Texas, USA
Contact:

Re: Scanning 5x7 Index Cards - Help?!

Post by vitorio » 04 Sep 2013, 02:24

Yes, given the quantity, it seems like even burning out a few document scanners like the Doxie, or the Canon P215 (if there's content on both sides) is the smartest way to go.

That said, if you have more time than money, I don't think you need a "book scanner" at all, you just need a copy stand and any sort of digital camera or modern phone. Place a card down, take a picture, set it aside, repeat. Five seconds a card? Ten, tops? You'll probably break the camera's shutter before you finish all 150k.

If you put the images of the cards in folders that match the archivist's organization scheme, that'll duplicate your current setup as-is. It won't be any worse than the paper version.

To get better than the paper version, that's a little harder.

You can run each card through OCR software to get you started on searching and recognizing things faster than that, but as you said, the quality is going to be hit-or-miss. You need a way to correct the OCR. Many libraries have had success with crowdsourcing OCR and post-correction, given a motivated audience and an IT budget: University of Iowa, University College London, George Mason University, National Library of Australia, and the sites that make up 18thConnect. There's now even commercial software, designed after the UI of the Australian libraries project: Veridian.

Some of these are built on Mediawiki, which is what Wikipedia is built on, and it is free, and has a "proofreading extension" available, but I'll be honest, I tried to get a Mediawiki install going and the extension running, and I couldn't figure it out. It is not plug-and-play.

This is unfortunate, because a straight Mediawiki setup is probably the easiest of these systems to get going. All the ones I've linked to are pretty well entrenched in their parent digital collection management software. There's no standalone system you can just load a bunch of pictures and OCR results into, put online for a while for patrons and staff to visit and correct, and take down when you're done. :|

User avatar
daniel_reetz
Posts: 2797
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: Scanning 5x7 Index Cards - Help?!

Post by daniel_reetz » 04 Sep 2013, 09:16

This is all great advice. If you persist in going on a camera-based route, you may find some things in this post useful:

http://www.diybookscanner.org/forum/vie ... ?f=1&t=228

I think a Fujitsu or one of the many feed-through scanners is your best bet. That, and lots of free labor. If that absolutely won't work, I think we could come up with something camera-based, but it's going to involve really awful boring work.

Abarbour
Posts: 13
Joined: 30 Aug 2012, 17:03
Number of books owned: 2000
Country: Canada

Re: Scanning 5x7 Index Cards - Help?!

Post by Abarbour » 04 Sep 2013, 14:00

I would echo the suggestions above re a scanner - not unlike the Fujitsu ScanSnap. I have a half dozen of the ix500s and they are great! I did a quick test with ten 5x7 index cards. The scanner/SW completed the entire operation in 1 minute give or take. You could probably have one operator keep three of these machines going. You should be able to get it done in 3 weeks:

200,000 cards
/ 10 cards per minute
/ 60 minutes per hour
/ 3 machines
/ 37.5 hours per week
= 3 (ish) weeks

Good luck!
Andrew

PossiblyCanadian
Posts: 3
Joined: 22 Aug 2013, 15:47
E-book readers owned: B&N Nook Simple Touch
Number of books owned: 300
Country: United States

Re: Scanning 5x7 Index Cards - Help?!

Post by PossiblyCanadian » 04 Sep 2013, 18:15

Wow! I didn't check back on this for a few days, and I'm absolutely dazzled at all of this great advice! Really, thank you.

I think that if we keep this project in-house a "real" scanner will be something we pursue, if for no other reason than it's boring, horrible work. (Thanks library volunteers!)

We're exploring the option of shipping them out if some grant funding comes through. We've been working to assure influential folks around the state that this is a worthwhile project, so we might have money to just transfer the headache to the professionals. Any suggestions for off-site archivists?

The OCR is the more complicated issue, I think. I'll spend some time looking through those links - Thanks vitorio.

Back to work on this. Thanks all - I'll keep you up to date on what I figure out.

User avatar
daniel_reetz
Posts: 2797
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: Scanning 5x7 Index Cards - Help?!

Post by daniel_reetz » 04 Sep 2013, 22:20

I think OCR is easy, actually. They're all typed with the same typeface, right? Just run ABBY Finereader over the entire lot.

PossiblyCanadian
Posts: 3
Joined: 22 Aug 2013, 15:47
E-book readers owned: B&N Nook Simple Touch
Number of books owned: 300
Country: United States

Re: Scanning 5x7 Index Cards - Help?!

Post by PossiblyCanadian » 05 Sep 2013, 14:15

Actually, the type is the ugliest part. These things date back to the 60's, and in 50 years, experienced a lot of smudging, striking, fading, X-ing out, and different typewriters.

As a test, I uploaded a handful to the Internet Archive, which runs FineReader over everything. The results were, at best, readable, but impossible to make searchable without a ton of hands-on postprocessing. At worst, it's completely garbled. IA, in trying to recognize the text language, guessed Turkish, if that's any indication.

Meeting with the Library Director today - I'll update what the feelings are about this whole thing soon.

User avatar
jbaiter
Posts: 98
Joined: 17 Jun 2013, 16:42
E-book readers owned: 2
Number of books owned: 0
Country: Germany
Location: Munich, Germany
Contact:

Re: Scanning 5x7 Index Cards - Help?!

Post by jbaiter » 05 Sep 2013, 15:42

I think the Internet Archive uses a rather old version of the FineReader engine (7, I think, the most recent version is 11). Maybe try to get a demo of the most recent desktop version and give that a try, I've had great results with it, even with rather "dirty" input. Do you have some examples that you could upload?
spreads: Command-line workflow assistant

cday
Posts: 251
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Scanning 5x7 Index Cards - Help?!

Post by cday » 06 Sep 2013, 12:08

It is usually possible to see which software was used to create a PDF by checking the document properties (File > Properties... or Ctrl+D in Adobe Reader 11), so it should be possible to see immediately which OCR program and version the Internet Archive used to create the test PDFs.

It would still be interesting to post one or more representative 5x7 scan images to illustrate the problem and to see, for example, if the scans can be enhanced to improve the OCR results.

Post Reply