Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

scanning to search

Convert page images into searchable text. Talk about software, techniques, and new developments here.
Post Reply
cfmorrill
Posts: 56
Joined: 17 Apr 2011, 21:20
Number of books owned: 0
Location: Charlottesville, Virginia

scanning to search

Post by cfmorrill » 02 Dec 2011, 20:13

I'm just starting out and am trying to understand how my scanning process might work. I'm going scan books of letters by 18nth and early 19nth century Americans so I can search them for items of interest. Sometimes a letter will exist on one page, a half a scanned page, or several pages. Ultimately, I want to use a search engine of some type to then show me all letters that mention "orrery" for example, or a name like "David Rittenhouse."

Am I better off leaving the book as a single .pdf or is their any easy way to break the different letters into separate .pdfs during the post-scanning process? What sort of search engine would I then use to ask the computer to show me all letters that mention "Rittenhouse + orrery?"

Also, how long does a computer generally take to scan an entire book for a single name like "Rittenhouse?"

I suspect I may be backing into database type questions, but I'm not sure and if any of you have a sense of this sort of thing, it would help a good bit.

Many thanks, Charles Morrill

User avatar
strider1551
Posts: 126
Joined: 01 Mar 2010, 11:39
Number of books owned: 0
Location: Ohio, USA

Re: scanning to search

Post by strider1551 » 02 Dec 2011, 21:36

cfmorrill wrote: Am I better off leaving the book as a single .pdf or is their any easy way to break the different letters into separate .pdfs during the post-scanning process?
Given your application, I would keep them as separate PDF files (mind you, I'm thinking one PDF per letter, not per page). If they are all in one, then most search programs will likely only tell you the page that the word is on, not something more useful like the author, date, etc. that could be in the filename of each one.

As for everything else you are asking, what operating system are you using? Also, what level of experience do you have? This would do well with a database component, if it's worth your time/ability/usage to create one... otherwise there are desktop search applications that can fit your purposes.

cfmorrill
Posts: 56
Joined: 17 Apr 2011, 21:20
Number of books owned: 0
Location: Charlottesville, Virginia

Re: scanning to search

Post by cfmorrill » 02 Dec 2011, 22:33

Thanks for answering Strider. I am using an iMac with a recent version of OS X, I forget which cat it's named after. I also have a recent version of Linux running on an older PC and I downloaded open base. I am trying to understand exactly what a form and a table is, and how I might get a book's worth of letters into a database. Maybe the answer is to anti up and just buy FileMaker for the Mac. I'm not sure.

spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: scanning to search

Post by spamsickle » 04 Dec 2011, 03:04

I'm not at all familiar with Mac software, but I think it would help if you clarify how you're planning to use your database. If you're hoping to find instances of a single word or phrase to use for research, keeping an entire book in a PDF will be most convenient: the PDF reader will do the search for you, and allow you to skip from one instance to the next. It won't handle the more complex search you listed though - "Rittenhouse" + "orrery" in the same letter. For that, you'll probably want to keep each letter distinct, and put it in your database with metadata - the date the letter was written, from whom, to whom, as well as the text of the letter.

Given your location, I'm also interested in what materials you'll be scanning. I own, but have not scanned, most of the volumes produced so far of "The Papers of Thomas Jefferson." Those particular books are copyrighted, but many of the same letters (sometimes a bit censored) can be found in "The Writings of Thomas Jefferson" and other 19th and early 20th-century compilations. Is your interest confined to letters, or are you planning to include other materials? What uses are you envisioning for the data once it's scanned?

cfmorrill
Posts: 56
Joined: 17 Apr 2011, 21:20
Number of books owned: 0
Location: Charlottesville, Virginia

Re: scanning to search

Post by cfmorrill » 04 Dec 2011, 10:54

I'm trying to assemble a compendium of letters from many different places so I can search them. I'm particularly interested in the development of early U.S. scientific instruments and gadgets.

Post Reply