Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

100-page handwritten guestbook workflow?

Don't know where to start, or stuck on a certain problem? Drop by and tell us about it. Feel like helping others? Start here.
Post Reply
Ben Armstrong
Posts: 1
Joined: 17 Jun 2018, 07:20
E-book readers owned: Moon+Reader Pro (Android); Kindle 7
Number of books owned: 100
Country: Canada
Contact:

100-page handwritten guestbook workflow?

Post by Ben Armstrong » 17 Jun 2018, 12:30

My first book scanning project this year is a very post-processing-heavy job (and therefore no special hardware needed): take a wilderness trail's 100-page guestbook, scan it, catalog the data, and pull it all together into an e-book.

Help!

Do you have a similar post-processing-heavy small-batch workflow? In the plan outlined below, have I underestimated how much work will be required, and/or am in danger of burdening my organization with a project no-one will want to continue after me? What cross-platform, open-source software would best suit the project described below?

The guestbook:

The guestbook is a waterproof, top-bound booklet housed with writing implements in an enclosure at a loop junction & popular resting place near the start of the trail system. We provide trail users with suggestions for things to put in their entries that we might find helpful, but the rest is up to them. So, in among the more mundane and workaday tallies of number of hikers encountered, off-leash dogs, trash, and so forth, there are some really creative, intriguing, or just bizarre entries. More than just a conduit of information back to us from our users, it is a creative outlet our users seem to enjoy that is qualitatively different from the way they relate to us online. Because of this, I think has potential to be turned into something really great to share with the world which I intend to explore through this project.

Figure 1:
IMG_20180617_101407_v1.jpg
The logbook front cover. Note that it is written on. It is desirable to preserve this in the dataset and e-book, too.

The e-book:

Earlier this year, we had even considered dropping the physical guestbook entirely, as it requires the manpower to maintain & process it, and it was argued that people can submit comments more efficiently online. But I strongly defended keeping it, arguing successfully that we would almost certainly never hear from certain people who barely go near a computer without it, and that the actual work required to keep it up wouldn't be too burdensome. Besides which, these handwritten entries, poems, sketches, and the like put an appealingly human face on our efforts. Making it all into an e-book is something nice I'd like to do in order to connect with this diverse, eager, & observant community centered around our beloved trail. If it is well received, I'd like to make the process as painless as possible so that I'll be able to hand it off to others to do once we have a process for it that is not burdensome.

Figure 2:
IMG_20180617_101514_v2.jpg
A representative page from the book, showing typical orientation and tidy, chronologically sequenced entries

Extracting the data:

I have just finished the first of the following intermediate work products, and am seeking advice from this community about tackling the rest using all open source software (on Linux, but other platforms are desirable to support):
  • The raw scans of page pairs: 51 scans including cover. Note that page orientation flips vertically on alternating pages. Finished! See fig. 3.
  • The scans sliced into pages with orientation corrected. See fig. 2 & 3.
  • The pages sliced into individual cropped fragments belonging to an entry, anticipating such problems as: non-rectangular entries, overlapping, continued on next page. See fig. 3 for a particularly challenging page.
  • A data set with all of the raw data, tabulated into one row per entry, with:
    • One or more cropped image fragments per entry.
    • Full text of entry. I anticipate no OCR will be possible, so this will all be hand entered from the scans.
    • Metadata, e.g. name(s) & # of people in party, date/time, hometown, loops/distance hiked, counts of people, dogs, trash, and whatever tags we feel are helpful, such as: 'poem', 'sketch', 'thanks', 'report', etc.
Figure 3:
page0003.png
page0003.png (163.83 KiB) Viewed 196 times
A particularly challenging page, illustrating some of my anticipated problems.

Putting it all together:

Ultimately, I aim to assemble all of this into these final products:
  • A spreadsheet document containing the data set that can be used by our organization to help produce reports, etc.
  • An e-book.
The spreadsheet is pretty straightforward. The e-book will require some more thought as to layout & features. Here are some early ideas about what I'd like to see:
  • Preserve all, or substantially all of the image content from the guestbook.
  • Follow some reasonable ordering, which is probably not the original order of the entries, since they tend to get jumbled up as the book fills and people look for partial pages to write new entries on. Chronological ordering seems like a natural choice.
  • Suitable for viewing on the web & on mobile devices.
  • Searchable.
  • Accessible (i.e. screen-reader-friendly).
  • Preserve look-and-feel of the original guestbook.

My hardware:

Figure 3 is a flatbed-scanner image of that page, produced with xsane. Once I got the hang of making small adjustments to the position of the slightly-too-big-for-the-scanning-area book without losing anything valuable off the top/bottom margins, I took about 8 minutes per 10 pages to do. Given the infrequency of this job & small number of pages to scan, as well as having zero dollars budget for the project, I'm not really interested in optimizing this at this time, so I'll stick with the hardware I have and want to focus mainly on the post-processing workflow.

My software:

So far, software I'm using, or have used in the past for this kind of work:
  • Debian 9 "stretch".
  • xsane - the most efficient way to do a batch of numbered page images with exactly the level of control I require
  • digikam - my photo management software (using this more for trail photo documentation than this project, but I also have an album of the scans in digikam)
  • gimp - in the past, I've used this for photo & scan editing, but these days I'm doing more with digikam's image editor, which is a bit simpler for routine photo editing jobs
  • gocr / tesseract - not very useful for this job, as noted above (though I've used it before to scan a book for a blind friend)
  • calibre - I've used this before for certain kinds of scripted epub processing; no idea if it will be any help on this project
  • evince - my preferred PDF viewer
  • libreoffice - Calc for the spreadsheet
Notably absent from this list is any software that would help assemble the images into a PDF (or other format, but considering how image-rich it is, I have reservations about anything else) e-book, along with full text of the entries, index by tag, etc. I'm also keeping in mind that not everyone runs Debian on their desktop, and also that I may have personal biases that won't be shared by people who follow in my footsteps, so it's good to have alternatives catering to those differing needs & desires.

Outcomes:

I'm after more than just a single work product as the outcome, here. I want a repeatable process. I want something that will not just help me to produce this tiny book, but to guide "future me" and others in my organization in making next year's edition, and for others undertaking similar work to revise & improve upon. I did a cursory scan of the web to try to find something like that I could follow, and after an hour or so of effort, the most promising lead was this website.

I spent some time browsing the forums, especially the HOWTO's, and after satisfying myself that nothing quite matched, I started drafting this post. In spite of not having found any similar stories in the forums (yet), it looks like many of you may have already covered at least some of the ground I'll need to cover soon. I'd like to pull together from your collective experiences whatever I can that would aid in my success. Any pointers to articles / threads here I may have looked, guides here or elsewhere that may have already been written, or comments on any of the ideas outlined above would be greatly appreciated!

BillGill
Posts: 83
Joined: 18 Dec 2016, 17:13
E-book readers owned: Calibre, FBReader
Number of books owned: 7000
Country: USA

Re: 100-page handwritten guestbook workflow?

Post by BillGill » 19 Jun 2018, 09:25

I don't have any real help for you. I just wanted to say that it sounds like a great project. As far as I can see you have the process well planned and you should be able to get a good output.

The biggest problem that I can see will be keeping the momentum going after your first book. Things will always come up that need to be done and you (or somebody) will have to put it off 'until I get caught up.'

Bill

dpc
Posts: 272
Joined: 01 Apr 2011, 18:05
Number of books owned: 0
Location: Issaquah, WA

Re: 100-page handwritten guestbook workflow?

Post by dpc » 19 Jun 2018, 13:58

If I was doing this I'd just save it as a PDF. Almost every device has a PDF viewer and it will allow the finished product to accurately represent the original.

Have you looked through the forums at Mobileread.com? This site is more about the hardware side of book scanning and producing a collection of quality images that can be used for creating a variety of digital documents, while the folks at Mobileread seem to focus on a variety of tools and processes that turn those images into ebooks (and others).

L.Willms
Posts: 127
Joined: 21 Sep 2016, 10:51
E-book readers owned: Tolino Shine
Country: Germany
Location: Frankfurt/Main, Germany

Re: 100-page handwritten guestbook workflow?

Post by L.Willms » 25 Jun 2018, 03:33

I am missing Scan Tailor in the list of your software. See the section on Scan Tailor in the "Software and processing" group of this forum.

This is very useful for de-skewing the images and finding the actual borders of the paper.

As to the form of publication, I allow myself to list some thoughts of mine:

1. Avoid Scan Tailor's offer to reduce the images to binary black and white, and keep the color. In gray scale images the shadows look awful, as one can see in the examples shown above. Color is quite probable a feature of various entries.

2. Keep the order of the pages, but add a secondary chain of links in chronological order, which might jump back and forth thru the physical pages. This can be done in PDF files, at least when using Adobe's Acrobat Pro.

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest