Proposed 100% Linux Workflow: Capture-Process-OCR

Convert page images into searchable text. Talk about software, techniques, and new developments here.

Moderator: peterZ

benjamin
Posts: 58
Joined: 04 Mar 2014, 00:53

Proposed 100% Linux Workflow: Capture-Process-OCR

Post by benjamin »

Hey folks, here is my rough outline for a complete book scanning software system relying entirely on open-source tools. Hoping for as much feedback as you can provide. I've been authorized to release the process and the finished scripts under the Creative Commons - Attribution 3.0 license; http://creativecommons.org/licenses/by/3.0/
Credit: Benjamin C. Varadi on behalf of the Tulane University Center for Intellectual Property Law & Culture; URL: http://iplc.tulane.edu

Each individual step has been tested and works OK, implementation through two shell scripts (capture & process) is at about 50%. A GUI is worth the extra work and hopefully will be forthcoming. I don't know how to make one so hopefully can build a team that includes folks who do. This is a June 2010 edit. I had proposed a browser-based interface but it's far too slow and clumsy. I now suspect the only reason to web-enable a physical device sitting in front of you is to have an excuse to talk about AJAX

While they're described together, we may move things around so that the scan computer is dedicated to capture and other workstations/staff members can handle processing.

Fun fact: while to my knowledge there aren't Windows/cygwin ports of some of the key components, everything runs just dandy in Virtualbox. So is a scanning LiveCD in the works? I think so...

Some factors to make sense of this:
1) We're not using the usb trigger most of you know & love, instead we're controlling two Canon SX100 cameras directly from a desktop and using a hacked USB joystick as a trigger. Instructions here: http://www.diybookscanner.org/forum/vie ... 252&p=2249
2) We're only dealing with two fonts, so we should be able to train Tesseract for better-than-average results. Initial tests are good.
3) We're scanning what is essentially a huge printed database with the goal of turning it into a huge digital database (for integration into our software and to be released separately)

I've added some notes in itals.

Platform: Xubuntu Linux. Really you just need an X interface and a basic LAMP environment, but apt and automatic updates make this a nice option if you don't mind the overhead. XFCE is really nice, too. Also I like the startup screen with the mouse and the glowing motes.

CAPTURE
0. In an ideal world, some automatic camera calibration stuff would happen using GPhoto and reference images. We're just doing that by hand. Live preview on the camera, or shoot & check out pic on screen (described below).
1. Script launches, user inputs book information. Script creates a MySQL database and disk folders for the book. if we weren't so specialized, we'd also collect extended metadata which would be packaged in MARC format & later used by the ebook
2. User presses foot switch, joystick signal captured by QJoypad I'd like to eventually replace this with a lower level daemon that supports nonstandard characters but spent 1.5hrs fighting with joy2key and decided to move on.
3. Currently keylaunch takes the QJoypad keystroke and sends it to the script. Eventually the script will capture the keypress itself.
4. Script launches GPhoto2 for camera 1, captures the image, renames it with page number and moves it to the book specific folder.
5. no five. it's now 8b. Didn't feel like renumbering
6. Script launches GPhoto2 for camera 2 and repeats the process.
7. Script loads a Firefox window where the two images are displayed using PHP, Apache, and ImageMagick.
8. Web page prompts user to continue to next pageset; tag either current page as requiring manual postprocessing; re-shoot either page; or close the book. (currently via web page, soon via manual controls)
8a. Reshoot deletes image and waits for foot pedal to trigger GPhoto for appropriate camera & image overwrite; GOTO7.
8b. Write page data to MySQL
8c. If present, manual postprocessing flag entered into MySQL database. GOTO 3
8c. If close book, GOTO 9 else GOTO 3.

MANUAL PROCESSING
9. Script opens trouble pages (tagged in 8a.) opened in GIMP for user image postprocessing. i hate GIMP. is there a better option for dodging/burning/tweaking?
10. Book opened in ScanTailor for semi-automatic image postprocessing. Can this be automated? Preset for given book format and run from commandline?

AUTOMATIC PROCESSING
11. Script sends Scantailor pages to OCRopus/Tesseract for character recognition, using our custom shapes database (which doesn't exactly exist yet, we have a team at the ready).
12. OCR'ed text goes to script for text postprocessing (mostly just removing hyphens at ends of lines & other issues specific to our database formatting goal)
13. Text goes to Project Gutenberg scripts for cleanup.
14. Script inserts clean text into MySQL database for book & saves plaintext copy.

REVIEW & STORAGE
15. User opens text review interface to conduct "reality check" to confirm OCR accuracy against physical book. June 2010 edit, was Firefox & LAMP code, which is still a possibility here since we don't care about camera response time during this phase, but also deserves a dedicated tool Manual edits possible.
16. On approval, MySQL data gets integrated into unified database.
17. Complete book archive- originals, processed images, and individual database- are transferred via Samba to network storage or burned to disc w/ Brasero (or maybe something from the commandline).
Last edited by Anonymous on 25 Jun 2010, 04:52, edited 6 times in total.
User avatar
rob
Posts: 773
Joined: 03 Jun 2009, 13:50
E-book readers owned: iRex iLiad, Kindle 2
Number of books owned: 4000
Country: United States
Location: Maryland, United States
Contact:

Re: Proposed 100% Linux Workflow: Capture-Process-OCR

Post by rob »

Wowzers, this would be awesome if it were on a LiveCD. Or a virtual image that I could load up in VMWare. I realize that there's a good deal of work still to be done, but it's a great plan!

One change I would make -- to step 13: Text goes to Distributed Proofreading for cleanup. This requires the image filenames to be renamed after their page numbers.
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.
benjamin
Posts: 58
Joined: 04 Mar 2014, 00:53

Re: Proposed 100% Linux Workflow: Capture-Process-OCR

Post by benjamin »

Sorry, to clarify it's not actually going to Project Gutenberg as an organization, just to gutcheck. Still, that's good to know- when the script names the images it's doing it sequentially, and these are the page numbers that are getting entered into MySQL, but adding some custom numbering options makes sense.
Last edited by Anonymous on 24 Feb 2010, 14:01, edited 1 time in total.
User avatar
daniel_reetz
Posts: 2812
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: Proposed 100% Linux Workflow: Capture-Process-OCR

Post by daniel_reetz »

I'm with Rob, this is Part Of The Dream. I would LOVE an integrated LiveCD that does the work for me.

Another user (Tim, maybe?) said in his intro that he's handy with this stuff; might pay to PM the guy. I know Spamsickle and others have done a lot of shell-scripting sorts of things, and I think also member Scrivener had some linux experience.

I'll comment on the substance of your post later, just wanted to say that there's probably already help here with some of the trickier parts. Also, I don't think you've been much interested in Scan Tailor, but it does run on Linux...
benjamin
Posts: 58
Joined: 04 Mar 2014, 00:53

Re: Proposed 100% Linux Workflow: Capture-Process-OCR

Post by benjamin »

Scan Tailor is #10! Key to the whole operation! ;)

I'm hoping to have a beta running by the end of next week.
Last edited by Anonymous on 24 Feb 2010, 14:08, edited 1 time in total.
User avatar
rob
Posts: 773
Joined: 03 Jun 2009, 13:50
E-book readers owned: iRex iLiad, Kindle 2
Number of books owned: 4000
Country: United States
Location: Maryland, United States
Contact:

Re: Proposed 100% Linux Workflow: Capture-Process-OCR

Post by rob »

Oh, gutcheck. I know DP uses GUIprep which fixes common OCR issues like tli = th at the beginning of the word, and so on. Does gutcheck do that?
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.
benjamin
Posts: 58
Joined: 04 Mar 2014, 00:53

Re: Proposed 100% Linux Workflow: Capture-Process-OCR

Post by benjamin »

EDITED: It's supposed to do some of that. I haven't gotten guiprep to work yet, but that was using the windows frontend. Gonna try both with & without it & see what happens.
benjamin
Posts: 58
Joined: 04 Mar 2014, 00:53

Re: Proposed 100% Linux Workflow: Capture-Process-OCR

Post by benjamin »

Just a brief little update, progress is moving along nicely on this. Right now we're using gtkam for initial camera setup (making sure the USB assignments haven't changed, checking alignment and zoom).

I'm a little unhappy with our current image preview situation- I don't know how to send keystrokes to a bash script unless it's running in the foreground, so while the firefox thing could work great in theory, it's more distracting than anything right now. Keylaunch can call a script but apparenly not pipe info to one, so I think maybe what I have to do is create one script which will set up the initial book data; and then another which can be called repeatedly to take the shots, reading/writing the current book and page data from the database... but this seems like a really inelegant way to get around something that I feel like we should be able to correct for elsewhere. Maybe the answer will be to find some guru who can migrate the thing into a GTK app with built-in preview windows & camera calibration... someday...

Speed is good; firing the cameras near-simultaneously, so can shoot around once every couplafew seconds.
User avatar
daniel_reetz
Posts: 2812
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: Proposed 100% Linux Workflow: Capture-Process-OCR

Post by daniel_reetz »

Sorry for the dumb question, but you've got more than one Powershot hooked up to a single computer? I've never tried more than one on any operating system, now that I think of it, but I remember back in the day when I was testing AHDRIA you could only hook up one.
benjamin
Posts: 58
Joined: 04 Mar 2014, 00:53

Re: Proposed 100% Linux Workflow: Capture-Process-OCR

Post by benjamin »

Yes, this is why we purchased the SX100's. Triggering is more or less simultaneous calling two instances of gphoto2 from bash. Here's the official list of supported cameras- it's pretty weird, broader than just those supported by the Canon SDK, but not all-inclusive. http://www.gphoto.org/doc/remote/

So far only a few issues:

1) because we're not using memory cards, it takes around 3 seconds per capture to download the image to the hard drive (they're running in parallel, though, so that's 3 seconds total for both cameras). This isn't a big deal since it typically takes about that much time to turn the page and reseat the platen, so I just had to put in a brief pause to accommodate the phenomenon, and wrote in a little error-checking routine that automatically re-shoots the image if it gets hung up. A better way to manage this would probably be to shoot to a memory card and then download the image and delete it from the card in the background. That'll probably be a v2 feature (we're still at v0).

2) control of camera parameters doesn't seem to be possible from the commandline interface, although it is supported by the library. Right now we just start the process by opening gtkam and using the preview there to ensure the cameras are properly aligned and zoomed.

3) the cameras don't seem to remember settings after getting shut off, which may be related to the fact that we're using $10 power adapters instead of batteries.

4) xubuntu seems to arbitrarily assign usb ports each time the cameras get turned on, so right now i have to run gphoto2 --auto-detect and then plug the port info into the script. by v1 this will be handled automatically by the script.
Post Reply