Proposed 100% Linux Workflow: Capture-Process-OCR
Posted: 24 Feb 2010, 12:28
Hey folks, here is my rough outline for a complete book scanning software system relying entirely on open-source tools. Hoping for as much feedback as you can provide. I've been authorized to release the process and the finished scripts under the Creative Commons - Attribution 3.0 license; http://creativecommons.org/licenses/by/3.0/
Credit: Benjamin C. Varadi on behalf of the Tulane University Center for Intellectual Property Law & Culture; URL: http://iplc.tulane.edu
Each individual step has been tested and works OK, implementation through two shell scripts (capture & process) is at about 50%. A GUI is worth the extra work and hopefully will be forthcoming. I don't know how to make one so hopefully can build a team that includes folks who do. This is a June 2010 edit. I had proposed a browser-based interface but it's far too slow and clumsy. I now suspect the only reason to web-enable a physical device sitting in front of you is to have an excuse to talk about AJAX
While they're described together, we may move things around so that the scan computer is dedicated to capture and other workstations/staff members can handle processing.
Fun fact: while to my knowledge there aren't Windows/cygwin ports of some of the key components, everything runs just dandy in Virtualbox. So is a scanning LiveCD in the works? I think so...
Some factors to make sense of this:
1) We're not using the usb trigger most of you know & love, instead we're controlling two Canon SX100 cameras directly from a desktop and using a hacked USB joystick as a trigger. Instructions here: http://www.diybookscanner.org/forum/vie ... 252&p=2249
2) We're only dealing with two fonts, so we should be able to train Tesseract for better-than-average results. Initial tests are good.
3) We're scanning what is essentially a huge printed database with the goal of turning it into a huge digital database (for integration into our software and to be released separately)
I've added some notes in itals.
Platform: Xubuntu Linux. Really you just need an X interface and a basic LAMP environment, but apt and automatic updates make this a nice option if you don't mind the overhead. XFCE is really nice, too. Also I like the startup screen with the mouse and the glowing motes.
CAPTURE
0. In an ideal world, some automatic camera calibration stuff would happen using GPhoto and reference images. We're just doing that by hand. Live preview on the camera, or shoot & check out pic on screen (described below).
1. Script launches, user inputs book information. Script creates a MySQL database and disk folders for the book. if we weren't so specialized, we'd also collect extended metadata which would be packaged in MARC format & later used by the ebook
2. User presses foot switch, joystick signal captured by QJoypad I'd like to eventually replace this with a lower level daemon that supports nonstandard characters but spent 1.5hrs fighting with joy2key and decided to move on.
3. Currently keylaunch takes the QJoypad keystroke and sends it to the script. Eventually the script will capture the keypress itself.
4. Script launches GPhoto2 for camera 1, captures the image, renames it with page number and moves it to the book specific folder.
5. no five. it's now 8b. Didn't feel like renumbering
6. Script launches GPhoto2 for camera 2 and repeats the process.
7. Script loads a Firefox window where the two images are displayed using PHP, Apache, and ImageMagick.
8. Web page prompts user to continue to next pageset; tag either current page as requiring manual postprocessing; re-shoot either page; or close the book. (currently via web page, soon via manual controls)
8a. Reshoot deletes image and waits for foot pedal to trigger GPhoto for appropriate camera & image overwrite; GOTO7.
8b. Write page data to MySQL
8c. If present, manual postprocessing flag entered into MySQL database. GOTO 3
8c. If close book, GOTO 9 else GOTO 3.
MANUAL PROCESSING
9. Script opens trouble pages (tagged in 8a.) opened in GIMP for user image postprocessing. i hate GIMP. is there a better option for dodging/burning/tweaking?
10. Book opened in ScanTailor for semi-automatic image postprocessing. Can this be automated? Preset for given book format and run from commandline?
AUTOMATIC PROCESSING
11. Script sends Scantailor pages to OCRopus/Tesseract for character recognition, using our custom shapes database (which doesn't exactly exist yet, we have a team at the ready).
12. OCR'ed text goes to script for text postprocessing (mostly just removing hyphens at ends of lines & other issues specific to our database formatting goal)
13. Text goes to Project Gutenberg scripts for cleanup.
14. Script inserts clean text into MySQL database for book & saves plaintext copy.
REVIEW & STORAGE
15. User opens text review interface to conduct "reality check" to confirm OCR accuracy against physical book. June 2010 edit, was Firefox & LAMP code, which is still a possibility here since we don't care about camera response time during this phase, but also deserves a dedicated tool Manual edits possible.
16. On approval, MySQL data gets integrated into unified database.
17. Complete book archive- originals, processed images, and individual database- are transferred via Samba to network storage or burned to disc w/ Brasero (or maybe something from the commandline).
Credit: Benjamin C. Varadi on behalf of the Tulane University Center for Intellectual Property Law & Culture; URL: http://iplc.tulane.edu
Each individual step has been tested and works OK, implementation through two shell scripts (capture & process) is at about 50%. A GUI is worth the extra work and hopefully will be forthcoming. I don't know how to make one so hopefully can build a team that includes folks who do. This is a June 2010 edit. I had proposed a browser-based interface but it's far too slow and clumsy. I now suspect the only reason to web-enable a physical device sitting in front of you is to have an excuse to talk about AJAX
While they're described together, we may move things around so that the scan computer is dedicated to capture and other workstations/staff members can handle processing.
Fun fact: while to my knowledge there aren't Windows/cygwin ports of some of the key components, everything runs just dandy in Virtualbox. So is a scanning LiveCD in the works? I think so...
Some factors to make sense of this:
1) We're not using the usb trigger most of you know & love, instead we're controlling two Canon SX100 cameras directly from a desktop and using a hacked USB joystick as a trigger. Instructions here: http://www.diybookscanner.org/forum/vie ... 252&p=2249
2) We're only dealing with two fonts, so we should be able to train Tesseract for better-than-average results. Initial tests are good.
3) We're scanning what is essentially a huge printed database with the goal of turning it into a huge digital database (for integration into our software and to be released separately)
I've added some notes in itals.
Platform: Xubuntu Linux. Really you just need an X interface and a basic LAMP environment, but apt and automatic updates make this a nice option if you don't mind the overhead. XFCE is really nice, too. Also I like the startup screen with the mouse and the glowing motes.
CAPTURE
0. In an ideal world, some automatic camera calibration stuff would happen using GPhoto and reference images. We're just doing that by hand. Live preview on the camera, or shoot & check out pic on screen (described below).
1. Script launches, user inputs book information. Script creates a MySQL database and disk folders for the book. if we weren't so specialized, we'd also collect extended metadata which would be packaged in MARC format & later used by the ebook
2. User presses foot switch, joystick signal captured by QJoypad I'd like to eventually replace this with a lower level daemon that supports nonstandard characters but spent 1.5hrs fighting with joy2key and decided to move on.
3. Currently keylaunch takes the QJoypad keystroke and sends it to the script. Eventually the script will capture the keypress itself.
4. Script launches GPhoto2 for camera 1, captures the image, renames it with page number and moves it to the book specific folder.
5. no five. it's now 8b. Didn't feel like renumbering
6. Script launches GPhoto2 for camera 2 and repeats the process.
7. Script loads a Firefox window where the two images are displayed using PHP, Apache, and ImageMagick.
8. Web page prompts user to continue to next pageset; tag either current page as requiring manual postprocessing; re-shoot either page; or close the book. (currently via web page, soon via manual controls)
8a. Reshoot deletes image and waits for foot pedal to trigger GPhoto for appropriate camera & image overwrite; GOTO7.
8b. Write page data to MySQL
8c. If present, manual postprocessing flag entered into MySQL database. GOTO 3
8c. If close book, GOTO 9 else GOTO 3.
MANUAL PROCESSING
9. Script opens trouble pages (tagged in 8a.) opened in GIMP for user image postprocessing. i hate GIMP. is there a better option for dodging/burning/tweaking?
10. Book opened in ScanTailor for semi-automatic image postprocessing. Can this be automated? Preset for given book format and run from commandline?
AUTOMATIC PROCESSING
11. Script sends Scantailor pages to OCRopus/Tesseract for character recognition, using our custom shapes database (which doesn't exactly exist yet, we have a team at the ready).
12. OCR'ed text goes to script for text postprocessing (mostly just removing hyphens at ends of lines & other issues specific to our database formatting goal)
13. Text goes to Project Gutenberg scripts for cleanup.
14. Script inserts clean text into MySQL database for book & saves plaintext copy.
REVIEW & STORAGE
15. User opens text review interface to conduct "reality check" to confirm OCR accuracy against physical book. June 2010 edit, was Firefox & LAMP code, which is still a possibility here since we don't care about camera response time during this phase, but also deserves a dedicated tool Manual edits possible.
16. On approval, MySQL data gets integrated into unified database.
17. Complete book archive- originals, processed images, and individual database- are transferred via Samba to network storage or burned to disc w/ Brasero (or maybe something from the commandline).