Proposed 100% Linux Workflow: Capture-Process-OCR

Convert page images into searchable text. Talk about software, techniques, and new developments here.

Moderator: peterZ

benjamin
Posts: 58
Joined: 04 Mar 2014, 00:53

Re: Proposed 100% Linux Workflow: Capture-Process-OCR

Post by benjamin »

A quick followup- I'm assuming the ability to shoot from a single machine is fairly new, as several of the systems I've seen (including Archive.org and I think some of the older Atiz stuff) require two networked computers, one for each camera. I've seen multiple capture applications using gphoto dating back to 2007. I think the key is that the software doesn't rely on system-wide device drivers.

Note that to my knowledge, the only software available to do this in windows is PSRemote multi, at a license fee of $95 per camera with only Canon SDK cameras supported. I didn't try it in cygwin, went straight to virtualbox, but if feasible this may be an interesting option for non-Linux folk.
Tim

Re: Proposed 100% Linux Workflow: Capture-Process-OCR

Post by Tim »

Just a couple notes. I really like the idea of working towards a smoothed, 100% Linux workflow. Have you considered CHDK http://chdk.wikia.com/wiki/CHDK for the camera settings issue? You would have to use flash cards to have the settings saved on, but that should really help.

Second, have you tried cuneiform? If Ocropus/Tesseract is working well for you no problem, but it may be worth a try. https://launchpad.net/cuneiform-linux

Third for GTK, I can't solve that for you, but have you looked at Glade? Some people don't like it, but if you have someone that can figure it out, it can save time. http://glade.gnome.org/ Google for tutorials.

Finally, you may be interested in looking at the Decapod project http://sites.google.com/site/decapodproject/ and particularly their wiki at http://wiki.fluidproject.org/display/fluid/Decapod that has a lot more information. They are aiming for much the same goals.
benjamin
Posts: 58
Joined: 04 Mar 2014, 00:53

Re: Proposed 100% Linux Workflow: Capture-Process-OCR

Post by benjamin »

Thanks for the feedback. Haven't messed w/ CHDK yet, but will get there eventually. Think I've figured out how to control zoom, and plan to put index marks on the platen and camera mounts- so rather than messing with trial and error, the software can just ask, "what size is your book?" and then say, "set height to notch 5, I'm adjusting zoom level to 4" or whatever.

Our results weren't great w/ Cuneiform and I don't think it's trainable. I think we can get what we want with Tesseract, it's just a question of having monkeys identify a giant pool of character shapes. Fortunately we have a nearly unlimited supply of both.

Tried watching a video tutorial on Glade and my brain started to melt- I'm a former web developer, but worked more frontend/design than anything else. Think I may actually use that to my advantage, and see if I can't do a PHP/AJAX interface... it's just gonna depend on whether there's any weirdness with the exec() family of functions. Unfortunately at the law school I'm the "someone," and the priority right now is ugly-but-working.

Been following Decapod development for a little while and actually lurk on their mailing list, but as far as I can tell, right now they're even further from building a UI than I am (though they certainly have more qualified folks working on it and a more professional approach than my "hmm... maybe it would be nice if it did this now" interface. I do like that flowchart, though...
User avatar
daniel_reetz
Posts: 2812
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: Proposed 100% Linux Workflow: Capture-Process-OCR

Post by daniel_reetz »

benjamin wrote: and a more professional approach than my "hmm... maybe it would be nice if it did this now" interface. I do like that flowchart, though...
Ben, just want to say I deeply appreciate the work you're doing here, and that I'm certain that the "nice if it did this now" approach will go a hell of a lot further than Wikis and mailing lists and so on, just because, yes, it does this now. This summer, I am going to dive headfirst into linux- have been whetting my appetite with my Nokia N900 linux phone. I think that a LiveCD would be such a perfect complement to the scanner, and a great way to bundle up tools. Thanks again for taking the initiative.
cathal_magus
Posts: 2
Joined: 04 Mar 2014, 00:52

Re: Proposed 100% Linux Workflow: Capture-Process-OCR

Post by cathal_magus »

I have some experience building GUIs with GTK+ and Glade, using the Ruby programming language (for Alexandria). I could fairly easily throw together a working prototype, given a couple of pencil-and-paper sketches of what you'd like the interface to look like. It would probably be a throw-away prototype, but should at least help in getting feedback from yourself and others as to what you'd like the eventual user-interface to look and behave like. And considering I'll eventually be writing a GUI for my own scan-processing software, it might be helpful for me too.
Tim

Re: Proposed 100% Linux Workflow: Capture-Process-OCR

Post by Tim »

benjamin wrote:Been following Decapod development for a little while and actually lurk on their mailing list, but as far as I can tell, right now they're even further from building a UI than I am (though they certainly have more qualified folks working on it and a more professional approach than my "hmm... maybe it would be nice if it did this now" interface. I do like that flowchart, though...
Well I'm glad I could point out at least one thing (CHDK) that you hadn't tried yet. :) As far as I could tell from decapod they actually have beta software and it runs on Linux. Ubuntu to be specific. I agree their process is a little odd, and I can't even find if they have a Sourceforge project page or other place where the beta software can be downloaded from. Or perhaps they are going to offer something for download when the .3 or later releases are ready. See http://wiki.fluidproject.org/display/fl ... .3+Release
That's way later than their originally planned deadline of a 1.0 release in the first year, but it seems like they are going to get something working out soonish.

Either way, I didn't point out other options to discourage your efforts. Precisely the opposite. I'd rather you got a chance to take advantage of the work others had already done. I feel like we're close to the vision you have mentioned and I'd like to see it happen as well.
benjamin
Posts: 58
Joined: 04 Mar 2014, 00:53

Re: Proposed 100% Linux Workflow: Capture-Process-OCR

Post by benjamin »

Thanks all for the support- I feel pretty lucky that this actually gets to be my job (at least, my current assignment). On that note, some job complications have slowed progress momentarily, but I did just find some nice reusable code for solving an annoying issue where the cameras randomly get assigned USB identifiers. Progress is being made, our hope is that staff will be using the system starting very soon and I'll post code up once we've done a few actual runs.

I didn't mean to denigrate Decapod, their goals are just really different- they're committed to straightening pages in new and interesting ways, and are working within an established code base/community; where my job is basically to get a bunch of existing text scanned ASAP, however quick/dirty the process may be. Turning it into something more is totally just because I'm excited about the possibilities, too...

I am really stoked to continue on with this project and move towards a LiveCD. I think as a format generally they're totally underutilized... back in '05 I was part of a group that was working on a LiveCD for webcasting/microradio; though we never made it past an early version: http://www.auppix.org still has the background info, it's something I'd still like to revisit someday (though obviously some more sophisticated projects have emerged since then).

Dan- if you haven't spent much time playing with Linux recently, you're going to be pleasantly surprised. It's an actual grown-up OS now, rather than what it was just a couple of years ago. It's at least as intuitive and more accessible than Win or Mac except, unfortunately, for certain specialized applications (particularly those where one or two software companies dominate, like video or audio production). Oh, and I've been completely unsuccessful in getting a PVR system to work even a little bit (though watching TV is noprob). :)
The major issue, as you may have already discovered, is that while the tech is open, documentation is lacking throughout.

Hey cathal_magus, can you PM me an email address?

Tim I'm looking forward to exploring CHDK/SDM options... if either is compatible with ghpoto2, this would open up both a ton of new control options AND mean a whole slew of new cameras are supported! If it's not yet compatible, perhaps this is something we as a community can help facilitate...
Niashi
Posts: 1
Joined: 04 Mar 2014, 00:52

Re: Proposed 100% Linux Workflow: Capture-Process-OCR

Post by Niashi »

HA, wonderful. This is actually nearly exactly what I was looking for. I came here looking to seeing if anyone was trying something similar to what I am attempting to do.

The only difference for me, is I need to build an interface to retrieve the information from, such as a webpage or an application. My intentions is to build a database of my cookbooks and install touchscreens in my kitchen and I only use linux at home, except for my laptop. Eh what can I say, I want a techy kitchen and saves the husband from having to read off recipes to me due to my extremely limited space in the kitchen, or sending them to my Blackberry and having to fiddle with that.

More and more my ideas are coming into reality, woohoo! Also, I'm not sure what you have against GIMP, but yes it is your best bet in Linux. You may to give the CVS/SVN version a shot, I primarily use the version in CVS/SVN as it's generally way ahead of what's in the stable/released tree.

I guess now is to attempt a run at this with premade documents as I cannot build the scanner yet (I need to wait until I move into my house next year before I build it) so I can see how I want to retrieve the data. I was planning on building something close to this, except I think you have speed up what I originally though with the GIMP launching with problematic pages.

I may have missed it, but how does this handle images from books?
User avatar
daniel_reetz
Posts: 2812
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: Proposed 100% Linux Workflow: Capture-Process-OCR

Post by daniel_reetz »

benjamin wrote: Dan- if you haven't spent much time playing with Linux recently, you're going to be pleasantly surprised. It's an actual grown-up OS now, rather than what it was just a couple of years ago. It's at least as intuitive and more accessible than Win or Mac except, unfortunately, for certain specialized applications (particularly those where one or two software companies dominate, like video or audio production). Oh, and I've been completely unsuccessful in getting a PVR system to work even a little bit (though watching TV is noprob). :)
The major issue, as you may have already discovered, is that while the tech is open, documentation is lacking throughout.
Ben, I met some good people in Boston at this O'Reilly conference who are working on some of that. And a lot of other stuff you'd care about. We should really chat on the phone soon.

Just FYI, I got a new PC and have Ubuntu and Win 7 on it. So far, loving Ubuntu... can't wait to start playing with some of the stuff you've built.
benjamin
Posts: 58
Joined: 04 Mar 2014, 00:53

Re: Proposed 100% Linux Workflow: Capture-Process-OCR

Post by benjamin »

Just a quick update: a "proof of concept" version of the scanner/software has now been implemented, and another Tulane project is spending the summer using the system and providing feedback as they go. So far it's been very helpful, from noting obvious stuff that just hadn't occurred to me (like handling book titles/folder names with spaces in the title) to minor feature requests that are easy to implement and make a big difference (like an audio prompt when ready to scan the next page). We've also talked about adding hardware controls to flag or reshoot pages, which it looks like we may accomplish using an old dictaphone pedal.

One thing I'm still struggling with is an effective image preview mechanism. The problem I was having was that ImageMagick was taking up to 10 seconds to rotate and scale each page, making post-capture "preview" far too slow. i've also yet to find a way to effectively close a program from bash other than pkill, and for some reason that leads to system instability over time. I think I may have that beat now that I've gotten a handle on gphoto's ability to access rotate options on the camera itself, but need to test this and before any release we'd need to ensure this works on cameras other than the SX100's. I suspect there may be a way to capture a still image from the viewfinder, which would solve the problem, but haven't looked into this yet.

One annoyance here is that gphoto actually supports live video preview- but the only option available (that I've found) for displaying that feed from the commandline is their built-in support for "aa", which converts video to ascii art, like so: http://www.google.com/search?q=aalib&tbs=vid:1 Interestingly, the library itself has support for other formats (it's documented, and gtkam has no problem showing not-weirdified previews), so one more reason a GUI is needed in the future. I've got most of a flowchart, the required libraries- most of which are pretty well documented, and the bash script demonstrates the logic, so it should be a simple project.

Anyway the real point of posting was to note that my fellowship officially ended May 21, I'm continuing on with the project in a volunteer capacity and enthusiastic about that- but am working a new job & studying for the Bar, so the timeline moving forward is likely just collecting feedback until August, then implementing changes and releasing a public version in late August/early September. If anyone has contacts who might be interested in contributing or helping with ongoing development, please shoot me a PM and I'll touch base.
Post Reply