Page 1 of 1

Voice Control

Posted: 26 Sep 2014, 11:30
by duerig
Voice recognition has come a long way in recent years and it is very accurate when the vocabulary is small. There are open source packages for voice controlled projects on the Raspberry Pi. This would free hands for page flipping. And you could either avoid a foot pedal altogether or repurpose your feet for platen movement on some scanners.

The vocabulary would be very simple:


Anything more complicated like changing settings would be accomplished through normal interfaces.

Ideally, we could have the equivalent words in other languages as well And this would be part of setup.

Now I need to do some research to find the best library to use.

Re: Voice Control

Posted: 28 Sep 2014, 07:52
by Gerard ... Speech-API

maybe you could use an andoid phone als remote controll

Re: Voice Control

Posted: 30 Sep 2014, 09:59
by duerig
Thanks, gerard. I will want to look into that when I get this more integrated into Spreads. Since I am running on a Raspberry Pi, it is always best to offload the work onto the client if possible. :)

I've also been working with Pocketsphinx to see how well that works. The nice thing is that it is very accurate when I say one of the keywords. It almost always detects when I say 'scan' or 'retake' and can distinguish between them very well.

The down side is that is also very good at taking other noises and sounds and interpreting them as 'scan' or 'retake'. This problem of out-of-vocabulary words doesn't seem to have any good solution in PocketSphinx, especially when you have a very small vocabulary. I have a few other possible solutions I want to explore, though.

Oddly enough, Pocketsphinx interprets the click of the camera shutter as the word 'scan'. Which of course makes it take another picture and generate another click. When I hooked it up to my prototype scanning workflow, it was in an infinite loop until I killed it.

Re: Voice Control

Posted: 12 Sep 2017, 09:21
by jesu_krist
In the broader context of voice controlling the camera(s) during acquisition, I've found this simple solution:

this is my scanning rig (viewtopic.php?p=20885#p20885), I use the TwoCamControl AutoHotKey script to trigger the camera(s), which in turn is activated with keystrokes or other custom actions performed on peripherals; my idea was to voice trigger the camera(s) through some speech recognition software capable of simulating keystrokes: enter Vocola.

Vocola 3 ( uses the built-in Windows Speech Recognition as input for local and global hotkeys and shortcuts. I just use the internal mic of my laptop to trigger the camera(s), with no significant latency; Vocola interprets my commands as keystrokes; now i scan hands-free, much faster, and with as little effort as possible. To prevent ambient noises to trigger the camera(s) the easiest way is to lower the mic sensitivity and to choose commands that sound unique; because I just need to "Shoot" the cameras, I have just one voice command, I say "now!" every time I need the camera(s) to shoot, which is interpreted as "F8" on the keyboard, and "now!" is much more easily recognized by the software -- it never fails -- than "shoot!" (at least in my case: I'm not a native english speaker).

Re: Voice Control

Posted: 14 Sep 2017, 15:02
by dtic
jesu_krist wrote:
12 Sep 2017, 09:21
Vocola 3 ( uses the built-in Windows Speech Recognition as input for local and global hotkeys and shortcuts.
Nice solution! I'm not familiar with Vocola, how long is the delay between the phrase you say and the action?

Re: Voice Control

Posted: 14 Sep 2017, 15:20
by jesu_krist
In my case, with a one word command and using the internal mic of my laptop, the delay is under 2 seconds; but with a little bit of practice and for such a repetitive task I'm able to voice the command as I'm still turning the pages (or positioning the book), so that in the end there is almost no latency.