I recently started experimenting with text-to-speech (TTS) and automated speech recognition (ASR) functionality after participating in a hackathon in San Francisco several weeks ago co-sponsored by AT&T. Further emboldened by a re-watching of 2001: A Space Odyssey, I set out to make a program that I could drive with some simple voice commands: “tell me the weather,” “what’s the market doing?” etc. Unsurprisingly, my results were mixed — the voice recording component sometimes truncated phrases, the web-based ASR took 1-2 seconds to process, and the results were sometimes wrong in unexpected ways (possibly attributable more to my microphone than AT&T’s API). I was sure that someone a little more experienced could really knock speech functionality out of the park.
Consequently, I was excited to play with an Xbox 360 while staying with a friend for a few days recently. It’s an old platform at this point, and I’m eager to see Xbox One, but I was surprised how limited and clunky the console’s speech capabilities were. Firstly, even in a quiet living room we found ourselves practically yelling at the console — “Xbox, Netflix!”. We’d laugh at how many times it took to recognize our command.
Secondly, the speech command tree is needlessly hierarchical. When using an Xbox controller to drive through menus by hand you’re limited to a few distinct options on any screen — the ‘A’ button means “Go to video screen” in one menu state and “Activate Netflix” in a deeper layer of the menu progression. The voice command feature makes you recreate these steps even though it should be able to skip the intermediate step of “Go the video screen.” “Netflix” means Netflix regardless of which branch or level of the tree I’m currently in.
Finally, while playing a video the Xbox would often interpret a show’s dialogue as a command. How the developers didn’t subtract the audio-out from the voice command input, I will never know. Regardless, both obvious and inexplicable phrases would at times pause, stop, or rewind a show.
The state of voice control today is analogous to that of soft keyboards. On a bit of a spree, I also recently watched 2010: The Year We Make Contact and noted how the set was crammed with old-looking keyboards. If the film were made today, each of those physical keyboards would be replaced by a virtual keyboard on a screen or glass surface — the control center in the recent film Oblivion comes to mind. We have the technology today, but how many of us are using soft keyboards to do actual work? My first iPad accessory was a Bluetooth keyboard; I can’t imagine using a screen-based keyboard to do actual work; and where are all the awesome laser-based virtual keyboards that have floated around the internet for years?
Soft keyboards are largely relegated to movies because they simply aren’t effective compared to traditional keyboards in most applications. They are perceptibly laggy and tactilely ambiguous. Soft keyboards’ sole domains are areas where alternatives aren’t practical (e.g., a small phone form factor, a wall-mounted interactive display, etc.). Voice control is the same. I recently heard a radio ad for Nuance’s Dragon NaturallySpeaking. The accessibility use case (for instance, for people missing fingers or otherwise unable to type) is such a prominent use case that it merited a mention in the scarce time of a 15-second spot despite the fact those individuals comprise an extreme minority of all computer users. Yes, the ad also mentioned ease of use and other arguments of futuristic simplicity, but these selling points aren’t sufficient for most consumers.
While I’m certain that voice control will improve dramatically in the coming years, I’ll refrain from yelling at my TV and stick to regular remote controls.