Web-Standard Speech Recognition: W3C Report and Unofficial Spec
Join the DZone community and get the full member experience.
Join For FreeNothing says 'truly modern technology' like speech recognition -- unless it's gesture recognition, but that requires more specialized, less common hardware (unless Kinect goes as far as Microsoft hopes).
If HTML5 is going to make the web truly modern, then it needs to go beyond 'making web development easier, faster, and better'. In particular, user experience lies squarely in HTML5's sights -- and yet, whenever I babble about some awesome new API, my non-developer friends will (reasonably) raise a skeptical eyebrow and snort, 'Until I can talk to Google, I won't be impressed.'
The W3C HTML Speech Incubator Group has taken the first steps to meeting this non-developer UX demand, aiming its ambitious eye at a truly voice-enabled web. (Incubator Groups define a need as precisely as possible; Working Groups try to meet the need as practically as possible.)
Last month the Group published its final report, which satisfied the following general requirement (specified the group's charter):
The mission of the HTML Speech Incubator Group, part of the Incubator Activity, is to determine the feasibility of integrating speech technology in HTML5 in a way that leverages the capabilities of both speech and HTML (e.g., DOM) to provide a high-quality, browser-independent speech/multimodal experience while avoiding unnecessary standards fragmentation or overlap.
Following HTML5's 'use-cases first' approach, the report listed these as the desired use-cases, in roughly prioritized order:
- Voice Web Search
- Speech Command Interface
- Domain Specific Grammars Contingent on Earlier Inputs
- Continuous Recognition of Open Dialog
- Domain Specific Grammars Filling Multiple Input Fields
- Speech UI present when no visible UI need be present
- Rerecognition
- Voice Activity Detection
- Temporal Structure of Synthesis to Provide Visual Feedback
- Hello World
- Speech Translation
- Speech Enabled Email Client
- Dialog Systems
- Multimodal Interaction
- Speech Driving Directions
- Multimodal Video Game
- Multimodal Search
Reaching deeper than use-cases, a series of technical requirements were discussed and divided into 'strong interest', 'moderate interest', and 'mild interest'. The requirements list gets rather long, so I won't reproduce it in this article (click here for the appropriate section of the report, but basically: the 'strong interest' requirements focused on making speech recognition as smart, unobtrusive, and concurrent as possible (letting web apps specify domain-specific grammars; allowing audio processing before capture completion; returning useful recognition or non-match errors to the web app), the 'moderate interest' requirements included transport-specific technical matters and pie-in-the-sky impracticalities (the spec must include a mandatory codec without IP issues; recognition without specified grammar should be possible), and the 'mild interest' requirements, it turns out, received less direct attention in part because several were more or less implied in the 'high interest' list.
The final report is quite long, and proposes (rudiments of) both an JavaScript API and a specialized speech protocol. The API now has its own unofficial draft in standard W3C spec format (which is a little easier to read as a separate document); the speech protocol is defined as a sub-protocol of WebSockets, chiefly because WebSockets has already done a lot of the low-level duplexing work (which several of the 'high-interest' requirements demand).
So take a look at the report and see what you think. The API and protocol are a long way from their respective final states, no doubt. But if it seems to you that a legitimately bi-directional voice-enabled web is a good thing, this first-month-after-report-publication is a great time to start contributing your own ideas.
Opinions expressed by DZone contributors are their own.
Comments