DZone
Web Dev Zone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
  • Refcardz
  • Trend Reports
  • Webinars
  • Zones
  • |
    • Agile
    • AI
    • Big Data
    • Cloud
    • Database
    • DevOps
    • Integration
    • IoT
    • Java
    • Microservices
    • Open Source
    • Performance
    • Security
    • Web Dev
DZone > Web Dev Zone > Web-Standard Speech Recognition: W3C Report and Unofficial Spec

Web-Standard Speech Recognition: W3C Report and Unofficial Spec

John Esposito user avatar by
John Esposito
·
Jan. 06, 12 · Web Dev Zone · Interview
Like (0)
Save
Tweet
11.08K Views

Join the DZone community and get the full member experience.

Join For Free

Nothing says 'truly modern technology' like speech recognition -- unless it's gesture recognition, but that requires more specialized, less common hardware (unless Kinect goes as far as Microsoft hopes).

If HTML5 is going to make the web truly modern, then it needs to go beyond 'making web development easier, faster, and better'. In particular, user experience lies squarely in HTML5's sights -- and yet, whenever I babble about some awesome new API, my non-developer friends will (reasonably) raise a skeptical eyebrow and snort, 'Until I can talk to Google, I won't be impressed.'

The W3C HTML Speech Incubator Group has taken the first steps to meeting this non-developer UX demand, aiming its ambitious eye at a truly voice-enabled web. (Incubator Groups define a need as precisely as possible; Working Groups try to meet the need as practically as possible.)

Last month the Group published its final report, which satisfied the following general requirement (specified the group's charter):

The mission of the HTML Speech Incubator Group, part of the Incubator Activity, is to determine the feasibility of integrating speech technology in HTML5 in a way that leverages the capabilities of both speech and HTML (e.g., DOM) to provide a high-quality, browser-independent speech/multimodal experience while avoiding unnecessary standards fragmentation or overlap.


Following HTML5's 'use-cases first' approach, the report listed these as the desired use-cases, in roughly prioritized order:

  • Voice Web Search
  • Speech Command Interface
  • Domain Specific Grammars Contingent on Earlier Inputs
  • Continuous Recognition of Open Dialog
  • Domain Specific Grammars Filling Multiple Input Fields
  • Speech UI present when no visible UI need be present
  • Rerecognition
  • Voice Activity Detection
  • Temporal Structure of Synthesis to Provide Visual Feedback
  • Hello World
  • Speech Translation
  • Speech Enabled Email Client
  • Dialog Systems
  • Multimodal Interaction
  • Speech Driving Directions
  • Multimodal Video Game
  • Multimodal Search

 
Reaching deeper than use-cases, a series of technical requirements were discussed and divided into 'strong interest', 'moderate interest', and 'mild interest'. The requirements list gets rather long, so I won't reproduce it in this article (click here for the appropriate section of the report, but basically: the 'strong interest' requirements focused on making speech recognition as smart, unobtrusive, and concurrent as possible (letting web apps specify domain-specific grammars; allowing audio processing before capture completion; returning useful recognition or non-match errors to the web app), the 'moderate interest' requirements included transport-specific technical matters and pie-in-the-sky impracticalities (the spec must include a mandatory codec without IP issues; recognition without specified grammar should be possible), and the 'mild interest' requirements, it turns out, received less direct attention in part because several were more or less implied in the 'high interest' list.

The final report is quite long, and proposes (rudiments of) both an JavaScript API and a specialized speech protocol. The API now has its own unofficial draft in standard W3C spec format (which is a little easier to read as a separate document); the speech protocol is defined as a sub-protocol of WebSockets, chiefly because WebSockets has already done a lot of the low-level duplexing work (which several of the 'high-interest' requirements demand).

So take a look at the report and see what you think. The API and protocol are a long way from their respective final states, no doubt. But if it seems to you that a legitimately bi-directional voice-enabled web is a good thing, this first-month-after-report-publication is a great time to start contributing your own ideas.

 

Speech recognition

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • DZone's Article Submission Guidelines
  • No-Code/Low-Code Use Cases in the Enterprise
  • Top 7 Features in Jakarta EE 10 Release
  • Top 10 Automated Software Testing Tools

Comments

Web Dev Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • MVB Program
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends:

DZone.com is powered by 

AnswerHub logo