DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

An Introduction to Speech Synthesis Markup Language

From the advent of Microsoft Sam, we've marveled at how computers have talked with us. Now it's time to teach them to talk to us better. SSML can make that happen.

Chris Ward user avatar by
Chris Ward
CORE ·
Apr. 21, 17 · Tutorial
Like (2)
Save
Tweet
Share
8.51K Views

Join the DZone community and get the full member experience.

Join For Free

Speech synthesis is a not a new technology — computers have been attempting to speak to us for decades — but with the recent rise of voice-activated appliances, speech synthesis is undergoing a renaissance. At more than one meetup I heard Speech Synthesis Markup Language (SSML) mentioned for modeling computerized speech and thought it warranted further investigation.

The W3C introduced SSML in September 2004 by the W3C, but based on JSML and JSGF specifications, which are owned by Sun. It’s an XML-based markup language that defines passages of text, a voice to use to speak them, and allows for ‘prosody’, or the tone or accent of words.

The structure of a passage of spoken text consists of XML elements. A parent speak element that defines the XML definition and a default language. Then optional p (paragraph) and s (sentence) elements that let you define the structure of the text, you can change three structural attributes, but also the language spoken.

Inside these elements are voice elements that let you use a predefined voice that affects the way text is spoken. You can change the gender, age, variant and language, all are optional and you can find out what values are available in the spec. You can combine languages between the structural and voice elements to create spoken text with an accent, i.e. English, but sounding like it’s spoken by a Spanish person.

Inside these elements, you can add a variety of elements to change the way certain passages are spoken. For example, emphasis elements that add ‘stress’ to sections of text:

<speak xmlns="http://www.w3.org/2001/10/synthesis" version="1.1">
  <metadata>
    <dc:title xml:lang="en">Hello readers</dc:title>
  </metadata>

  <p>
    <s xml:lang="en-UK">
      <voice name="David" gender="male" age="25">
        Good day, is it <emphasis>tea time?</emphasis>
      </voice>
    </s>
    <s xml:lang="en-US">
      <voice name="David" gender="male" age="25">
        Hey there, want some <emphasis>pie</emphasis>?
      </voice>
    </s>
  </p>
</speak>


In addition to basic emphasis, you can use the prosody element plus parameters to control:

  • pitch
  • contour
  • pitch range
  • rate
  • duration
  • volume

For example:

…
<s xml:lang="en-US">
  <voice name="David" gender="male" age="25">
    Hey there, want some <prosody pitch="high" rate="slow">pie</prosody>?
  </voice>
</s>
…


Or add pauses for dramatic effect:

…
<s xml:lang="en-US">
  <voice name="David" gender="male" age="25">
    Hey there, want some <break time="3s" /> pie?
  </voice>
</s>
…


And there’s much more you can control with other elements and parameters, read the full W3C specification to find out more.

Testing SSML

Great, now you know how to create SSML, and as it’s XML-based, you can use a plethora of existing tools to validate the file, but this is audio, so you want to hear how it sounds.

Considering the pedigree of the standard, the options are limited, assuming you have access to hardware to perform the real tests. I ended up using eSpeak, a CLI tool that had GUIs available, but I couldn’t get them to work.

As IBM Watson supports SSML via its speech API, so in theory, you can test some SSML features on the demo page but I couldn’t figure out what elements you are able to use.

Element

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Connecting Your Devs' Work to the Business
  • Key Considerations When Implementing Virtual Kubernetes Clusters
  • Pipes And Filters Pattern
  • Data Ingestion vs. ETL: Definition, Benefits, and Key Differences

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: