Speech synthesis is a not a new technology — computers have been attempting to speak to us for decades — but with the recent rise of voice-activated appliances, speech synthesis is undergoing a renaissance. At more than one meetup I heard Speech Synthesis Markup Language (SSML) mentioned for modeling computerized speech and thought it warranted further investigation.
The W3C introduced SSML in September 2004 by the W3C, but based on JSML and JSGF specifications, which are owned by Sun. It’s an XML-based markup language that defines passages of text, a voice to use to speak them, and allows for ‘prosody’, or the tone or accent of words.
The structure of a passage of spoken text consists of XML elements. A parent
speak element that defines the XML definition and a default language. Then optional
p (paragraph) and
s (sentence) elements that let you define the structure of the text, you can change three structural attributes, but also the language spoken.
Inside these elements are
voice elements that let you use a predefined voice that affects the way text is spoken. You can change the
language, all are optional and you can find out what values are available in the spec. You can combine languages between the structural and voice elements to create spoken text with an accent, i.e. English, but sounding like it’s spoken by a Spanish person.
Inside these elements, you can add a variety of elements to change the way certain passages are spoken. For example,
emphasis elements that add ‘stress’ to sections of text:
<speak xmlns="http://www.w3.org/2001/10/synthesis" version="1.1"> <metadata> <dc:title xml:lang="en">Hello readers</dc:title> </metadata> <p> <s xml:lang="en-UK"> <voice name="David" gender="male" age="25"> Good day, is it <emphasis>tea time?</emphasis> </voice> </s> <s xml:lang="en-US"> <voice name="David" gender="male" age="25"> Hey there, want some <emphasis>pie</emphasis>? </voice> </s> </p> </speak>
In addition to basic emphasis, you can use the
prosody element plus parameters to control:
- pitch range
… <s xml:lang="en-US"> <voice name="David" gender="male" age="25"> Hey there, want some <prosody pitch="high" rate="slow">pie</prosody>? </voice> </s> …
Or add pauses for dramatic effect:
… <s xml:lang="en-US"> <voice name="David" gender="male" age="25"> Hey there, want some <break time="3s" /> pie? </voice> </s> …
And there’s much more you can control with other elements and parameters, read the full W3C specification to find out more.
Great, now you know how to create SSML, and as it’s XML-based, you can use a plethora of existing tools to validate the file, but this is audio, so you want to hear how it sounds.
Considering the pedigree of the standard, the options are limited, assuming you have access to hardware to perform the real tests. I ended up using eSpeak, a CLI tool that had GUIs available, but I couldn’t get them to work.
As IBM Watson supports SSML via its speech API, so in theory, you can test some SSML features on the demo page but I couldn’t figure out what elements you are able to use.