Transition of Siri’s Voice From Robotic to Human: Note the Difference
AI and deep learning are strengthening their roots and being used more to develop virtual personal assistants. Learn how they've been used to improve Siri's voice.
Join the DZone community and get the full member experience.Join For Free
Being an iOS user, how many times do you talk to Siri in a day? A good many times, isn’t it? If you are a keen observer, then you know that Siri’s voice sounds much more like a human in iOS 11 than it has before. This is because Apple is digging deeper into the technology of artificial intelligence, machine learning, and deep learning to offer the best personal assistant experience to its users.
From the introduction of Siri with the iPhone 4S to its continuation in iOS 11, this personal assistant has evolved to get closer to humans and establish good relations with them. To reply to voice commands of users, Siri uses speech synthesis combined with deep learning.
Speech Synthesis: An Integral Part of Siri’s Functioning
Speech synthesis is basically the artificial production of human speech. This technology is quintessential in several domains including virtual personal assistants, games, and entertainment. While several advancements have been made to the basic models of unit selection and parametric synthesis, deep learning has penetrated into it deeper.
The integration of this technology in speech synthesis has given rise to a new model known as direct waveform modeling. With this model, it is now possible to process high-quality unit selection synthesis and also avail the benefit of flexibility with parametric synthesis.
Apple utilizes the power of deep learning in hybrid unit selection systems in order to get the highest-quality voice output for Siri.
How the Text-to-Speech System (TTS) Works
The TTS system works by recording the voices of humans for possible instances, bifurcating speech units, and using machine learning.
Recording the Voices of Humans for Possible Instances
The first major task in making a text-to-speech system for virtual personal assistants is to record voice of a human. This voice should not only be pleasant to hear but should also be very clear to understand for everyone.
In order to cover a variety of human speech, it is required to record approximately 20 hours of speech in a professional studio. This includes almost all types of responses, including narrating instructions, dictating weather reports, telling jokes, and more. It is not possible to use these audio clips, as it is as there is no limit to the type of questions any user may ask the personal assistant. These recorded responses are then processed to make the virtual assistant learn about them.
Bifurcation of Speech Units
The recorded speech of humans is divided into several components and later joined together as per the received text for creating a perfect response. Optimizing speech units for specific devices or making them compatible for an array of devices requires analyzing the acoustic characteristics of each phone and prosody of speech.
Use of Machine Learning
Though it sounds like just another process, it is quite difficult and challenging for developers to get the pattern of stress and intonation (prosody) perfectly. Further, it is too heavy for a mobile phone to go with this method of stringing.
These challenges are solved to an extent with the introduction of machine learning. By gathering data for training, it is possible to make the text-to-speech system understand the pattern and how to divide different elements of audio for delivering natural human-like output.
Apple’s Efforts in Improving Siri’s Voice
Once they decided to work rigorously to improve Siri’s voice, engineers at Apple worked with a female voice actor to record 20 hours of speech in US accent English. These 1-2 million audio segment recordings were then used to train the deep learning system.
Next, they tested the output by making subject choose from previous and new voices of Siri. The majority of them preferred the new natural and human-like Siri voice. They noticed a clear difference from a robotic to a natural voice when Siri responded to trivia questions, acknowledged "request completed" notifications, and provided other navigation instructions.
The following graph shows the result of AB pairwise subjective listening tests:
Moreover, the test subjects were of the view that this voice perfectly matches the "personality" of Siri. iOS app development service providers are studying this technology to know how can they utilize the same for building more innovative apps.
When Will Users Get to Experience the New Voice of Siri?
iPhone 8 will be the first Apple phone to come with iOS 11 and the new voice of Siri. The latest iPad release will also feature the new personal assistant voice. Apple never stops experimenting with technology to discover new possibilities. Now that the voice of Siri is improved, Apple is now in the observation phase to know the reaction of end users.
Artificial intelligence and deep learning are strengthening their roots in terms of usage in virtual personal assistants and other applications. The future seems quite bright for these technologies, as people are reacting positively to it.
Opinions expressed by DZone contributors are their own.