Voice Synthesis: Evolution, Ethics, and Law

Roman Garin, Senior Vice President @ Innovation, Sportradar This article traces the evolution of voice synthesis and explores its far-reaching legal implications.

Roman Garin

Nov. 22, 23 · Review

Likes (2)

Comment

Save

2.8K Views

Voice synthesis technology has progressed remarkably from early mechanical experiments to present-day AI systems capable of natural, human-like speech. Modern applications span accessibility, education, entertainment, communication, and information retrieval, enhancing user experiences with various platforms like smart speakers and chatbots. This article traces the evolution of voice synthesis and explores its far-reaching legal implications as the technology continues advancing.

A Long History Leading to Recent Advances

The history of artificially generating human speech can be split into three main eras: mechanical, electronic, and digital. The mechanical era involved physical devices like bellows and keyboards that manipulated sounds to mimic speech, such as von Kempelen's 1769 acoustic-mechanical machine. The electronic era used electricity and components like filters and amplifiers to generate more lifelike vocal sounds, like Bell Labs' 1939 Voder. The digital era, enabled by computers, revolutionized synthesis through software algorithms and datasets. Early systems like Bell Labs' 1962 PAT used mathematical models and parameters to control synthetic speech. Later systems like MIT's 1980 Klatt Synthesizer used linguistic rules and tables.

Within the digital era, two main approaches emerged: concatenative and statistical parametric. Concatenative systems stitch together snippets of real human voices, while parametric systems use models and parameters to mathematically generate speech. Concatenative systems can sound more natural but require more data, while parametric systems are more flexible but may sound robotic.

Recently, AI and deep learning have achieved major advances in voice synthesis, like Google DeepMind's 2016 WaveNet using neural networks to model speech waveforms directly. Other innovations include Tacotron, Transformer-TTS, and FastSpeech neural architectures from Google, Baidu, and Microsoft, as well as generative flow models like Glow-TTS. These systems can produce increasingly human-like, natural, and expressive synthetic speech in different languages and voices.

Here are some of the best real-world examples of voice cloning and speech synthesis applications (as of late 2023):

Descript is a platform founded in 2017 that uses AI to let users edit audio and video files like text. It can also generate synthetic voices from user recordings to correct mistakes, add new content, or alter speech style and tone.
Elevenlabs, founded in 2022, is creating personalized, expressive synthetic voices for gaming, education, entertainment, and healthcare. It uses deep learning to clone and customize voices from minutes of speech, with controls for emotion, pitch, speed, etc.
Coqui.ai is a non-profit founded in 2021 dedicated to developing open-source voice synthesis and analysis tools for text-to-speech, speech-to-text, and speech recognition. It aims to make voice technology affordable and accessible, especially for underrepresented languages. Coqui.ai was founded by former Microsoft and Mozilla researchers and has support from Mozilla, Google, GitHub, and others.

AI Unlocks New Capabilities

AI has enabled major advancements in speech synthesis, making computer-generated voices sound far more human-like and expressive. Key innovations include:

Neural voice cloning: This uses deep learning to clone a person's voice from just a small sample of their speech. It allows the creation of personalized voices for digital assistants, bringing fictional characters to life and preserving endangered voices.
Neural voice conversion: This transforms the voice of one speaker into another while keeping the content unchanged. It enables applications like voice style transfer, voice enhancement, and cross-gender/cross-language voice conversion.
Neural voice synthesis: This uses AI to generate lifelike synthetic speech from text input. Systems like Google's WaveNet and Amazon's Polly can synthesize natural voices in different languages, accents, and tones, with nuanced emotions and prosody.

Together, these advancements in neural voice modeling are enabling more human-sounding text-to-speech, new forms of audio creation, and preserving voices for future generations. The rapid progress shows the transformative impact AI is having on the naturalness and creativity of synthesized speech.

Confronting Social and Ethical Challenges

Voice synthesis technology has many potential benefits, like improving accessibility, education, entertainment, and communication. However, it also raises ethical issues we must thoughtfully address. Synthesized voices could spread misinformation by impersonating real people or manipulating emotions. Deepfakes of public figures could damage reputations or sway elections. Voice phishing could trick people into revealing private details or money.

We must also consider how synthesis impacts privacy and identity. Voices could be collected or cloned without consent to infringe on privacy or steal identities. Users might alter their voice in ways that affect self-perception and social connections.

Additionally, synthesis challenges our ability to trust and evaluate information. It may become difficult to confirm if speech is real or synthetic, authenticate the source, or detect edits. The technology could generate misleading content that lacks the nuance of human interaction.

As voice synthesis advances, we need open discussions on responsible development and use that respects human dignity. With care, we can maximize benefits and mitigate risks. But we must thoughtfully consider the technology's implications for truth, trust, and our shared humanity.

Updating Laws and Regulations

Voice synthesis technology is rapidly improving, raising new legal and regulatory issues. For example, who owns the intellectual property rights to synthesized voices? If a company creates a synthesized version of a celebrity's voice for a commercial, who controls the rights — the celebrity or the company? There are also consent issues to consider. Can a company synthesize a person's voice without their permission? And who is liable if synthesized voices are misused, like for fraud or defamation?

Current laws weren't designed for synthesized voices. They're outdated, inconsistent across jurisdictions, or inadequate. New legal frameworks are needed to balance the interests of those affected. For example, intellectual property laws could be updated to address synthesized voices. New laws specific to voice synthesis could be created, like voice cloning laws. Regulatory bodies overseeing voice synthesis could be established to create standards.

Self-regulation and best practices are other options. Companies could voluntarily adopt codes of conduct for ethically synthesizing voices. They could implement transparency measures, like disclosing when a voice is synthesized. As voice synthesis advances, balancing the interests of companies, individuals, and society will require proactive, collaborative solutions.

Advancing Voice Authentication

Voice authentication and verification refer to the processes of confirming a speaker's identity and authenticity using voice biometrics and other techniques. These are important for securing communication and information involving speech. Some key methods and applications include:

Speaker recognition identifies speakers by analyzing vocal characteristics like pitch and accent. This can be used for access control, ID verification, and forensics. Technologies like Microsoft's Speaker Recognition API allow the integration of speaker recognition into apps.
Speech recognition, which transcribes speech into text by examining words, phrases, grammar, etc. This enables transcription, translation, captioning, and verifying content and context. Google's Speech-to-Text API converts audio to text using deep learning. Amazon Transcribe provides high-accuracy, low-latency speech-to-text.
Speech synthesis detection which distinguishes synthetic speech from real speech by looking at spectral, prosodic, and articulatory cues. This helps assess quality, moderate content, and prevent fraud. It can also identify the source and type of synthetic speech and compare it to real speech. For instance, Google's ASVspoof dataset aids anti-spoofing in speaker verification. Another example is Resemblyzer, which measures voice similarity using neural networks.

Enabling Responsible Innovation

Voice synthesis technology crosses borders and jurisdictions, so international cooperation and regulation are needed to address shared challenges and opportunities. Some examples include developing international standards so systems are compatible and reliable worldwide, promoting research collaboration and knowledge exchange among developers globally, ensuring ethical development that respects human rights and dignity, and fostering innovation through initiatives that bring together stakeholders across sectors and regions. Global organizations like the UN, ISO, and IEEE can facilitate standards development. Funding programs like the EU Horizon 2020 can enable international innovation. Advocacy groups like AI4People can champion ethical principles for the technology. With coordinated efforts across nations, voice synthesis can progress responsibly and benefit people equitably around the world.

Conclusion

Voice synthesis technology has advanced impressively from its early beginnings to today's AI-powered systems that can simulate, manipulate, and personalize speech in incredible ways. This opens exciting possibilities but also raises concerns about misuse and erosion of trust in a world where perfect vocal fakes are possible. As this fascinating tech continues evolving rapidly, we find ourselves at an ethical crossroads — will we use its power responsibly when anyone can sound like a celebrity? The future remains unclear, but one thing is certain: voice synthesis is about to make our lives far more interesting if we can develop the laws and ethics to keep pace.

Additional Sources

AI Deep learning Speech recognition Machine learning

Opinions expressed by DZone contributors are their own.

Related

Trending