Voice User Interfaces (VUI) — The Ultimate UX Guide
Hey, Siri, read me this article from DZone.com.
Join the DZone community and get the full member experience.Join For Free
— “Ok, calling Selma Martin`”
“No! Set my alarm to 7:15am”
— “I’m sorry. I can’t help you with that.”
“Sigh” *Manually sets alarm*
Our voices are diverse, complex, and variable. Voice commands are even more daunting to process — even between people, let alone computers. The way we frame our thoughts, the way we culturally communicate, the way we use slang and infer meaning…all of these nuances influence the interpretation and comprehensibility of our words.
So, how are designers and engineers tackling this challenge? How can we cultivate trust between the user and AI? This is where VUIs come into play.
Voice User Interfaces (VUI) are the primary or supplementary visual, auditory, and tactile interfaces that enable voice interaction between people and devices. Simply stated, a VUI can be anything from a light that blinks when it hears your voice to an automobile’s entertainment console. Keep in mind, a VUI does not need to have a visual interface — it can be completely auditory or tactile (ex. a vibration).
Voice User Interfaces (VUI) are the primary or supplementary visual, auditory, and tactile interfaces that enable voice interaction between people and devices.
While there’s a vast spectrum of VUI, they all share a set of common UX fundamentals that drive usability. We’ll explore those fundamentals so, as a user, you can dissect your everyday VUI interactions — and, as a designer, you can build better experiences.
Discovery — Constraints, Dependencies, Use Cases
The way we interact with our world is highly shaped by our technological, environmental, and sociological constraints. The speed at which we can process information, the accuracy at which we can translate that data into action, the language/dialect we use to communicate that data, and the recipient of that action (whether it’s ourselves or someone else).
Before we dive into our interactive design, we must first identify the environmental context that frames the voice interaction.
Determine the Device Genre
The device type influences the modes and inputs that underly the spectrum and scope of the voice interaction.
- iPhones, Pixels, Galaxies
- Connectivity — Cellular networks, wifi, paired devices
- Environmental context has a substantial impact on voice interactivity
- Users are accustomed to using voice interaction
- Allows interaction through visual, auditory, and tactile feedback
- Interaction methods are fairly standardized across models
- Use case specific and typically geared towards specific use cases, like a watch, fitness band, or smart shoes
- Connectivity — Cellular networks, wifi, paired devices
- Users may be accustomed to using voice interaction, but this interaction is unstandardized across devices
- Some wearables allow for interaction through visual, auditory, and tactile feedback, though some are more passive with no explicit user interaction
- Typically dependent on connected devices for user interaction and data consumption
Stationary Connected Devices
- Desktop computers, appliances with screens, thermostats, smart home hubs, sound systems, TVs
- Connectivity — Wired networks, Wi-Fi, paired devices
- Users are accustomed to using these devices in the same location and setting on a habitual basis
- Quasi-standardized methods of voice interaction between similar device genres (desktop computers vs. connected hubs like Google Home/Amazon Alexa vs. smart thermostats).
Non-Stationary Computing Devices (Non-Phones)
- Laptops, tablets, transponders, automobile infotainment systems
- Connectivity — Wireless networks, wired networks (not common), wifi, paired devices
- Primary input mode is typically not voice
- Environmental context has a substantial impact on voice interactivity
- Typically have unstandardized voice interaction methods between device genres
Create a Use Case Matrix
What are the primary, secondary, and tertiary use cases for the voice interaction? Does the device have one primary use case (like a fitness tracker)? Or does it have an eclectic mix of use cases (like a smartphone)?
It is very important to create a use case matrix that will help you identify why users are interacting with the device. What is their primary mode of interaction? What is secondary? What is a nice-to-have interaction mode and what is essential?
You can create a use case matrix for each mode of interaction. When applied to voice interaction, the matrix will help you understand how your users currently use or want to use voice to interact with the product — including where they would use the voice assistant:
Rank Order the Modes of Interaction
If you’re using user research to inform your use cases (either usage or raw quantitative/qualitative research), then it is important to qualify your analysis by rank ordering the perspective modes of interaction.
If someone tells you: “OMG that would be so cool if I could talk to my TV and tell it to change the channel,” then you really need to dig deeper. Would they really use it? Do they understand the constraints? Do they truly understand their own propensity to use that feature?
As a designer, you must understand your users better than they understand themselves.
You must question the likelihood that they will use a particular mode of interaction given their access to the alternatives.
For instance, let’s say we’re examining whether a user is likely to use voice commands to interact with their TV. In this case, it is safe to assume that voice interaction is one of many possible types of interaction.
The user has access to multiple alternative interaction implements: a remote, a paired smartphone, a gaming controller, or a connected IoT device. Voice, therefore, does not necessarily become the default mode of interaction. It is one of many.
So the question becomes: what is the likelihood that a user will rely on voice interaction as the primary means of interaction? If not primary, then would it be secondary? Tertiary? This will qualify your assumptions and UX hypotheses moving forward.
Enumerate Technological Constraints
Translating our words into actions is an extremely difficult technological challenge. With unlimited time, connectivity, and training, a well-tuned computational engine could expediently ingest our speech and trigger the appropriate action.
Unfortunately, we live in a world where we don’t have unlimited connectivity (i.e omnipresent gigabit internet), nor do we have unlimited time. We want our voice interactions to be as immediate as the traditional alternatives: visual and touch — even though voice engines require complex processing and predictive modeling.
Here are some sample flows that demonstrate what has to happen for our speech to be recognized:
As we see…there are numerous models that need to be continuously trained to work with our lexicon, accents, variable tones, and more.
Every voice recognition platform has a unique set of technological constraints. It is imperative that you embrace these constraints when architecting a voice interaction UX.
Analyze the following categories:
- Connectivity level — will the device always be connected to the internet?
- Processing speed — will the user need their speech to be processed in real time?
- Processing accuracy — what will the trade off be between accuracy and speed?
- Speech models — how well-trained are our current models? Will they be able to accurately process full sentences or just short words?
- Fallbacks — what are the technological fallbacks if the speech cannot be recognized? Can the user harness another mode of interaction?
- Consequence of inaccuracy — will an incorrectly processed command result in an irreversible action? Is our voice recognition engine mature enough to avoid severe errors?
- Environmental testing — has the voice engine been tested in multiple environmental contexts? For instance, if I am building a car infotainment system, then I will be anticipating much more background disturbance than a smart thermostat.
Furthermore, we should also consider that users can interact with the device in a non-linear way. For example, if I want to book a plane ticket on a website, then I am forced to follow the website’s progressive information flow: select destination, select date, select number of tickets, look at options, etc…
But, VUI’s have a bigger challenge. The user can say, “We want to fly to San Francisco on business class.” Now, the VUI has to extract all of the relevant information from the user in order to harness existing flight booking APIs. The logical ordering may be skewed, so it is the responsibility of the VUI to extract the relevant information (either by voice or visual supplements) from the user.
Voice Input UX
Now that we’ve explored our constraints, dependencies, and use cases, we can start to dive a little deeper into the actual voice UX. First, we’ll explore how devices know when to listen to us.
For some added context, this diagram below illustrates a basic voice UX flow:
Which manifests as…
There are four types of voice input triggers:
- Voice trigger — the user will utter a phrase that will prompt the device to begin processing the speech (“Ok Google”)
- Tactile trigger — pressing a button (physical or digital) or toggling a control (ex. a microphone icon)
- Motion trigger — waving your hand in front of a sensor
- Device self-trigger — an event or pre-determined setting will trigger the device (a car accident or a task reminder that prompts for your confirmation)
As a designer, you must understand which triggers will be relevant to your use cases; and rank order those triggers from likely relevant to not relevant.
Typically, when a device is triggered to listen, there will be an auditory, visual, or tactile cue.
These cues should follow the following usability principles:
- Immediate — after an appropriate trigger, the cue should prompt as quickly as possible, even if it means interrupting a current action (so long as interrupting that action is not destructive).
- Brief and transitory — the cue should be nearly instantaneous, especially for habitually used devices. For example, two affirmative beeps are more effective than "Ok Justin, what would you like me to do now?" The longer the leading cue, the more likely your user’s words will conflict with the device prompt. This principle also applies to visual cues. The screen should immediately transform into a state of listening.
- Clear beginning— the user should know exactly when their voice is starting to be recorded.
- Consistent — the cue should always be the same. Differences in sound or visual feedback will confuse users.
- Distinct — the cue should be distinct from the device’s normal sounds and visuals — and should never be used or repeated in any other context.
- Supplementary cues— when possible, harness multiple interactive mediums to surface the cue (ex. two beeps, a light change, and a screen dialogue).
- Initial Prompt — for first-time users, or when a user seems stuck, then you can display an initial prompt or suggestions to facilitate voice communication.
Feedback is critical to successful voice interface UX. It allows users to get consistent and immediate confirmation that their words are being ingested and processed by the device. Feedback also lets users take corrective or affirmative action.
- Real-time, responsive visuals — this visual feedback is most common in native voice devices (ex. phones). It creates immediate cognitive feedback across multiple sound dimensions: pitch, timbre, intensity, and duration — which can all responsively change colors and patterns in real-time.
- Audio playback — a simple playback to confirm the interpretation of speech
- Real-time text — textual feedback will progressively appear in real-time as the user speaks
- Output text — textual feedback that is transformed and amended after the user has finished speaking. Think of this as the first layer of corrective processing before the audio is confirmed or translated into an action.
- Non-screen visual cues (lights, light patterns) — the responsive visuals mentioned above are not just confined to digital screens. These responsive patterns can also manifest in simple LED lights or light patterns.
This cue connotes when the device has stopped listening to the user’s voice and will begin processing the command. Many of the same ‘leading cue’ principles apply to the end cue (immediate, brief, clear, consistent, and distinct). However, a few additional principles apply:
- Adequate time — ensure that adequate time has been allowed for the user to complete their command.
- Adaptive time — the time allotted should adapt to the use case and expected response. For instance, if the user was asked a “Yes” or “No” question, then the ending cue should expect a reasonable pause after one syllable.
- Reasonable pause — has a reasonable time elapsed since the last sound was recorded? This is very tricky to calculate, but is also contextually dependent on the use case of the interaction,
Simple commands like “Turn on my alarm” don’t necessarily require a lengthy conversation, but more complex commands do. Unlike traditional human-to-human interaction, human to AI interaction requires additional layers of confirmation, redundancy, and rectification.
More complex commands or iterative conversation typically require multiple layers of speech/option verification to assure accuracy. Complicating matters even more, often times the user is not sure what to ask or how to ask for it. So, it becomes the VUI’s job to decipher the message and allow the user to provide additional context.
- Affirmative — When the AI does understand the speech, it should respond with an affirmative message that also confirms the speech. For example, instead of saying “Sure”, the AI could say “Sure, I’ll turn the lights off” — or “Are you sure you’d like me to turn off the lights?”
- Corrective — When the AI is unable to decipher the user’s intent, it should respond with a corrective option. This allows the user to select another option or restart the conversation entirely.
- Empathetic — When the AI is unable to fulfill a user’s request, it should take ownership of the lack of understanding — and then provide the user with corrective actions. Empathy is important to establishing a more personable relationship.
Giving human-like traits to voice interaction creates a relationship between human and device. This anthropomorphization can manifest in various ways: patterns of lights, shapes that bounce, abstract spherical patterns, computer-generated voice, and sounds.
Anthropomorphism is the attribution of human traits, emotions, or intentions to non-human entities.
This relationship cultivates a more intimate bond between user and machine, which can also span across products with similar operating platforms (ex. Google’s Assistant, Amazon’s Alexa, and Apple’s Siri).
- Personality — Brings an extra dimension to the interaction, allowing the virtual personality to relate and empathize with the user. It helps mitigate the negative impacts of incorrectly processed speech.
- Positivity— General positivity encourages repeat interaction and an affirmative tone.
- Confidence & Trust — Encourages additional interaction and complex conversation because the user has additional confidence that the outcome will be positive and add value.
End-to-End Motion UX
Voice interactions should be fluid and dynamic. When we converse with each other in person, we typically use a myriad of facial expressions, changes in tone, body language, and movement. The challenge is capturing this fluid interaction in a digitized environment.
When possible, the entire voice interaction experience should feel like a rewarding interaction. Of course, more fleeting interactions like “Turn the lights off” don’t necessarily require a full relationship. However, any sort of more robust interaction like cooking with a digital assistant does require a prolonged conversation.
An effective voice motion experience would benefit from the following principles:
- Transitory — Seamlessly handles transitions between different states. The user should feel like they are never waiting, but rather that the assistant is working for them.
- Vivid— Vivid color conveys delight and futurism. It adds an element of futuristic elegance to the interaction — which encourages repeat interaction.
- Responsive — Responds to user input and gesturing. Gives hints regarding which words are being processed and allows the user to see if their speech/intent is being accurately parsed.
Conclusion and Resources
VUI’s are extremely complex, multifaceted, and often hybrid amalgams of interaction. In fact, there isn’t really an all encompassing definition. What’s important to remember is that an increasingly digitized world means that we may actually be spending more time with our devices than we do with each other. Will VUIs eventually become our primary means of interaction with our world? We’ll see.
In the meantime, are you looking to build a world-class VUI? Here are some helpful resources:
Published at DZone with permission of Justin Baker, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.