[Part-3] Text to Action: Adding Voice Control to Your Smart Calendar
In Part 3 of “Text to Action,” we add voice commands to our calendar assistant, converting speech to events using the Web Speech API and NLP pipeline.
Join the DZone community and get the full member experience.
Join For FreeWelcome to the third installment of our “Text to Action” series, where we’re building intelligent systems that transform natural language into real-world actions using AI.
In "[Part-1] Text to Action: Build a Smart Calendar AI Assistant," we established our foundation by creating an Express.js backend that connects to Google Calendar’s API. This gave us the ability to programmatically create calendar events through exposed API endpoint.
In "[Part-2] Text to Action: Words to Calendar Events," we added natural language processing (NLP) capabilities, enabling users to type descriptions like “Schedule a team meeting tomorrow at 3pm” and have our system intelligently transform these words into calendar events.
Today, we’re taking another step forward by adding voice command capabilities to create a truly hands-free experience. Imagine being able to simply speak, “Schedule a lunch meeting tomorrow at noon,” and have your calendar updated automatically — no typing required!
What We’re Building
We’re adding a voice interface to our existing application that will:
- Listen for user speech using the Web Speech API
- Convert spoken words to text
- Process the text using our existing NLP pipeline
- Create calendar events based on the spoken commands
- Provide voice feedback to confirm actions
This creates a complete voice-to-action pipeline that demonstrates how modern web technologies can create powerful, accessible interfaces.
The complete code is available on GitHub.
The Voice-to-Action Flow
Before diving into implementation, let’s understand the complete flow:
- Simplified flow: Speech → Text → NLP Processing → Calendar Event Creation
- High level flow overview:
Start with your voice command (your natural voice)
- → Text conversion using Web Speech API
- → Text sent for NLP processing (existing backend /api/text-to-event)
- → Structured JSON Payload (Using LLM)
- → Google Calendar event (Using Google calendar API with JSON payload)
The voice feedback playing back is just a nice touch to complete the conversational experience.
Project Architecture
Our application follows a modular architecture where each tutorial part builds on the previous ones:
- Backend: Express.js server that handles API requests and connects to Google Calendar
- NLP Processing: Uses Ollama to understand natural language (from Part 2)
- Frontend: Three separate interfaces that all connect to the same backend API
For styling, we’ve organized our CSS into separate files for each part of the tutorial in the /public/css/ directory, making it easy to understand which styles belong to which functionality.
Core Implementation Code-Walkthrough
1. Creating a Press-and-Hold Interface
A more intuitive press-and-hold interface was implemented similar to Google Assistant, rather than a simple toggle button. Users press the microphone button to speak and release when they’re done:
// Press and hold to speak pattern
micButton.addEventListener('mousedown', startListening); // Desktop
micButton.addEventListener('touchstart', startListening); // Mobile
document.addEventListener('mouseup', stopListening); // Desktop
document.addEventListener('touchend', stopListening); // Mobile
function startListening(e) {
// Prevent default behavior for touch events
if (e.type === 'touchstart') {
e.preventDefault();
}
// Only start if we're not already listening
if (!isListening) {
// Reset transcripts
finalTranscript = '';
interimTranscript = '';
// Start speech recognition
recognition.start();
isListening = true;
statusEl.textContent = 'Listening...';
micButton.classList.add('listening');
transcriptEl.innerHTML = '<em>Listening...</em>';
}
}
function stopListening() {
// Only process if we were listening
if (isListening) {
recognition.stop();
isListening = false;
statusEl.textContent = 'Processing...';
micButton.classList.remove('listening');
// Processing will happen in the onend handler
}
}
2. Understanding the Web Speech API
The Web Speech API has two main components we’ll use:
- SpeechRecognition: Converts spoken words to text
- SpeechSynthesis: Converts text to spoken words (voice playback)
Browser support varies, with Chrome offering the best compatibility. Note the 2 cases of speech recognition function execution flow below.
Success Speech Recognition Function Execution Flow
onstart → onresult (multiple times) → onend
Failure Speech Recognition Function Execution Flow
onstart → onresult (maybe) → onerror
Speech Recognition Setup and Configuration
// Check for Speech Recognition API support
if (!('webkitSpeechRecognition' in window) && !('SpeechRecognition' in window)) {
alert('Your browser does not support the Speech Recognition API. Please use Chrome, Edge, or Safari.');
document.getElementById('micButton').disabled = true;
document.getElementById('status').textContent = 'Speech recognition not available';
return;
}
// Initialize speech recognition
const SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition;
const recognition = new SpeechRecognition();
// Initialize speech synthesis
const speechSynthesis = window.speechSynthesis;
// Configure recognition
recognition.continuous = false;
recognition.interimResults = true;
recognition.lang = 'en-US';
Speech Recognition Event Handlers
recognition.onstart = () => {
isListening = true;
statusEl.textContent = 'Listening...';
micButton.classList.add('listening');
};
recognition.onend = () => {
isListening = false;
if (finalTranscript) {
statusEl.textContent = 'Processing...';
processVoiceCommand(finalTranscript);
} else if (interimTranscript) {
// If we only have interim results when recognition ends,
// use those as our final transcript
finalTranscript = interimTranscript;
statusEl.textContent = 'Processing...';
processVoiceCommand(finalTranscript);
} else {
statusEl.textContent = 'No speech detected. Try again.';
speak('I didn\'t hear anything. Please try again.');
}
};
recognition.onerror = (event) => {
isListening = false;
micButton.classList.remove('listening');
statusEl.textContent = `Error: ${event.error}`;
console.error('Speech Recognition Error:', event.error);
};
3. Real-Time Speech Recognition
We show both final and interim transcription results in real-time:
recognition.onresult = (event) => {
interimTranscript = '';
for (let i = event.resultIndex; i < event.results.length; i++) {
const transcript = event.results[i][0].transcript;
if (event.results[i].isFinal) {
finalTranscript += transcript;
} else {
interimTranscript += transcript;
}
}
transcriptEl.innerHTML = `
<div class="final">${finalTranscript}</div>
<div class="interim"><em>${interimTranscript}</em></div>
`;
};
4. Connecting Voice Input to Our Existing NLP Pipeline
The key integration point of our system is how we connect the voice interface to our existing NLP pipeline from Part 2.
This is where the voice-to-calendar-event-creation-action magic happens:
// Process the voice command
async function processVoiceCommand(text) {
statusEl.textContent = 'Creating event...';
resultEl.innerHTML = `<div class="loading">Processing your request <span></span></div>`;
resultEl.style.display = 'block';
try {
// Get user timezone
const timezone = Intl.DateTimeFormat().resolvedOptions().timeZone;
// Send the command to our existing NLP API from Part 2
const response = await fetch('/api/text-to-event', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'X-Timezone': timezone
},
body: JSON.stringify({ text })
});
const data = await response.json();
if (data.success) {
// Format response for display
const eventData = data.eventData;
const startDate = new Date(eventData.startDateTime);
const endDate = new Date(eventData.endDateTime);
const formattedStart = startDate.toLocaleString();
const formattedEnd = endDate.toLocaleString();
resultEl.innerHTML = `
<div class="success-message">
<h3>Event Created Successfully!</h3>
<p><strong>Summary:</strong> ${eventData.summary}</p>
<p><strong>Start:</strong> ${formattedStart}</p>
<p><strong>End:</strong> ${formattedEnd}</p>
<p><a href="${data.eventLink}" target="_blank">View in Google Calendar</a></p>
</div>
`;
statusEl.textContent = 'Event created!';
// Provide voice feedback
speak(`Event created successfully: ${eventData.summary} on ${startDate.toLocaleDateString()} at ${startDate.toLocaleTimeString()}.`);
} else {
throw new Error(data.error || 'Failed to create event');
}
} catch (error) {
console.error('Error:', error);
resultEl.innerHTML = `<div class="error-message">Error: ${error.message}</div>`;
statusEl.textContent = 'Error creating event';
speak('Sorry, I couldn\'t create that event. Please try again.');
}
}
Notice how we’re simply sending the recognized speech text to our existing /api/text-to-event endpoint from Part 2.
This demonstrates the power of good architectural design — we can add new interface modes (like voice) without having to recreate our NLP and calendar integration logic.
5. Adding Voice Feedback
To create a conversational experience, we provide spoken feedback using the Speech Synthesis Web API:
function speak(text) {
// Cancel any ongoing speech
speechSynthesis.cancel();
// Create a new utterance
const utterance = new SpeechSynthesisUtterance(text);
utterance.lang = 'en-US';
utterance.rate = 1.0;
utterance.pitch = 1.0;
// Speak the text
speechSynthesis.speak(utterance);
}
After creating an event, we provide spoken confirmation:
speak(`Event created successfully: ${eventData.summary} on ${startDate.toLocaleDateString()} at ${startDate.toLocaleTimeString()}.`);
We also make example commands clickable — when clicked, they’re spoken aloud so the user can hear how they should phrase their commands:
// Click example commands to hear them spoken
document.querySelectorAll('.command-example').forEach(example => {
example.addEventListener('click', () => {
speak(example.textContent);
});
});
The Complete Voice-to-Action Pipeline
Here’s the full flow when a user speaks a command:
- Initiation: User presses and holds the microphone button
- Recognition: Web Speech API converts speech to text in real-time
- Visualization: Recognized text appears in the transcript area
- Processing: When the button is released, text is sent to our backend API
- NLP: Our existing NLP service (from Part 2) processes the text
- Creation: A calendar event is created in the user’s Google Calendar
- Feedback: Both visual and spoken confirmation is provided
This creates a seamless experience where speaking a command leads directly to a real-world outcome.
Testing the Voice Interface
To try it out:
- Clone the vivekvells/text-to-calendar repo & install required deps
- Start the server:
npm start& make sure Ollama is running - Goto
http://localhost:3000/voice-commands.html - Press and hold the microphone button
- Speak a command like “Schedule a team meeting tomorrow at 2pm”
- Release the button and watch as your event is created
Troubleshooting Tips
If you’re having trouble with voice recognition:
- Use Chrome for best compatibility
- Ensure your microphone is working and has browser permission
- Speak clearly at a normal pace
- Try one of the example commands first
- Check that your environment isn’t too noisy
Conclusion
In this tutorial, we’ve added voice command capabilities to our Calendar Assistant, creating a hands-free way to schedule events. We’ve learned how to:
- Use the Web Speech API for speech recognition and synthesis
- Create an intuitive press-and-hold interface
- Process spoken commands with our existing NLP pipeline
- Provide spoken feedback for a conversational experience
Our application now offers three different ways to create Google calendar events:
- REST API call setup: Programmatic creation via structured JSON and the Google Calendar API
- NLP Pipeline: Natural language text processing using Ollama LLM to convert text to structured API calls
- Voice Command Interface: Voice command using Web Speech API to capture speech, which then flows through our established NLP pipeline (Part 3)
This makes our calendar assistant more accessible and versatile for different contexts and user preferences.
The complete code for this project is available on GitHub.
Resources
I hope you enjoyed this tutorial series on building a complete “Text to Action” system. Let me know in the comments what you’d like to see next!
Published at DZone with permission of Vivek Vellaiyappan Surulimuthu. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments