[Part-3] Text to Action: Adding Voice Control to Your Smart Calendar

In Part 3 of “Text to Action,” we add voice commands to our calendar assistant, converting speech to events using the Web Speech API and NLP pipeline.

Jul. 23, 25 · Tutorial

Likes (1)

Comment

Save

1.9K Views

Welcome to the third installment of our “Text to Action” series, where we’re building intelligent systems that transform natural language into real-world actions using AI.

In "[Part-1] Text to Action: Build a Smart Calendar AI Assistant," we established our foundation by creating an Express.js backend that connects to Google Calendar’s API. This gave us the ability to programmatically create calendar events through exposed API endpoint.

In "[Part-2] Text to Action: Words to Calendar Events," we added natural language processing (NLP) capabilities, enabling users to type descriptions like “Schedule a team meeting tomorrow at 3pm” and have our system intelligently transform these words into calendar events.

Today, we’re taking another step forward by adding voice command capabilities to create a truly hands-free experience. Imagine being able to simply speak, “Schedule a lunch meeting tomorrow at noon,” and have your calendar updated automatically — no typing required!

What We’re Building

We’re adding a voice interface to our existing application that will:

Listen for user speech using the Web Speech API
Convert spoken words to text
Process the text using our existing NLP pipeline
Create calendar events based on the spoken commands
Provide voice feedback to confirm actions

This creates a complete voice-to-action pipeline that demonstrates how modern web technologies can create powerful, accessible interfaces.

The complete code is available on GitHub.

The Voice-to-Action Flow

Before diving into implementation, let’s understand the complete flow:

Simplified flow: Speech → Text → NLP Processing → Calendar Event Creation
High level flow overview:

Start with your voice command (your natural voice)

→ Text conversion using Web Speech API
→ Text sent for NLP processing (existing backend /api/text-to-event)
→ Structured JSON Payload (Using LLM)
→ Google Calendar event (Using Google calendar API with JSON payload)

The voice feedback playing back is just a nice touch to complete the conversational experience.

Project Architecture

Our application follows a modular architecture where each tutorial part builds on the previous ones:

Backend: Express.js server that handles API requests and connects to Google Calendar
NLP Processing: Uses Ollama to understand natural language (from Part 2)
Frontend: Three separate interfaces that all connect to the same backend API

For styling, we’ve organized our CSS into separate files for each part of the tutorial in the /public/css/ directory, making it easy to understand which styles belong to which functionality.

Core Implementation Code-Walkthrough

1. Creating a Press-and-Hold Interface

A more intuitive press-and-hold interface was implemented similar to Google Assistant, rather than a simple toggle button. Users press the microphone button to speak and release when they’re done:

// Press and hold to speak pattern
micButton.addEventListener('mousedown', startListening); // Desktop
micButton.addEventListener('touchstart', startListening); // Mobile
document.addEventListener('mouseup', stopListening); // Desktop
document.addEventListener('touchend', stopListening); // Mobile

function startListening(e) {
  // Prevent default behavior for touch events
  if (e.type === 'touchstart') {
    e.preventDefault();
  }
  
  // Only start if we're not already listening
  if (!isListening) {
    // Reset transcripts
    finalTranscript = '';
    interimTranscript = '';
    
    // Start speech recognition
    recognition.start();
    isListening = true;
    statusEl.textContent = 'Listening...';
    micButton.classList.add('listening');
    transcriptEl.innerHTML = '<em>Listening...</em>';
  }
}

function stopListening() {
  // Only process if we were listening
  if (isListening) {
    recognition.stop();
    isListening = false;
    statusEl.textContent = 'Processing...';
    micButton.classList.remove('listening');
    
    // Processing will happen in the onend handler
  }
}

2. Understanding the Web Speech API

The Web Speech API has two main components we’ll use:

SpeechRecognition: Converts spoken words to text
SpeechSynthesis: Converts text to spoken words (voice playback)

Browser support varies, with Chrome offering the best compatibility. Note the 2 cases of speech recognition function execution flow below.

Success Speech Recognition Function Execution Flow

onstart → onresult (multiple times) → onend

Failure Speech Recognition Function Execution Flow

onstart → onresult (maybe) → onerror

Speech Recognition Setup and Configuration

// Check for Speech Recognition API support
if (!('webkitSpeechRecognition' in window) && !('SpeechRecognition' in window)) {
  alert('Your browser does not support the Speech Recognition API. Please use Chrome, Edge, or Safari.');
  document.getElementById('micButton').disabled = true;
  document.getElementById('status').textContent = 'Speech recognition not available';
  return;
}

// Initialize speech recognition
const SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition;
const recognition = new SpeechRecognition();

// Initialize speech synthesis
const speechSynthesis = window.speechSynthesis;

// Configure recognition
recognition.continuous = false;
recognition.interimResults = true;
recognition.lang = 'en-US';

Speech Recognition Event Handlers

recognition.onstart = () => {
  isListening = true;
  statusEl.textContent = 'Listening...';
  micButton.classList.add('listening');
};

recognition.onend = () => {
  isListening = false;
  
  if (finalTranscript) {
    statusEl.textContent = 'Processing...';
    processVoiceCommand(finalTranscript);
  } else if (interimTranscript) {
    // If we only have interim results when recognition ends,
    // use those as our final transcript
    finalTranscript = interimTranscript;
    statusEl.textContent = 'Processing...';
    processVoiceCommand(finalTranscript);
  } else {
    statusEl.textContent = 'No speech detected. Try again.';
    speak('I didn\'t hear anything. Please try again.');
  }
};

recognition.onerror = (event) => {
  isListening = false;
  micButton.classList.remove('listening');
  statusEl.textContent = `Error: ${event.error}`;
  console.error('Speech Recognition Error:', event.error);
};

3. Real-Time Speech Recognition

We show both final and interim transcription results in real-time:

recognition.onresult = (event) => {
  interimTranscript = '';
  
  for (let i = event.resultIndex; i < event.results.length; i++) {
    const transcript = event.results[i][0].transcript;
    
    if (event.results[i].isFinal) {
      finalTranscript += transcript;
    } else {
      interimTranscript += transcript;
    }
  }
  
  transcriptEl.innerHTML = `
    <div class="final">${finalTranscript}</div>
    <div class="interim"><em>${interimTranscript}</em></div>
  `;
};

4. Connecting Voice Input to Our Existing NLP Pipeline

The key integration point of our system is how we connect the voice interface to our existing NLP pipeline from Part 2.

This is where the voice-to-calendar-event-creation-action magic happens:

// Process the voice command
async function processVoiceCommand(text) {
  statusEl.textContent = 'Creating event...';
  resultEl.innerHTML = `<div class="loading">Processing your request <span></span></div>`;
  resultEl.style.display = 'block';
  
  try {
    // Get user timezone
    const timezone = Intl.DateTimeFormat().resolvedOptions().timeZone;
    
    // Send the command to our existing NLP API from Part 2
    const response = await fetch('/api/text-to-event', {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'X-Timezone': timezone
      },
      body: JSON.stringify({ text })
    });
    
    const data = await response.json();
    
    if (data.success) {
      // Format response for display
      const eventData = data.eventData;
      const startDate = new Date(eventData.startDateTime);
      const endDate = new Date(eventData.endDateTime);
      
      const formattedStart = startDate.toLocaleString();
      const formattedEnd = endDate.toLocaleString();
      
      resultEl.innerHTML = `
        <div class="success-message">
          <h3>Event Created Successfully!</h3>
          <p><strong>Summary:</strong> ${eventData.summary}</p>
          <p><strong>Start:</strong> ${formattedStart}</p>
          <p><strong>End:</strong> ${formattedEnd}</p>
          <p><a href="${data.eventLink}" target="_blank">View in Google Calendar</a></p>
        </div>
      `;
      
      statusEl.textContent = 'Event created!';
      
      // Provide voice feedback
      speak(`Event created successfully: ${eventData.summary} on ${startDate.toLocaleDateString()} at ${startDate.toLocaleTimeString()}.`);
    } else {
      throw new Error(data.error || 'Failed to create event');
    }
  } catch (error) {
    console.error('Error:', error);
    resultEl.innerHTML = `<div class="error-message">Error: ${error.message}</div>`;
    statusEl.textContent = 'Error creating event';
    speak('Sorry, I couldn\'t create that event. Please try again.');
  }
}

Notice how we’re simply sending the recognized speech text to our existing /api/text-to-event endpoint from Part 2.

This demonstrates the power of good architectural design — we can add new interface modes (like voice) without having to recreate our NLP and calendar integration logic.

5. Adding Voice Feedback

To create a conversational experience, we provide spoken feedback using the Speech Synthesis Web API:

function speak(text) {
  // Cancel any ongoing speech
  speechSynthesis.cancel();
  
  // Create a new utterance
  const utterance = new SpeechSynthesisUtterance(text);
  utterance.lang = 'en-US';
  utterance.rate = 1.0;
  utterance.pitch = 1.0;
  
  // Speak the text
  speechSynthesis.speak(utterance);
}

After creating an event, we provide spoken confirmation:

speak(`Event created successfully: ${eventData.summary} on ${startDate.toLocaleDateString()} at ${startDate.toLocaleTimeString()}.`);

We also make example commands clickable — when clicked, they’re spoken aloud so the user can hear how they should phrase their commands:

// Click example commands to hear them spoken
document.querySelectorAll('.command-example').forEach(example => {
  example.addEventListener('click', () => {
    speak(example.textContent);
  });
});

The Complete Voice-to-Action Pipeline

Here’s the full flow when a user speaks a command:

Initiation: User presses and holds the microphone button
Recognition: Web Speech API converts speech to text in real-time
Visualization: Recognized text appears in the transcript area
Processing: When the button is released, text is sent to our backend API
NLP: Our existing NLP service (from Part 2) processes the text
Creation: A calendar event is created in the user’s Google Calendar
Feedback: Both visual and spoken confirmation is provided

This creates a seamless experience where speaking a command leads directly to a real-world outcome.

Testing the Voice Interface

To try it out:

Clone the vivekvells/text-to-calendar repo & install required deps
Start the server: npm start & make sure Ollama is running
Goto http://localhost:3000/voice-commands.html
Press and hold the microphone button
Speak a command like “Schedule a team meeting tomorrow at 2pm”
Release the button and watch as your event is created

Troubleshooting Tips

If you’re having trouble with voice recognition:

Use Chrome for best compatibility
Ensure your microphone is working and has browser permission
Speak clearly at a normal pace
Try one of the example commands first
Check that your environment isn’t too noisy

Conclusion

In this tutorial, we’ve added voice command capabilities to our Calendar Assistant, creating a hands-free way to schedule events. We’ve learned how to:

Use the Web Speech API for speech recognition and synthesis
Create an intuitive press-and-hold interface
Process spoken commands with our existing NLP pipeline
Provide spoken feedback for a conversational experience

Our application now offers three different ways to create Google calendar events:

REST API call setup: Programmatic creation via structured JSON and the Google Calendar API
NLP Pipeline: Natural language text processing using Ollama LLM to convert text to structured API calls
Voice Command Interface: Voice command using Web Speech API to capture speech, which then flows through our established NLP pipeline (Part 3)

This makes our calendar assistant more accessible and versatile for different contexts and user preferences.

The complete code for this project is available on GitHub.

Resources

I hope you enjoyed this tutorial series on building a complete “Text to Action” system. Let me know in the comments what you’d like to see next!

AI Google Calendar NLP

Published at DZone with permission of Vivek Vellaiyappan Surulimuthu. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending