DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • An Introduction to Artificial Intelligence: Neural Networks, NLP, and Word Embeddings
  • AI-Driven Test Automation Techniques for Multimodal Systems
  • A Complete Guide to Modern AI Developer Tools
  • Getting Started With GenAI on BigQuery: A Step-by-Step Guide

Trending

  • Identity in Action
  • Building a Zero-Cost Approval Workflow With AWS Lambda Durable Functions
  • Building AI-Powered Java Applications With Jakarta EE and LangChain4j
  • 5 AI Security Incidents That Broke Things in Production (and What They Have in Common)
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. [Part-3] Text to Action: Adding Voice Control to Your Smart Calendar

[Part-3] Text to Action: Adding Voice Control to Your Smart Calendar

In Part 3 of “Text to Action,” we add voice commands to our calendar assistant, converting speech to events using the Web Speech API and NLP pipeline.

By 
Vivek Vellaiyappan Surulimuthu user avatar
Vivek Vellaiyappan Surulimuthu
·
Jul. 23, 25 · Tutorial
Likes (1)
Comment
Save
Tweet
Share
1.9K Views

Join the DZone community and get the full member experience.

Join For Free

Welcome to the third installment of our “Text to Action” series, where we’re building intelligent systems that transform natural language into real-world actions using AI.

In "[Part-1] Text to Action: Build a Smart Calendar AI Assistant," we established our foundation by creating an Express.js backend that connects to Google Calendar’s API. This gave us the ability to programmatically create calendar events through exposed API endpoint.

In "[Part-2] Text to Action: Words to Calendar Events," we added natural language processing (NLP) capabilities, enabling users to type descriptions like “Schedule a team meeting tomorrow at 3pm” and have our system intelligently transform these words into calendar events.

Today, we’re taking another step forward by adding voice command capabilities to create a truly hands-free experience. Imagine being able to simply speak, “Schedule a lunch meeting tomorrow at noon,” and have your calendar updated automatically — no typing required!

What We’re Building

We’re adding a voice interface to our existing application that will:

  1. Listen for user speech using the Web Speech API
  2. Convert spoken words to text
  3. Process the text using our existing NLP pipeline
  4. Create calendar events based on the spoken commands
  5. Provide voice feedback to confirm actions

This creates a complete voice-to-action pipeline that demonstrates how modern web technologies can create powerful, accessible interfaces.

The complete code is available on GitHub.


The Voice-to-Action Flow

Before diving into implementation, let’s understand the complete flow:

  1. Simplified flow: Speech → Text → NLP Processing → Calendar Event Creation
  2. High level flow overview:

Start with your voice command (your natural voice)

  • → Text conversion using Web Speech API
  • → Text sent for NLP processing (existing backend /api/text-to-event)
  • → Structured JSON Payload (Using LLM)
  • → Google Calendar event (Using Google calendar API with JSON payload)

The voice feedback playing back is just a nice touch to complete the conversational experience.

Project Architecture

Our application follows a modular architecture where each tutorial part builds on the previous ones:

  • Backend: Express.js server that handles API requests and connects to Google Calendar
  • NLP Processing: Uses Ollama to understand natural language (from Part 2)
  • Frontend: Three separate interfaces that all connect to the same backend API

For styling, we’ve organized our CSS into separate files for each part of the tutorial in the /public/css/ directory, making it easy to understand which styles belong to which functionality.

Core Implementation Code-Walkthrough

1. Creating a Press-and-Hold Interface

A more intuitive press-and-hold interface was implemented similar to Google Assistant, rather than a simple toggle button. Users press the microphone button to speak and release when they’re done:

// Press and hold to speak pattern
micButton.addEventListener('mousedown', startListening); // Desktop
micButton.addEventListener('touchstart', startListening); // Mobile
document.addEventListener('mouseup', stopListening); // Desktop
document.addEventListener('touchend', stopListening); // Mobile

function startListening(e) {
  // Prevent default behavior for touch events
  if (e.type === 'touchstart') {
    e.preventDefault();
  }
  
  // Only start if we're not already listening
  if (!isListening) {
    // Reset transcripts
    finalTranscript = '';
    interimTranscript = '';
    
    // Start speech recognition
    recognition.start();
    isListening = true;
    statusEl.textContent = 'Listening...';
    micButton.classList.add('listening');
    transcriptEl.innerHTML = '<em>Listening...</em>';
  }
}

function stopListening() {
  // Only process if we were listening
  if (isListening) {
    recognition.stop();
    isListening = false;
    statusEl.textContent = 'Processing...';
    micButton.classList.remove('listening');
    
    // Processing will happen in the onend handler
  }
}


2. Understanding the Web Speech API

The Web Speech API has two main components we’ll use:

  • SpeechRecognition: Converts spoken words to text
  • SpeechSynthesis: Converts text to spoken words (voice playback)

Browser support varies, with Chrome offering the best compatibility. Note the 2 cases of speech recognition function execution flow below.

Success Speech Recognition Function Execution Flow

onstart → onresult (multiple times) → onend


Failure Speech Recognition Function Execution Flow

onstart → onresult (maybe) → onerror


Speech Recognition Setup and Configuration

// Check for Speech Recognition API support
if (!('webkitSpeechRecognition' in window) && !('SpeechRecognition' in window)) {
  alert('Your browser does not support the Speech Recognition API. Please use Chrome, Edge, or Safari.');
  document.getElementById('micButton').disabled = true;
  document.getElementById('status').textContent = 'Speech recognition not available';
  return;
}

// Initialize speech recognition
const SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition;
const recognition = new SpeechRecognition();

// Initialize speech synthesis
const speechSynthesis = window.speechSynthesis;

// Configure recognition
recognition.continuous = false;
recognition.interimResults = true;
recognition.lang = 'en-US';


Speech Recognition Event Handlers

recognition.onstart = () => {
  isListening = true;
  statusEl.textContent = 'Listening...';
  micButton.classList.add('listening');
};

recognition.onend = () => {
  isListening = false;
  
  if (finalTranscript) {
    statusEl.textContent = 'Processing...';
    processVoiceCommand(finalTranscript);
  } else if (interimTranscript) {
    // If we only have interim results when recognition ends,
    // use those as our final transcript
    finalTranscript = interimTranscript;
    statusEl.textContent = 'Processing...';
    processVoiceCommand(finalTranscript);
  } else {
    statusEl.textContent = 'No speech detected. Try again.';
    speak('I didn\'t hear anything. Please try again.');
  }
};

recognition.onerror = (event) => {
  isListening = false;
  micButton.classList.remove('listening');
  statusEl.textContent = `Error: ${event.error}`;
  console.error('Speech Recognition Error:', event.error);
};


3. Real-Time Speech Recognition

We show both final and interim transcription results in real-time:

recognition.onresult = (event) => {
  interimTranscript = '';
  
  for (let i = event.resultIndex; i < event.results.length; i++) {
    const transcript = event.results[i][0].transcript;
    
    if (event.results[i].isFinal) {
      finalTranscript += transcript;
    } else {
      interimTranscript += transcript;
    }
  }
  
  transcriptEl.innerHTML = `
    <div class="final">${finalTranscript}</div>
    <div class="interim"><em>${interimTranscript}</em></div>
  `;
};


4. Connecting Voice Input to Our Existing NLP Pipeline

The key integration point of our system is how we connect the voice interface to our existing NLP pipeline from Part 2.

This is where the voice-to-calendar-event-creation-action magic happens:

// Process the voice command
async function processVoiceCommand(text) {
  statusEl.textContent = 'Creating event...';
  resultEl.innerHTML = `<div class="loading">Processing your request <span></span></div>`;
  resultEl.style.display = 'block';
  
  try {
    // Get user timezone
    const timezone = Intl.DateTimeFormat().resolvedOptions().timeZone;
    
    // Send the command to our existing NLP API from Part 2
    const response = await fetch('/api/text-to-event', {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'X-Timezone': timezone
      },
      body: JSON.stringify({ text })
    });
    
    const data = await response.json();
    
    if (data.success) {
      // Format response for display
      const eventData = data.eventData;
      const startDate = new Date(eventData.startDateTime);
      const endDate = new Date(eventData.endDateTime);
      
      const formattedStart = startDate.toLocaleString();
      const formattedEnd = endDate.toLocaleString();
      
      resultEl.innerHTML = `
        <div class="success-message">
          <h3>Event Created Successfully!</h3>
          <p><strong>Summary:</strong> ${eventData.summary}</p>
          <p><strong>Start:</strong> ${formattedStart}</p>
          <p><strong>End:</strong> ${formattedEnd}</p>
          <p><a href="${data.eventLink}" target="_blank">View in Google Calendar</a></p>
        </div>
      `;
      
      statusEl.textContent = 'Event created!';
      
      // Provide voice feedback
      speak(`Event created successfully: ${eventData.summary} on ${startDate.toLocaleDateString()} at ${startDate.toLocaleTimeString()}.`);
    } else {
      throw new Error(data.error || 'Failed to create event');
    }
  } catch (error) {
    console.error('Error:', error);
    resultEl.innerHTML = `<div class="error-message">Error: ${error.message}</div>`;
    statusEl.textContent = 'Error creating event';
    speak('Sorry, I couldn\'t create that event. Please try again.');
  }
}


Notice how we’re simply sending the recognized speech text to our existing /api/text-to-event endpoint from Part 2.

This demonstrates the power of good architectural design — we can add new interface modes (like voice) without having to recreate our NLP and calendar integration logic.

5. Adding Voice Feedback

To create a conversational experience, we provide spoken feedback using the Speech Synthesis Web API:

function speak(text) {
  // Cancel any ongoing speech
  speechSynthesis.cancel();
  
  // Create a new utterance
  const utterance = new SpeechSynthesisUtterance(text);
  utterance.lang = 'en-US';
  utterance.rate = 1.0;
  utterance.pitch = 1.0;
  
  // Speak the text
  speechSynthesis.speak(utterance);
}


After creating an event, we provide spoken confirmation:

speak(`Event created successfully: ${eventData.summary} on ${startDate.toLocaleDateString()} at ${startDate.toLocaleTimeString()}.`);


We also make example commands clickable — when clicked, they’re spoken aloud so the user can hear how they should phrase their commands:

// Click example commands to hear them spoken
document.querySelectorAll('.command-example').forEach(example => {
  example.addEventListener('click', () => {
    speak(example.textContent);
  });
});


The Complete Voice-to-Action Pipeline

Here’s the full flow when a user speaks a command:

  1. Initiation: User presses and holds the microphone button
  2. Recognition: Web Speech API converts speech to text in real-time
  3. Visualization: Recognized text appears in the transcript area
  4. Processing: When the button is released, text is sent to our backend API
  5. NLP: Our existing NLP service (from Part 2) processes the text
  6. Creation: A calendar event is created in the user’s Google Calendar
  7. Feedback: Both visual and spoken confirmation is provided

This creates a seamless experience where speaking a command leads directly to a real-world outcome.

Testing the Voice Interface

To try it out:

  1. Clone the vivekvells/text-to-calendar repo & install required deps
  2. Start the server: npm start & make sure Ollama is running
  3. Goto http://localhost:3000/voice-commands.html
  4. Press and hold the microphone button
  5. Speak a command like “Schedule a team meeting tomorrow at 2pm”
  6. Release the button and watch as your event is created

Troubleshooting Tips

If you’re having trouble with voice recognition:

  • Use Chrome for best compatibility
  • Ensure your microphone is working and has browser permission
  • Speak clearly at a normal pace
  • Try one of the example commands first
  • Check that your environment isn’t too noisy

Conclusion

In this tutorial, we’ve added voice command capabilities to our Calendar Assistant, creating a hands-free way to schedule events. We’ve learned how to:

  • Use the Web Speech API for speech recognition and synthesis
  • Create an intuitive press-and-hold interface
  • Process spoken commands with our existing NLP pipeline
  • Provide spoken feedback for a conversational experience

Our application now offers three different ways to create Google calendar events:

  1. REST API call setup: Programmatic creation via structured JSON and the Google Calendar API 
  2. NLP Pipeline: Natural language text processing using Ollama LLM to convert text to structured API calls
  3. Voice Command Interface: Voice command using Web Speech API to capture speech, which then flows through our established NLP pipeline (Part 3)

This makes our calendar assistant more accessible and versatile for different contexts and user preferences.

The complete code for this project is available on GitHub.

Resources

  • Web Speech API Documentation
  • Google Calendar API
  • Ollama Documentation

I hope you enjoyed this tutorial series on building a complete “Text to Action” system. Let me know in the comments what you’d like to see next!

AI Google Calendar NLP

Published at DZone with permission of Vivek Vellaiyappan Surulimuthu. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • An Introduction to Artificial Intelligence: Neural Networks, NLP, and Word Embeddings
  • AI-Driven Test Automation Techniques for Multimodal Systems
  • A Complete Guide to Modern AI Developer Tools
  • Getting Started With GenAI on BigQuery: A Step-by-Step Guide

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook