DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • The Documentation Crisis Nobody Sees: Why AI Agents Are Breaking Faster Than Humans Can Document Them
  • A Practical Blueprint for Deploying Agentic Solutions
  • Beyond Manual Annotation: Engineering Self-Correcting Pseudo-Labeling Pipelines
  • Building Threat Intelligence Pipelines Using Python, APIs, and Elasticsearch

Trending

  • Using LLMs to Automate Data Cleaning and Transformation Pipelines
  • The Serverless Illusion: When “Pay for What You Use” Becomes Expensive
  • Monitoring Spring Boot Applications with Prometheus and Grafana
  • AI-Driven Integration in Large-Scale Agile Environments
  1. DZone
  2. Data Engineering
  3. Databases
  4. Mastering Audio Transcription With Gemini APIs: A Developer's Guide

Mastering Audio Transcription With Gemini APIs: A Developer's Guide

This guide walks you through implementing audio transcription in your applications using Gemini models, from basic implementations to real-time streaming solutions.

By 
Roline Stapny Saldanha user avatar
Roline Stapny Saldanha
·
Oct. 22, 25 · Tutorial
Likes (1)
Comment
Save
Tweet
Share
5.7K Views

Join the DZone community and get the full member experience.

Join For Free

Understanding Audio Transcription via Gemini APIs

Gemini models are multimodal large language models. They can process and generate various types of data, including text, code, images, audio, and video. Gemini models also offer powerful audio transcription capabilities, enabling developers to convert spoken content into text. This can help in building a transcription service, creating subtitles for videos, and developing voice-enabled applications. If you are looking to convert speech to text using Gemini's powerful AI models, this comprehensive guide will show you how to implement audio transcription using different Gemini APIs. We will go from basic implementation to advanced real-time streaming.

Gemini supports the following audio formats as input: WAV, MP3, AIFF, AAC, OGG, and FLAC. We will look at generateContent, streamGenerateContent, and BidiGenerateContent(LiveAPI) APIs. You can find all supported APIs at https://ai.google.dev/api. generateContent is a standard REST endpoint, which processes the request and returns a single response. streamGenerateContent uses SSE (server-sent-events) to send partial responses as they are generated. This API is a better choice for applications like chatbots, which need a faster and more interactive experience. 

BidiGenerateContent or LiveAPI creates a stateful bidirectional streaming connection for real-time conversational use cases. Google provides the official Google GenAI SDK in multiple programming languages to access these APIs.

Diagram showing how Gemini-supported bidirectional streaming connection works

Pre-requisites

  1. Install Python version > 3.10
  2. Install pip
  3. Install Google GenAI SDK  pip install google-genai

Authentication and Setup

To use the Gemini API, you need an API key. You can create a new key for free at Google AI Studio. You will need to log in to your Google account or create a new one if you don’t have one. We will provide the API key explicitly while creating the GenAI client. More documentation is available at https://ai.google.dev/gemini-api/docs/api-key.

Audio Input to Gemini Model

You can provide audio data to Gemini in two ways. 

1. Upload the audio file to Google Cloud using files API. 

Python
 
myfile = client.files.upload(file="path/to/sample.mp3")
response = client.models.generate_content(
    model="gemini-2.5-flash", contents=["Transcribe the Audio", myfile]
)


2. Send audio file contents input as inline data bytes. In this method, the maximum request size, including audio data, is limited to 20 MB. Hence, for long audio files, you should use the files' API to pass the audio input. For our sample code in this guide, we are passing audio as inline data.

Python
 
with open('path/to/small-sample.wav', 'rb') as f:
    audio_bytes = f.read()

response = client.models.generate_content(
  model='gemini-2.5-flash',
  contents=[
    'Transcribe the Audio',
    types.Part.from_bytes(
      data=audio_bytes,
      mime_type='audio/wav',
    )
  ]
)


Basic Audio Transcription

To get audio transcription from generateContent and streamGenerateContent API, the user must prompt the model. Live API is slightly different and will be explained in a later section. A simple prompt to trigger audio transcription would be, “Transcribe this audio” or “Transcribe this audio in English.” You can also ask for a transcription with specific instructions in the prompt, as shown below.

Transcribe the audio following these formatting rules:
1. Currency:
- For exact dollar amounts, include "$" and ".00"
Example: "ten dollars" → "$10.00"

2. Dates:
- Use format MMDDYYYY without any separators
- Convert written months to numbers
Example: "December 20th 2020" → "12202020"
Example: "February third 2023" → "02032023"

Sample Code to Get Transcription Using GenerateContent  API

Python
 
from google import genai
from google.genai import types
import os
import base64

os.environ["GEMINI_API_KEY"]="GEMINI_API_KEY_VALUE"

with open('pathToAudio/audioFile.wav', 'rb') as f:
    audio_bytes = base64.b64encode(f.read()).decode('utf-8')
    client = genai.Client()
    response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
      'Transcribe this audio',
      types.Part.from_bytes(
        data=audio_bytes,
        mime_type='audio/wav',
      )
    ]
  )
print(response.text)


Sample Code to Get Transcription Using StreamGenerateContent API

from google import genai
from google.genai import types
import os
import base64

os.environ["GEMINI_API_KEY"]="GEMINI_API_KEY_VALUE"

with open('pathToAudio/audioFile.wav', 'rb') as f:
  audio_bytes = base64.b64encode(f.read()).decode('utf-8')
  client = genai.Client()
  response = client.models.generate_content_stream(
    model='gemini-2.5-flash',
    contents=[
      'Transcribe this audio',
      types.Part.from_bytes(
        data=audio_bytes,
        mime_type='audio/wav',
      )
    ]
  )
for chunk in response:
    print(chunk.text)

You could choose from any Gemini model that takes audio input and provides text output available here https://ai.google.dev/gemini-api/docs/models.

Advanced Usage With Live API

The Gemini Live API is still in preview and enables bidirectional real-time voice and video interaction with Gemini models. This is comparable with InvokeModelWithBidirectionalStream API for Amazon’s Nova Sonic model and Realtime API for OpenAI models. These APIs are designed for human-like interaction with the model and provides both audio and text as output. Live API can provide transcriptions for both input and output audio. To get input audio’s transcription, you need to send input_audio_transcription in setup config. Once transcription is enabled, we can access the ASR transcript from the response object as response.server_content.input_transcription. Since we only need the audio transcription , we can choose either TEXT or AUDIO for  response_modalities. Response modalities indicate the type of output from the model, i.e., text or voice. The input audio for Live API must be Raw 16-bit PCM audio at 16kHz, little-endian.

The following code is an example to get input audio transcription from Gemini Live API. As a pre-requisite, install the google-genai and soundfile libraries. The google-genai library provides Gemini API access, and soundfile helps in reading audio data from file.

Python
 
pip install google-genai
 pip install soundfile


Python
 
import asyncio
from google import genai
import soundfile as sf
import numpy as np
from google.genai.types import (
    AudioTranscriptionConfig,
    LiveConnectConfig,
    Blob,
)

class AudioFileManager:
    def __init__(self, sample_rate=16000):
        self.sample_rate = sample_rate
        self.chunk_size = 512  # Adjust chunk size as needed
        self.sleep_duration = self.chunk_size / self.sample_rate
        
    async def stream_audio_file(self, file_path):
        """Stream audio file in chunks."""
        audio_data, file_sample_rate = sf.read(file_path)
            
        # Convert to int16
        audio_data = (audio_data * 32767).astype(np.int16)
        
        # Stream audio in chunks
        for i in range(0, len(audio_data), self.chunk_size):
            chunk = audio_data[i:i + self.chunk_size]
            yield chunk.tobytes()
            await asyncio.sleep(self.sleep_duration)

async def process_responses(session):
    """Process responses from the model in real-time."""
    while True:  # Keep receiving responses
        turn = session.receive()
        
        async for message in turn:
            if message.go_away is not None:
                # The connection will soon be terminated
                print("Received go away Message")
                print(message.go_away.time_left)
                continue
            
            if message.server_content.interrupted is True:
                print("The session was interrupted")
                continue

            if message.server_content.input_transcription:
                print(f"Input Audio Transcription: {message.server_content.input_transcription.text}")

            if message.text:
                print(f"Model Response: {message.text}")

client = genai.Client(api_key="GEMINI_API_KEY_VALUE")
MODEL_ID = "gemini-live-2.5-flash-preview" # you could choose any live api supported models

CONFIG = LiveConnectConfig(
    response_modalities=["TEXT"],
    input_audio_transcription=AudioTranscriptionConfig(), # Enable Audio Input Transcription
)

async def main():
    async with client.aio.live.connect(model=MODEL_ID, config=CONFIG) as session:
        # Start response processing task
        response_task = asyncio.create_task(process_responses(session))

        audio_manager = AudioFileManager()
        async for chunk in audio_manager.stream_audio_file("pathToAudio/audioFile.wav"):
            await session.send_realtime_input(
                media=Blob(data=chunk, mime_type="audio/pcm;rate=16000")
            )
        await session.send_realtime_input(audio_stream_end=True)
        await response_task


if __name__ == "__main__":
    asyncio.run(main())


Things to Know

  1. In this guide, I have used an API key obtained from Google AI Studio to access Gemini APIs. However, you can also access these APIs using vertexAI by creating Google Cloud project credentials. For Google Cloud users who develop applications at scale, vertexAI should be used. Creating GenAI client enabling vertexAI looks like  client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)During testing, I noticed Live API starting sending Gibberish text after certain point in the audio.
  2. The generateContent and generateContentStream APIs can return empty responses.
  3. Gemini 2.5 Pro provides better transcription accuracy than Gemini 2.0 Flash. However, the transcription is received much faster when using Gemini 2.0 Flash.

In conclusion, Gemini APIs offer robust and flexible solutions for audio transcription needs. Through generateContent, streamGenerateContent, and the preview Live API, developers can choose the most suitable approach for their specific use cases. Whether it is simple one-off transcriptions or real-time streaming, Gemini APIs support them. With support for multiple audio formats, customizable prompts, and integration through the Google GenAI SDK, implementing audio transcription in applications has never been more accessible. As you begin implementing these APIs, remember to consider factors like file size limits, audio quality, model cost, and proper error handling to ensure optimal performance in your transcription services.

API

Opinions expressed by DZone contributors are their own.

Related

  • The Documentation Crisis Nobody Sees: Why AI Agents Are Breaking Faster Than Humans Can Document Them
  • A Practical Blueprint for Deploying Agentic Solutions
  • Beyond Manual Annotation: Engineering Self-Correcting Pseudo-Labeling Pipelines
  • Building Threat Intelligence Pipelines Using Python, APIs, and Elasticsearch

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook