DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

SBOMs are essential to circumventing software supply chain attacks, and they provide visibility into various software components.

Related

  • Automation of IT Incident Reports Using Observability API and GenAI
  • Master AI Development: The Ultimate Guide to LangChain, LangGraph, LangFlow, and LangSmith
  • Building an AI Nutrition Coach With OpenAI, Gradio, and gTTS
  • Stop Prompt Hacking: How I Connected My AI Agent to Any API With MCP

Trending

  • The AWS Playbook for Building Future-Ready Data Systems
  • The Twelve-Factor Agents: Building Production-Ready LLM Applications
  • Why Traditional CI/CD Falls Short for Cloud Infrastructure
  • AI-Powered Ransomware and Malware Detection in Cloud Environments
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Building a Real-Time Audio Transcription System With OpenAI’s Realtime API

Building a Real-Time Audio Transcription System With OpenAI’s Realtime API

Dive into real-time audio transcription with this hands-on guide, showing how to stream and transcribe audio in Java using OpenAI’s WebSocket API and GPT-4o models.

By 
Roline Stapny Saldanha user avatar
Roline Stapny Saldanha
·
May. 27, 25 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
3.7K Views

Join the DZone community and get the full member experience.

Join For Free

OpenAI launched two new Speech to Text models gpt-4o-mini-transcribe and gpt-4o-transcribe in March 2025. These models support streaming transcription for both completed and ongoing audio. Audio transcription refers to converting the audio input to text output (output format would be text or json).The transcription of already completed audio is much simpler using the transcription API provided by OpenAI.

The Realtime transcription is useful in application that require immediate feedback such as Voice assistants, Live captioning, Interactive voice applications, Meeting transcription and Accessibility tools. OpenAI has provided Realtime Transcription API (currently in beta) which allows you to stream audio data and receive transcription results in real-time. The realtime transcription API should be invoked using WebSocket or webRTC. This article focuses on invoking Realtime API using Java WebSocket implementation.


An image showing Realtime API using Java WebSocket implementation.

This image has been designed using resources from Flaticon.com


What Are WebSockets

WebSockets are a bidirectional communication protocol useful for an ongoing communication between client and server. This differs from HTTP, which follows request response format and client needs to submit a request in-order to get a response from service. WebSockets create a single, long lived TCP connection that allows for a 2 way communication between client and server. In real-time audio transcription, the client sends audio chunks as they become available; the OpenAI API returns transcription results upon processing.

WebSocket Methods

  • OnOpen(): This method is invoked when WebSocket connection is established.
  • onMessage(): This method is invoked when the client receives any message from the server. We should add the core logic to detect errors and process transcription response here.
  • onClose(): This method is invoked when the WebSocket client is closed. WebSocket connection could be closed by both client and server.
  • onError(): This method is invoked when WebSocket encounters an error. WebSocket always closes the connection after an error.

Implementation

In your Java project and add the Java-WebSocket dependency to your pom.xml file. Pick the latest stable version from MVN Repository. 

XML
 
<!--WebSocket client -->
<dependency>
    <groupId>org.java-websocket</groupId>
    <artifactId>Java-WebSocket</artifactId>
    <version>1.6.0</version>
</dependency>


Create two classes: SimpleOpenAITranscription and TranscriptionWebSocketClient. TranscriptionWebSocketClient will contain the WebSocket Client implementation. SimpleOpenAITranscription will contain the main method that co-ordinates audio streaming and transcription. In the code examples below, I have specified the class they belong to.

Establish a WebSocket connection to OpenAI's API

The OpenAI API uses API keys for authentication. To create API keys, login to your OpenAI account and go to organization settings. The connection is established by invoking client.connect() from the main method. The WebSocket url is wss://api.openai.com/v1/realtime?intent=transcription

Java
 
//TranscriptionWebSocketClient
// Set the request headers
private static class TranscriptionWebSocketClient extends WebSocketClient {
  public TranscriptionWebSocketClient(URI serverUri) {
  super(serverUri, createHeaders());
  }

  private static Map<String, String> createHeaders() {
    Map<String, String> headers = new HashMap<>();
    headers.put("Authorization", "Bearer " + API_KEY);
    headers.put("openai-beta", "realtime=v1");
    return headers;
    }
}


Java
 
//SimpleOpenAITranscription

TranscriptionWebSocketClient client = new TranscriptionWebSocketClient(new URI(WEBSOCKET_URL));
client.connect();

// Wait until websocket connection is established
while (!client.isOpen()) {
    System.out.println("Websocket is not open.");
    Thread.sleep(100);
}


Set Up Transcription Session

A config message is required while creating the transcription session. The model name in config could be either gpt-4o-transcribe or gpt-4o-mini-transcribe based on your applications requirement.  The language (eg: "en", "fr", "ko") is optional field, but when specified improves accuracy. The turn_detection field is used to setup Voice Activity Detection (VAD). When VAD is enabled, OpenAI will automatically detect any silence in audio and commit the audio message. Once committed, the server responds with transcription result. I have disabled VAD for simplicity.

Java
 
// SimpleOpenAITranscription
client.sendInitialConfig();


Java
 
//TranscriptionWebSocketClient

public void sendInitialConfig() {
  JSONObject config = new JSONObject()
  .put("type", "transcription_session.update")
  .put("session", new JSONObject()
  .put("input_audio_format", "pcm16")
  .put("input_audio_transcription", new JSONObject()
  .put("model", "gpt-4o-transcribe")
  .put("language", "en")) 
  .put("turn_detection", JSONObject.NULL)
  .put("input_audio_noise_reduction", new JSONObject()
  .put("type", "near_field")));
  send(config.toString());
}


Stream Audio Data in Chunks

The audio chunks are sent using “input_audio_buffer.append" message type. When VAD is disabled, we must commit the messages manually to receive transcription result.

Java
 
//SimpleOpenAITranscription
File audioFile = new File(args[0]);
AudioInputStream audioInputStream = AudioSystem.getAudioInputStream(audioFile);

byte[] buffer = new byte[CHUNK_SIZE];
int bytesRead;

while ((bytesRead = audioInputStream.read(buffer)) != -1) {
    client.sendAudioChunk(buffer, bytesRead);
}

client.commitAndClearBuffer();
Java
 
//TranscriptionWebSocketClient
public void sendAudioChunk(byte[] buffer, int bytesRead) {
    // Create a new byte array with only the read bytes
    byte[] audioData = Arrays.copyOf(buffer, bytesRead);
    String base64Audio = Base64.getEncoder().encodeToString(audioData);
    JSONObject audioMessage = new JSONObject()
        .put("type", "input_audio_buffer.append")
        .put("audio", base64Audio);
    send(audioMessage.toString());
}

public void commitAndClearBuffer() {
    send(new JSONObject().put("type", "input_audio_buffer.commit").toString());
    send(new JSONObject().put("type", "input_audio_buffer.clear").toString());
}


Receive and Process the Responses in Real-Time

The client receives messages from server using onMessage() method of WebSocket. The transcript  is present in conversation.item.input_audio_transcription.completed message type. Application should listen these events and display the transcription response to users.

Java
 
//TranscriptionWebSocketClient
@Override
public void onMessage(String message) {
    System.out.println("message: " + message);
    JSONObject response = new JSONObject(message);
    if ("conversation.item.input_audio_transcription.completed".equals(response.get("type"))) {
        System.out.println("Transcription: " + response.getString("transcript"));
        this.close();
    }
}

These are few other response types that are helpful to track

  • transcription_session.created:  Session is created.
  • transcription_session.updated:  Session is updated based on the config payload.
  • input_audio_buffer.committed: The audio was committed by the client.
  • conversation.item.input_audio_transcription.delta: Partial transcriptions received from server.

Tips to improve your application

  1. For better tracking the transcription session, create a sessionCreatedLatch and streamLatch variables of type CountDownLatch.  Use sessionCreatedLatch to wait sending audio until “transcription_session.created” event is received. Use streamLatch to keep the audio stream open until client has sent all events and has received all transcriptions back.
  2. The Voice Activity detection(VAD) feature automatically detects start and end of speech turns. When using server_vad mode of VAD, you might need to adjust  silence_duration_ms if silences in your audio are incorrectly identified.
  3. Never hardcode your API key in your application. You could use AWS secrets manager to store the secrets.
  4. When VAD is disabled, commit and clear audio buffer at periodic intervals to avoid exceeding the input buffer limit.
  5. Adding a small delay between events prevent overwhelming the connection. OpenAI response contains gibberish text when the connection is overwhelmed.
  6. Close connections properly during errors to free up resources. The WebSocket connection times out at 30 minutes.
  7. As OpenAI's realtime audio transcription feature is actively developed, regularly check the official official documentation for updates 

By following this guide, you should now be able to integrate OpenAI's Realtime Transcription API into your Java applications.

API WebSocket AI

Opinions expressed by DZone contributors are their own.

Related

  • Automation of IT Incident Reports Using Observability API and GenAI
  • Master AI Development: The Ultimate Guide to LangChain, LangGraph, LangFlow, and LangSmith
  • Building an AI Nutrition Coach With OpenAI, Gradio, and gTTS
  • Stop Prompt Hacking: How I Connected My AI Agent to Any API With MCP

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: