Building a Real-Time Audio Transcription System With OpenAI’s Realtime API

Dive into real-time audio transcription with this hands-on guide, showing how to stream and transcribe audio in Java using OpenAI’s WebSocket API and GPT-4o models.

May. 27, 25 · Tutorial

Likes (2)

Comment

Save

5.7K Views

OpenAI launched two new Speech to Text models gpt-4o-mini-transcribe and gpt-4o-transcribe in March 2025. These models support streaming transcription for both completed and ongoing audio. Audio transcription refers to converting the audio input to text output (output format would be text or json).The transcription of already completed audio is much simpler using the transcription API provided by OpenAI.

The Realtime transcription is useful in application that require immediate feedback such as Voice assistants, Live captioning, Interactive voice applications, Meeting transcription and Accessibility tools. OpenAI has provided Realtime Transcription API (currently in beta) which allows you to stream audio data and receive transcription results in real-time. The realtime transcription API should be invoked using WebSocket or webRTC. This article focuses on invoking Realtime API using Java WebSocket implementation.

This image has been designed using resources from Flaticon.com

What Are WebSockets

WebSockets are a bidirectional communication protocol useful for an ongoing communication between client and server. This differs from HTTP, which follows request response format and client needs to submit a request in-order to get a response from service. WebSockets create a single, long lived TCP connection that allows for a 2 way communication between client and server. In real-time audio transcription, the client sends audio chunks as they become available; the OpenAI API returns transcription results upon processing.

WebSocket Methods

OnOpen(): This method is invoked when WebSocket connection is established.
onMessage(): This method is invoked when the client receives any message from the server. We should add the core logic to detect errors and process transcription response here.
onClose(): This method is invoked when the WebSocket client is closed. WebSocket connection could be closed by both client and server.
onError(): This method is invoked when WebSocket encounters an error. WebSocket always closes the connection after an error.

Implementation

In your Java project and add the Java-WebSocket dependency to your pom.xml file. Pick the latest stable version from MVN Repository.

    XML
   
   <!--WebSocket client -->
<dependency>
    <groupId>org.java-websocket</groupId>
    <artifactId>Java-WebSocket</artifactId>
    <version>1.6.0</version>
</dependency>

Create two classes: SimpleOpenAITranscription and TranscriptionWebSocketClient. TranscriptionWebSocketClient will contain the WebSocket Client implementation. SimpleOpenAITranscription will contain the main method that co-ordinates audio streaming and transcription. In the code examples below, I have specified the class they belong to.

Establish a WebSocket connection to OpenAI's API

The OpenAI API uses API keys for authentication. To create API keys, login to your OpenAI account and go to organization settings. The connection is established by invoking client.connect() from the main method. The WebSocket url is wss://api.openai.com/v1/realtime?intent=transcription

    Java
   
 

   //TranscriptionWebSocketClient
// Set the request headers
private static class TranscriptionWebSocketClient extends WebSocketClient {
  public TranscriptionWebSocketClient(URI serverUri) {
  super(serverUri, createHeaders());
  }

  private static Map<String, String> createHeaders() {
    Map<String, String> headers = new HashMap<>();
    headers.put("Authorization", "Bearer " + API_KEY);
    headers.put("openai-beta", "realtime=v1");
    return headers;
    }
}
  

    Java
   
 

   //SimpleOpenAITranscription

TranscriptionWebSocketClient client = new TranscriptionWebSocketClient(new URI(WEBSOCKET_URL));
client.connect();

// Wait until websocket connection is established
while (!client.isOpen()) {
    System.out.println("Websocket is not open.");
    Thread.sleep(100);
}
  

Set Up Transcription Session

A config message is required while creating the transcription session. The model name in config could be either gpt-4o-transcribe or gpt-4o-mini-transcribe based on your applications requirement. The language (eg: "en", "fr", "ko") is optional field, but when specified improves accuracy. The turn_detection field is used to setup Voice Activity Detection (VAD). When VAD is enabled, OpenAI will automatically detect any silence in audio and commit the audio message. Once committed, the server responds with transcription result. I have disabled VAD for simplicity.

    Java
   
   // SimpleOpenAITranscription
client.sendInitialConfig();

    Java
   
 

   //TranscriptionWebSocketClient

public void sendInitialConfig() {
  JSONObject config = new JSONObject()
  .put("type", "transcription_session.update")
  .put("session", new JSONObject()
  .put("input_audio_format", "pcm16")
  .put("input_audio_transcription", new JSONObject()
  .put("model", "gpt-4o-transcribe")
  .put("language", "en")) 
  .put("turn_detection", JSONObject.NULL)
  .put("input_audio_noise_reduction", new JSONObject()
  .put("type", "near_field")));
  send(config.toString());
}
  

Stream Audio Data in Chunks

The audio chunks are sent using “input_audio_buffer.append" message type. When VAD is disabled, we must commit the messages manually to receive transcription result.

    Java
   
   //SimpleOpenAITranscription
File audioFile = new File(args[0]);
AudioInputStream audioInputStream = AudioSystem.getAudioInputStream(audioFile);

byte[] buffer = new byte[CHUNK_SIZE];
int bytesRead;

while ((bytesRead = audioInputStream.read(buffer)) != -1) {
    client.sendAudioChunk(buffer, bytesRead);
}

client.commitAndClearBuffer();

    Java
   
 

   //TranscriptionWebSocketClient
public void sendAudioChunk(byte[] buffer, int bytesRead) {
    // Create a new byte array with only the read bytes
    byte[] audioData = Arrays.copyOf(buffer, bytesRead);
    String base64Audio = Base64.getEncoder().encodeToString(audioData);
    JSONObject audioMessage = new JSONObject()
        .put("type", "input_audio_buffer.append")
        .put("audio", base64Audio);
    send(audioMessage.toString());
}

public void commitAndClearBuffer() {
    send(new JSONObject().put("type", "input_audio_buffer.commit").toString());
    send(new JSONObject().put("type", "input_audio_buffer.clear").toString());
}
  

Receive and Process the Responses in Real-Time

The client receives messages from server using onMessage() method of WebSocket. The transcript is present in conversation.item.input_audio_transcription.completed message type. Application should listen these events and display the transcription response to users.

    Java
   
 

   //TranscriptionWebSocketClient
@Override
public void onMessage(String message) {
    System.out.println("message: " + message);
    JSONObject response = new JSONObject(message);
    if ("conversation.item.input_audio_transcription.completed".equals(response.get("type"))) {
        System.out.println("Transcription: " + response.getString("transcript"));
        this.close();
    }
}
  

These are few other response types that are helpful to track

transcription_session.created: Session is created.
transcription_session.updated: Session is updated based on the config payload.
input_audio_buffer.committed: The audio was committed by the client.
conversation.item.input_audio_transcription.delta: Partial transcriptions received from server.

Tips to improve your application

For better tracking the transcription session, create a sessionCreatedLatch and streamLatch variables of type CountDownLatch. Use sessionCreatedLatch to wait sending audio until “transcription_session.created” event is received. Use streamLatch to keep the audio stream open until client has sent all events and has received all transcriptions back.
The Voice Activity detection(VAD) feature automatically detects start and end of speech turns. When using server_vad mode of VAD, you might need to adjust silence_duration_ms if silences in your audio are incorrectly identified.
Never hardcode your API key in your application. You could use AWS secrets manager to store the secrets.
When VAD is disabled, commit and clear audio buffer at periodic intervals to avoid exceeding the input buffer limit.
Adding a small delay between events prevent overwhelming the connection. OpenAI response contains gibberish text when the connection is overwhelmed.
Close connections properly during errors to free up resources. The WebSocket connection times out at 30 minutes.
As OpenAI's realtime audio transcription feature is actively developed, regularly check the official official documentation for updates

By following this guide, you should now be able to integrate OpenAI's Realtime Transcription API into your Java applications.

API WebSocket AI

Opinions expressed by DZone contributors are their own.

Related

Trending