Building a Real-Time Audio Transcription System With OpenAI’s Realtime API
Dive into real-time audio transcription with this hands-on guide, showing how to stream and transcribe audio in Java using OpenAI’s WebSocket API and GPT-4o models.
Join the DZone community and get the full member experience.
Join For FreeOpenAI launched two new Speech to Text models gpt-4o-mini-transcribe
and gpt-4o-transcribe
in March 2025. These models support streaming transcription for both completed and ongoing audio. Audio transcription refers to converting the audio input to text output (output format would be text or json).The transcription of already completed audio is much simpler using the transcription API provided by OpenAI.
The Realtime transcription is useful in application that require immediate feedback such as Voice assistants, Live captioning, Interactive voice applications, Meeting transcription and Accessibility tools. OpenAI has provided Realtime Transcription API (currently in beta) which allows you to stream audio data and receive transcription results in real-time. The realtime transcription API should be invoked using WebSocket or webRTC. This article focuses on invoking Realtime API using Java WebSocket implementation.
What Are WebSockets
WebSockets are a bidirectional communication protocol useful for an ongoing communication between client and server. This differs from HTTP, which follows request response format and client needs to submit a request in-order to get a response from service. WebSockets create a single, long lived TCP connection that allows for a 2 way communication between client and server. In real-time audio transcription, the client sends audio chunks as they become available; the OpenAI API returns transcription results upon processing.
WebSocket Methods
OnOpen()
: This method is invoked when WebSocket connection is established.onMessage()
: This method is invoked when the client receives any message from the server. We should add the core logic to detect errors and process transcription response here.onClose()
: This method is invoked when the WebSocket client is closed. WebSocket connection could be closed by both client and server.onError()
: This method is invoked when WebSocket encounters an error. WebSocket always closes the connection after an error.
Implementation
In your Java project and add the Java-WebSocket dependency to your pom.xml
file. Pick the latest stable version from MVN Repository.
<!--WebSocket client -->
<dependency>
<groupId>org.java-websocket</groupId>
<artifactId>Java-WebSocket</artifactId>
<version>1.6.0</version>
</dependency>
Create two classes: SimpleOpenAITranscription
and TranscriptionWebSocketClient
. TranscriptionWebSocketClient will contain the WebSocket Client implementation. SimpleOpenAITranscription
will contain the main method that co-ordinates audio streaming and transcription. In the code examples below, I have specified the class they belong to.
Establish a WebSocket connection to OpenAI's API
The OpenAI API uses API keys for authentication. To create API keys, login to your OpenAI account and go to organization settings. The connection is established by invoking client.connect()
from the main method. The WebSocket url is wss://api.openai.com/v1/realtime?intent=transcription
//TranscriptionWebSocketClient
// Set the request headers
private static class TranscriptionWebSocketClient extends WebSocketClient {
public TranscriptionWebSocketClient(URI serverUri) {
super(serverUri, createHeaders());
}
private static Map<String, String> createHeaders() {
Map<String, String> headers = new HashMap<>();
headers.put("Authorization", "Bearer " + API_KEY);
headers.put("openai-beta", "realtime=v1");
return headers;
}
}
//SimpleOpenAITranscription
TranscriptionWebSocketClient client = new TranscriptionWebSocketClient(new URI(WEBSOCKET_URL));
client.connect();
// Wait until websocket connection is established
while (!client.isOpen()) {
System.out.println("Websocket is not open.");
Thread.sleep(100);
}
Set Up Transcription Session
A config message is required while creating the transcription session. The model name in config could be either gpt-4o-transcribe
or gpt-4o-mini-transcribe
based on your applications requirement. The language
(eg: "en", "fr", "ko") is optional field, but when specified improves accuracy. The turn_detection
field is used to setup Voice Activity Detection (VAD). When VAD is enabled, OpenAI will automatically detect any silence in audio and commit the audio message. Once committed, the server responds with transcription result. I have disabled VAD for simplicity.
// SimpleOpenAITranscription
client.sendInitialConfig();
//TranscriptionWebSocketClient
public void sendInitialConfig() {
JSONObject config = new JSONObject()
.put("type", "transcription_session.update")
.put("session", new JSONObject()
.put("input_audio_format", "pcm16")
.put("input_audio_transcription", new JSONObject()
.put("model", "gpt-4o-transcribe")
.put("language", "en"))
.put("turn_detection", JSONObject.NULL)
.put("input_audio_noise_reduction", new JSONObject()
.put("type", "near_field")));
send(config.toString());
}
Stream Audio Data in Chunks
The audio chunks are sent using “input_audio_buffer.append
" message type. When VAD is disabled, we must commit the messages manually to receive transcription result.
//SimpleOpenAITranscription
File audioFile = new File(args[0]);
AudioInputStream audioInputStream = AudioSystem.getAudioInputStream(audioFile);
byte[] buffer = new byte[CHUNK_SIZE];
int bytesRead;
while ((bytesRead = audioInputStream.read(buffer)) != -1) {
client.sendAudioChunk(buffer, bytesRead);
}
client.commitAndClearBuffer();
//TranscriptionWebSocketClient
public void sendAudioChunk(byte[] buffer, int bytesRead) {
// Create a new byte array with only the read bytes
byte[] audioData = Arrays.copyOf(buffer, bytesRead);
String base64Audio = Base64.getEncoder().encodeToString(audioData);
JSONObject audioMessage = new JSONObject()
.put("type", "input_audio_buffer.append")
.put("audio", base64Audio);
send(audioMessage.toString());
}
public void commitAndClearBuffer() {
send(new JSONObject().put("type", "input_audio_buffer.commit").toString());
send(new JSONObject().put("type", "input_audio_buffer.clear").toString());
}
Receive and Process the Responses in Real-Time
The client receives messages from server using onMessage()
method of WebSocket. The transcript is present in conversation.item.input_audio_transcription.completed
message type. Application should listen these events and display the transcription response to users.
//TranscriptionWebSocketClient
@Override
public void onMessage(String message) {
System.out.println("message: " + message);
JSONObject response = new JSONObject(message);
if ("conversation.item.input_audio_transcription.completed".equals(response.get("type"))) {
System.out.println("Transcription: " + response.getString("transcript"));
this.close();
}
}
These are few other response types that are helpful to track
transcription_session.created
: Session is created.transcription_session.updated
: Session is updated based on the config payload.input_audio_buffer.committed
: The audio was committed by the client.conversation.item.input_audio_transcription.delta
: Partial transcriptions received from server.
Tips to improve your application
- For better tracking the transcription session, create a
sessionCreatedLatch
andstreamLatch
variables of type CountDownLatch. Use sessionCreatedLatch to wait sending audio until “transcription_session.created
” event is received. UsestreamLatch
to keep the audio stream open until client has sent all events and has received all transcriptions back. - The Voice Activity detection(VAD) feature automatically detects start and end of speech turns. When using
server_vad
mode of VAD, you might need to adjustsilence_duration_ms
if silences in your audio are incorrectly identified. - Never hardcode your API key in your application. You could use AWS secrets manager to store the secrets.
- When VAD is disabled, commit and clear audio buffer at periodic intervals to avoid exceeding the input buffer limit.
- Adding a small delay between events prevent overwhelming the connection. OpenAI response contains gibberish text when the connection is overwhelmed.
- Close connections properly during errors to free up resources. The WebSocket connection times out at 30 minutes.
- As OpenAI's realtime audio transcription feature is actively developed, regularly check the official official documentation for updates
By following this guide, you should now be able to integrate OpenAI's Realtime Transcription API into your Java applications.
Opinions expressed by DZone contributors are their own.
Comments