Voice AI has always had a latency problem. Traditional pipelines—speech-to-text, LLM processing, text-to-speech—stack delays that make conversations feel robotic. Users wait 2-3 seconds for responses. It kills the magic.
OpenAI's Realtime API changes everything. We're talking sub-200ms response times. Bidirectional audio streaming. Real conversations with AI.
Here's how I built it.
The Architecture
Twilio (Phone) <-> WebSocket Server <-> OpenAI Realtime API
↓ ↓ ↓
PSTN Audio Media Stream Bridge GPT-4o Realtime
(μ-law) (Base64) (PCM 24kHz)
The key insight: no transcription step. Audio goes directly to the model, and audio comes directly back. The model "hears" and "speaks" natively.
Twilio Media Streams
When someone calls your Twilio number, you respond with TwiML that opens a WebSocket:
<Response>
<Connect>
<Stream url="wss://your-server.com/media-stream">
<Parameter name="callerNumber" value="{From}"/>
</Stream>
</Connect>
</Response>
Twilio sends audio chunks as base64-encoded μ-law (8kHz). You'll need to transcode to PCM 24kHz for OpenAI.
The WebSocket Bridge
Your server maintains two WebSocket connections:
- Twilio → Your Server: Receives caller audio
- Your Server → OpenAI: Sends/receives audio from the model
// Simplified flow
twilioWs.on('message', (data) => {
const { event, media } = JSON.parse(data);
if (event === 'media') {
const pcmAudio = transcode(media.payload);
openaiWs.send(JSON.stringify({
type: 'input_audio_buffer.append',
audio: pcmAudio
}));
}
});
Server VAD: The Secret Sauce
turn_detection: { type: 'server_vad' } enables Voice Activity Detection on OpenAI's side. The model automatically detects when the user stops speaking and begins responding.
This is crucial for natural conversation flow.
Latency Breakdown
| Component | Time |
|---|---|
| Twilio → Server | ~50ms |
| Server → OpenAI | ~30ms |
| Model Processing | ~80ms |
| OpenAI → Server | ~30ms |
| Server → Twilio | ~50ms |
| Total | ~200ms |
Compare this to traditional pipelines (2-3 seconds) and it's night and day.
Production Considerations
- Audio Format Hell: Twilio uses μ-law 8kHz. OpenAI wants PCM 24kHz.
- WebSocket Lifecycle: Handle disconnections gracefully.
- Costs: Realtime API pricing is per-minute of audio.
-
Interruptions: Handle
response.cancelledevents.
The voice assistant feels genuinely conversational. This is the future of voice interfaces.
Originally published at ryancwynar.com
Top comments (0)