Voice and Audio AI Models
Real-time voice conversation has crossed the uncanny valley. Voice cloning is a few seconds of audio. The current state and what the platforms ship.
The two architectures
Older systems: speech-to-text + LLM + text-to-speech. Three components, latency stacks. Newer end-to-end voice models: audio in, audio out, single forward pass.
Speech-to-text
OpenAI Whisper (open-weight, multilingual, free), Deepgram, AssemblyAI. Word error rates under 5% on clean English. Streaming variants achieve sub-300ms latency.
Text-to-speech
ElevenLabs (commercial, best quality), OpenAI TTS, Google’s WaveNet successors. Voice cloning from 30 seconds of audio is in production. The result is indistinguishable from the source for casual listeners.
End-to-end voice
OpenAI Realtime API, Gemini Live, Sesame, Hume. Audio in, audio out, latency < 300ms. The model preserves prosody, emotion, and turn-taking cues that pipeline systems lose.
The user-experience leap is real: end-to-end voice agents feel like phone calls; pipeline ones feel like radio plays.
Safety
Voice cloning makes social-engineering attacks easier. The 2026 mitigations: content credentials baked into generated audio, watermark-detection APIs, platform-side limits on cloning a user’s voice without consent. None is bulletproof; all raise the cost for casual misuse.