We’re introducing three audio models in the API
Summary
OpenAI has launched three real-time audio models in the API, including a real-time translation model GPT Realtime Translate that supports 70 languages and a voice agent GPT Realtime 2 with reasoning capabilities, enabling developers to build more natural voice interaction interfaces.
View Cached Full Text
Cached at: 05/08/26, 06:26 AM
Similar Articles
Advancing voice intelligence with new models in the API
OpenAI has announced three new voice models in its API: GPT-Realtime-2 with advanced reasoning, GPT-Realtime-Translate for live multilingual translation, and GPT-Realtime-Whisper for streaming transcription, aiming to enable more natural and action-oriented voice applications.
@seclink: OpenAI Launches GPT-Realtime-2, Its Most Intelligent Voice Model to Date. The model features GPT-5-level reasoning, a 128,000 token context window, and supports adjusting 'effort level' for more natural conversation. It can pair with GPT-R…
OpenAI released the GPT-Realtime-2 voice model, featuring GPT-5-level reasoning capabilities and a 128,000 token context window. It supports real-time translation from over 70 input languages to 13 output languages, achieving 96.6% accuracy on the Big Bench Audio Intelligence benchmark. Greg Brockman called it a milestone in voice translation.
Introducing next-generation audio models in the API
OpenAI introduced next-generation audio models for the API, including improved speech-to-text (gpt-4o-transcribe, gpt-4o-mini-transcribe) and customizable text-to-speech models that enable developers to build more intelligent and expressive voice agents with enhanced accuracy across challenging scenarios.
OpenAI's New Voice Models Want to Do More Than Talk Back
OpenAI has launched three new real-time audio models to enable continuous, multitasking voice interactions that prioritize long-context reasoning, live translation, and seamless tool use.
Introducing the Realtime API
OpenAI introduces the Realtime API, enabling developers to build low-latency multimodal speech-to-speech conversational experiences with natural voice interactions powered by GPT-4o. The API supports six preset voices and simplifies development by eliminating the need to integrate multiple models.