We’re introducing three audio models in the API

YouTube AI Channels Models

Summary

OpenAI has launched three real-time audio models in the API, including a real-time translation model GPT Realtime Translate that supports 70 languages and a voice agent GPT Realtime 2 with reasoning capabilities, enabling developers to build more natural voice interaction interfaces.

No content available
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/08/26, 06:26 AM

TL;DR: OpenAI has introduced three real-time audio models in the API, including GPT Realtime Translate for real-time translation (supporting 70 languages) and GPT Realtime 2 for voice agents (featuring reasoning and parallel tool calling). ## New Audio Model Overview In the release of new real-time audio models in the OpenAI API, the demo showcases two core capabilities: real-time translation and voice agents. The models process speech in real time without editing, capturing audio directly from a laptop along with transcribed text. ## GPT Realtime Translate: Real-Time Translation The presenter first speaks in French, and the model listens in real time and translates into English. Key features include: - **Real-time following**: The model waits for key words (e.g., verbs) while speaking, then begins translation immediately, creating a natural conversational rhythm. - **Multi-language switching**: In the demo, switching from French to German, the model seamlessly tracks and smoothly transitions between the two languages. - **Technical term handling**: Professional terms such as “GPT real time”, “OpenAI”, and “computer use” are handled effortlessly. - **Support for 70 languages**: The model can translate 70 different languages in real time, adapting to the inflection of each sentence. Use cases include media platforms, customer support, educational tools, etc., aiming to break down language barriers. ## GPT Realtime 2: Intelligent Reasoning for Voice Agents The new model GPT Realtime 2 brings reasoning capabilities to voice agents. The demo uses a personal voice assistant to perform tasks. ### Schedule Query and Parallel Tool Calling The user asks: “I have a customer meeting coming up. Can you check my schedule?” The model replies: “You have a meeting with Sable Crust Robotics in 12 minutes, with their CTO Alex Kim.” The model has reasoning and parallel tool calling capabilities, making it important to use preambles so the model can explain its status and inform the user. ### Maintaining Conversational Continuity and Confirmation Mechanism Executing an operation takes a few seconds. During reasoning and tool calling, the model communicates directly with the user, ensuring the user is always aware of progress. The voice agent stays engaged in the conversation — in the demo, the model keeps listening but does not interrupt until the user says “back to the demo.” ### CRM Update Example The user says: “Hey, can you help me update the CRM? Mark today’s meeting as brief and add next steps.” The model responds: “Let me get the latest context and update your CRM. Sablerest released a warehouse automation solution this morning. Expansion plans are underway, and security review is the bottleneck.” The model then confirms the task is complete. ### Connecting to External Systems The model can connect to any system, including dashboards, services, connected devices, etc. With voice as the primary interface, it can maintain a fluid conversation while thinking in the background. ## Model Capabilities Summary - **Real-time translation**: Supports 70 languages, natural conversational rhythm. - **Reasoning and tool calling**: Communicates with the user during thought processes, executes actions in parallel. - **Context retention**: Continuously listens without interrupting until the user indicates. - **System integration**: Can connect to external products and services. ## Conclusion These new real-time audio models are now available in the OpenAI API, enabling developers to build more natural voice interaction interfaces. OpenAI looks forward to seeing the community create more applications with these models. Source: We’re introducing three audio models in the API – OpenAI (https://www.youtube.com/watch?v=JOu8v6CBjkE)

Similar Articles

Advancing voice intelligence with new models in the API

OpenAI Blog

OpenAI has announced three new voice models in its API: GPT-Realtime-2 with advanced reasoning, GPT-Realtime-Translate for live multilingual translation, and GPT-Realtime-Whisper for streaming transcription, aiming to enable more natural and action-oriented voice applications.

@seclink: OpenAI Launches GPT-Realtime-2, Its Most Intelligent Voice Model to Date. The model features GPT-5-level reasoning, a 128,000 token context window, and supports adjusting 'effort level' for more natural conversation. It can pair with GPT-R…

X AI KOLs Following

OpenAI released the GPT-Realtime-2 voice model, featuring GPT-5-level reasoning capabilities and a 128,000 token context window. It supports real-time translation from over 70 input languages to 13 output languages, achieving 96.6% accuracy on the Big Bench Audio Intelligence benchmark. Greg Brockman called it a milestone in voice translation.

Introducing next-generation audio models in the API

OpenAI Blog

OpenAI introduced next-generation audio models for the API, including improved speech-to-text (gpt-4o-transcribe, gpt-4o-mini-transcribe) and customizable text-to-speech models that enable developers to build more intelligent and expressive voice agents with enhanced accuracy across challenging scenarios.

Introducing the Realtime API

OpenAI Blog

OpenAI introduces the Realtime API, enabling developers to build low-latency multimodal speech-to-speech conversational experiences with natural voice interactions powered by GPT-4o. The API supports six preset voices and simplifies development by eliminating the need to integrate multiple models.