OpenAI's New Voice Models Want to Do More Than Talk Back

Reddit r/ArtificialInteligence Models

Summary

OpenAI has launched three new real-time audio models to enable continuous, multitasking voice interactions that prioritize long-context reasoning, live translation, and seamless tool use.

No content available
Original Article
View Cached Full Text

Cached at: 05/08/26, 01:31 PM

# OpenAI's New Voice Models Want to Do More Than Talk Back - Firethering Source: [https://firethering.com/openai-new-voice-models-realtime-api/](https://firethering.com/openai-new-voice-models-realtime-api/) \- Advertisement \- OpenAI is pushing deeper into voice\. The company just launched three new realtime audio models in its API\. GPT\-Realtime\-2 for conversational reasoning, GPT\-Realtime\-Translate for live multilingual translation, and GPT\-Realtime\-Whisper for streaming speech transcription\. GPT\-Realtime\-2 can now handle longer conversations, recover from interruptions more naturally, use tools while someone is still talking, and respond with different reasoning levels depending on the task\. OpenAI says the model is designed for things like customer support, scheduling, travel assistance, and other workflows where the AI actually has to keep track of context instead of just replying quickly\. OpenAI is no longer treating voice as a side feature attached to chatbots\. It’s starting to position voice as the interface itself\. That means live translation during conversations\. Real time transcription while meetings are still happening\. AI agents that can check your calendar, pull information from apps, or complete actions while the conversation keeps moving\. ## **Table of Contents** - [Voice models are starting to behave more like agents](https://firethering.com/openai-new-voice-models-realtime-api/#voice-models-are-starting-to-behave-more-like-agents) - [Voice changes how people use software](https://firethering.com/openai-new-voice-models-realtime-api/#voice-changes-how-people-use-software) - [GPT\-Realtime\-Translate may end up being the sleeper feature](https://firethering.com/openai-new-voice-models-realtime-api/#gpt-realtime-translate-may-end-up-being-the-sleeper-feature) - [Pricing and availability](https://firethering.com/openai-new-voice-models-realtime-api/#pricing-and-availability) ## **Voice models are starting to behave more like agents** Three audio models in the OpenAI APIThe most interesting part of this launch is not the voices themselves\. It’s the fact that OpenAI keeps framing these systems around actions and workflows instead of conversations\. The company highlighted examples like Zillow building voice agents that can search for homes and schedule tours, Deutsche Telekom testing multilingual customer support, and Priceline exploring trip planning that happens conversationally from start to finish\. That points to a shift happening across AI right now\. Voice assistants used to exist mainly to answer questions\. These new systems are being designed to stay active while tasks are unfolding like checking calendars, updating bookings, pulling information from apps, translating conversations live, or handling interruptions without restarting the interaction\. That’s also why OpenAI focused heavily on realtime reasoning and tool use in this launch\. A voice assistant that simply sounds natural is no longer enough\. The hard part is making the system useful while the conversation is still moving\. ## **Voice changes how people use software** Typing naturally creates pauses\. People send a prompt, wait for a response, then move on\. Voice interactions work differently\. Conversations keep moving even when requests change halfway through or multiple things happen at once\. That creates a much harder problem for AI systems\. The model has to listen continuously, decide when to respond, remember context across longer sessions, and sometimes use tools without interrupting the flow of the conversation itself\. And that’s probably why companies like OpenAI are suddenly investing so heavily in realtime infrastructure\. ##### **You May Like:**[Open\-Source TTS Models That Can Clone Voices and Actually Sound Human](https://firethering.com/open-source-tts-voice-cloning/) ## **GPT\-Realtime\-Translate may end up being the sleeper feature** The reasoning upgrades will get most of the attention, but the realtime translation model could end up having the bigger commercial impact\. OpenAI says GPT\-Realtime\-Translate can handle more than 70 input languages and translate into 13 output languages while keeping pace with live conversations\. That opens the door for customer support, meetings, events, travel assistance, and sales calls where people no longer need to speak the same language fluently to communicate smoothly\. And unlike older translation systems, OpenAI is clearly pushing for conversations that continue naturally while the translation happens in the background\. The company also highlighted testing from BolnaAI, which said the model handled regional Indian languages like Hindi, Tamil, and Telugu with lower word error rates and fewer fallback failures compared to other systems they tested\. Vimeo is experimenting with the model as well\. The company says it’s using GPT\-Realtime\-Translate for live translation during broadcasts so creators can reach global audiences while streaming in real time\. According to Vimeo, one of the biggest improvements was how well the system handled multilingual conversations without breaking flow mid\-interaction\. Multilingual voice AI becomes much more useful once it starts handling accents, interruptions, and regional speech patterns reliably in real time\. ##### **You May Like:**[SubQ’s 12M Token Model Could Change How AI Handles Long Context\. If It’s Real\.](https://firethering.com/subq-12m-token-context-llm-subquadratic-attention/) ## **Pricing and availability** All three models are available through OpenAI’s Realtime API\. GPT\-Realtime\-2 is priced at $32 per million audio input tokens and $64 per million audio output tokens, while GPT\-Realtime\-Translate costs $0\.034 per minute and GPT\-Realtime\-Whisper costs $0\.017 per minute\. Developers can also test the models through OpenAI’s Playground before integrating them into apps and workflows\.

Similar Articles

Advancing voice intelligence with new models in the API

OpenAI Blog

OpenAI has announced three new voice models in its API: GPT-Realtime-2 with advanced reasoning, GPT-Realtime-Translate for live multilingual translation, and GPT-Realtime-Whisper for streaming transcription, aiming to enable more natural and action-oriented voice applications.

We’re introducing three audio models in the API

YouTube AI Channels

OpenAI has launched three real-time audio models in the API, including a real-time translation model GPT Realtime Translate that supports 70 languages and a voice agent GPT Realtime 2 with reasoning capabilities, enabling developers to build more natural voice interaction interfaces.

Introducing next-generation audio models in the API

OpenAI Blog

OpenAI introduced next-generation audio models for the API, including improved speech-to-text (gpt-4o-transcribe, gpt-4o-mini-transcribe) and customizable text-to-speech models that enable developers to build more intelligent and expressive voice agents with enhanced accuracy across challenging scenarios.

Introducing the Realtime API

OpenAI Blog

OpenAI introduces the Realtime API, enabling developers to build low-latency multimodal speech-to-speech conversational experiences with natural voice interactions powered by GPT-4o. The API supports six preset voices and simplifies development by eliminating the need to integrate multiple models.