@miramurati: Today we're sharing our work on interaction models. A new class of model trained from scratch to handle real-time inter…

X AI KOLs Following 05/11/26, 08:43 PM Models

Summary

Mira Murati's team showcased a preview of the new interaction model. Trained from scratch, it natively supports full-duplex real-time audio and video conversations, instant interruptions, multi-language translation, and dynamic multi-tasking. The demonstration verified its core capabilities in low-latency streaming interaction, multimodal perception, and concurrent task execution.

Today we're sharing our work on interaction models. A new class of model trained from scratch to handle real-time interaction natively, instead of gluing it onto a turn-based one. https://t.co/MoS5s4cm60

Original Article

View Cached Full Text

Cached at: 05/11/26, 10:47 PM

Today we’re sharing our work on interaction models. A new class of model trained from scratch to handle real-time interaction natively, instead of gluing it onto a turn-based one. https://t.co/MoS5s4cm60

TL;DR: Mira Murati’s team unveils a new preview of an interaction model, supporting full-duplex real-time audio/video conversation, interruptible responses during speech, real-time multilingual translation, web search, and dynamic artifact generation.

Full-Duplex Real-Time Audio/Video Interaction Architecture

The demo begins by introducing a brand new full-duplex audio and video system. This system allows users to stream input to the model in real-time, with the model capable of responding while the user is simultaneously speaking, achieving a low-latency, seamless real-time interaction experience.

To verify the model’s real-time visual perception and instruction-following capabilities, a trigger rule was set in the demo: whenever a new person enters the frame, the model must immediately identify and output “friend.” The model accurately executed this visual trigger instruction during the process of multiple people alternating entering the shot.

Real-Time Multilingual Translation

This preview model further lowers the barrier for human-computer dialogue, supporting low-latency real-time voice translation. In the demo, participant Rowan proposed supplementing the content in Hindi and requested the model to simultaneously translate for the onsite audience and viewers into English. The model confirmed that based on the preview model’s capabilities, it can achieve real-time cross-language interaction of “translating while speaking.”

Web Search and Dynamic Artifact Generation

The model integrates real-time web search and artifact generation capabilities, supporting parallel multitasking. In the demo, the participant asked about the typical simple reaction times for humans regarding tactile, auditory, and visual communication signals. The model returned precise data through real-time retrieval:

Tactile: Approximately 150 milliseconds
Auditory: 140 to 170 milliseconds
Visual: 180 to 250 milliseconds

After obtaining the data, the participant requested visualization. The model instantly generated a bar chart comparing reaction times. While rendering the chart, the model was still able to maintain the conversation thread, synchronously responding to the participant’s follow-up questions.

Neurological Principles of Sensory Reaction Speed

Addressing the question “why is auditory reaction speed even faster than visual,” the model provided a concise mechanistic explanation: the neural pathways through which auditory signals travel to the brain are shorter and more direct than visual information, resulting in faster neural processing and reaction speeds. This explanation aligns with the retrieved empirical data, demonstrating the model’s coherence in integrating real-time retrieved data with basic scientific common knowledge.

At the end of the demo, the participant affirmed the model’s response speed and multimodal collaboration capabilities (“This is great, Tar”), fully validating the interaction model’s core features in real-time streaming input, multimodal understanding, dynamic generation, and concurrent task processing.

Source: https://youtu.be/A12AVongNN4

@miramurati: Today we're sharing our work on interaction models. A new class of model trained from scratch to handle real-time inter…

Full-Duplex Real-Time Audio/Video Interaction Architecture

Real-Time Multilingual Translation

Web Search and Dynamic Artifact Generation

Neurological Principles of Sensory Reaction Speed

Similar Articles

Interaction Models

OpenAI's New Voice Models Want to Do More Than Talk Back

@rohanpaul_ai: Thinking Machines is replacing turn-taking AI with always-present AI. They just announced TML-Interaction-Small, a 276B…

@rohanpaul_ai: Just a few days back, Thinking Machines Lab (TML), showcased a way of making AI interaction continuous instead of turn-…

Submit Feedback

Similar Articles

@FinanceYF5: Mira Murati said something very accurate: current AI models, when thinking, are basically deaf and blind—they can't hear what you're saying, can't perceive any new information. That's not how humans interact. Silence, interruptions, talking at the same time—these are all information. True human-machine collaboration requires "time-based interaction"—AI continues to…

OpenAI's New Voice Models Want to Do More Than Talk Back

@rohanpaul_ai: Thinking Machines is replacing turn-taking AI with always-present AI. They just announced TML-Interaction-Small, a 276B…

@rohanpaul_ai: Just a few days back, Thinking Machines Lab (TML), showcased a way of making AI interaction continuous instead of turn-…