@miramurati: Today we're sharing our work on interaction models. A new class of model trained from scratch to handle real-time inter…

X AI KOLs Following Models

Summary

Mira Murati's team showcased a preview of the new interaction model. Trained from scratch, it natively supports full-duplex real-time audio and video conversations, instant interruptions, multi-language translation, and dynamic multi-tasking. The demonstration verified its core capabilities in low-latency streaming interaction, multimodal perception, and concurrent task execution.

Today we're sharing our work on interaction models. A new class of model trained from scratch to handle real-time interaction natively, instead of gluing it onto a turn-based one. https://t.co/MoS5s4cm60
Original Article
View Cached Full Text

Cached at: 05/11/26, 10:47 PM

Today we’re sharing our work on interaction models. A new class of model trained from scratch to handle real-time interaction natively, instead of gluing it onto a turn-based one. https://t.co/MoS5s4cm60


TL;DR: Mira Murati’s team unveils a new preview of an interaction model, supporting full-duplex real-time audio/video conversation, interruptible responses during speech, real-time multilingual translation, web search, and dynamic artifact generation.

Full-Duplex Real-Time Audio/Video Interaction Architecture

The demo begins by introducing a brand new full-duplex audio and video system. This system allows users to stream input to the model in real-time, with the model capable of responding while the user is simultaneously speaking, achieving a low-latency, seamless real-time interaction experience.

To verify the model’s real-time visual perception and instruction-following capabilities, a trigger rule was set in the demo: whenever a new person enters the frame, the model must immediately identify and output “friend.” The model accurately executed this visual trigger instruction during the process of multiple people alternating entering the shot.

Real-Time Multilingual Translation

This preview model further lowers the barrier for human-computer dialogue, supporting low-latency real-time voice translation. In the demo, participant Rowan proposed supplementing the content in Hindi and requested the model to simultaneously translate for the onsite audience and viewers into English. The model confirmed that based on the preview model’s capabilities, it can achieve real-time cross-language interaction of “translating while speaking.”

Web Search and Dynamic Artifact Generation

The model integrates real-time web search and artifact generation capabilities, supporting parallel multitasking. In the demo, the participant asked about the typical simple reaction times for humans regarding tactile, auditory, and visual communication signals. The model returned precise data through real-time retrieval:

  • Tactile: Approximately 150 milliseconds
  • Auditory: 140 to 170 milliseconds
  • Visual: 180 to 250 milliseconds

After obtaining the data, the participant requested visualization. The model instantly generated a bar chart comparing reaction times. While rendering the chart, the model was still able to maintain the conversation thread, synchronously responding to the participant’s follow-up questions.

Neurological Principles of Sensory Reaction Speed

Addressing the question “why is auditory reaction speed even faster than visual,” the model provided a concise mechanistic explanation: the neural pathways through which auditory signals travel to the brain are shorter and more direct than visual information, resulting in faster neural processing and reaction speeds. This explanation aligns with the retrieved empirical data, demonstrating the model’s coherence in integrating real-time retrieved data with basic scientific common knowledge.

At the end of the demo, the participant affirmed the model’s response speed and multimodal collaboration capabilities (“This is great, Tar”), fully validating the interaction model’s core features in real-time streaming input, multimodal understanding, dynamic generation, and concurrent task processing.

Source: https://youtu.be/A12AVongNN4

Similar Articles

Interaction Models

Hacker News Top

Thinking Machines AI announces a research preview of interaction models, a new architecture designed for native, real-time human-AI collaboration across audio, video, and text. By replacing turn-based interfaces with a multi-stream, micro-turn design, the model aims to keep humans actively in the loop while delivering state-of-the-art intelligence and responsiveness.

@FinanceYF5: Mira Murati said something very accurate: current AI models, when thinking, are basically deaf and blind—they can't hear what you're saying, can't perceive any new information. That's not how humans interact. Silence, interruptions, talking at the same time—these are all information. True human-machine collaboration requires "time-based interaction"—AI continues to…

X AI KOLs Following

Mira Murati points out that current AI models cannot perceive new information in real-time while thinking. True collaboration requires time-based interaction, continuously receiving and outputting multimodal information.