MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction
Summary
MiniCPM-o 4.5 is a 9B parameter multimodal model featuring Omni-Flow, a framework enabling real-time full-duplex interaction where the model can simultaneously perceive and respond proactively. It achieves state-of-the-art open-source performance comparable to Gemini 2.5 Flash and runs on edge devices with less than 12GB RAM.
View Cached Full Text
Cached at: 05/08/26, 07:57 AM
Paper page - MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction
Source: https://huggingface.co/papers/2604.27393 Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
MiniCPM-o 4.5 enables real-time full-duplex multimodal interaction through Omni-Flow, a unified streaming framework that aligns inputs and outputs temporally for simultaneous perception and response.
Recent progress inmultimodal large language models(MLLMs) has brought AI capabilities from static offline data processing toreal-time streaming interaction, yet they still remain far from human-level multimodal interaction. The key bottlenecks are no longer modality coverage or latency alone, but the interaction paradigm itself. First, perception and response are still separated into alternating phases, preventing models from incorporating new inputs for timely adjustment during generation. Second, most current models remain reactive, responding only to explicit user requests instead of acting proactively in the evolving multimodal environment. We present MiniCPM-o 4.5, our latest effort towards human-like multimodal interaction, which mitigates these gaps by real-time full-duplexomni-modal interaction. It can see, listen, and speak simultaneously in real-time, while also exhibiting proactive behaviors such as issuing reminders or comments based on its continuous understanding of the live scene. The key technique behind MiniCPM-o 4.5 isOmni-Flow, a unified streaming framework that aligns omni-modal inputs and outputs along a sharedtemporal axis. This formulation converts conventionalturn-based interactioninto a full-duplex, time-aligned process, enabling simultaneous perception and response and allowing proactive behavior to arise within the same framework. With a total of 9B parameters, MiniCPM-o 4.5 approaches Gemini 2.5 Flash invision-language capabilities, delivering state-of-the-art open-source performance at its scale. It also surpasses Qwen3-Omni-30B-A3B inomni-modal understandingand delivers betterspeech generation, with significantly higher computation efficiency. Driven by its efficient architecture design and inference optimization, the model can perform real-time full-duplexomni-modal interactiononedge deviceswith less than 12GB RAM cost.
View arXiv pageView PDFProject pageGitHub24.5kAdd to collection
Get this paper in your agent:
hf papers read 2604\.27393
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.27393 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.27393 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.27393 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
@rohanpaul_ai: Just a few days back, Thinking Machines Lab (TML), showcased a way of making AI interaction continuous instead of turn-…
Thinking Machines Lab and OpenBMB released MiniCPM-o 4.5, a 9B full-duplex omnimodal model with the Omni-Flow framework that enables continuous, time-aligned real-time video and voice interaction, surpassing previous models and available as open source.
@AdinaYakup: MiniCPM V4.6 a 1B MLLM that actually runs on your phone, just released by @OpenBMB 1B - Apache2.0 Runs on iOS, Android,…
OpenBMB has released MiniCPM V4.6, a 1B-parameter multimodal large language model optimized for mobile devices under the Apache 2.0 license. It features mixed visual token compression and claims approximately 1.5x faster throughput than Qwen3.5 0.8B while running natively on iOS, Android, and HarmonyOS.
MiniCPM-V 4.6
MiniCPM-V 4.6 is an ultra-efficient 1.3B vision-language model optimized for mobile devices.
MiniCPM4: Ultra-Efficient LLMs on End Devices
MiniCPM4 is a highly efficient large language model designed for end devices, achieving strong performance with 0.5B and 8B parameter versions through innovations in sparse attention, data filtering, training algorithms, and inference systems.
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
MiniCPM-V 4.5 is an 8B multimodal large language model that achieves high efficiency and strong performance through a unified 3D-Resampler architecture, a novel data strategy, and a hybrid reinforcement learning approach. The model reportedly surpasses larger proprietary and open-source benchmarks while significantly reducing GPU memory usage and inference time.