MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

Hugging Face Daily Papers Papers

Summary

MiniCPM-o 4.5 is a 9B parameter multimodal model featuring Omni-Flow, a framework enabling real-time full-duplex interaction where the model can simultaneously perceive and respond proactively. It achieves state-of-the-art open-source performance comparable to Gemini 2.5 Flash and runs on edge devices with less than 12GB RAM.

Recent progress in multimodal large language models (MLLMs) has brought AI capabilities from static offline data processing to real-time streaming interaction, yet they still remain far from human-level multimodal interaction. The key bottlenecks are no longer modality coverage or latency alone, but the interaction paradigm itself. First, perception and response are still separated into alternating phases, preventing models from incorporating new inputs for timely adjustment during generation. Second, most current models remain reactive, responding only to explicit user requests instead of acting proactively in the evolving multimodal environment. We present MiniCPM-o 4.5, our latest effort towards human-like multimodal interaction, which mitigates these gaps by real-time full-duplex omni-modal interaction. It can see, listen, and speak simultaneously in real-time, while also exhibiting proactive behaviors such as issuing reminders or comments based on its continuous understanding of the live scene. The key technique behind MiniCPM-o 4.5 is Omni-Flow, a unified streaming framework that aligns omni-modal inputs and outputs along a shared temporal axis. This formulation converts conventional turn-based interaction into a full-duplex, time-aligned process, enabling simultaneous perception and response and allowing proactive behavior to arise within the same framework. With a total of 9B parameters, MiniCPM-o 4.5 approaches Gemini 2.5 Flash in vision-language capabilities, delivering state-of-the-art open-source performance at its scale. It also surpasses Qwen3-Omni-30B-A3B in omni-modal understanding and delivers better speech generation, with significantly higher computation efficiency. Driven by its efficient architecture design and inference optimization, the model can perform real-time full-duplex omni-modal interaction on edge devices with less than 12GB RAM cost.
Original Article
View Cached Full Text

Cached at: 05/08/26, 07:57 AM

Paper page - MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

Source: https://huggingface.co/papers/2604.27393 Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

MiniCPM-o 4.5 enables real-time full-duplex multimodal interaction through Omni-Flow, a unified streaming framework that aligns inputs and outputs temporally for simultaneous perception and response.

Recent progress inmultimodal large language models(MLLMs) has brought AI capabilities from static offline data processing toreal-time streaming interaction, yet they still remain far from human-level multimodal interaction. The key bottlenecks are no longer modality coverage or latency alone, but the interaction paradigm itself. First, perception and response are still separated into alternating phases, preventing models from incorporating new inputs for timely adjustment during generation. Second, most current models remain reactive, responding only to explicit user requests instead of acting proactively in the evolving multimodal environment. We present MiniCPM-o 4.5, our latest effort towards human-like multimodal interaction, which mitigates these gaps by real-time full-duplexomni-modal interaction. It can see, listen, and speak simultaneously in real-time, while also exhibiting proactive behaviors such as issuing reminders or comments based on its continuous understanding of the live scene. The key technique behind MiniCPM-o 4.5 isOmni-Flow, a unified streaming framework that aligns omni-modal inputs and outputs along a sharedtemporal axis. This formulation converts conventionalturn-based interactioninto a full-duplex, time-aligned process, enabling simultaneous perception and response and allowing proactive behavior to arise within the same framework. With a total of 9B parameters, MiniCPM-o 4.5 approaches Gemini 2.5 Flash invision-language capabilities, delivering state-of-the-art open-source performance at its scale. It also surpasses Qwen3-Omni-30B-A3B inomni-modal understandingand delivers betterspeech generation, with significantly higher computation efficiency. Driven by its efficient architecture design and inference optimization, the model can perform real-time full-duplexomni-modal interactiononedge deviceswith less than 12GB RAM cost.

View arXiv pageView PDFProject pageGitHub24.5kAdd to collection

Get this paper in your agent:

hf papers read 2604\.27393

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.27393 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.27393 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.27393 in a Space README.md to link it from this page.

Collections including this paper1

Similar Articles

MiniCPM-V 4.6

Product Hunt

MiniCPM-V 4.6 is an ultra-efficient 1.3B vision-language model optimized for mobile devices.

MiniCPM4: Ultra-Efficient LLMs on End Devices

Papers with Code Trending

MiniCPM4 is a highly efficient large language model designed for end devices, achieving strong performance with 0.5B and 8B parameter versions through innovations in sparse attention, data filtering, training algorithms, and inference systems.

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

Papers with Code Trending

MiniCPM-V 4.5 is an 8B multimodal large language model that achieves high efficiency and strong performance through a unified 3D-Resampler architecture, a novel data strategy, and a hybrid reinforcement learning approach. The model reportedly surpasses larger proprietary and open-source benchmarks while significantly reducing GPU memory usage and inference time.