Tag
This paper shows that text+image coding agents using sandboxed tool-use can match or outperform native omni-modal models on audio-video benchmarks, converting omni-modal tasks into retrieval and information-processing problems.
Announcing Cosmos 3, an omnimodal world model for Physical AI that can understand and generate language, images, video, audio, and actions within a unified architecture.
Cosmos 3 is a family of omnimodal world models from NVIDIA that jointly processes language, image, video, audio, and action sequences using a unified mixture-of-transformers architecture, achieving state-of-the-art performance in understanding and generation tasks for Physical AI.
OmniInteract introduces a streaming benchmark for real-time omnimodal LLMs, evaluating online audio-visual processing with temporal grounding and interactive response requirements. Experiments show that current models perform poorly, with the best overall IA-QTF1 score reaching only 0.368.
LatentOmni proposes a unified latent space for audio-visual reasoning, avoiding the information loss of text-based chain-of-thought. It achieves state-of-the-art performance among open-source models on audio-visual reasoning benchmarks.
Thinking Machines Lab and OpenBMB released MiniCPM-o 4.5, a 9B full-duplex omnimodal model with the Omni-Flow framework that enables continuous, time-aligned real-time video and voice interaction, surpassing previous models and available as open source.
MiMo-V2.5 is a native omnimodal AI model with strong agentic capabilities, supporting text, image, video, and audio understanding within a unified sparse MoE architecture.
This paper introduces Omni-Persona, the first comprehensive benchmark for omnimodal personalization across text, image, and audio, featuring a Persona Modality Graph and a new Calibrated Accuracy metric to evaluate grounding behaviors.