omnimodal

#omnimodal

Sandboxed Coding Agents are Competitive Omni-modal Task Solvers

arXiv cs.CL ↗ · 6d ago Cached

This paper shows that text+image coding agents using sandboxed tool-use can match or outperform native omni-modal models on audio-video benchmarks, converting omni-modal tasks into retrieval and information-processing problems.

0 favorites 0 likes

#omnimodal

@mli0603: This is THE moment of Physical AI! We are officially announcing Cosmos 3: Omnimodal World Models for Physical AI - Cosm…

X AI KOLs Following ↗ · 2026-06-01 Cached

Announcing Cosmos 3, an omnimodal world model for Physical AI that can understand and generate language, images, video, audio, and actions within a unified architecture.

0 favorites 0 likes

#omnimodal

Cosmos 3: Omnimodal World Models for Physical AI

Hugging Face Daily Papers ↗ · 2026-06-01 Cached

Cosmos 3 is a family of omnimodal world models from NVIDIA that jointly processes language, image, video, audio, and action sequences using a unified mixture-of-transformers architecture, achieving state-of-the-art performance in understanding and generation tasks for Physical AI.

0 favorites 0 likes

#omnimodal

OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants

Hugging Face Daily Papers ↗ · 2026-05-26 Cached

OmniInteract introduces a streaming benchmark for real-time omnimodal LLMs, evaluating online audio-visual processing with temporal grounding and interactive response requirements. Experiments show that current models perform poorly, with the best overall IA-QTF1 score reaching only 0.368.

0 favorites 0 likes

#omnimodal

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

arXiv cs.CL ↗ · 2026-05-22 Cached

LatentOmni proposes a unified latent space for audio-visual reasoning, avoiding the information loss of text-based chain-of-thought. It achieves state-of-the-art performance among open-source models on audio-visual reasoning benchmarks.

0 favorites 0 likes

#omnimodal

@rohanpaul_ai: Just a few days back, Thinking Machines Lab (TML), showcased a way of making AI interaction continuous instead of turn-…

X AI KOLs Following ↗ · 2026-05-17 Cached

Thinking Machines Lab and OpenBMB released MiniCPM-o 4.5, a 9B full-duplex omnimodal model with the Omni-Flow framework that enables continuous, time-aligned real-time video and voice interaction, surpassing previous models and available as open source.

0 favorites 0 likes

#omnimodal

unsloth/MiMo-V2.5-GGUF · Hugging Face

Reddit r/LocalLLaMA ↗ · 2026-05-11 Cached

MiMo-V2.5 is a native omnimodal AI model with strong agentic capabilities, supporting text, image, video, and audio understanding within a unified sparse MoE architecture.

0 favorites 1 likes

#omnimodal

Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization

Hugging Face Daily Papers ↗ · 2026-05-11 Cached

This paper introduces Omni-Persona, the first comprehensive benchmark for omnimodal personalization across text, image, and audio, featuring a Persona Modality Graph and a new Calibrated Accuracy metric to evaluate grounding behaviors.

0 favorites 0 likes

omnimodal

Submit Feedback