omni-modal

#omni-modal

Native Active Perception as Reasoning for Omni-Modal Understanding

Hugging Face Daily Papers ↗ · 2026-06-17 Cached

Introduces OmniAgent, an omni-modal agent that uses an iterative Observation-Thought-Action cycle with active perception to achieve superior long video understanding, outperforming larger models like Qwen2.5-VL-72B on benchmarks.

0 favorites 0 likes

#omni-modal

MODF-SIR: A Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning

arXiv cs.AI ↗ · 2026-06-11 Cached

This paper proposes MODF-SIR, a multi-agent collaborative framework built on a lightweight multimodal large language model for social intelligence reasoning. It employs knowledge distillation, long-tail event extraction, and test-time adaptation to achieve state-of-the-art results with reduced training data.

0 favorites 0 likes

#omni-modal

OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

Hugging Face Daily Papers ↗ · 2026-06-07 Cached

Introduces OmniCap-IF, the first comprehensive benchmark for evaluating instruction-following in omni-modal video captioning, revealing a format-content tradeoff and proposing improved models and datasets.

0 favorites 0 likes

#omni-modal

TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents

arXiv cs.AI ↗ · 2026-05-19 Cached

TOBench is a new benchmark for evaluating AI agents on real-world, task-oriented tool use with multimodal inputs and closed-loop verification. Experiments show top models like Qwen 3.5 Plus achieve only 41% success, far below the 94% human benchmark, highlighting a significant gap.

0 favorites 0 likes

#omni-modal

Stage-adaptive Token Selection for Efficient Omni-modal LLMs

Hugging Face Daily Papers ↗ · 2026-05-19 Cached

SEATS is a training-free, stage-adaptive token selection method that reduces computational overhead in omni-modal LLMs by progressively pruning redundant visual and audio tokens, achieving a 9.3x FLOPs reduction and 4.8x prefill speedup while preserving 96.3% performance.

0 favorites 0 likes

#omni-modal

Qwen3.7 Preview lands on Arena (1 minute read)

TLDR AI ↗ · 2026-05-19 Cached

Alibaba Qwen announces two major model releases: Qwen3-Omni, the first natively end-to-end omni-modal AI unifying text, image, audio and video, and Qwen3-Next-80B-A3B, an ultra-efficient MoE model with 3B activated parameters per token, achieving SOTA performance and 10x faster inference than Qwen3-32B.

0 favorites 0 likes

#omni-modal

Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents

Hugging Face Blog ↗ · 2026-04-28 Cached

NVIDIA releases Nemotron 3 Nano Omni, a new long-context multimodal AI model capable of processing documents, audio, video, and text with high accuracy and efficiency.

0 favorites 0 likes

#omni-modal

Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models

Hugging Face Daily Papers ↗ · 2026-04-18 Cached

This paper investigates modality preference in omni-modal large language models (OLLMs), revealing a paradigm shift from text-dominance to visual preference. The authors introduce a conflict-based benchmark and layer-wise probing to diagnose cross-modal hallucinations using internal model signals.

0 favorites 0 likes

#omni-modal

OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments

Hugging Face Daily Papers ↗ · 2026-04-03 Cached

OmniGUI introduces a step-level benchmark for GUI agents that integrates static images, synchronous audio, and video clips to simulate real smartphone interactions. Evaluation shows current models struggle with temporal and auditory inputs, highlighting the need for omni-modal capabilities.

0 favorites 0 likes

omni-modal

Submit Feedback