Tag
Introduces OmniAgent, an omni-modal agent that uses an iterative Observation-Thought-Action cycle with active perception to achieve superior long video understanding, outperforming larger models like Qwen2.5-VL-72B on benchmarks.
This paper proposes MODF-SIR, a multi-agent collaborative framework built on a lightweight multimodal large language model for social intelligence reasoning. It employs knowledge distillation, long-tail event extraction, and test-time adaptation to achieve state-of-the-art results with reduced training data.
Introduces OmniCap-IF, the first comprehensive benchmark for evaluating instruction-following in omni-modal video captioning, revealing a format-content tradeoff and proposing improved models and datasets.
TOBench is a new benchmark for evaluating AI agents on real-world, task-oriented tool use with multimodal inputs and closed-loop verification. Experiments show top models like Qwen 3.5 Plus achieve only 41% success, far below the 94% human benchmark, highlighting a significant gap.
SEATS is a training-free, stage-adaptive token selection method that reduces computational overhead in omni-modal LLMs by progressively pruning redundant visual and audio tokens, achieving a 9.3x FLOPs reduction and 4.8x prefill speedup while preserving 96.3% performance.
Alibaba Qwen announces two major model releases: Qwen3-Omni, the first natively end-to-end omni-modal AI unifying text, image, audio and video, and Qwen3-Next-80B-A3B, an ultra-efficient MoE model with 3B activated parameters per token, achieving SOTA performance and 10x faster inference than Qwen3-32B.
NVIDIA releases Nemotron 3 Nano Omni, a new long-context multimodal AI model capable of processing documents, audio, video, and text with high accuracy and efficiency.
This paper investigates modality preference in omni-modal large language models (OLLMs), revealing a paradigm shift from text-dominance to visual preference. The authors introduce a conflict-based benchmark and layer-wise probing to diagnose cross-modal hallucinations using internal model signals.
OmniGUI introduces a step-level benchmark for GUI agents that integrates static images, synchronous audio, and video clips to simulate real smartphone interactions. Evaluation shows current models struggle with temporal and auditory inputs, highlighting the need for omni-modal capabilities.