Tag
ByteDance's Seed 2.1 model achieved strong results on multimodal agentic (Claw-Eval) and long video understanding (Video-MME) benchmarks, though a gap remains between perception and agentic capabilities.
Introduces OmniAgent, an omni-modal agent that uses an iterative Observation-Thought-Action cycle with active perception to achieve superior long video understanding, outperforming larger models like Qwen2.5-VL-72B on benchmarks.
LLaVA-OneVision-2 introduces codec-stream tokenization for efficient video understanding, significantly outperforming Qwen3-VL-8B on temporal and spatial benchmarks. The model, data, and code are open-sourced.
InternVideo3 introduces Multimodal Contextual Reasoning (MCR) and efficient attention mechanisms to enhance long-horizon multimodal tasks, achieving strong results on video understanding benchmarks and demonstrating video agent capabilities.
This paper studies the ability of multimodal large language models (MLLMs) to detect when the correct answer is absent in video understanding tasks, finding that models systematically fail by selecting plausible distractors instead of recognizing no valid option exists. The failure worsens in temporal reasoning and dense frame sampling, and chain-of-thought prompting only partially mitigates the issue.
Introduces OmniCap-IF, the first comprehensive benchmark for evaluating instruction-following in omni-modal video captioning, revealing a format-content tradeoff and proposing improved models and datasets.
A survey presenting a human-view perspective on video understanding with multimodal large language models, organized around watching, remembering, and reasoning abilities, covering challenges, methods, and applications.
VCIFBench is a new benchmark for evaluating complex instruction following in video understanding, featuring 306 test instructions with content, format, style, and structure constraints, plus a DPO preference dataset. Experiments on 10 MLLMs reveal that joint constraint satisfaction remains challenging, and DPO training on the benchmark data improves instruction-following performance.
GeoVR enhances multimodal large language models with 3D awareness by restructuring their semantic latent space through geometric knowledge distillation from 3D foundation models using multiple geometric targets.
This paper introduces One-to-Many Temporal Grounding (OMTG), a new task for localizing multiple disjoint video segments from a single text query, along with a benchmark, evaluation metrics, a 56k-sample dataset, and novel reward functions that achieve state-of-the-art results, outperforming Gemini 2.5 Pro and Seed-1.8.
ByteDance Seed has open-sourced the TaskMem checkpoint, trained on Qwen3-VL-30B-A3B. It uses two-stage reinforcement learning to enable multimodal Agents to learn to generate long-term memory from video streams, achieving significant improvements on benchmarks such as VideoMME and EgoLife.
Introduces VSTAT, a new benchmark to measure how well multimodal LLMs track states in videos, revealing that frontier models struggle with tasks humans find easy.
VSTAT is a new benchmark for visual state tracking in videos that reveals perceptual gaps between humans and multimodal LLMs.
M^3Eval is a comprehensive evaluation framework and benchmark for probing memory capabilities in multi-modal models, grounded in cognitive psychology. Experiments reveal consistent weaknesses in memory maintenance, interference patterns, and spatial-temporal grounding.
Introduces VSTAT, a benchmark for evaluating visual state tracking in multimodal large language models (MLLMs) using 834 clips and 1,500 questions. Current MLLMs perform poorly compared to humans, failing at visual perception rather than reasoning.
X-Stream introduces the first benchmark for multi-stream video understanding, evaluating MLLMs as multiplexers across multiple concurrent streams. The study reveals that current MLLMs achieve only about 50% accuracy, exposing significant limitations in handling multiple streams.
StateKV is an inference-time method that enables linear-time video prefill for long-video vision-language models by carrying cross-frame context in a fixed-capacity recurrent state, maintaining accuracy close to full self-attention without fine-tuning.
EarlyTom is a training-free framework that compresses visual tokens early in the vision encoder to reduce time-to-first-token and computational costs while maintaining accuracy, achieving up to 2.65x TTFT reduction.
Kwai-Keye releases Keye-VL-2.0-30B-A3B, a 30B-class vision-language model with advanced video understanding, sparse attention, and agent capabilities, achieving top benchmarks.
LLaVA-OneVision-2 introduces codec-stream tokenization and windowed attention for efficient video understanding, achieving state-of-the-art performance across multiple multimodal benchmarks including video, spatial, and tracking tasks.