video-understanding

#video-understanding

@_TobiasLee: Seed 2.1 from Bytedance achieved impressive results on two of our benchmarks. Claw-Eval (Multimodal, https://claw-eval.…

X AI KOLs Timeline ↗ · yesterday Cached

ByteDance's Seed 2.1 model achieved strong results on multimodal agentic (Claw-Eval) and long video understanding (Video-MME) benchmarks, though a gap remains between perception and agentic capabilities.

0 favorites 0 likes

#video-understanding

Native Active Perception as Reasoning for Omni-Modal Understanding

Hugging Face Daily Papers ↗ · 2026-06-17 Cached

Introduces OmniAgent, an omni-modal agent that uses an iterative Observation-Thought-Action cycle with active perception to achieve superior long video understanding, outperforming larger models like Qwen2.5-VL-72B on benchmarks.

0 favorites 0 likes

#video-understanding

@jiqizhixin: What if your AI could “see” video like a streaming codec—spending tokens only on the most important moments? Introducin…

X AI KOLs Timeline ↗ · 2026-06-15 Cached

LLaVA-OneVision-2 introduces codec-stream tokenization for efficient video understanding, significantly outperforming Qwen3-VL-8B on temporal and spatial benchmarks. The model, data, and code are open-sourced.

0 favorites 0 likes

#video-understanding

InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

Hugging Face Daily Papers ↗ · 2026-06-10 Cached

InternVideo3 introduces Multimodal Contextual Reasoning (MCR) and efficient attention mechanisms to enhance long-horizon multimodal tasks, achieving strong results on video understanding benchmarks and demonstrating video agent capabilities.

0 favorites 0 likes

#video-understanding

When No Answer Is Correct: Diagnosing Absent Answer Detection for MLLMs in Video Understanding

arXiv cs.AI ↗ · 2026-06-09 Cached

This paper studies the ability of multimodal large language models (MLLMs) to detect when the correct answer is absent in video understanding tasks, finding that models systematically fail by selecting plausible distractors instead of recognizing no valid option exists. The failure worsens in temporal reasoning and dense frame sampling, and chain-of-thought prompting only partially mitigates the issue.

0 favorites 0 likes

#video-understanding

OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

Hugging Face Daily Papers ↗ · 2026-06-07 Cached

Introduces OmniCap-IF, the first comprehensive benchmark for evaluating instruction-following in omni-modal video captioning, revealing a format-content tradeoff and proposing improved models and datasets.

0 favorites 0 likes

#video-understanding

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Hugging Face Daily Papers ↗ · 2026-06-05 Cached

A survey presenting a human-view perspective on video understanding with multimodal large language models, organized around watching, remembering, and reasoning abilities, covering challenges, methods, and applications.

0 favorites 0 likes

#video-understanding

VCIFBench: Evaluating Complex Instruction Following for Video Understanding

arXiv cs.CL ↗ · 2026-06-04 Cached

VCIFBench is a new benchmark for evaluating complex instruction following in video understanding, featuring 306 test instructions with content, format, style, and structure constraints, plus a DPO preference dataset. Experiments on 10 MLLMs reveal that joint constraint satisfaction remains challenging, and DPO training on the benchmark data improves instruction-following performance.

0 favorites 0 likes

#video-understanding

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

Hugging Face Daily Papers ↗ · 2026-06-04 Cached

GeoVR enhances multimodal large language models with 3D awareness by restructuring their semantic latent space through geometric knowledge distillation from 3D foundation models using multiple geometric targets.

0 favorites 0 likes

#video-understanding

Towards One-to-Many Temporal Grounding

Hugging Face Daily Papers ↗ · 2026-06-04 Cached

This paper introduces One-to-Many Temporal Grounding (OMTG), a new task for localizing multiple disjoint video segments from a single text query, along with a benchmark, evaluation metrics, a 56k-sample dataset, and novel reward functions that achieve state-of-the-art results, outperforming Gemini 2.5 Pro and Seed-1.8.

0 favorites 0 likes

#video-understanding

@MaxForAI: Yesterday, ByteDance Seed open-sourced a very interesting checkpoint, TaskMem. It is trained on Qwen3-VL-30B-A3B, with the goal not being to directly answer questions, but to enable multimodal Agents to learn to generate more useful long-term memory from video/environment streams. The key is to let the Agent learn in continuous video…

X AI KOLs Timeline ↗ · 2026-06-03 Cached

ByteDance Seed has open-sourced the TaskMem checkpoint, trained on Qwen3-VL-30B-A3B. It uses two-stage reinforcement learning to enable multimodal Agents to learn to generate long-term memory from video streams, achieving significant improvements on benchmarks such as VideoMME and EgoLife.

0 favorites 0 likes

#video-understanding

@PinzhiHuang: State tracking is a core pillar of video understanding: it requires identifying entities and events, and mapping how th…

X AI KOLs Following ↗ · 2026-06-03 Cached

Introduces VSTAT, a new benchmark to measure how well multimodal LLMs track states in videos, revealing that frontier models struggle with tasks humans find easy.

0 favorites 0 likes

#video-understanding

@ma_nanye: VSTAT highlights the substantial perceptual gap between humans and MLLMs, but it goes far beyond that. Its diverse task…

X AI KOLs Following ↗ · 2026-06-03 Cached

VSTAT is a new benchmark for visual state tracking in videos that reveals perceptual gaps between humans and multimodal LLMs.

0 favorites 0 likes

#video-understanding

M^3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

Hugging Face Daily Papers ↗ · 2026-06-03 Cached

M^3Eval is a comprehensive evaluation framework and benchmark for probing memory capabilities in multi-modal models, grounded in cognitive psychology. Experiments reveal consistent weaknesses in memory maintenance, interference patterns, and spatial-temporal grounding.

0 favorites 0 likes

#video-understanding

Benchmarking Visual State Tracking in Multimodal Video Understanding

Hugging Face Daily Papers ↗ · 2026-06-02 Cached

Introduces VSTAT, a benchmark for evaluating visual state tracking in multimodal large language models (MLLMs) using 834 clips and 1,500 questions. Current MLLMs perform poorly compared to humans, failing at visual perception rather than reasoning.

0 favorites 0 likes

#video-understanding

X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding

Hugging Face Daily Papers ↗ · 2026-06-01 Cached

X-Stream introduces the first benchmark for multi-stream video understanding, evaluating MLLMs as multiplexers across multiple concurrent streams. The study reveals that current MLLMs achieve only about 50% accuracy, exposing significant limitations in handling multiple streams.

0 favorites 0 likes

#video-understanding

Linear Scaling Video VLMs for Long Video Understanding

Hugging Face Daily Papers ↗ · 2026-05-29 Cached

StateKV is an inference-time method that enables linear-time video prefill for long-video vision-language models by carrying cross-frame context in a fixed-capacity recurrent state, maintaining accuracy close to full self-attention without fine-tuning.

0 favorites 0 likes

#video-understanding

EarlyTom: Early Token Compression Completes Fast Video Understanding

Hugging Face Daily Papers ↗ · 2026-05-28 Cached

EarlyTom is a training-free framework that compresses visual tokens early in the vision encoder to reduce time-to-first-token and computational costs while maintaining accuracy, achieving up to 2.65x TTFT reduction.

0 favorites 0 likes

#video-understanding

Kwai-Keye/Keye-VL-2.0-30B-A3B

Hugging Face Models Trending ↗ · 2026-05-25 Cached

Kwai-Keye releases Keye-VL-2.0-30B-A3B, a 30B-class vision-language model with advanced video understanding, sparse attention, and agent capabilities, achieving top benchmarks.

0 favorites 0 likes

#video-understanding

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

Hugging Face Daily Papers ↗ · 2026-05-25 Cached

LLaVA-OneVision-2 introduces codec-stream tokenization and windowed attention for efficient video understanding, achieving state-of-the-art performance across multiple multimodal benchmarks including video, spatial, and tracking tasks.

0 favorites 0 likes

video-understanding

Submit Feedback