Benchmarking Visual State Tracking in Multimodal Video Understanding

Hugging Face Daily Papers 06/02/26, 12:00 AM Papers

Summary

Introduces VSTAT, a benchmark for evaluating visual state tracking in multimodal large language models (MLLMs) using 834 clips and 1,500 questions. Current MLLMs perform poorly compared to humans, failing at visual perception rather than reasoning.

Understanding a video requires more than recognizing isolated moments, as humans continuously track entities, states, and events over time. This capacity for visual state tracking is fundamental to video understanding, yet remains underexplored in current evaluations of Multimodal Large Language Models (MLLMs). We introduce Visual STAte Tracking benchmark (VSTAT), a video-based benchmark designed to diagnose visual state tracking in MLLMs. VSTAT consists of 834 clips drawn from both synthetic and real-world videos, paired with 1,500 questions that cannot be answered from any single frame or short segment, requiring continuous perception and integration of events across the entire video stream. Despite their strong performance on existing video benchmarks, we find that state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines. To analyze this gap, we compare MLLMs' thinking traces with the underlying video stream to understand why and when MLLMs fail on VSTAT. We find that MLLMs reason and track correctly in text, but fail at visually perceiving the events they need to track. Finally, our preliminary evaluation suggests that recent agentic approaches, including MLLM-based video agents and coding agents, do not readily resolve these failures, still falling short on VSTAT.

Original Article

View Cached Full Text

Cached at: 06/03/26, 03:35 AM

Paper page - Benchmarking Visual State Tracking in Multimodal Video Understanding

Source: https://huggingface.co/papers/2606.03920 Authors:

Abstract

Current multimodal large language models struggle with visual state tracking in videos, performing poorly even when human-level capabilities are required, and existing agentic approaches do not effectively address these limitations.

Understanding a video requires more than recognizing isolated moments, as humans continuously track entities, states, and events over time. This capacity forvisual state trackingis fundamental tovideo understanding, yet remains underexplored in current evaluations ofMultimodal Large Language Models(MLLMs). We introduceVisual STAte Trackingbenchmark (VSTAT), a video-based benchmark designed to diagnosevisual state trackingin MLLMs. VSTAT consists of 834 clips drawn from both synthetic and real-world videos, paired with 1,500 questions that cannot be answered from any single frame or short segment, requiringcontinuous perceptionand integration of events across the entire video stream. Despite their strong performance on existing video benchmarks, we find that state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines. To analyze this gap, we compare MLLMs’ thinking traces with the underlying video stream to understand why and when MLLMs fail on VSTAT. We find that MLLMs reason and track correctly in text, but fail at visually perceiving the events they need to track. Finally, our preliminary evaluation suggests that recent agentic approaches, including MLLM-basedvideo agentsandcoding agents, do not readily resolve these failures, still falling short on VSTAT.

View arXiv page View PDF Project page GitHub3 Add to collection

Get this paper in your agent:

hf papers read 2606\.03920

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.03920 in a model README.md to link it from this page.

Datasets citing this paper1

#### nyu-visionx/vstat Viewer• Updatedabout 1 hour ago • 530 • 495 • 1

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.03920 in a Space README.md to link it from this page.

Benchmarking Visual State Tracking in Multimodal Video Understanding

Paper page - Benchmarking Visual State Tracking in Multimodal Video Understanding

Abstract

Models citing this paper0

Datasets citing this paper1

Spaces citing this paper0

Collections including this paper1

Similar Articles

@PinzhiHuang: State tracking is a core pillar of video understanding: it requires identifying entities and events, and mapping how th…

@ma_nanye: VSTAT highlights the substantial perceptual gap between humans and MLLMs, but it goes far beyond that. Its diverse task…

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

ViMU: Benchmarking Video Metaphorical Understanding

Submit Feedback

Similar Articles

@PinzhiHuang: State tracking is a core pillar of video understanding: it requires identifying entities and events, and mapping how th…

@ma_nanye: VSTAT highlights the substantial perceptual gap between humans and MLLMs, but it goes far beyond that. Its diverse task…

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

ViMU: Benchmarking Video Metaphorical Understanding