Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Hugging Face Daily Papers 06/05/26, 12:00 AM Papers

video-understanding multimodal-llms survey perception memory reasoning

Summary

A survey presenting a human-view perspective on video understanding with multimodal large language models, organized around watching, remembering, and reasoning abilities, covering challenges, methods, and applications.

Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These scenarios require models to handle sparse evidence, long-range dependencies, multimodal alignment, and reliable inference under limited computational budgets. This work presents a human-view perspective on LLM-based video understanding, organized around three functional abilities: watching, remembering, and reasoning. Rather than treating video tasks as isolated benchmarks, this view provides a unified structure for analyzing how video MLLMs acquire evidence, preserve context, and produce grounded outputs. We introduce a formulation that characterizes video understanding systems by their perceptual representations, memory states, reasoning traces, and final predictions. Based on this formulation, we identify challenges in spatio-temporal perception, efficient long-video processing, memory modeling, streaming understanding, and faithful reasoning. Representative methods are organized by their roles in video MLLM systems. Watching covers fine-grained, comprehensive, audio-visual, and efficient perception. Remembering includes offline and streaming memory, while reasoning covers text-only reasoning and thinking with videos. We further examine application domains such as egocentric, sports, instructional, medical, and narrative videos, and cover training datasets and evaluation benchmarks across task types, supervision formats, modalities, and capability dimensions. Finally, we outline open problems and future directions for scalable, memory-aware, and evidence-grounded video intelligence. Related works will be continuously traced at https://github.com/marinero4972/Awesome-HumanView-VideoUnderstanding.

Original Article

View Cached Full Text

Cached at: 06/08/26, 07:14 AM

Paper page - Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Source: https://huggingface.co/papers/2606.07433 Authors:

Abstract

Multimodal large language models for video understanding are structured around three core capabilities—watching, remembering, and reasoning—with applications spanning multiple video domains and addressing challenges in perception, memory, and reasoning.

Video understandingis being rapidly transformed bymultimodal large language models(MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These scenarios require models to handle sparse evidence,long-range dependencies,multimodal alignment, and reliable inference under limited computational budgets. This work presents a human-view perspective on LLM-basedvideo understanding, organized around three functional abilities: watching, remembering, and reasoning. Rather than treating video tasks as isolated benchmarks, this view provides a unified structure for analyzing howvideo MLLMsacquire evidence, preserve context, and produce grounded outputs. We introduce a formulation that characterizesvideo understandingsystems by theirperceptual representations,memory states,reasoning traces, and final predictions. Based on this formulation, we identify challenges inspatio-temporal perception, efficient long-video processing,memory modeling,streaming understanding, andfaithful reasoning. Representative methods are organized by their roles in video MLLM systems. Watching covers fine-grained, comprehensive, audio-visual, and efficient perception. Remembering includes offline and streaming memory, while reasoning covers text-only reasoning and thinking with videos. We further examine application domains such as egocentric, sports, instructional, medical, andnarrative videos, and cover training datasets and evaluation benchmarks across task types, supervision formats, modalities, and capability dimensions. Finally, we outline open problems and future directions for scalable, memory-aware, and evidence-grounded video intelligence. Related works will be continuously traced at https://github.com/marinero4972/Awesome-HumanView-VideoUnderstanding.

View arXiv page View PDF GitHub8 Add to collection

Get this paper in your agent:

hf papers read 2606\.07433

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.07433 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.07433 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.07433 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Paper page - Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Benchmarking Visual State Tracking in Multimodal Video Understanding

MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

Learning to reason with LLMs

iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

Submit Feedback

Similar Articles

Benchmarking Visual State Tracking in Multimodal Video Understanding

MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning