InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning
Summary
InternVideo3 introduces Multimodal Contextual Reasoning (MCR) and efficient attention mechanisms to enhance long-horizon multimodal tasks, achieving strong results on video understanding benchmarks and demonstrating video agent capabilities.
View Cached Full Text
Cached at: 06/11/26, 01:40 PM
Paper page - InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning
Source: https://huggingface.co/papers/2606.12195 Authors:
,
,
,
,
,
,
,
,
,
Abstract
InternVideo3 enhances long-horizon multimodal tasks through Multimodal Contextual Reasoning and efficient attention mechanisms, demonstrating strong performance on video understanding benchmarks and video agent capabilities.
Recent progress in foundation models has shifted toward agentic behavior involving multi-step reasoning and tool use. However, open-source efforts largely focus on text-dominant settings, leaving long-horizon multimodal tasks underexplored. This gap is evident in video tasks requiring sustained temporal understanding and iterative interaction. We present InternVideo3, a framework enhancing these capabilities viaMultimodal Contextual Reasoning(MCR). MCR treats understanding as a closed-loop process over a shared, evolving context containing observations, instructions, reasoning, tool actions, and memory. This frames long-video understanding asevidence accumulationand verification. To ensure efficiency, we introduce Multimodal Multi-head Latent Attention (M^2LA), atoken-preserving reparameterizationcompressingKV-cache stateswhile retaining the full token stream. Ourstaged trainingincludescontinued pretraining, short-to-longsupervised fine-tuning, rule-basedreinforcement learning, andon-policy distillation. Experiments show InternVideo3 achieves strong performance on benchmarks like Video-MME, MLVU, and EgoSchema. We further instantiate the model as avideo agentwith retrieval tools, demonstrating robust evidence-grounded behavior. Our results suggest that efficient context handling andclosed-loop reasoningare vital for adapting open multimodal models toward long-horizon visually grounded agency.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2606\.12195
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
#### yanziang/InternVideo3-8B-Instruct Video-Text-to-Text• 9B• Updatedabout 11 hours ago • 105 • 2
Datasets citing this paper1
#### yanziang/InternVideo3_Dataset Viewer• Updatedabout 11 hours ago • 380k • 28 • 1
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.12195 in a Space README.md to link it from this page.
Collections including this paper3
Similar Articles
Watch, Remember, Reason: Human-View Video Understanding with MLLMs
A survey presenting a human-view perspective on video understanding with multimodal large language models, organized around watching, remembering, and reasoning abilities, covering challenges, methods, and applications.
MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory
MemEye is a visual-centric evaluation framework that assesses multimodal agent memory by measuring visual evidence granularity and retrieval complexity across 8 life-scenario tasks, revealing that current architectures struggle to preserve fine-grained visual details and reason about state changes over time.
CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models
CollabVR is a research paper proposing a closed-loop framework that collaboratively integrates vision-language models with video generation models to improve visual reasoning and correct failures in real-time.
M^3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks
M^3Eval is a comprehensive evaluation framework and benchmark for probing memory capabilities in multi-modal models, grounded in cognitive psychology. Experiments reveal consistent weaknesses in memory maintenance, interference patterns, and spatial-temporal grounding.
VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding
VideoKR introduces a large-scale video reasoning dataset and benchmark designed to enhance knowledge-intensive video understanding through expert-domain content and human-in-the-loop example generation. The dataset contains 315K video reasoning examples over 145K expert-domain videos.