InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

Hugging Face Daily Papers 06/10/26, 12:00 AM Papers

multimodal video-understanding reasoning foundation-models attention reinforcement-learning agent

Summary

InternVideo3 introduces Multimodal Contextual Reasoning (MCR) and efficient attention mechanisms to enhance long-horizon multimodal tasks, achieving strong results on video understanding benchmarks and demonstrating video agent capabilities.

Recent progress in foundation models has shifted toward agentic behavior involving multi-step reasoning and tool use. However, open-source efforts largely focus on text-dominant settings, leaving long-horizon multimodal tasks underexplored. This gap is evident in video tasks requiring sustained temporal understanding and iterative interaction. We present InternVideo3, a framework enhancing these capabilities via Multimodal Contextual Reasoning (MCR). MCR treats understanding as a closed-loop process over a shared, evolving context containing observations, instructions, reasoning, tool actions, and memory. This frames long-video understanding as evidence accumulation and verification. To ensure efficiency, we introduce Multimodal Multi-head Latent Attention (M^2LA), a token-preserving reparameterization compressing KV-cache states while retaining the full token stream. Our staged training includes continued pretraining, short-to-long supervised fine-tuning, rule-based reinforcement learning, and on-policy distillation. Experiments show InternVideo3 achieves strong performance on benchmarks like Video-MME, MLVU, and EgoSchema. We further instantiate the model as a video agent with retrieval tools, demonstrating robust evidence-grounded behavior. Our results suggest that efficient context handling and closed-loop reasoning are vital for adapting open multimodal models toward long-horizon visually grounded agency.

Original Article

View Cached Full Text

Cached at: 06/11/26, 01:40 PM

Paper page - InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

Source: https://huggingface.co/papers/2606.12195 Authors:

Abstract

InternVideo3 enhances long-horizon multimodal tasks through Multimodal Contextual Reasoning and efficient attention mechanisms, demonstrating strong performance on video understanding benchmarks and video agent capabilities.

Recent progress in foundation models has shifted toward agentic behavior involving multi-step reasoning and tool use. However, open-source efforts largely focus on text-dominant settings, leaving long-horizon multimodal tasks underexplored. This gap is evident in video tasks requiring sustained temporal understanding and iterative interaction. We present InternVideo3, a framework enhancing these capabilities viaMultimodal Contextual Reasoning(MCR). MCR treats understanding as a closed-loop process over a shared, evolving context containing observations, instructions, reasoning, tool actions, and memory. This frames long-video understanding asevidence accumulationand verification. To ensure efficiency, we introduce Multimodal Multi-head Latent Attention (M^2LA), atoken-preserving reparameterizationcompressingKV-cache stateswhile retaining the full token stream. Ourstaged trainingincludescontinued pretraining, short-to-longsupervised fine-tuning, rule-basedreinforcement learning, andon-policy distillation. Experiments show InternVideo3 achieves strong performance on benchmarks like Video-MME, MLVU, and EgoSchema. We further instantiate the model as avideo agentwith retrieval tools, demonstrating robust evidence-grounded behavior. Our results suggest that efficient context handling andclosed-loop reasoningare vital for adapting open multimodal models toward long-horizon visually grounded agency.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2606\.12195

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### yanziang/InternVideo3-8B-Instruct Video-Text-to-Text• 9B• Updatedabout 11 hours ago • 105 • 2

Datasets citing this paper1

#### yanziang/InternVideo3_Dataset Viewer• Updatedabout 11 hours ago • 380k • 28 • 1

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.12195 in a Space README.md to link it from this page.

InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

Paper page - InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

Abstract

Models citing this paper1

Datasets citing this paper1

Spaces citing this paper0

Collections including this paper3

Similar Articles

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Light-Omni: Reflex over Reasoning in Agentic Video Understanding with Long-Term Memory

Native Active Perception as Reasoning for Omni-Modal Understanding

VideoChat3: Fully Open Video MLLM for Efficient and Generalist Video Understanding

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

Submit Feedback

Similar Articles

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Light-Omni: Reflex over Reasoning in Agentic Video Understanding with Long-Term Memory

Native Active Perception as Reasoning for Omni-Modal Understanding

VideoChat3: Fully Open Video MLLM for Efficient and Generalist Video Understanding

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory