ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

Hugging Face Daily Papers 05/19/26, 12:00 AM Papers

video-understanding reinforcement-learning multi-agent tool-calling open-source grpo

Summary

ParaVT introduces the first multi-agent end-to-end RL framework for parallel video tool calling, addressing the Tool Prior Paradox with PARA-GRPO, and fully open-sources the paper, code, weights, and data.

Training large multimodal models (LMMs) via reinforcement learning (RL) to natively invoke video-processing tools (e.g., cropping) has become a promising route to long-video understanding. However, existing native-RL methods dispatch tool calls sequentially (i.e., one per turn): a single wrong crop propagates errors without peer correction, multi-turn tool calls corrupt context, and inference cost scales linearly with the number of turns. We introduce ParaVT, the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling, dispatching multiple time-window crops in a single turn for cleaner context and better fault tolerance. Yet applying standard RL to ParaVT reveals an obstacle we term the Tool Prior Paradox: the pretrained tool priors that enable tool exploration also destabilize cold-started structural format and expose the skip-tool reward shortcut under temperature sampling. A cross-model contrast on a weaker-prior LMM supports this claim: format stays stable but RL elicits zero tool calls, indicating that prior strength is the shared driver of both format collapse and tool exploration. We propose PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO), which augments standard RL with two complementary mechanisms: (i) a targeted format reward applied only at the structural-token positions most prone to collapse, and (ii) a per-prompt frame-budget randomization that creates training prompts where calling the tool yields a measurable reward signal over skipping it. Across six long-video understanding benchmarks, ParaVT improves over the Qwen3-VL baseline by +7.9% on average, with PARA-GRPO lifting training-time format compliance from 0.13 to 0.64. As tool capabilities become increasingly internalized in modern LMMs, RL must cooperate with the resulting priors, and ParaVT offers a general recipe for agentic RL. Code, data, and model weights are publicly available.

Original Article

View Cached Full Text

Cached at: 05/26/26, 06:43 AM

Paper page - ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

Source: https://huggingface.co/papers/2605.20342 Long-video understanding is becoming agentic where LMMs are post-trained with RL to natively invoke video tools (e.g., temporal cropping). But every existing native-RL recipe (including our own LongVT @ CVPR 2026) dispatches tool callssequentially, one per turn: a bad crop has no peer correction, multi-turn calls drift the context, and inference cost grows linearly with turns.

ParaVTis the first multi-agent end-to-end RL-trained framework forParallelVideoTool calling. A main agent emits multiple temporal-window crops in asingleturn, weight-sharing sub-agents process them concurrently, and a gather-and-reason step produces the final answer.

But applying standard GRPO on top of a tool-native LMM surfaces two coupled failures driven by the same pretrained tool prior. We call this theTool Prior Paradox:

Format Fragility— SFT-learned<think\>/<tool\_call\>/<answer\>closures collapse under temperature sampling. Tool Necessity Gap— with a 64-frame overview, “skip-tool” becomes a shortcut and the GRPO advantage of calling vs. skipping flattens to zero.

We proposePARA-GRPO(Parseability-Anchored and Ratio-gAted GRPO), pairing one targeted fix per failure: (i) a format reward applied only at the structural-token positions most prone to collapse, and (ii) per-prompt overview-frame randomization K ∼ Uniform{4, 8, 16, 32, 64} that keeps the tool-call advantage non-degenerate.

Fully open: paper, code, weights, data 📄 arxiv.org/abs/2605.20342 · 💻 github.com/EvolvingLMMs-Lab/ParaVT · 🤖https://huggingface.co/ParaVT· 🌐 evolvinglmms-lab.github.io/ParaVT

ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

Paper page - ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

Similar Articles

@TheTuringPost: 10 open-source tools for the Agent RL stack ↓ OpenPipe ART verl-agent Agent Lightning Unsloth OpenRLHF SkyRL NVIDIA’s P…

Visual Reasoning through Tool-supervised Reinforcement Learning

Visual Para-Thinker++: A Single-Policy Multi-Agent Framework for Visual Reasoning

OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

AgentV-RL: Scaling Reward Modeling with Agentic Verifier

Submit Feedback

Similar Articles

@TheTuringPost: 10 open-source tools for the Agent RL stack ↓ OpenPipe ART verl-agent Agent Lightning Unsloth OpenRLHF SkyRL NVIDIA’s P…

Visual Reasoning through Tool-supervised Reinforcement Learning

Visual Para-Thinker++: A Single-Policy Multi-Agent Framework for Visual Reasoning

OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

AgentV-RL: Scaling Reward Modeling with Agentic Verifier