ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning
Summary
ParaVT introduces the first multi-agent end-to-end RL framework for parallel video tool calling, addressing the Tool Prior Paradox with PARA-GRPO, and fully open-sources the paper, code, weights, and data.
View Cached Full Text
Cached at: 05/26/26, 06:43 AM
Paper page - ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning
Source: https://huggingface.co/papers/2605.20342 Long-video understanding is becoming agentic where LMMs are post-trained with RL to natively invoke video tools (e.g., temporal cropping). But every existing native-RL recipe (including our own LongVT @ CVPR 2026) dispatches tool callssequentially, one per turn: a bad crop has no peer correction, multi-turn calls drift the context, and inference cost grows linearly with turns.
ParaVTis the first multi-agent end-to-end RL-trained framework forParallelVideoTool calling. A main agent emits multiple temporal-window crops in asingleturn, weight-sharing sub-agents process them concurrently, and a gather-and-reason step produces the final answer.
But applying standard GRPO on top of a tool-native LMM surfaces two coupled failures driven by the same pretrained tool prior. We call this theTool Prior Paradox:
Format Fragility— SFT-learned<think\>/<tool\_call\>/<answer\>closures collapse under temperature sampling. Tool Necessity Gap— with a 64-frame overview, “skip-tool” becomes a shortcut and the GRPO advantage of calling vs. skipping flattens to zero.
We proposePARA-GRPO(Parseability-Anchored and Ratio-gAted GRPO), pairing one targeted fix per failure: (i) a format reward applied only at the structural-token positions most prone to collapse, and (ii) per-prompt overview-frame randomization K ∼ Uniform{4, 8, 16, 32, 64} that keeps the tool-call advantage non-degenerate.
Fully open: paper, code, weights, data 📄 arxiv.org/abs/2605.20342 · 💻 github.com/EvolvingLMMs-Lab/ParaVT · 🤖https://huggingface.co/ParaVT· 🌐 evolvinglmms-lab.github.io/ParaVT
Similar Articles
@TheTuringPost: 10 open-source tools for the Agent RL stack ↓ OpenPipe ART verl-agent Agent Lightning Unsloth OpenRLHF SkyRL NVIDIA’s P…
A curated roundup of 10 open-source tools for training AI agents using reinforcement learning, covering frameworks like OpenPipe ART, verl-agent, Agent Lightning, and Unsloth, with details on their use cases and strengths.
Visual Reasoning through Tool-supervised Reinforcement Learning
Introduces ToolsRL, a two-stage reinforcement learning framework that teaches multimodal LLMs to use simple visual tools for complex visual reasoning tasks.
Visual Para-Thinker++: A Single-Policy Multi-Agent Framework for Visual Reasoning
Visual Para-Thinker++ proposes a single-policy multi-agent framework for visual reasoning that uses role-conditioned agents (Main, Worker, Summary) and dedicated training methods to reduce hallucinations and improve efficiency, outperforming baselines on hallucination-sensitive benchmarks.
OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents
OpenWebRL presents an open framework for training visual web agents using online multi-turn reinforcement learning on real websites, achieving state-of-the-art performance with minimal initial supervision. Their 4B-parameter model outperforms prior open agents and competes with proprietary systems like OpenAI CUA and Gemini CUA.
AgentV-RL: Scaling Reward Modeling with Agentic Verifier
AgentV-RL introduces an Agentic Verifier framework that enhances reward modeling through bidirectional verification with forward and backward agents augmented with tools, achieving 25.2% improvement over state-of-the-art ORMs. The approach addresses error propagation and grounding issues in verifiers for complex reasoning tasks through multi-turn deliberative processes combined with reinforcement learning.