Visual Reasoning through Tool-supervised Reinforcement Learning
Summary
Introduces ToolsRL, a two-stage reinforcement learning framework that teaches multimodal LLMs to use simple visual tools for complex visual reasoning tasks.
View Cached Full Text
Cached at: 04/23/26, 07:47 AM
Paper page - Visual Reasoning through Tool-supervised Reinforcement Learning
Source: https://huggingface.co/papers/2604.19945
Abstract
A novel Tool-supervised Reinforcement Learning framework is presented that enables multimodal large language models to effectively learn tool-use for complex visual reasoning through a two-stage curriculum approach.
In this paper, we investigate the problem of how to effectively master tool-use to solve complexvisual reasoning tasksforMultimodal Large Language Models. To achieve that, we propose a novelTool-supervised Reinforcement Learning(ToolsRL) framework, with direct tool supervision for more effectivetool-use learning. We focus on a series of simple, native, and interpretable visual tools, including zoom-in, rotate, flip, and draw point/line, whose tool supervision is easy to collect. Areinforcement learning curriculumis developed, where the first stage is solely optimized by a set of well motivatedtool-specific rewards, and the second stage is trained with theaccuracy targeted rewardswhile allowing calling tools. In this way,tool calling capabilityis mastered before using tools to completevisual reasoning tasks, avoiding the potential optimization conflict among those heterogeneous tasks. Our experiments have shown that the tool-supervised curriculum training is efficient and ToolsRL can achieve strong tool-use capabilities for complexvisual reasoning tasks.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2604\.19945
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.19945 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.19945 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.19945 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning
Introduces AutoTool, a model that adaptively decides whether to invoke tools for multimodal LLM reasoning, achieving significant accuracy and efficiency gains through reinforcement learning and dual-mode reasoning.
iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning
Introduces iVGR, a reinforcement learning framework that internalizes visual localization into textual reasoning for multimodal language models, eliminating the need for explicit visual grounding during inference while improving fine-grained perception performance.
VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies
VisualThink-VLA introduces a visual intermediate reasoning framework for vision-language-action policies that preserves spatial precision and dramatically reduces latency compared to text-based reasoning, achieving sub-second inference and state-of-the-art success rates on robot manipulation benchmarks.
DyCo-RL: Dynamic Cross-Modal Coordination for Visual Reasoning
This paper identifies that failures in visual reasoning often stem from breakdowns in dynamic cross-modal coordination between visual and textual evidence during chain-of-thought generation. It introduces DyCo-RL, a reinforcement learning framework that rewards effective cross-modal coordination, leading to improved reasoning performance.
Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning
This paper introduces a reinforcement learning framework that improves perception-reasoning synergy in vision-language models by explicitly rewarding perceptual fidelity, using a 'blindfolded reasoning' proxy and structured verbal verification to address ambiguity in modality credit assignment.