Visual Reasoning through Tool-supervised Reinforcement Learning

Hugging Face Daily Papers 04/21/26, 12:00 AM Papers

Summary

Introduces ToolsRL, a two-stage reinforcement learning framework that teaches multimodal LLMs to use simple visual tools for complex visual reasoning tasks.

In this paper, we investigate the problem of how to effectively master tool-use to solve complex visual reasoning tasks for Multimodal Large Language Models. To achieve that, we propose a novel Tool-supervised Reinforcement Learning (ToolsRL) framework, with direct tool supervision for more effective tool-use learning. We focus on a series of simple, native, and interpretable visual tools, including zoom-in, rotate, flip, and draw point/line, whose tool supervision is easy to collect. A reinforcement learning curriculum is developed, where the first stage is solely optimized by a set of well motivated tool-specific rewards, and the second stage is trained with the accuracy targeted rewards while allowing calling tools. In this way, tool calling capability is mastered before using tools to complete visual reasoning tasks, avoiding the potential optimization conflict among those heterogeneous tasks. Our experiments have shown that the tool-supervised curriculum training is efficient and ToolsRL can achieve strong tool-use capabilities for complex visual reasoning tasks.

Original Article

View Cached Full Text

Cached at: 04/23/26, 07:47 AM

Paper page - Visual Reasoning through Tool-supervised Reinforcement Learning

Source: https://huggingface.co/papers/2604.19945

Abstract

A novel Tool-supervised Reinforcement Learning framework is presented that enables multimodal large language models to effectively learn tool-use for complex visual reasoning through a two-stage curriculum approach.

In this paper, we investigate the problem of how to effectively master tool-use to solve complexvisual reasoning tasksforMultimodal Large Language Models. To achieve that, we propose a novelTool-supervised Reinforcement Learning(ToolsRL) framework, with direct tool supervision for more effectivetool-use learning. We focus on a series of simple, native, and interpretable visual tools, including zoom-in, rotate, flip, and draw point/line, whose tool supervision is easy to collect. Areinforcement learning curriculumis developed, where the first stage is solely optimized by a set of well motivatedtool-specific rewards, and the second stage is trained with theaccuracy targeted rewardswhile allowing calling tools. In this way,tool calling capabilityis mastered before using tools to completevisual reasoning tasks, avoiding the potential optimization conflict among those heterogeneous tasks. Our experiments have shown that the tool-supervised curriculum training is efficient and ToolsRL can achieve strong tool-use capabilities for complexvisual reasoning tasks.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2604\.19945

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.19945 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.19945 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.19945 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Visual Reasoning through Tool-supervised Reinforcement Learning

Paper page - Visual Reasoning through Tool-supervised Reinforcement Learning

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning

iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

DyCo-RL: Dynamic Cross-Modal Coordination for Visual Reasoning

Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning

Submit Feedback

Similar Articles

Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning

iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

DyCo-RL: Dynamic Cross-Modal Coordination for Visual Reasoning

Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning