ReGRPO: Reflection-Augmented Policy Optimization for Tool-Using Agents

arXiv cs.AI 07/01/26, 04:00 AM Papers
Summary
ReGRPO introduces a reflection-augmented policy optimization framework for tool-using vision-language agents, leveraging structured failure observations and joint optimization of reflection tokens and actions to improve recovery from tool failures, achieving state-of-the-art results on GTA and GAIA benchmarks.
arXiv:2606.31392v1 Announce Type: new Abstract: Tool-augmented vision-language models (VLMs) can solve multimodal, multi-step tasks by calling external tools, yet they remain fragile in practice. Existing works have two common gaps. Supervised fine-tuning (SFT) is built mostly on successful trajectories and offers little signal for recovery after tool failures, while sparse trajectory-level RL rewards provide limited guidance on which step failed and how to repair it. We introduce ReGRPO (Reflection-augmented Group Relative Policy Optimization), a framework that learns reflection-guided correction in tool-using agents. ReGRPO starts with a structured reflective data engine: we execute near-miss actions to collect grounded failure observations, then build Reflection-of-Thought triplets (ErrorType, Evidence, FixPlan) paired with corrected actions for warm-start SFT. We then optimize reflection tokens and corrective actions jointly within local trajectories using group-relative advantages, and include a reflection-cost term to reduce unnecessary reflection. Experiments on GTA and GAIA show that, under the same backbone and tool suite, ReGRPO consistently outperforms strong open-source baselines and achieves the best results among the compared open-source controllers. Code and RoT data are available at https://github.com/showlab/ReGRPO.
Original Article
View Cached Full Text
Cached at: 07/01/26, 05:38 AM
# ReGRPO: Reflection-Augmented Policy Optimization for Tool-Using Agents
Source: [https://arxiv.org/html/2606.31392](https://arxiv.org/html/2606.31392)
11institutetext:Show Lab, National University of Singapore
11email:binjie97@u\.nus\.edu###### Abstract

Tool\-augmented vision–language models \(VLMs\) can solve multimodal, multi\-step tasks by calling external tools, yet they remain fragile in practice\. Existing works have two common gaps\. Supervised fine\-tuning \(SFT\) is built mostly on successful trajectories and offers little signal for recovery after tool failures, while sparse trajectory\-level RL rewards provide limited guidance on which step failed and how to repair it\. We introduce ReGRPO \(Reflection\-augmented Group Relative Policy Optimization\), a framework that learns reflection\-guided correction in tool\-using agents\. ReGRPO starts with a structured reflective data engine: we execute near\-miss actions to collect grounded failure observations, then build Reflection\-of\-Thought triplets \(ErrorType, Evidence, FixPlan\) paired with corrected actions for warm\-start SFT\. We then optimize reflection tokens and corrective actions jointly within local trajectories using group\-relative advantages, and include a reflection\-cost term to reduce unnecessary reflection\. Experiments on GTA and GAIA show that, under the same backbone and tool suite, ReGRPO consistently outperforms strong open\-source baselines and achieves the best results among the compared open\-source controllers\. Code and RoT data are available at[https://github\.com/showlab/ReGRPO](https://github.com/showlab/ReGRPO)\.

![Refer to caption](https://arxiv.org/html/2606.31392v1/imgs/rot_pipeline.png)Figure 1:Pipeline of the Structured Reflective Data Engine\. Given a task trajectory, we first induce a tool failure \(for example, OCR on a face image returns empty text\)\. A teacher model \(GPT\-4o by default\) then generates a structured Reflection\-of\-Thought with explicit ErrorType, Evidence, and FixPlan, which explains the failure and proposes the next action \(for example, switching to face detection\)\. The agent executes the corrected action to recover and finish the task\. This Error, Reflection, and Correction loop converts raw failures into grounded supervision for recovery\.## 1Introduction

External tools such as web search, OCR, table readers, PDF parsers, code execution, and visual operators expand the capability of vision\-language models \(VLMs\) beyond in\-context prompting\[suris2023vipergpt,fan2024videoagent,lu2024deepseek,guo2025deepseek\]\. In these settings, success depends not only on the final answer, but also on planning and executing grounded intermediate steps, including which tool to call, when to call it, and how to set its arguments from multimodal evidence\.

One common approach is to train trajectory\-tuned controllers on synthetic tasks and verified traces\. MAT\-AGENT\[gao2024multi\]follows this approach by generating multimodal tool\-use tasks, executing tools to collect successful trajectories, and applying supervised fine\-tuning \(SFT\) to learn tool invocation\. This strategy is effective, but it also ties the controller closely to supervised traces\. When the policy deviates, for example under a new PDF layout or receipt style, the training signal offers limited guidance for recovery\.

A complementary direction uses self\-exploration\. SPORT\[li2025iterative\]alternates between sampling tool\-use steps and verifying them, then converts rollouts into step\-wise preferences without human labels\. This introduces process\-level feedback, but the supervision remains relatively unstructured\. For example, SPORT does not explicitly encode multimodal chain\-of\-thought, grounding tags that link language to regions or cells, or verifier rationales\. As a result, the learned preferences are often hard to interpret and can fail under distribution shift\.

These approaches share two limitations\. SFT\-only tuning on expert trajectories tends to saturate quickly because the loss depends heavily on one teacher trace and provides weak signals for local repair\. Standard reinforcement learning also provides limited recovery\-oriented supervision in long\-horizon tasks, because a scalar failure reward does not identify which decision should be revised\.

We therefore train tool\-using VLM agents with reflection\-augmented reinforcement learning, where diagnostic reflections are optimized as explicit recovery signals\. Instead of treating reflection as a test\-time prompting heuristic, we model it as a learnable variable that links a failed action to its correction\.

We introduce ReGRPO \(Reflection\-augmented Group Relative Policy Optimization\) to realize this idea\. On the data side, ReGRPO builds a Structured Reflective Data Engine that converts tool\-execution failures into grounded Reflection\-of\-Thought \(RoT\) triplets\. The engine perturbs MAT\-AGENT trajectories\[gao2024multi\], executes the perturbed actions to obtain real failure observations, and then generates structured reflections with ErrorType, Evidence, and FixPlan\. On the optimization side, ReGRPO jointly optimizes reflection and correction tokens within the same local trajectory objective, which gives the policy step\-level supervision beyond final success or failure\. ReGRPO keeps the standard GRPO optimizer and adds structured reflection parameterization to enable end\-to\-end policy learning\.

At inference time, ReGRPO follows the same principle that failures should trigger explicit local repair rather than blind continuation\. We use a lightweight single\-path, zero\-verifier setup with a deterministic trigger, which opens a reflection\-correction block only when needed\. This design allows recovery from tool errors while keeping deployment efficient\.

On GTA\[wang2024gta\]and GAIA\[mialon2023gaia\], under the same backbone and tool suite as MAT\-AGENT and SPORT, ReGRPO consistently improves overall accuracy\.

In summary, this paper makes three contributions:

- •We build aStructured Reflective Data Enginethat converts tool\-execution failures into grounded Reflection\-of\-Thought supervision with ErrorType, Evidence, and FixPlan, paired with corrected actions\.
- •We presentReGRPO, which adopts the standard GRPO optimizer while adding structured reflection trajectory parameterization and a zero\-verifier trigger, enabling joint optimization of reflection and correction tokens under reproducible single\-path deployment\.
- •Under the same backbone and tool settings, ReGRPO achieves thestrongest resultsamong the compared open\-source controllers on GTA and GAIA\.

## 2Related Work

### 2\.1Multi\-Modal Agents

Multimodal agents\[li2024llava,yin2024agent,zhang2024agentohana,wang2024mllm,liu2024llavanext\]can solve complex problems via external tools and APIs\. For example, CLOVA\[gao2024clova\]uses LLMs as controllers to compose off\-the\-shelf visual tools\. ViperGPT\[suris2023vipergpt\]uses code\-generation models to compose vision\-and\-language models into subroutines to produce results for arbitrary queries\. VideoAgent\[fan2024videoagent\]adopts multi\-step reasoning, where the agent selects tools according to intermediate observations\.

Though LLM\-driven methods can achieve strong results, VLM\-driven agents\[sasazawa2024layout,zheng2024gpt,wang2024genartist\]are often more efficient for visual tasks because the controller directly consumes images or videos and tool outputs\. For example, GenArtist\[wang2024genartist\]proposes a unified image generation and editing system coordinated by a multimodal large language model \(MLLM\) agent\.

In addition, other works\[xiong2025llava,liao2024can,wang2024divide\]use models to generate AI feedback that improves performance\. For example, LLaVA\-Critic\[xiong2025llava\]presents a high\-quality dataset tailored to follow instructions in complex evaluation settings, providing quantitative judgments and accompanying reasoning\.

### 2\.2Datasets for Tool\-Using Agents

The reasoning ability of VLM\-driven agents is often weaker than that of large text\-only LLMs\. To bridge this gap, recent works synthesize tool\-usage data to tune open\-source VLMs\[wang2024mllm,liu2023llava,liu2024visualagentbench\]\. For example, DEDER\[choiembodied\]uses in\-context learning to generate trajectories and distills chain\-of\-thought reasoning from LLMs to smaller models\. Lumos\[yin2024agent\]converts ground\-truth reasoning steps from existing benchmarks into tool\-usage trajectories\. TASKBENCH\[shen2023taskbench\]samples trajectories from pre\-defined graphs\. MAT\-AGENT\[gao2024multi\]scales up trajectory tuning for open VLM controllers with a diverse tool suite and synthetic tasks\.

### 2\.3Reflective and Self\-Correcting Agents

The concept of reflection—prompting a model to critique its own outputs—has been widely explored in LLMs\. Reflexion\[shinn2024reflexion\]and similar frameworks\[madaan2024selfrefine,chen2023teaching\]use verbal feedback and episodic memory to improve performance over multiple trials at inference time\. However, these methods typically treat reflection as a frozen prompting strategy or rely on external scalar feedback, without optimizing the reflection generation process itself\. In contrast, ReGRPO explicitly learns how to generate a diagnostic reflection and how that reflection guides the next corrective tool call under an explicit trigger mechanism\. Rather than relying on inference\-time trial\-and\-error alone, ReGRPO trains the model to produce grounded diagnostics that enable successful recovery, internalizing the correction loop into the policy\. We evaluate this approach on GTA and GAIA to assess effectiveness and sample efficiency in tool\-using settings\.

![Refer to caption](https://arxiv.org/html/2606.31392v1/x1.png)Figure 2:Overview of ReGRPO\. \(1\) Structured Reflective Data Engine: from multimodal inputs and a successful action, synthesize a near\-miss failure \(wrong crop/tool/argument\), execute it to obtain grounded failure observations \(e\.g\., empty OCR or tool error\), then use a teacher VLM \(e\.g\., GPT\-4o\) to generate a structured Reflection\-of\-Thought triplet \(ErrorType, Evidence, FixPlan\)\. Pair the reflection with the corrected action to form reflective supervision \(failure action, failure observation, reflection, corrected action\), and warm\-start SFT on these trajectories\. \(2\) ReGRPO training: form groups of candidate local trajectories, including one\-shot successes and reflection\-based recoveriesa\(0\)→o\(0\)→z→a\(1\)→o\(1\)a^\{\(0\)\}\\\!\\rightarrow\\\!o^\{\(0\)\}\\\!\\rightarrow\\\!z\\\!\\rightarrow\\\!a^\{\(1\)\}\\\!\\rightarrow\\\!o^\{\(1\)\}; combine execution success, an optional teacher\-derived verifier score \(training only, computed deterministically from the teacher’s RoT metadata with no in\-loop LLM call\), and reflection cost into a reward; compute group\-relative advantages to update both reflection and correction tokens\. \(3\) Zero\-Verifier Inference Stage: single\-path execution with a deterministic trigger that opens a local reflection\-correction block only when failure evidence appears, enabling efficient recovery without external verifier calls\.

## 3Method

Existing training strategies for tool\-using agents have two key issues\. First, supervised trajectories rarely include recovery steps\[gao2024multi\]\. As a result, the model learns only successful traces\. A small mistake can then cascade without guidance on how to fix it\. For example, in receipt QA, SFT teaches the correctobject\_locfollowed by OCR\. At test time, a slight layout shift may lead to a nearby crop and empty OCR, but the training data provides no corrective signal \(e\.g\., expand the box or re\-localize\)\. Second, standard RL provides weak localization\. A multi\-step document QA run can fail after a wrong crop, then an empty OCR, then an incorrect answer\. A final reward of 0 does not reveal which step was wrong or how to correct it\. This ambiguity makes learning slow and unstable\.

We introduceReGRPO\(Reflection\-Augmented Group Relative Policy Optimization\), which explicitly learns*how*to generate diagnostic reflection and*how*reflection guides the next corrective tool call in local trajectories\. As shown in Figure[2](https://arxiv.org/html/2606.31392#S2.F2), ReGRPO follows three principles\. First, it uses a Structured Reflective Data Engine that converts execution failures into grounded diagnostic triplets\. Second, it applies ReGRPO to optimize the generation of reflection steps that lead to successful corrections\. Third, it uses a Zero\-Verifier Inference Stage with single\-path execution and no external verifier calls\. We evaluate the resulting agent on GTA and GAIA benchmarks to assess overall tool\-use performance\.

The remainder of this section presents the problem setup \(Sec\.[3\.1](https://arxiv.org/html/2606.31392#S3.SS1)\), the Structured Reflective Data Engine \(Sec\.[3\.2](https://arxiv.org/html/2606.31392#S3.SS2)\), ReGRPO training \(Sec\.[3\.3](https://arxiv.org/html/2606.31392#S3.SS3)\), and the inference strategy \(Sec\.[3\.5](https://arxiv.org/html/2606.31392#S3.SS5)\)\.

### 3\.1Problem Setup

A task is given by\(Q,F\)\(Q,F\), whereQQis a user query andFFis a set of files or images\. At stepii, the agent observes historyhi=\{\(tj,cj,oj\)\}j=1i−1h\_\{i\}=\\\{\(t\_\{j\},c\_\{j\},o\_\{j\}\)\\\}\_\{j=1\}^\{i\-1\}\. Standard agents predict an actionai=\(ti,ci\)a\_\{i\}=\(t\_\{i\},c\_\{i\}\), wheretit\_\{i\}is a thought andcic\_\{i\}is a tool call\.

In ReGRPO, the action space includes an additional reflection stepziz\_\{i\}\. When an actionai\(0\)a\_\{i\}^\{\(0\)\}fails and produces observationoi\(0\)o\_\{i\}^\{\(0\)\}, the agent may generate reflectionziz\_\{i\}before attempting correctionai\(1\)a\_\{i\}^\{\(1\)\}\. The local trajectory segment is

τi=\{\(ai\(0\),oi\(0\)\)if success\(ai\(0\),oi\(0\),zi,ai\(1\),oi\(1\)\)if reflection triggered\\tau\_\{i\}=\\begin\{cases\}\(a\_\{i\}^\{\(0\)\},o\_\{i\}^\{\(0\)\}\)&\\text\{if success\}\\\\ \(a\_\{i\}^\{\(0\)\},o\_\{i\}^\{\(0\)\},z\_\{i\},a\_\{i\}^\{\(1\)\},o\_\{i\}^\{\(1\)\}\)&\\text\{if reflection triggered\}\\end\{cases\}\(1\)The goal is to learn a policyπθ\\pi\_\{\\theta\}that both acts and reflects to self\-correct when necessary\.

### 3\.2Structured Reflective Data Engine

Existing trajectory cloning methods and outcome\-based reward models provide only coarse supervision in multimodal environments\. When a trajectory fails, a scalar penalty does not reveal*which*step or parameter caused the error, so the model cannot learn how to repair it\. To address this, we introduce a Structured Reflective Data Engine that converts failures into explicit causal evidence\. Instead of treating failed actions as generic negatives—which can worsen off\-policy shifts—we reformulate them as instructional “Error\-Reflection\-Correction” trajectories, giving the agent concrete recovery supervision from the outset\.

Failure Induction\.To teach recovery, the model must observe concrete failure states rather than only successful traces\. We therefore construct*initial failures*that are realistic yet recoverable, so the agent can learn how errors arise and how to fix them\. Specifically, starting from a ground\-truth stepai∗a\_\{i\}^\{\*\}\(from MM\-Traj\[gao2024multi\]\), we synthesize plausible “near\-miss” actionsaifaila\_\{i\}^\{fail\}by perturbing tool choices or arguments \(e\.g\., shifting a bounding box, selecting an adjacent table column, or calling a mismatched tool\)\. We then*execute*aifaila\_\{i\}^\{fail\}in the sandbox to obtain real failure observationsoifailo\_\{i\}^\{fail\}\(e\.g\., API exceptions, empty OCR outputs, or irrelevant crops\)\. These executed failures provide grounded error signals that the Reflective Data Engine links to diagnostic reflections and corrections\. We organize the resulting annotations using a Reflection\-of\-Thought \(RoT\) reflectionziz\_\{i\}\. Each annotation is a RoT triplet\(aifail,oifail,zi\)\(a\_\{i\}^\{fail\},o\_\{i\}^\{fail\},z\_\{i\}\)paired with the corrected actionai∗a\_\{i\}^\{\*\}, which together explicitly encode the failure, structured reflection, and recovery target\.

Reflection Annotation\.To bridge the causal gap between the faulty actionaifaila\_\{i\}^\{fail\}and the failure observationoifailo\_\{i\}^\{fail\}, we use a teacher vision\-language model \(GPT\-4o by default\) to generate a structured RoT reflectionziz\_\{i\}, conditioned on the failure context\(hi,aifail,oifail\)\(h\_\{i\},a\_\{i\}^\{fail\},o\_\{i\}^\{fail\}\)\. To prevent the agent from generating free\-form, hallucinated excuses,ziz\_\{i\}is strictly constrained to a triplet schema:

- •ErrorType: A categorical diagnosis of the failure \(e\.g\.,ToolMismatch,ArgInvalid,GroundingDrift,InfoInsufficient\)\.
- •Evidence: A mandatory reference to the visual or textual observation that triggered the error\. For instance, rather than a generic “the tool failed”, the model must explicitly ground its reasoning: “The OCR output for the specific region\#box1returned empty text, indicating an incorrect crop\.”
- •FixPlan: A concrete, actionable natural language strategy to correct the error and reach the target state \(e\.g\., “Expand the bounding box slightly to cover the text” or “Switch to a visual localization tool”\)\.

Correction Pairing and Internalization\.Finally, the original ground\-truth actionai∗a\_\{i\}^\{\*\}is appended as the corrected action, closing the causal loop\. This pipeline converts the MM\-Traj dataset\[gao2024multi\]into a supervised corpus of augmented trajectories:τireflective=\(xi,aifail,oifail,zi,ai∗\)\\tau\_\{i\}^\{reflective\}=\(x\_\{i\},a\_\{i\}^\{fail\},o\_\{i\}^\{fail\},z\_\{i\},a\_\{i\}^\{\*\}\)\. During supervised fine\-tuning \(SFT\), the model is forced to maximize the conditional likelihoodP\(zi,ai∗∣xi,aifail,oifail\)P\(z\_\{i\},a\_\{i\}^\{\*\}\\mid x\_\{i\},a\_\{i\}^\{fail\},o\_\{i\}^\{fail\}\)\. Consequently, the agent internalizes the multimodal error diagnostic capability at the parameter level, enabling subsequent training\-time exploration over reflective recoveries in ReGRPO\.

### 3\.3Reflection\-Augmented Group Relative Policy Optimization

Dynamic Local Trajectory Formulation\.For a given state contextxix\_\{i\}, rather than scoring atomic actions, we form a group ofKKcandidate local trajectories\{τi\(k\)\}k=1K\\\{\\tau\_\{i\}^\{\(k\)\}\\\}\_\{k=1\}^\{K\}and score each under the current policyπθ\\pi\_\{\\theta\}\. The structure of each candidate trajectory depends on the intermediate environmental feedback: a candidate begins with an initial actionai\(0\)a\_\{i\}^\{\(0\)\}and its observationoi\(0\)o\_\{i\}^\{\(0\)\}\. If the execution succeeds, the trajectory terminates early:τi=\(ai\(0\),oi\(0\)\)\\tau\_\{i\}=\(a\_\{i\}^\{\(0\)\},o\_\{i\}^\{\(0\)\}\)\. Otherwise, ifoi\(0\)o\_\{i\}^\{\(0\)\}indicates a hard failure \(e\.g\., execution error, empty visual crop\) or low task–observation consistency, the policy generates a diagnostic reflectionziz\_\{i\}and a corrective actionai\(1\)a\_\{i\}^\{\(1\)\}, forming the recovery segment in \([1](https://arxiv.org/html/2606.31392#S3.E1)\)\.

Reflection\-Aware Process Reward\.To encourage meaningful exploration while penalizing infinite loops or verbosity, we define a composite reward functionR\(τ\)R\(\\tau\)that balances task success with execution efficiency:

R\(τ\)=λexec1\{success\}−ηC\(τ\)\+λvalV\(xi,τ\)\.R\(\\tau\)=\\lambda\_\{\\text\{exec\}\}\\,\\mathbf\{1\}\\\{\\text\{success\}\\\}\-\\eta\\,C\(\\tau\)\+\\lambda\_\{\\text\{val\}\}\\,V\(x\_\{i\},\\tau\)\.\(2\)Here,𝟏\{success\}\\mathbf\{1\}\\\{\\text\{success\}\\\}is deterministic environment feedback indicating task completion, andC\(τ\)C\(\\tau\)penalizes unnecessary reflection length\. The first two terms already form a complete*verifier\-free*objective\.V\(⋅\)V\(\\cdot\)is an*optional*training\-only verifier value, computed*deterministically*from the active record’s RoT metadata \(no model is queried inside the RL loop\): it can be enabled withλval\>0\\lambda\_\{\\text\{val\}\}\>0for extra stabilization, or disabled withλval=0\\lambda\_\{\\text\{val\}\}=0without changing the deployment algorithm\. Default coefficients in our reported setting areλexec=1\.0\\lambda\_\{\\text\{exec\}\}=1\.0,λval=0\.3\\lambda\_\{\\text\{val\}\}=0\.3, andη=0\.1\\eta=0\.1\.

The termC\(τ\)C\(\\tau\)introduces a reflection cost penalty\. If the trajectory invokes the reflection stepziz\_\{i\},C\(τ\)C\(\\tau\)is proportional to the token length ofziz\_\{i\}; otherwise, for a one\-shot success,C\(τ\)=0C\(\\tau\)=0\. The penalty coefficientη\\etaforces the agent to reflect*only*when strictly necessary, ensuring that the expected reward gain from a successful recovery strictly outweighs the penalty of generating additional reasoning tokens\.

Optional training\-time verifier design\.When enabled, the verifier valueV\(xi,τ\)V\(x\_\{i\},\\tau\)is computed*deterministically*from each candidate and the active record’s RoT metadata;no GPT\-4o \(or any LLM\) is queried inside the RL loop\. We derive three subscores in\[0,1\]\[0,1\]by signature matching and a grounded\-reflection check rather than by model scoring:

- •Plan validitysps\_\{p\}:sp=1s\_\{p\}=1iff the candidate’s normalized primary tool and first argument match the stored corrected\-action signature, else0\.
- •Answer consistencysas\_\{a\}:sa=1s\_\{a\}=1iff the candidate’s replay succeeds \(its terminal action agrees with the group’s correct answer\), else0\.
- •Groundingsgs\_\{g\}:sg=1s\_\{g\}=1iff the candidate carries a reflection whose evidence is text\-grounded in the stored failure observation*and*sp\>0s\_\{p\}\>0, else0\. There is intentionally no fallback tosps\_\{p\}for reflection\-less candidates, soVVrewards grounded reflection above a bare plan repair\.

The overall score is the weighted sumV=wasa\+wgsg\+wpspV=w\_\{a\}s\_\{a\}\+w\_\{g\}s\_\{g\}\+w\_\{p\}s\_\{p\}, with each subscore clamped to\[0,1\]\[0,1\]and weights set to emphasize grounding \(wg≥wa,wpw\_\{g\}\\geq w\_\{a\},w\_\{p\}\); we use\(wa,wg,wp\)=\(0\.25,0\.50,0\.25\)\(w\_\{a\},w\_\{g\},w\_\{p\}\)=\(0\.25,0\.50,0\.25\)by default\. BecauseVVis a function of the candidate and the metadata, it is used only as a training\-time reward\-shaping signal and never as a deployment\-time call\. A GPT\-4o teacher is used*only offline*, to synthesize the RoT reflections in the data engine \(Sec\.[3\.2](https://arxiv.org/html/2606.31392#S3.SS2)\); it is never queried during RL or at inference\. Full coefficient ranges, the deterministic subscore definitions, and the selection protocol are provided in Appendix[A3](https://arxiv.org/html/2606.31392#Pt0.A3)\.

### 3\.4Training Objective

Structured Reflective Data in Supervised Fine\-Tuning \(warm start\)\.We first teach the model to diagnose and correct failures with explicit supervision, so Reinforcement Learning \(RL\) can focus on*how*reflection should be written under explicit trigger gating rather than learning recovery from scratch\. Given the failure context\(xi,aifail,oifail\)\(x\_\{i\},a\_\{i\}^\{fail\},o\_\{i\}^\{fail\}\), we maximize the likelihood of the structured reflection and corrected action:

ℒSFT=−𝔼\(xi,aifail,oifail,zi,ai∗\)\[log⁡Pθ\(zi,ai∗∣xi,aifail,oifail\)\]\.\\mathcal\{L\}\_\{\\mathrm\{SFT\}\}=\-\\mathbb\{E\}\_\{\(x\_\{i\},a\_\{i\}^\{fail\},o\_\{i\}^\{fail\},z\_\{i\},a\_\{i\}^\{\*\}\)\}\\left\[\\log P\_\{\\theta\}\(z\_\{i\},a\_\{i\}^\{\*\}\\mid x\_\{i\},a\_\{i\}^\{fail\},o\_\{i\}^\{fail\}\)\\right\]\.\(3\)This stage teaches the model*what*to diagnose and*how*to fix errors, providing a strong initialization for subsequent RL\.

![Refer to caption](https://arxiv.org/html/2606.31392v1/x2.png)Figure 3:Comparison of PPO\[schulman2017proximal\], DPO\[rafailov2023direct\], GRPO\[shao2024deepseekmath\], and ReGRPO \(ours\)\. PPO and DPO optimize actions or preferences without treating reflection as a decision variable; GRPO reduces variance via group\-relative rewards; ReGRPO further includes reflection in the optimized trajectory to provide stronger recovery\-oriented supervision for failed steps\.Structured Reflective Data in Policy Optimization\.SFT on RoT data teaches the model to fix common tool\-call errors \(e\.g\., crop shift, wrong region, mismatched tool arguments\), but real\-world failures are more diverse\. Therefore, to generalize beyond curated patterns, the agent must self\-explore to discover new failure modes and recovery strategies\. This motivates an RL stage to expand coverage and improve robustness\.

However, PPO\[schulman2017proximal\]and DPO\[rafailov2023direct\]optimize actions or preference pairs without modeling reflection as a decision variable\. In long\-horizon tool use, a scalar success/failure signal does not indicate*which*step failed or what correction would fix the trajectory\. Moreover, DPO relies on preference data that is difficult to collect for multi\-step tool executions, and PPO can be sample\-inefficient under sparse rewards\. In contrast, GRPO\[shao2024deepseekmath\]optimizes relative rewards within a group of sampled trajectories, producing lower\-variance advantages that are well suited to sparse, delayed rewards\. By comparing recoveries against nearby failures, GRPO yields more informative gradients than absolute\-reward optimization\.

As illustrated in Figure[3](https://arxiv.org/html/2606.31392#S3.F3), ReGRPO extends GRPO by making reflectionziz\_\{i\}part of the optimized trajectory\. This ties the advantage signal directly to diagnostic tokens, so the policy learns*which*reflections are useful and*how much*reflection is worth generating\. Consequently, reflection quality and brevity become explicitly optimized rather than a fixed prompting heuristic\.

Reflection\-Driven Advantage and Optimization\.Within the sampled group, we compute the baseline as the mean rewardR¯i=1K∑k=1KR\(τi\(k\)\)\\bar\{R\}\_\{i\}=\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}R\(\\tau\_\{i\}^\{\(k\)\}\), and extract the advantage for each trajectory:Ai\(k\)=R\(τi\(k\)\)−R¯iA\_\{i\}^\{\(k\)\}=R\(\\tau\_\{i\}^\{\(k\)\}\)\-\\bar\{R\}\_\{i\}\.

For two trajectories from the same state context, an unrecovered failureτ−\\tau^\{\-\}and a recovered trajectoryτ\+\\tau^\{\+\}, the reward gap isΔR=R\(τ\+\)−R\(τ−\)\\Delta R=R\(\\tau^\{\+\}\)\-R\(\\tau^\{\-\}\)\. When recovery succeeds and reflection remains concise,ΔR\\Delta Ris typically positive, soA\(τ\+\)\>A\(τ−\)A\(\\tau^\{\+\}\)\>A\(\\tau^\{\-\}\)within the same sampled group\. Importantly, this remains true in the verifier\-free setting \(λval=0\\lambda\_\{\\text\{val\}\}=0\), where improvements come entirely from reflection\-conditioned recovery and reflection\-cost control\.

The ReGRPO objective is optimized as:

ℒReGRPO=−𝔼xi,τ∼πθ\[1K∑k=1KAi\(k\)log⁡πθ\(τi\(k\)∣xi\)\]\+β𝔻KL\(πθ∥πref\)\\mathcal\{L\}\_\{\\mathrm\{ReGRPO\}\}=\-\\mathbb\{E\}\_\{x\_\{i\},\\tau\\sim\\pi\_\{\\theta\}\}\\left\[\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}A\_\{i\}^\{\(k\)\}\\log\\pi\_\{\\theta\}\(\\tau\_\{i\}^\{\(k\)\}\\mid x\_\{i\}\)\\right\]\+\\beta\\,\\mathbb\{D\}\_\{\\mathrm\{KL\}\}\\\!\\big\(\\pi\_\{\\theta\}\\\|\\pi\_\{\\text\{ref\}\}\\big\)Since the generation probability of structured reflectionlog⁡πθ\(zi∣xi,ai\(0\),oi\(0\)\)\\log\\pi\_\{\\theta\}\(z\_\{i\}\\mid x\_\{i\},a\_\{i\}^\{\(0\)\},o\_\{i\}^\{\(0\)\}\)is factorized within the trajectory likelihoodlog⁡πθ\(τi\(k\)∣xi\)\\log\\pi\_\{\\theta\}\(\\tau\_\{i\}^\{\(k\)\}\\mid x\_\{i\}\), relative advantages directly scale gradients on reflection tokens\. ReGRPO therefore uses the standard GRPO optimizer and advantage estimator; the contribution is the structured reflection trajectory design and training/deployment protocol, not a new value estimator\. Detailed quantitative results are presented in Sec\.[4](https://arxiv.org/html/2606.31392#S4), and extended diagnostics are reported in the appendix\.

### 3\.5Zero\-Verifier Inference Stage

At deployment, we use single\-path inference without external verifier calls\. A deterministic gate decides whether the policy should open one local reflection\-correction block\. Tool outputs are first normalized into a canonical schema

o^i=\{status,payload,meta\},\\hat\{o\}\_\{i\}=\\\{\\texttt\{status\},\\texttt\{payload\},\\texttt\{meta\}\\\},\(4\)wherestatusstores backend error codes andpayloadstores normalized content\. We then use a minimal trigger

gi=𝟏\{ToolError\(o^i\)∨EmptyObs\(o^i\)∨ui<κi\},g\_\{i\}=\\mathbf\{1\}\\\{\\texttt\{ToolError\}\(\\hat\{o\}\_\{i\}\)\\lor\\texttt\{EmptyObs\}\(\\hat\{o\}\_\{i\}\)\\lor u\_\{i\}<\\kappa\_\{i\}\\\},\(5\)with policy confidence

ui=exp⁡\(1\|ai\(0\)\|∑jlog⁡πθ\(ai,j\(0\)∣xi,hi\)\)\.u\_\{i\}=\\exp\\\!\\left\(\\frac\{1\}\{\|a\_\{i\}^\{\(0\)\}\|\}\\sum\_\{j\}\\log\\pi\_\{\\theta\}\(a\_\{i,j\}^\{\(0\)\}\\mid x\_\{i\},h\_\{i\}\)\\right\)\.\(6\)The first two terms capture explicit runtime failures\. The confidence term is a lightweight uncertainty proxy: it measures token\-level confidence for the predicted action sequence\. We therefore use it only as a trigger heuristic for potential silent failures\. To avoid per\-tool offline calibration, we use an online adaptive threshold computed from the current trajectory:

κi=1max⁡\(1,i−1\)∑j=1i−1uj\.\\kappa\_\{i\}=\\frac\{1\}\{\\max\(1,i\-1\)\}\\sum\_\{j=1\}^\{i\-1\}u\_\{j\}\.\(7\)Fori=1i=1, we disable the confidence trigger and rely on hard\-failure checks only\. Ifgi=0g\_\{i\}=0, inference continues with the next standard action\. Ifgi=1g\_\{i\}=1, the policy executes one local blockai\(0\)→oi\(0\)→zi→ai\(1\)→oi\(1\)a\_\{i\}^\{\(0\)\}\\rightarrow o\_\{i\}^\{\(0\)\}\\rightarrow z\_\{i\}\\rightarrow a\_\{i\}^\{\(1\)\}\\rightarrow o\_\{i\}^\{\(1\)\}, with at most one reflection\-correction block per step\. This design keeps inference deterministic and lightweight, while reducing manual feature engineering and per\-tool tuning\.

Pseudo\-code for this trigger is provided in the Appendix\.

## 4Experiments

### 4\.1Benchmarks and Metrics

We evaluate ReGRPO on two multimodal tool\-use benchmarks in the MAT\-AGENT setting\[gao2024multi\], where agents must reason over images and documents with real tools\.

GTA Dataset\.GTA\[wang2024gta\]contains 229 tasks paired with 252 images\. Each task requires 2–8 tool\-use steps \(typically 2–4\) and tests visual perception, local operations \(e\.g\., reading receipts or charts\), and short reasoning chains over screenshots and UI\-like images\.

GAIA Dataset\.GAIA\[mialon2023gaia\]is a document\-centric benchmark with 446 tasks over 109 files in PPTX, PDF, and XLSX formats\. Tasks are grouped into three difficulty levels and often require multiple tool calls for document understanding, web navigation, logical reasoning, and summarization\.

Table 1:Main comparison on GTA and GAIA under a unified single\-path, zero\-verifier inference protocol\. The default verifier\-free ReGRPO \(λval=0\\lambda\_\{\\text\{val\}\}=0\) achieves the best performance among the evaluated open\-source controllers on both benchmarks\.MethodControllerGTAGAIAToolAccCodeExecAnsAccLevel 1Level 2Level 3AnsAcc\\rowcolorcustomgrayClosed\-source ControllerLego AgentGPT\-4\-\-46\.59\-\-\-\-Lego AgentGPT\-4o\-\-41\.52\-\-\-\-Warm\-up AgentGPT\-4\-turbo\-\-\-30\.2015\.100\.0017\.60HF AgentGPT\-4o63\.4195\.1257\.0547\.1731\.4011\.5433\.40HF AgentGPT\-4o\-mini56\.10100\.0057\.6933\.9627\.913\.8426\.06\\rowcolorcustomgrayOpen\-Source ControllerHF AgentLLaVA\-NeXT\-8B14\.9725\.0814\.109\.431\.160\.003\.64HF AgentInternVL2\-8B36\.7552\.1832\.057\.554\.650\.004\.85HF AgentMiniCPM\-V\-8\.5B36\.5956\.1033\.9713\.215\.810\.007\.27HF AgentQwen2\-VL\-7B44\.8565\.1942\.3116\.988\.140\.009\.70T3\-AgentMAT\-MiniCPM\-V\-8\.5B65\.8580\.4952\.5626\.4211\.633\.8415\.15T3\-AgentMAT\-Qwen2\-VL\-7B64\.6384\.3253\.8526\.4215\.123\.8416\.97SPORT AgentTuned\-Qwen2\-VL\-7B72\.4191\.8760\.2635\.8516\.283\.8420\.61\\rowcolorcustomblueOursReGRPO \(default,λval=0\\lambda\_\{\\text\{val\}\}=0\)MAT\-Qwen2\-VL\-7B76\.3593\.7767\.6639\.0218\.714\.8923\.35

Metrics\.We report standard accuracy metrics:*AnsAcc*\(answer accuracy\),*ToolAcc*\(tool\-call validity\), and*CodeExec*\(execution success rate\)\. Unless noted otherwise, all main\-text comparisons use zero\-verifier deployment, with no external verifier calls at test time\.

### 4\.2Baselines

We compare ReGRPO with both closed\-source and open\-source controllers\.

Closed\-source agents\.We report GPT\-4/4o\-based results from prior work, including Lego Agent and Warm\-up Agent\[mialon2023gaia,wang2024gta\], and HF Agents powered by GPT\-4o and GPT\-4o\-mini\. These models provide strong proprietary references\.

Open\-source agents\.We include HF Agents based on LLaVA\-NeXT\-8B, InternVL2\-8B, MiniCPM\-V\-8\.5B, and Qwen2\-VL\-7B\. We also evaluate MAT\-AGENT \(T3\-Agent\) with MAT\-MiniCPM\-V\-8\.5B and MAT\-Qwen2\-VL\-7B, and SPORT Agent with a tuned Qwen2\-VL\-7B controller\. For fair internal comparisons, all ablations use the same Qwen2\-VL\-7B controller and tool suite\.

Our model\.We use the same Qwen2\-VL\-7B backbone and toolset as SPORT\[li2025iterative\]to isolate the effect of our method\. We compare controlled variants under matched training settings \(same total updates, rollout budget, and token budget\): \(1\) MAT\-AGENT: Fine\-tuned on MM\-Traj without RL; \(2\) \+RoT SFT: SFT with structured reflective data only; \(3\) \+Optional Verifier Distill: adds verifier\-aware distillation but no RL; \(4\) GRPO\-only: GRPO on actions only \(no reflection tokens\); \(5\) GRPO \+ Free\-form Reflection: allows unstructured reflection text without schema constraints; \(6\) ReGRPO core \(default, verifier\-free RL\): structured reflection withλval=0\\lambda\_\{\\text\{val\}\}=0during RL; \(7\) ReGRPO \+ optional verifier reward: additive shaping with the deterministic, metadata\-derived verifier value \(λval\>0\\lambda\_\{\\text\{val\}\}\>0\)\. All variants use Single\-Path, Zero\-Verifier Inference at test time\.

### 4\.3Implementation Details

Model and optimization\.We use Qwen2\-VL\-7B as the controller\. The vision encoder and visual token compressor are frozen\. We fine\-tune the language model with LoRA\[hu2021lora\]\(rank 32\), applied to query, key, and value projections in all self\-attention layers\. We optimize with AdamW and cosine learning\-rate decay, using a base learning rate of1\.0×10−61\.0\\times 10^\{\-6\}and batch size 2 per device\.

Training stages\.Our default pipeline has two mandatory stages: \(1\)*SFT warm start*on our Structured Reflective Data \(triplets\); \(2\)*ReGRPO process RL*on a mixture of offline groups and online self\-exploration groups\. Optionally, we insert verifier\-aware distillation between \(1\) and \(2\), where the controller predicts the deterministic, metadata\-derived verifier subscores for better calibration and more stable initialization\. ReGRPO uses the reflection\-aware reward in Eq\.[2](https://arxiv.org/html/2606.31392#S3.E2), with a reflection\-cost penaltyη=0\.1\\eta=0\.1\. By default, we setλval=0\\lambda\_\{\\text\{val\}\}=0\(verifier\-free RL\); runs withλval\>0\\lambda\_\{\\text\{val\}\}\>0using the deterministic, metadata\-derived verifier value are reported as additive variants\. For controlled comparisons, variants with and without distillation use identical optimizer settings and matched update steps to avoid gains from extra training budget\.

Table 2:Ablation of ReGRPO under the same backbone and tool settings\. The verifier\-free core pipeline \(RoT \+ ReGRPO,λval=0\\lambda\_\{\\text\{val\}\}=0\) reaches67\.66/23\.3567\.66/23\.35GTA/GAIA AnsAcc, and adding the optional teacher\-derived verifier reward further improves to68\.49/24\.0168\.49/24\.01\. Most gains come from the RoT SFT \+ structured RL combination, while verifier signals act as optional additive improvements\.\(a\) MAT SFT vs\. RoT training example\(verbatim record0FLZe2lb\_rot\_s0\_GroundingDrift\)MAT SFThas top\-level fieldsid,image,answer,conversations\.RoTkeepsid/image/conversationsand adds areflectiontriplet \(error\_type,evidence,fix\_plan\) plus acorrected\_action\.MAT instance \(abbreviated\)\{"id":"0FLZe2lb", "image":"coco/\.\.\./000000043093\.jpg",
"answer":"\.\.\.smoothie calorie estimate\.\.\.",
"conversations":\[\{"role":"user",\.\.\.\},\{"role":"assistant",\.\.\.\}\]\}RoT instance \(abbreviated\)\(blue = fields RoT adds over MAT\)\{"id":"0FLZe2lb\_rot\_s0\_GroundingDrift",
"image":"coco/\.\.\./000000043093\.jpg",
"failed\_action":"visualizer\(q=’fruits/utensils on the table’\)",
"reflection":\{"error\_type":"GroundingDrift",
"evidence":"cutting board, banana, knife",
"fix\_plan":"re\-ground to the smoothie glass"\},
"corrected\_action":"visualizer\(q=’ingredients in this smoothie’\)"\}\(b\) Mechanism\-aligned reflection\(real\(ErrorType, Evidence, FixPlan\)triplet\)Task\.“How many calories are in the smoothie shown in this image?” The image shows a smoothie glass beside a cutting board with a sliced banana, a knife, and granola\.Silent failure \(image grounding\)\.The perturbed action asksvisualizerabout the*fruits and utensils on the table*; the tool returns a valid, non\-empty description \(“a wooden cutting board with a sliced banana, a knife, …; the smoothie glass is not described”\), so no explicit tool error is raised\. Low policy confidence \(ui<κiu\_\{i\}<\\kappa\_\{i\}\) opens one local repair block\.Reflection \(ReGRPO schema, verbatim\)\.ErrorType=GroundingDrift;Evidence="wooden cutting board with a sliced banana, a knife";FixPlan="refocus from the table/cutting board to the smoothie glass and ask about its contents"\. The corrected action re\-grounds the query to the smoothie itself\.Takeaway\.This real record exposes the full Error→\\rightarrowReflection→\\rightarrowCorrection chain that ReGRPO optimizes, under the same single\-path zero\-verifier deployment\.\(c\) Inference Path ComparisonSample contenttools: visualizer
files: coco/train2017/000000043093\.jpg
dialogs\[0\]: \{"role":"user","content":"How many calories are in the smoothie shown in this image?"\}Note: abbreviated from the verbatim RoT record; tool outputs are shortened for readability\.Baseline path \(no local repair block\)1\.Step 1:visualizer\(image, q="fruits/utensils on the table"\)→\\rightarrow“cutting board, sliced banana, knife, granola; smoothie glass not described”\.2\.Step 2: estimate calories from the off\-target items on the board→\\rightarrowungrounded ingredient list \(wrong basis\)\.ReGRPO path \(single\-path, zero\-verifier\)1\.Step 1: same first call; the description omits the target smoothie \(gi=0g\_\{i\}=0so far\)\.2\.Step 1 \(gate\): the target object is missing from the observation, so confidence is low \(ui<κiu\_\{i\}<\\kappa\_\{i\}\); setgi=1g\_\{i\}=1\.3\.Local block:ai\(0\)→oi\(0\)→zi→ai\(1\)→oi\(1\)a\_\{i\}^\{\(0\)\}\\\!\\rightarrow\\\!o\_\{i\}^\{\(0\)\}\\\!\\rightarrow\\\!z\_\{i\}\\\!\\rightarrow\\\!a\_\{i\}^\{\(1\)\}\\\!\\rightarrow\\\!o\_\{i\}^\{\(1\)\}\. Reflection emitsErrorType=GroundingDrift,Evidence=cutting board / banana / knife, andFixPlan=re\-ground to the smoothie glass\.4\.Step 2: corrected actionvisualizer\(image, q="ingredients in this smoothie"\)→\\rightarrowgrounded smoothie contents, then a grounded calorie estimate\.Figure 4:Figure\-level evidence for ReGRPO, instantiated on a verbatim synthesized RoT record \(0FLZe2lb\_rot\_s0\_GroundingDrift\)\. \(a\) RoT augments the MAT SFT format with explicit reflective fields \(areflectiontriplet and acorrected\_action\)\. \(b\) The real\(ErrorType, Evidence, FixPlan\)reflection diagnoses a silent grounding failure—the tool answers about the cutting board rather than the smoothie—and prescribes a re\-grounding fix\. \(c\) Inference\-path comparison contrasts a brittle baseline route that estimates calories from off\-target items with a ReGRPO route where one confidence trigger opens a single local reflection–correction block that re\-grounds to the smoothie\.### 4\.4Quantitative Results

Table[4\.1](https://arxiv.org/html/2606.31392#S4.SS1)summarizes results on GTA and GAIA\[wang2024gta,mialon2023gaia\]\. Under the same single\-path, zero\-verifier setup, default ReGRPO \(λval=0\\lambda\_\{\\text\{val\}\}=0\) gives the strongest results among the compared open\-source controllers\. On GTA, it reaches76\.3576\.35ToolAcc and67\.6667\.66AnsAcc, improving over SPORT\[li2025iterative\]by\+3\.94\+3\.94ToolAcc and\+7\.40\+7\.40AnsAcc\. The larger AnsAcc gain suggests that reflection helps end\-to\-end reasoning beyond tool\-call validity alone\. On GAIA, default ReGRPO raises overall AnsAcc to23\.3523\.35\(\+2\.74\+2\.74over SPORT\), indicating stronger results on document\-centric multi\-step tasks even without verifier reward\. Adding the optional teacher\-derived verifier reward further improves performance to68\.49/24\.0168\.49/24\.01GTA/GAIA AnsAcc in Table[2](https://arxiv.org/html/2606.31392#S4.T2), while the main gains are already achieved in verifier\-free training and deployment\. Across both datasets, the results support our hypothesis that structured reflections improve tool\-grounded reasoning\. Multi\-seed statistics and extended analyses are provided in the appendix\.

### 4\.5Ablation Studies

Table[2](https://arxiv.org/html/2606.31392#S4.T2)reports a controlled ablation in which we progressively add reflective supervision and reflection\-aware optimization\.

Contribution coverage\.Our experimental design covers these method components: \(1\) the*Structured Reflective Data Engine*is evaluated by MAT\-AGENT→\\rightarrow\+RoT SFT and the data\-format comparison in Figure[4](https://arxiv.org/html/2606.31392#S4.F4); \(2\) the*GRPO\-based reflection protocol*is demonstrated by comparing GRPO\-only and free\-form reflection against structured ReGRPO under the same settings; \(3\)*deterministic zero\-verifier trigger gating*is assessed under single\-path, zero\-verifier deployment, with extended trigger diagnostics in the appendix\.

Baseline and reflective SFT gains\.Starting from MAT\-AGENT \(MM\-Traj, no RL\), performance is53\.8553\.85GTA AnsAcc and16\.9716\.97GAIA AnsAcc\. Replacing SFT data with RoT \(\+RoT SFT\) raises GTA AnsAcc to58\.5958\.59and GAIA AnsAcc to19\.0319\.03\(\+4\.74\+4\.74GTA /\+2\.06\+2\.06GAIA\), showing that explicit Error\-Reflection\-Correction supervision improves trajectory quality before RL\.

Optional verifier distillation contributes modestly\.Adding optional verifier distillation on top of RoT SFT further improves performance to59\.7259\.72GTA AnsAcc and19\.8419\.84GAIA AnsAcc \(an additional\+1\.13\+1\.13GTA /\+0\.81\+0\.81GAIA over \+RoT SFT\)\. This suggests the deterministic verifier subscores help intermediate decisions, but alone they do not close the gap to RL\-based methods\.

Action\-only RL vs\. reflection\-aware variants\.GRPO\-only \(without RoT reflection tokens in policy optimization\) reaches64\.5164\.51GTA AnsAcc and18\.9218\.92GAIA AnsAcc, indicating that RL exploration is effective on GTA but more limited on GAIA\. Allowing free\-form reflection improves results to65\.3465\.34GTA AnsAcc and21\.3821\.38GAIA AnsAcc \(\+0\.83\+0\.83GTA /\+2\.46\+2\.46GAIA over GRPO\-only\), suggesting reflection text helps, while unstructured reflection remains less efficient and less consistent\.

Effect of structured ReGRPO and optional verifier reward\.Switching from free\-form reflection to structured ReGRPO withλval=0\\lambda\_\{\\text\{val\}\}=0yields67\.6667\.66GTA AnsAcc and23\.3523\.35GAIA AnsAcc \(\+2\.32\+2\.32GTA /\+1\.97\+1\.97GAIA\), showing that schema\-constrained reflection and reflection\-correction coupling add gains even without verifier reward in RL\. Adding the optional teacher\-derived verifier reward further improves to68\.4968\.49GTA AnsAcc and24\.0124\.01GAIA AnsAcc \(\+0\.83\+0\.83GTA /\+0\.66\+0\.66GAIA over ReGRPO core\), indicating that verifier reward is useful but not essential\.

Key insight\.The ablation shows complementary effects: \(1\) RoT data provides a strong starting point, \(2\) structured reflection\-aware RL adds consistent gains on top of that initialization, and \(3\) verifier signals provide modest additive shaping rather than core capability\. ReGRPO therefore remains effective in a verifier\-free default setting during both training \(λval=0\\lambda\_\{\\text\{val\}\}=0\) and inference \(zero\-verifier single\-path execution\)\. We further verify that ReGRPO preserves the base model’s VQA ability \(Appendix Sec\.[A6](https://arxiv.org/html/2606.31392#Pt0.A6)\)\.

### 4\.6Qualitative Analysis

Figure[4](https://arxiv.org/html/2606.31392#S4.F4)presents three views of the mechanism, instantiated on a verbatim synthesized RoT record \(aGroundingDriftcase\)\. Panel \(a\) shows that RoT extends the MAT SFT format with explicit reflective fields \(areflectiontriplet and acorrected\_action\), so supervision covers failure diagnosis and correction rather than tool calls alone\. Panel \(b\) gives case\-level evidence aligned with Sec\.[3\.5](https://arxiv.org/html/2606.31392#S3.SS5): the tool returns a valid but off\-target description \(it answers about the cutting board, not the smoothie\), so no explicit error is raised, and low confidence \(ui<κiu\_\{i\}<\\kappa\_\{i\}\) triggers a structured reflection whose schema \(ErrorType, Evidence, FixPlan\) re\-grounds the query rather than adding generic free\-form text\. Panel \(c\) shows the same mechanism as paths: the baseline route estimates from off\-target items, while ReGRPO opens exactly one local block at the uncertain step and rewrites the action to re\-ground on the target object\. The full record is given in Appendix Table[12](https://arxiv.org/html/2606.31392#Pt0.A7.T12)\.

## 5Conclusion

We presentedReGRPO, a reflection\-augmented framework for training tool\-using vision–language agents\. ReGRPO builds a Structured Reflective Data Engine and applies a GRPO\-based reflection protocol to optimize reflection and action generation jointly in tool\-using trajectories\. Our method keeps the standard GRPO optimizer, with the main design changes in structured reflection representation and the training/deployment protocol\.

On the GTA and GAIA benchmarks, our approach consistently outperforms action\-level and SFT\-only baselines under the same backbone/tool setting\. The ablation results show complementary gains: RoT SFT provides a strong foundation, structured RL contributes further improvements, and optional verifier distillation/reward adds further but smaller gains\. Overall, ReGRPO improves answer accuracy and tool correctness over MAT\-AGENT and SPORT while using the same VLM backbone and tool suite\.

#### Acknowledgements\.

This project is supported by the Ministry of Education, Singapore, under its Academic Research Fund Tier 2 \(Award No: MOE\-T2EP20124\-0012\)\.

## References

Appendix

## Appendix A1Algorithmic Details

In this section we provide pseudo code and implementation details for Reflection\-Augmented Group Relative Policy Optimization \(ReGRPO\)\.

### A1\.1ReGRPO Training

We assume a dataset of task states\. For each statesis\_\{i\}, we sample a group of trajectories and optimize a verifier\-aware reward during training\. The verifier term is used only in training; inference is strictly zero\-verifier\.

Algorithm 1ReGRPO Training Loop \(Verifier\-Aware Reward\)0:Dataset

𝒟\\mathcal\{D\}, policy

πθ\\pi\_\{\\theta\}, ref policy

πref\\pi\_\{\\text\{ref\}\}, coefficients

λexec,λval,η\\lambda\_\{\\text\{exec\}\},\\lambda\_\{\\text\{val\}\},\\eta, KL weight

β\\beta, verifier weights

wa,wg,wpw\_\{a\},w\_\{g\},w\_\{p\}, group size

KK\.

1:whilenot convergeddo

2:Sample batch of states

S⊂𝒟S\\subset\\mathcal\{D\}
3:for allstate

s∈Ss\\in Sdo

4:Construct a group of

KKcandidate trajectories

\{τk\}k=1K\\\{\\tau^\{k\}\\\}\_\{k=1\}^\{K\}
5:for

k=1k=1to

KKdo

6:Replay

τk\\tau^\{k\}to collect success signal and tool observations

7:Compute verifier subscores

\(sak,sgk,spk\)\(s\_\{a\}^\{k\},s\_\{g\}^\{k\},s\_\{p\}^\{k\}\)
8:

V\(τk\)=wasak\+wgsgk\+wpspkV\(\\tau^\{k\}\)=w\_\{a\}s\_\{a\}^\{k\}\+w\_\{g\}s\_\{g\}^\{k\}\+w\_\{p\}s\_\{p\}^\{k\}
9:Compute reward

R\(τk\)=λexec⋅Success\(τk\)\+λval⋅V\(τk\)−η⋅C\(τk\)R\(\\tau^\{k\}\)=\\lambda\_\{\\text\{exec\}\}\\cdot\\text\{Success\}\(\\tau^\{k\}\)\+\\lambda\_\{\\text\{val\}\}\\cdot V\(\\tau^\{k\}\)\-\\eta\\cdot C\(\\tau^\{k\}\)
10:endfor

11:Compute baseline

R¯=1K∑k=1KR\(τk\)\\bar\{R\}=\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}R\(\\tau^\{k\}\)
12:Compute advantages

Ak=R\(τk\)−R¯A^\{k\}=R\(\\tau^\{k\}\)\-\\bar\{R\}
13:Accumulate gradients

∇θℒ=−∑kAk∇log⁡πθ\(τk\)\+βKL\(πθ∥πref\)\\nabla\_\{\\theta\}\\mathcal\{L\}=\-\\sum\_\{k\}A^\{k\}\\nabla\\log\\pi\_\{\\theta\}\(\\tau^\{k\}\)\+\\beta\\,\\mathrm\{KL\}\(\\pi\_\{\\theta\}\\\|\\pi\_\{\\text\{ref\}\}\)
14:endfor

15:Update

θ\\theta
16:endwhile

### A1\.2Reward Coefficients and Group Size

Table[3](https://arxiv.org/html/2606.31392#Pt0.A1.T3)lists the default coefficients used in our reported setting\.

Table 3:Default ReGRPO coefficients\. In deployment we setλval=0\\lambda\_\{\\text\{val\}\}=0and do not call external verifiers\.
### A1\.3Inference Trigger Implementation

We use a deterministic and lightweight trigger that avoids per\-tool feature engineering\. For each step, tool output is normalized as

o^i=\{status,payload,meta\},\\hat\{o\}\_\{i\}=\\\{\\texttt\{status\},\\texttt\{payload\},\\texttt\{meta\}\\\},\(8\)wherestatusrecords runtime errors andpayloadstores normalized content\. The trigger is

gi=𝟏\{ToolError\(o^i\)∨EmptyObs\(o^i\)∨ui<κi\}\.g\_\{i\}=\\mathbf\{1\}\\\{\\texttt\{ToolError\}\(\\hat\{o\}\_\{i\}\)\\lor\\texttt\{EmptyObs\}\(\\hat\{o\}\_\{i\}\)\\lor u\_\{i\}<\\kappa\_\{i\}\\\}\.\(9\)Hard failures are captured byToolError/EmptyObs; silent failures are captured by low policy confidence

ui=exp⁡\(1\|ai\(0\)\|∑jlog⁡πθ\(ai,j\(0\)∣xi,hi\)\)\.u\_\{i\}=\\exp\\\!\\left\(\\frac\{1\}\{\|a\_\{i\}^\{\(0\)\}\|\}\\sum\_\{j\}\\log\\pi\_\{\\theta\}\(a\_\{i,j\}^\{\(0\)\}\\mid x\_\{i\},h\_\{i\}\)\\right\)\.\(10\)To avoid offline calibration and per\-tool thresholds, we use an online adaptive threshold

κi=1max⁡\(1,i−1\)∑j=1i−1uj\.\\kappa\_\{i\}=\\frac\{1\}\{\\max\(1,i\-1\)\}\\sum\_\{j=1\}^\{i\-1\}u\_\{j\}\.\(11\)Fori=1i=1, confidence\-based triggering is disabled and only hard\-failure checks are used\.

Algorithm 2Minimal Deterministic Trigger \(Zero\-Verifier\)0:Context

xix\_\{i\}, tentative action

ai\(0\)a\_\{i\}^\{\(0\)\}, tool output

oi\(0\)o\_\{i\}^\{\(0\)\}, previous confidences

\{uj\}j<i\\\{u\_\{j\}\\\}\_\{j<i\}
1:Normalize

oi\(0\)o\_\{i\}^\{\(0\)\}to canonical schema

o^i\\hat\{o\}\_\{i\}
2:Compute confidence

uiu\_\{i\}from action log\-probabilities

3:Compute adaptive threshold

κi\\kappa\_\{i\}as mean confidence of previous steps

4:

gi←𝟏\{ToolError\(o^i\)∨EmptyObs\(o^i\)∨\(i\>1∧ui<κi\)\}g\_\{i\}\\leftarrow\\mathbf\{1\}\\\{\\texttt\{ToolError\}\(\\hat\{o\}\_\{i\}\)\\lor\\texttt\{EmptyObs\}\(\\hat\{o\}\_\{i\}\)\\lor\(i\>1\\land u\_\{i\}<\\kappa\_\{i\}\)\\\}
5:If

gi=1g\_\{i\}=1and local\-reflection\-count

<1<1, run one block

ai\(0\)→oi\(0\)→zi→ai\(1\)→oi\(1\)a\_\{i\}^\{\(0\)\}\\\!\\rightarrow\\\!o\_\{i\}^\{\(0\)\}\\\!\\rightarrow\\\!z\_\{i\}\\\!\\rightarrow\\\!a\_\{i\}^\{\(1\)\}\\\!\\rightarrow\\\!o\_\{i\}^\{\(1\)\}
5:Trigger decision

gig\_\{i\}

This design keeps deployment deterministic, simple, and reproducible while reducing manual tuning\.

## Appendix A2Structured Reflective Data Engine Statistics

Table[4](https://arxiv.org/html/2606.31392#Pt0.A2.T4)reports corpus statistics over the synthesized Reflection\-on\-Thought \(RoT\) data \(16,55216\{,\}552records\)\. Each record is labeled with one of the fourErrorTypecategories used throughout the paper \(Sec\.[3\.2](https://arxiv.org/html/2606.31392#S3.SS2)\):ArgInvalid,ToolMismatch,InfoInsufficient, andGroundingDrift\. This four\-way taxonomy is the reflection label set:ArgInvaliddominates \(45\.4%45\.4\\%\), reflecting that corrupted\-argument failures \(wrong path/index/span\) are the most frequent recoverable error in the source trajectories, followed by tool\-capability mismatches \(24\.7%24\.7\\%\), insufficient\-information queries \(19\.7%19\.7\\%\), and grounding drift \(10\.1%10\.1\\%\)\. The corpus is multi\-modal:37\.0%37\.0\\%of records carry an image, while the remaining63\.0%63\.0\\%are text\-only\.

Table 4:ErrorTypedistribution over the synthesized Reflection\-on\-Thought corpus \(16,55216\{,\}552records\)\. The reflection label set is the four\-way taxonomy used throughout the paper\. Across the corpus,37\.0%37\.0\\%of records carry an image and63\.0%63\.0\\%are text\-only\.Table 5:Tool used in the perturbed action over the synthesized Reflection\-on\-Thought corpus \(top entries\)\. The distribution is dominated by the image\-QA, search, and file\-inspection tools that appear most often in the source trajectories\.Table 6:Component contribution of the Structured Reflective Data Engine and reflection\-aware RL\.
## Appendix A3Verifier Subscores and Offline Teacher

We stress that no LLM verifier is queried inside the RL loop\. The verifier valueVVused by theλvalV\\lambda\_\{\\text\{val\}\}Vreward term is computed*deterministically*from each candidate and the active record’s RoT metadata, soVVis a function of the candidate, not of any network or external model call:V = w\_a s\_a \+ w\_g s\_g \+ w\_p s\_p, w\_g ≥w\_a, w\_p, s\_a,s\_g,s\_p ∈\[0,1\],with default weights\(wa,wg,wp\)=\(0\.25,0\.50,0\.25\)\(w\_\{a\},w\_\{g\},w\_\{p\}\)=\(0\.25,0\.50,0\.25\)\. The three subscores are derived as follows \(see Fig\.[5](https://arxiv.org/html/2606.31392#Pt0.A3.F5)\):

- •sps\_\{p\}\(plan validity\):sp=1s\_\{p\}=1iff the candidate’s normalized primary tool and first argument match the storedcorrected\_actionsignature, else0\.
- •sas\_\{a\}\(answer consistency\):sa=1s\_\{a\}=1iff the candidate’s replay succeeds \(its terminal action agrees with the group’s correct answer\), else0\.
- •sgs\_\{g\}\(grounding\):sg=1s\_\{g\}=1iff the candidate carries a reflection whoseevidenceis text\-grounded in the storedfailed\_observation*and*sp\>0s\_\{p\}\>0, else0\. There is intentionally no fallback tosps\_\{p\}for reflection\-less candidates, soVVrewards grounded reflection above bare plan repair\.

A GPT\-4o teacher is used*only offline*, to synthesize the RoT reflections in the data engine \(Sec\.[A5](https://arxiv.org/html/2606.31392#Pt0.A5)\); it is never called during RL training or at inference\.

Deterministic Verifier Subscores\(no LLM in the loop\)

Inputs\(from the active RoT record\):candidate\.code // the sampled action under evaluation meta\.corrected\_action // gold action signaturea∗a^\{\*\} meta\.failed\_observation, candidate\.reflection\.evidence candidate\.success // replay success flagComputation:s\_p = 1 if signature\(candidate\.code\) == signature\(corrected\_action\) else 0 s\_a = 1 if candidate\.success else 0 grounded = evidence\_grounded\(reflection\.evidence, failed\_observation\) s\_g = 1 if \(grounded and s\_p \> 0\) else 0 V = 0\.25\*s\_a \+ 0\.50\*s\_g \+ 0\.25\*s\_pNo external model query; deterministic given the metadata\.

Figure 5:Teacher\-derived verifier subscores\. At training time\(sa,sg,sp\)\(s\_\{a\},s\_\{g\},s\_\{p\}\)are computed deterministically from the teacher’s RoT metadata via tool/argument\-signature matching, the replay success flag, and a grounded\-reflection check \(sgs\_\{g\}requires grounded evidence andsp\>0s\_\{p\}\>0\); no GPT\-4o \(or any LLM\) is queried in the RL loop\. The GPT\-4o teacher is used only offline to synthesize the RoT reflections\.
## Appendix A4Tool Suite

We use the standard MAT\-AGENT tool suite:

- •Web search:searchinformation,visit,webqa\.
- •Image QA:image\_qa\.
- •File inspector:pdf\_span,table\_reader\.
- •Object localization:object\_loc\.
- •Python sandbox:python\_exec\.

## Appendix A5Prompt Templates

### A5\.1Reflection Generation Prompt

The Reflection\-on\-Thought \(RoT\) corpus is synthesized offline by a teacher LLM \(GPT\-4o\)\. For each source step, the teacher is asked to produce, under a single requestedErrorType, one realistic near\-miss*failed action*, one faithful*failed observation*in the target tool’s native output format, and one structured*reflection*triple\. The system prompt fixes the contract and enforces strict JSON output:

System Prompt

You are a strict data generation engine for ReGRPO\. Produce one realistic near\-miss failed action, one faithful failed observation, and one Reflection\-of\-Thought triple\. Return STRICT JSON only\. Do not include markdown, prose, comments, or extra keys\. The failed action must keep the same Thought plus Code shape as the provided correct action, but it must be broken according to the requestedErrorType\. The failed observation must look like the named tool’s real output format and must be consistent with the failed action\. The reflection evidence must quote a concrete token or phrase fromfailed\_observation\.

The user message is a JSON payload that supplies the task, the source step, therequested\_error\_typedrawn from the four\-way taxonomy\{\\\{ArgInvalid,ToolMismatch,InfoInsufficient,GroundingDrift\}\\\}, a per\-error\-type guidance string, a per\-tool failure\-observation template, the history before the step, and the \(teacher\-only\) correct action\. It also fixes the exact output schema the teacher must emit:

User Message\(abridged JSON\)

\{ "task": \.\.\., "source\_id": \.\.\., "step\_index": \.\.\., "target\_tool": \.\.\., "requested\_error\_type": <one of ArgInvalid\|ToolMismatch\| InfoInsufficient\|GroundingDrift\>, "error\_type\_guidance": \[\.\.\.\], "failure\_observation\_template": \[\.\.\.\], "history\_before\_step": \[\.\.\.\], "correct\_action\_teacher\_only": \[\.\.\.\], "output\_schema": \{ "failed\_action": "<Thought\+Code, same shape as correct but broken\>", "failed\_observation": "<realistic failure in the tool’s output format\>", "reflection": \{ "error\_type": <the requested ErrorType\>, "evidence": "<quote a token or phrase from failed\_observation\>", "fix\_plan": "<concrete corrective strategy\>" \} \} \}

The per\-error\-type guidance instantiates each label concretely, e\.g\.GroundingDrift“shift the referenced region, object, crop, or visual target while preserving the tool shape”;ToolMismatch“replace the tool with a different, unsuitable tool for the same intent”;ArgInvalid“corrupt an argument such as a path, page, bbox, object name, query, size, or format”; andInfoInsufficient“drop or underspecify necessary context so the tool cannot retrieve enough information\.” Every emitted record passes a deterministic validator that, among other checks, requiresreflection\.evidenceto be text\-grounded infailed\_observation\(a literal token overlap heuristic\) and rejects any record whoseerror\_typelies outside the taxonomy\.

## Appendix A6Base VQA Ability Is Preserved

A natural concern is whether reflection\-augmented RL erodes the controller’s underlying vision–language ability\. Evaluating the base model and theReGRPO\-trained model on MMBench, we find thatReGRPOpreserves base VQA ability \(85\.0→84\.2585\.0\\to 84\.25, a0\.750\.75\-point change\), so learning structured reflection does not come at the cost of general VQA competence\.

## Appendix A7Qualitative Study and Case Studies

We present trajectory\-level case studies under the same inference setup as Sec\.[3\.5](https://arxiv.org/html/2606.31392#S3.SS5): inference is single\-path and zero\-verifier, and at each step we apply the deterministic gategi=𝟏\{ToolError\(o^i\)∨EmptyObs\(o^i\)∨\(i\>1∧ui<κi\)\}g\_\{i\}=\\mathbf\{1\}\\\{\\texttt\{ToolError\}\(\\hat\{o\}\_\{i\}\)\\lor\\texttt\{EmptyObs\}\(\\hat\{o\}\_\{i\}\)\\lor\(i\>1\\land u\_\{i\}<\\kappa\_\{i\}\)\\\}, allowing at most one local reflection\-correction block per step\. All trajectories and records are copied*verbatim*from theReGRPOevaluation caches and the Structured Reflective Data Engine corpus; no trajectory is paraphrased or synthesized, and long fields are truncated with “\[…\]” for readability only\.

### A7\.1Success Trajectories

We first show two correct end\-to\-end trajectories from the GTA evaluation cache\. Correctness is checked with GTA’s official substring rule \(for every gold list, at least one gold string is a case\-insensitive substring of the final answer\)\. Trajectories use the runtime MAT\-AGENT tool names \(visualizer,ask\_search\_agent\), which correspond respectively to the image\-QA and web\-search tools\.

##### Case A1: GTA multi\-step tool use — constrained menu selection\.

Task:“Choose only one dish in Salad and one dish in Sandwich, which combination can be no more than 10\.9 yuan in total?” Input:image\_318\.jpg\(a menu, Figure[6](https://arxiv.org/html/2606.31392#Pt0.A7.F6)\)\. Gold: \{Toast, Garden Green Salad\}\.

![Refer to caption](https://arxiv.org/html/2606.31392v1/imgs/case_gta_318.jpg)Figure 6:Input image for Case A1 \(image\_318\.jpg\): the menu the agent grounds and reads before solving the constrained selection\.Table 7:Case A1 \(GTA,\_\_pf2, id 137\)\. Image grounding \(reading the structured menu\) followed by exact constrained optimization in code: the only pair summing to≤10\.9\\leq 10\.9is found correctly\. Verified by the GTA substring rule\.
##### Case A2: GTA search\+\+reason — image→\\rightarrowentity→\\rightarrowweb fact\.

Task:“Who is the CEO of this company?” Input:image\_417\.jpg\(an iPhone 15 Pro, Figure[7](https://arxiv.org/html/2606.31392#Pt0.A7.F7)\)\. Gold: \{Tim Cook\}\.

![Refer to caption](https://arxiv.org/html/2606.31392v1/imgs/case_gta_417.jpg)Figure 7:Input image for Case A2 \(image\_417\.jpg\): the depicted product the agent recognizes and resolves to its manufacturer before retrieving the CEO\.Table 8:Case A2 \(GTA,\_\_pf2, id 213\)\. A search\-and\-reason chain: recognize the depicted product, resolve it to the manufacturer, then retrieve and report the CEO\. The agent selects the search tool rather than an image tool\.

### A7\.2Failure Modes and Reflective Recovery

The four records below are copied*verbatim*from the Structured Reflective Data Engine corpus \(Sec\.[3\.2](https://arxiv.org/html/2606.31392#S3.SS2)\) and cover the fourErrorTypecategories used throughout the paper:ArgInvalid,ToolMismatch,InfoInsufficient, andGroundingDrift\. For one error type each, the record shows the task, the failed \(perturbed\) actionaifaila\_\{i\}^\{fail\}, the resulting failure observationoifailo\_\{i\}^\{fail\}, the synthesized structured Reflection\-on\-Thoughtziz\_\{i\}\(ErrorType/Evidence/FixPlan, the schema of Sec\.[3\.2](https://arxiv.org/html/2606.31392#S3.SS2)\), and the corrected actionai∗a\_\{i\}^\{\*\}\. During supervised fine\-tuning the reflection is emitted in\-context as aReflection:block immediately before the corrected action, so the model maximizesP\(zi,ai∗∣xi,aifail,oifail\)P\(z\_\{i\},a\_\{i\}^\{\*\}\\mid x\_\{i\},a\_\{i\}^\{fail\},o\_\{i\}^\{fail\}\)\.

Table 9:Case B1 —ArgInvalid\(record00WXZu4s\_rot\_s2\_ArgInvalid\)\. The reflection notices the observation describes traffic signs \(not the target image\), traces the cause to a corrupted path \(\.\.\.\_jpgwith an extra underscore instead of\.\.\.\.jpg\), and repairs the argument to the known\-good path\.Table 10:Case B2 —ToolMismatch\(record00\_tFZUB\_rot\_s0\_ToolMismatch\)\. The reflection reads the tool’s own error \(“does not access historical temperature records”\), recognizes a capability mismatch, and switches fromimage\_generatortoask\_search\_agent— a fact\-retrieval goal routed to a search tool\.Table 11:Case B3 —InfoInsufficient\(record09ysaaTv\_rot\_s0\_InfoInsufficient\)\. A generic “what is shown” prompt returned only a scene description \(no artist/year\); the reflection rewrites the query to ask the tool for exactly the requested facts while keeping the correct image path\.Table 12:Case B4 —GroundingDrift\(record0FLZe2lb\_rot\_s0\_GroundingDrift\)\. The reflection detects that the query drifted to the cutting board and utensils \(“the smoothie glass is not described”\) and re\-anchors the visual question to the smoothie itself, so the extracted ingredients feed the actual calorie estimate\.
ReGRPO: Reflection-Augmented Policy Optimization for Tool-Using Agents

Similar Articles

Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs

Active-GRPO: Adaptive Imitation and Self-Improving Reasoning for Molecular Optimization

Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents

Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents

APPO: Agentic Procedural Policy Optimization

Submit Feedback

Similar Articles

Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs
Active-GRPO: Adaptive Imitation and Self-Improving Reasoning for Molecular Optimization
Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents
Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents
APPO: Agentic Procedural Policy Optimization