Closing the Reflection Gap: A Free Calibration Bonus for Agentic RL
Summary
LLM agents often mis-assess their own performance after observing environment feedback, a problem called the reflection gap. RefGRPO addresses this by augmenting RL with a free calibration bonus and dynamic scheduling, reducing underconfidence from 44.4% to 7.7% and improving task accuracy on text-to-SQL benchmarks.
View Cached Full Text
Cached at: 06/15/26, 09:11 AM
# Closing the Reflection Gap: A Free Calibration Bonus for Agentic RL
Source: [https://arxiv.org/html/2606.14211](https://arxiv.org/html/2606.14211)
###### Abstract
LLMs are increasingly deployed as agents that interact with external environments and observe feedback such as execution results, error messages, and tool outputs\. A well\-functioning agent should be able to leverage this feedback to accurately assess its own performance\. Yet we find a persistent*reflection gap*: LLM agents tend to mis\-assess their own outputs after observing concrete environment feedback—even for questions they correctly answered—and standard RL barely helps due to a credit\-assignment mismatch\. To close this gap, we proposeRefGRPO, a simple yet effective fix that augments standard RL algorithms with two key ingredients: a free calibration bonus computed by contrasting the agent’s own reflection with the actual outcome \(requiring no additional reward model, LLM judge, or external annotation\), and a dynamic schedule on its coefficient\. Compared to standard RL baselines, our method simultaneously improves reflection calibration \(e\.g\., reduces underconfidence rate44\.4%→7\.7%44\.4\\%\\to 7\.7\\%\) and task accuracy \(e\.g\.,75\.1%→76\.5%75\.1\\%\\to 76\.5\\%\) on text\-to\-SQL across five benchmarks\. The resulting calibrated reflection turns the agent into its own verifier grounded in environment feedback, which further enables \(i\) better self\-improvement that uses reflections as pseudo\-rewards without outcome supervision, and \(ii\) more effective test\-time selective prediction by committing only to rollouts flagged as correct\.
Figure 1:High\-level overview of ourRefGRPOalgorithm\. We instruct the agent to reflect on environment feedback and generate a binary reflection scoresref∈\{0,1\}s^\{\\mathrm\{ref\}\}\\in\\\{0,1\\\}\.RefGRPOhas two key ingredients: \(i\) a*free*calibration bonusck=𝕀\(sk,Href=rk\)c\_\{k\}=\\mathbb\{I\}\(s^\{\\mathrm\{ref\}\}\_\{k,H\}=r\_\{k\}\)computed by contrasting post\-feedback reflection with the outcome, requiring no additional reward model, LLM judge, or external annotation; it lifts well\-calibrated rollouts above miscalibrated ones, giving honest reflection positive relative advantage regardless of task outcome\. \(ii\) a*dynamic*schedule on the calibration coefficient, which enables the model to simultaneously improve reflection calibration and task performance\. In effect,RefGRPOturns the agent into its own verifier grounded in environment feedback\.## 1Introduction
Figure 2:Comparison of the base model, a GRPO\-style RL baseline, and ourRefGRPOin the multi\-turn setting\. The model generates reflection scoressref∈\{0,1\}s^\{\\mathrm\{ref\}\}\\in\\\{0,1\\\}based on environment feedback before receiving the outcome rewardr∈\{0,1\}r\\in\\\{0,1\\\}\.\(a\)underconfidence rate𝖴𝗇𝖽𝖾𝗋𝖢𝗈𝗇𝖿=ℙ\(r=1∣sref=0\)\{\\mathsf\{UnderConf\}\}=\\mathbb\{P\}\(r=1\\mid s^\{\\mathrm\{ref\}\}=0\), the fraction of self\-flagged errors that are actually correct \(*lower is better*\)\.\(b\)task accuracy𝖠𝖼𝖼=ℙ\(r=1\)\{\\mathsf\{Acc\}\}=\\mathbb\{P\}\(r=1\)\(*higher is better*\)\.\(c\)a unified metric𝖢𝗁𝗈𝗐𝖲𝖼𝗈𝗋𝖾\{\\mathsf\{ChowScore\}\}for task accuracy and reflection calibration \(*higher is better*\)\.LLM agents acting in an environment receive*environment feedback*, e\.g\., execution results, error messages, or tool outputs, after each action\(Jin et al\.,[2025](https://arxiv.org/html/2606.14211#bib.bib24); Cao et al\.,[2025](https://arxiv.org/html/2606.14211#bib.bib5); Zhang et al\.,[2026](https://arxiv.org/html/2606.14211#bib.bib47)\)\. This feedback is concrete evidence about whether an action succeeded, yet agents typically consume it only as context for choosing the next action, rarely explicitly using it to judge whether the task itself has been correctly solved\.
In this work, we explicitly instruct the agent to*reflect*on the environment feedback and produce a*binary score*indicating whether its answer is correct—a post\-feedback self\-assessment grounded in evidence it has already observed\. This casts reflection as a prediction problem: an agent that can accurately predict its own correctness from the feedback it has seen demonstrates good understanding of the consequences of its actions\. Post\-feedback reflection quality is therefore a direct measure of how well the agent comprehends what it has done and how the environment evolves, i\.e\., the quality of its*implicit*world model\(Ha and Schmidhuber,[2018](https://arxiv.org/html/2606.14211#bib.bib19)\)\.
Crucially, this is different from existing work on LLM self\-assessment\(Kadavath et al\.,[2022](https://arxiv.org/html/2606.14211#bib.bib25); Tian et al\.,[2023](https://arxiv.org/html/2606.14211#bib.bib41); Xiong et al\.,[2024](https://arxiv.org/html/2606.14211#bib.bib44); Tao et al\.,[2024](https://arxiv.org/html/2606.14211#bib.bib40); Leng et al\.,[2025](https://arxiv.org/html/2606.14211#bib.bib27); Bani\-Harouni et al\.,[2026](https://arxiv.org/html/2606.14211#bib.bib3); Damani et al\.,[2026](https://arxiv.org/html/2606.14211#bib.bib10)\), where a model expresses its confidence in a self\-generated answer*solely*based on its own generation,*without environment feedback*\. In our setting, the agent reflects on its own actions*after*observing concrete environment feedback—strictly more information than the self\-generated answer alone\. Interpreting evidence one has already seen should be easier than forecasting correctness blind, so we would expect agents—especially RL\-trained agents with strong reasoning capabilities—to excel at it\.
Yet they do not\. Beyond the well\-documented overconfidence of LLMs\(Xiong et al\.,[2024](https://arxiv.org/html/2606.14211#bib.bib44); Leng et al\.,[2025](https://arxiv.org/html/2606.14211#bib.bib27)\), we find a more surprising failure: LLM agents exhibit significant*underconfidence*—flagging correct answers as wrong—even after observing concrete environment feedback, and standard RL recipes fail to fix it\. On a multi\-turn text\-to\-SQL setup, a Qwen2\.5\-Coder\-7B\-Instruct base is badly underconfident:54\.3%54\.3\\%of the answers it flags as wrong are actually correct \([Figure2](https://arxiv.org/html/2606.14211#S1.F2)\(a\)\)\. Training that base with a GRPO\-style algorithm\(Shao et al\.,[2024](https://arxiv.org/html/2606.14211#bib.bib37); Guo et al\.,[2025](https://arxiv.org/html/2606.14211#bib.bib18); Yu et al\.,[2025](https://arxiv.org/html/2606.14211#bib.bib45); Liu et al\.,[2025](https://arxiv.org/html/2606.14211#bib.bib30); He et al\.,[2025](https://arxiv.org/html/2606.14211#bib.bib20)\)only nudges this rate down to44\.4%44\.4\\%even though it lifts task accuracy substantially\. Standard RL recipes make the model a stronger task solver, but leave its post\-feedback reflection quality broken—a persistent failure we call the*reflection gap*\. The cause is structural: outcome\-only RL assigns advantage purely based on the outcome, so a rollout with an incorrect outcome but an honest error flag receives negative advantage, training the agent to suppress correct error flags rather than reward them; see[Section3\.1](https://arxiv.org/html/2606.14211#S3.SS1)for a detailed discussion\.
To close the reflection gap, we propose a new algorithmRefGRPOthat extends outcome\-only RL with two key ingredients \([Figure1](https://arxiv.org/html/2606.14211#S0.F1)\)\. \(i\) A*free*calibration bonus: since both reflectionsrefs^\{\\mathrm\{ref\}\}and the outcomerrare already available during RL training, we add a calibration bonusc=𝕀\(sref=r\)c=\\mathbb\{I\}\(s^\{\\mathrm\{ref\}\}=r\)to the outcome reward before group normalization\. This lifts well\-calibrated rollouts above miscalibrated ones, giving honest reflection positive relative advantage regardless of task outcome\. \(ii\) A dynamic schedule on the calibration coefficientα\(t\)≥0\\alpha\(t\)\\geq 0, which starts with a relatively large value to front\-load calibration early and then decays the coefficient to let the model focus on task performance while largely retaining the calibration ability\. As shown in[Figure2](https://arxiv.org/html/2606.14211#S1.F2), compared to a GRPO\-style baseline,RefGRPOsignificantly reduces the underconfidence rate44\.4%→7\.7%44\.4\\%\\to 7\.7\\%, while simultaneously improving the task accuracy75\.1%→76\.5%75\.1\\%\\to 76\.5\\%\. We also introduce the𝖢𝗁𝗈𝗐𝖲𝖼𝗈𝗋𝖾\{\\mathsf\{ChowScore\}\}from statistical learning\(Chow,[1957](https://arxiv.org/html/2606.14211#bib.bib6),[1970](https://arxiv.org/html/2606.14211#bib.bib7)\)to the agentic setting as a unified metric for task accuracy and reflection calibration;RefGRPOimproves𝖢𝗁𝗈𝗐𝖲𝖼𝗈𝗋𝖾\{\\mathsf\{ChowScore\}\}73\.0%→76\.5%73\.0\\%\\to 76\.5\\%\.
#### Contributions\.
We make the following main contributions:
1. 1\.Post\-feedback reflection is broken, and outcome\-only RL barely fixes it\.We show LLM agents tend to mis\-assess their own outputs even after observing concrete environment feedback, with underconfidence rates above44%44\\%in the multi\-turn setting \([Figure2](https://arxiv.org/html/2606.14211#S1.F2)\)\. We propose metrics to quantify the reflection gap \([Section2](https://arxiv.org/html/2606.14211#S2.SS0.SSS0.Px3)\), and trace it to a credit\-assignment mismatch problem \([Section3\.1](https://arxiv.org/html/2606.14211#S3.SS1)\)\.
2. 2\.Augmenting RL with a free calibration bonus\.We augment standard RL recipes with \(i\) a free calibration bonus computed from the contrast between post\-feedback reflection and outcome, and \(ii\) a dynamic schedule on its coefficient \([Section3\.2](https://arxiv.org/html/2606.14211#S3.SS2)\)\. As shown in[Figures2](https://arxiv.org/html/2606.14211#S1.F2)and[4\.2](https://arxiv.org/html/2606.14211#S4.SS2), our algorithm simultaneously improves reflection calibration \(e\.g\., reduces underconfidence rate44\.4%→7\.7%44\.4\\%\\to 7\.7\\%\) and task accuracy \(e\.g\.,75\.1%→76\.5%75\.1\\%\\to 76\.5\\%\), lifting the unified𝖢𝗁𝗈𝗐𝖲𝖼𝗈𝗋𝖾\{\\mathsf\{ChowScore\}\}metric73\.0%→76\.5%73\.0\\%\\to 76\.5\\%\.
3. 3\.Calibration enables better self\-improvement and selective prediction\.The resulting calibrated reflection turns the agent into its own verifier—one*grounded in environment feedback*rather than pure self\-assessment\. We show that this further enables \(i\) better self\-improvement that uses reflections as pseudo\-rewards*without*outcome supervision \([Section4\.3](https://arxiv.org/html/2606.14211#S4.SS3)\), and \(ii\) more effective test\-time selective prediction by committing only to rollouts flagged as correct \([Section4\.4](https://arxiv.org/html/2606.14211#S4.SS4)\)\.
## 2Problem Setting
We formalize the interaction between agents and environments, and introduce the metrics to evaluate the quality of an agent’s reflection based on both its own actions and the environment feedback\.
#### The interaction framework\.
We study an environment in which the agent assesses its action*after observing concrete environment feedback*\. Each turn proceeds in three stages \(see[Figure1](https://arxiv.org/html/2606.14211#S0.F1), top\):
1. 1\.Action\.The agent takes an actionaa\(e\.g\., a SQL query\) after step\-by\-step reasoning\.
2. 2\.Observation\.The environment executes the actionaaand returns an observationoo\(e\.g\., query results, error messages, or tool outputs\)\.
3. 3\.Reflection\.The agent reflects on*both*the actionaaand the observationooand generates a binary reflection scoresref∈\{0,1\}s^\{\\mathrm\{ref\}\}\\in\\\{0,1\\\}indicating whether it believes the action successfully completed the task \(i\.e\.,sref=1s^\{\\mathrm\{ref\}\}=1\) or not \(i\.e\.,sref=0s^\{\\mathrm\{ref\}\}=0\)\.
After the final turn, the environment provides a binary outcome rewardr∈\{0,1\}r\\in\\\{0,1\\\}based on the correctness of the action; the outcome reward is used for RL training but is*not*revealed to the agent in the interaction\.
#### Single\-turn and multi\-turn settings\.
We study both single\-turn and multi\-turn interactions\. In the*single\-turn*setting, the agent takes an action, observes the environment feedback, and conducts the reflection\.111Our single\-turn setting differs from the non\-agentic single\-turn setup \(e\.g\.,Damani et al\. \([2026](https://arxiv.org/html/2606.14211#bib.bib10)\)\): the agent reflects on its action together with the resulting environment feedback, rather than on its self\-generated answer alone\.In the*multi\-turn*setting, the agent interacts with the environment up toHHturns\. The agent can commit to a final action early and terminate the episode if it is confident about the result\. At each turn, we instruct the agent to reflect on*all*preceding actions and observations and generate a reflection score; we therefore use the last reflection scoresHrefs^\{\\mathrm\{ref\}\}\_\{H\}as the overall self\-assessment for the episode\.
#### Evaluation metrics\.
We measure task performance and reflection quality with the following metrics\.*Task accuracy*𝖠𝖼𝖼=ℙ\(r=1\)\{\\mathsf\{Acc\}\}=\\mathbb\{P\}\(r=1\)measures whether the agent’s answer is correct\.*Reflection accuracy*𝖠𝖼𝖼𝗋𝖾𝖿=ℙ\(sref=r\)\{\\mathsf\{Acc\}\_\{\\mathsf\{ref\}\}\}=\\mathbb\{P\}\(s^\{\\mathrm\{ref\}\}=r\)measures the agreement between the agent’s reflection and the actual outcome\. We also consider two directional miscalibration rates to decompose the failure modes whensref≠rs^\{\\mathrm\{ref\}\}\\neq r:*overconfidence rate*𝖮𝗏𝖾𝗋𝖢𝗈𝗇𝖿=ℙ\(r=0∣sref=1\)\{\\mathsf\{OverConf\}\}=\\mathbb\{P\}\(r=0\\mid s^\{\\mathrm\{ref\}\}=1\), the fraction of answers it flags as correct that are actually wrong; and*underconfidence rate*𝖴𝗇𝖽𝖾𝗋𝖢𝗈𝗇𝖿=ℙ\(r=1∣sref=0\)\{\\mathsf\{UnderConf\}\}=\\mathbb\{P\}\(r=1\\mid s^\{\\mathrm\{ref\}\}=0\), the fraction of answers it flags as wrong that are actually correct \(lower is better for both\)\.
#### A unified metric: Chow score\.
We introduce the𝖢𝗁𝗈𝗐𝖲𝖼𝗈𝗋𝖾\{\\mathsf\{ChowScore\}\}\(Chow,[1957](https://arxiv.org/html/2606.14211#bib.bib6),[1970](https://arxiv.org/html/2606.14211#bib.bib7)\)from statistical learning to the agentic setting as a unified metric for task accuracy and*reflection calibration*—the agent’s ability to know when it has solved the task\. Specifically, the𝖢𝗁𝗈𝗐𝖲𝖼𝗈𝗋𝖾\{\\mathsf\{ChowScore\}\}scores commits \(sref=1s^\{\\mathrm\{ref\}\}=1\) by correctness and self\-flagged errors \(sref=0s^\{\\mathrm\{ref\}\}=0\) by a fixed creditβ∈\[0,1\)\\beta\\in\[0,1\):
𝖢𝗁𝗈𝗐𝖲𝖼𝗈𝗋𝖾β=ℙ\(sref=1,r=1\)\+β⋅ℙ\(sref=0\)\.\{\\mathsf\{ChowScore\}\}\_\{\\beta\}=\\mathbb\{P\}\(s^\{\\mathrm\{ref\}\}=1,r=1\)\+\\beta\\cdot\\mathbb\{P\}\(s^\{\\mathrm\{ref\}\}=0\)\.𝖢𝗁𝗈𝗐𝖲𝖼𝗈𝗋𝖾β=𝖠𝖼𝖼\{\\mathsf\{ChowScore\}\}\_\{\\beta\}=\{\\mathsf\{Acc\}\}for an always\-commit agent and exceeds𝖠𝖼𝖼\{\\mathsf\{Acc\}\}whenever the agent’s error flags are informative;β\\betacontrols how much we credit honest error detection \(which can be interpreted as abstention\)\. We treat𝖢𝗁𝗈𝗐𝖲𝖼𝗈𝗋𝖾\{\\mathsf\{ChowScore\}\}as the*main metric*since it captures both task accuracy and reflection calibration\. We setβ=0\.1\\beta=0\.1by default and report a sweep overβ∈\[0,0\.5\]\\beta\\in\[0,0\.5\]in the ablations \([Section4\.5](https://arxiv.org/html/2606.14211#S4.SS5)\)\.
## 3Methods
Figure 3:Training curves for GRPO\+andRefGRPO\(Ours\) in the single\-turn setting using Qwen2\.5\-Coder\-3B\-Instruct as the base model\.Left:underconfidence rate \(*lower is better*\);Middle:reflection accuracy \(*higher is better*\);Right:𝖢𝗁𝗈𝗐𝖲𝖼𝗈𝗋𝖾\{\\mathsf\{ChowScore\}\}atβ=0\.1\\beta\{=\}0\.1\(*higher is better*\)\.We analyze why outcome\-only RL leads to uncalibrated models \([Section3\.1](https://arxiv.org/html/2606.14211#S3.SS1)\) and propose a new algorithm that closes the reflection gap \([Section3\.2](https://arxiv.org/html/2606.14211#S3.SS2)\)\.
### 3\.1Limitations of Outcome\-Only RL
#### Background: GRPO\.
Group Relative Policy Optimization \(GRPO;Shao et al\.[2024](https://arxiv.org/html/2606.14211#bib.bib37); Guo et al\.[2025](https://arxiv.org/html/2606.14211#bib.bib18)\) generatesGGrollouts\{τk\}k=1G\\\{\\tau\_\{k\}\\\}\_\{k=1\}^\{G\}for each promptq∼𝒟q\\sim\\mathcal\{D\}and updates the policy via the following clipped surrogate objective\(Schulman et al\.,[2017](https://arxiv.org/html/2606.14211#bib.bib36); Yu et al\.,[2025](https://arxiv.org/html/2606.14211#bib.bib45)\):
JGRPO\+\(θ\)=𝔼q,\{τk\}\[1N∑k=1G∑t=1\|τk\|min\(ρk,tAkgrpo,clip\(ρk,t,1−εlow,1\+εhigh\)Akgrpo\)\],J\_\{\\text\{GRPO$\{\}^\{\+\}$\}\}\(\\theta\)=\\mathbb\{E\}\_\{q,\\,\\\{\\tau\_\{k\}\\\}\}\\left\[\\tfrac\{1\}\{N\}\\sum\_\{k=1\}^\{G\}\\sum\_\{t=1\}^\{\|\\tau\_\{k\}\|\}\\min\\bigl\(\\rho\_\{k,t\}\\,A^\{\\mathrm\{grpo\}\}\_\{k\},\\;\\operatorname\{\\mathrm\{clip\}\}\(\\rho\_\{k,t\},1\-\\varepsilon\_\{\\text\{low\}\},1\+\\varepsilon\_\{\\text\{high\}\}\)\\,A^\{\\mathrm\{grpo\}\}\_\{k\}\\bigr\)\\right\],\(1\)whereρk,t=πθ\(ak,t∣hk,t\)/πθold\(ak,t∣hk,t\)\\rho\_\{k,t\}=\\pi\_\{\\theta\}\(a\_\{k,t\}\\mid h\_\{k,t\}\)/\\pi\_\{\\theta\_\{\\operatorname\{\{old\}\}\}\}\(a\_\{k,t\}\\mid h\_\{k,t\}\)is the importance ratio and the advantageAkgrpo=\(rk−μ\)/\(σ\+ε\)A^\{\\mathrm\{grpo\}\}\_\{k\}=\(r\_\{k\}\-\\mu\)/\(\\sigma\+\\varepsilon\)is normalized with respect to the group meanμ\\muand standard deviationσ\\sigmaover outcome rewards\{rj\}j=1G\\\{r\_\{j\}\\\}\_\{j=1\}^\{G\}\. We additionally \(i\) normalize by total number of tokensN=∑k\|τk\|N=\\sum\_\{k\}\|\\tau\_\{k\}\|to remove length bias, \(ii\) apply the asymmetricclip\(ρ,1−εlow,1\+εhigh\)\\operatorname\{\\mathrm\{clip\}\}\(\\rho,1\-\\varepsilon\_\{\\text\{low\}\},1\+\\varepsilon\_\{\\text\{high\}\}\)to encourage exploration and prevent entropy collapse, and \(iii\) drop the KL divergence term to avoid over\-constraining the policy and allow learning to be driven more directly by verifiable rewards\(Yu et al\.,[2025](https://arxiv.org/html/2606.14211#bib.bib45); Liu et al\.,[2025](https://arxiv.org/html/2606.14211#bib.bib30); He et al\.,[2025](https://arxiv.org/html/2606.14211#bib.bib20)\)\. We denote the resulting objective in[Eq\.1](https://arxiv.org/html/2606.14211#S3.E1)as GRPO\+to distinguish it from the original GRPO\(Shao et al\.,[2024](https://arxiv.org/html/2606.14211#bib.bib37)\)\.
#### Credit assignment mismatch\.
The advantageAkgrpoA^\{\\mathrm\{grpo\}\}\_\{k\}in[Eq\.1](https://arxiv.org/html/2606.14211#S3.E1)depends only on the outcomerkr\_\{k\}, so the gradient is purely driven by task correctness\. On rollouts where the agent self\-flags its final action as wrong \(sk,Href=0s^\{\\mathrm\{ref\}\}\_\{k,H\}=0\), the signal can be misleading for reflection quality:
- •Honest error flag \(rk=0r\_\{k\}=0,sk,Href=0s^\{\\mathrm\{ref\}\}\_\{k,H\}=0\):even though the reflection is correct, the negative advantageAkgrpo<0A^\{\\mathrm\{grpo\}\}\_\{k\}<0induced by the outcome trains the agent to*stop flagging genuine errors*\.
- •Underconfident error flag \(rk=1r\_\{k\}=1,sk,Href=0s^\{\\mathrm\{ref\}\}\_\{k,H\}=0\):the positive advantageAkgrpo\>0A^\{\\mathrm\{grpo\}\}\_\{k\}\>0induced by the outcome trains the agent to*wrongly doubt its success*\.
Both cases harm the agent’s post\-feedback reflection quality\. As shown in[Figure3](https://arxiv.org/html/2606.14211#S3.F3), under standard outcome\-only RL \(e\.g\., GRPO\+\),𝖴𝗇𝖽𝖾𝗋𝖢𝗈𝗇𝖿\{\\mathsf\{UnderConf\}\}drifts upward as training proceeds, while𝖠𝖼𝖼𝗋𝖾𝖿\{\\mathsf\{Acc\}\_\{\\mathsf\{ref\}\}\}and𝖢𝗁𝗈𝗐𝖲𝖼𝗈𝗋𝖾\{\\mathsf\{ChowScore\}\}lag far behind those of our method \(RefGRPO, introduced in[Section3\.2](https://arxiv.org/html/2606.14211#S3.SS2)\)\.
Algorithm 1RefGRPO: GRPO Augmented with a Free Calibration Bonus and a Dynamic Schedule0:Schedule
\{α\(t\)\}t≥0\\\{\\alpha\(t\)\\\}\_\{t\\geq 0\}with
α\(t\)≥0\\alpha\(t\)\\geq 0
1:foreach training step
ttdo
2:Get
α←α\(t\)\\alpha\\leftarrow\\alpha\(t\)from the schedule
3:foreach rollout
k=1,…,Gk=1,\\dots,Gdo
4:Extract the last reflection score
sk,Hrefs^\{\\mathrm\{ref\}\}\_\{k,H\}
5:Get
ck←𝕀\(sk,Href=rk\)c\_\{k\}\\leftarrow\\mathbb\{I\}\(s^\{\\mathrm\{ref\}\}\_\{k,H\}=r\_\{k\}\)// Calibration bonus
6:Augment the reward with
r~k←rk\+α⋅ck\\tilde\{r\}\_\{k\}\\leftarrow r\_\{k\}\+\\alpha\\cdot c\_\{k\}// Augmented reward
7:Compute
Akgrpo←r~k−μσ\+εA^\{\\mathrm\{grpo\}\}\_\{k\}\\leftarrow\\dfrac\{\\tilde\{r\}\_\{k\}\-\\mu\}\{\\sigma\+\\varepsilon\}with
μ=1G∑j=1Gr~j\\mu=\\tfrac\{1\}\{G\}\\sum\_\{j=1\}^\{G\}\\tilde\{r\}\_\{j\}and
σ2=1G∑j=1G\(r~j−μ\)2\\sigma^\{2\}=\\tfrac\{1\}\{G\}\\sum\_\{j=1\}^\{G\}\(\\tilde\{r\}\_\{j\}\-\\mu\)^\{2\}// GRPO advantage
8:Update policy with
\{Akgrpo\}\\\{A^\{\\mathrm\{grpo\}\}\_\{k\}\\\}via the RL objective in[Eq\.1](https://arxiv.org/html/2606.14211#S3.E1)
### 3\.2RefGRPO: Free Calibration from the Reflection\-Outcome Contrast
We introduce a new algorithmRefGRPO\([Algorithm1](https://arxiv.org/html/2606.14211#alg1)\) to resolve the credit\-assignment mismatch in outcome\-only RL, thereby closing the reflection gap\.RefGRPOhas two key ingredients: \(i\) a free calibration bonus computed by contrasting the agent’s reflection with the actual outcome, and \(ii\) a dynamic schedule on the calibration coefficient\.
#### A free calibration bonus from the reflection\-outcome contrast\.
The key observation is that, for each rolloutkk, the agent’s post\-feedback reflectionsk,Hrefs^\{\\mathrm\{ref\}\}\_\{k,H\}and the outcome rewardrkr\_\{k\}are both available during training, and*the contrast between them*can be used to compute a calibration signal\. Specifically, we compute the binary calibration bonusck=𝕀\(sk,Href=rk\)c\_\{k\}=\\mathbb\{I\}\(s^\{\\mathrm\{ref\}\}\_\{k,H\}=r\_\{k\}\)and augment the outcome rewardrkr\_\{k\}with this calibration bonus \(with coefficientα\(t\)≥0\\alpha\(t\)\\geq 0at training steptt\):
r~k\(t\)=rk\+α\(t\)⋅ck\.\\tilde\{r\}\_\{k\}\(t\)=r\_\{k\}\+\\alpha\(t\)\\cdot c\_\{k\}\.This augmentation assigns positive relative advantage to correct post\-feedback reflection*independently*of the task outcome\. As shown in[Figure1](https://arxiv.org/html/2606.14211#S0.F1), no matter whether the task succeeds or not, a well\-calibrated rollout receives higher reward than a miscalibrated one with the same outcome, which pushes the agent toward better\-calibrated reflection\.[Figure3](https://arxiv.org/html/2606.14211#S3.F3)further corroborates this effect: compared to GRPO\+,RefGRPOsignificantly reduces𝖴𝗇𝖽𝖾𝗋𝖢𝗈𝗇𝖿\{\\mathsf\{UnderConf\}\}and improves𝖠𝖼𝖼𝗋𝖾𝖿\{\\mathsf\{Acc\}\_\{\\mathsf\{ref\}\}\}, while achieving higher𝖢𝗁𝗈𝗐𝖲𝖼𝗈𝗋𝖾\{\\mathsf\{ChowScore\}\}\.
Notably, the calibration bonus is*free*: it is computed by contrasting the agent’s own reflection with the actual outcome, requiring no additional reward model, LLM judge, or external annotation\.
#### A dynamic schedule on the calibration bonus coefficient\.
While the augmented rewardr~k\\tilde\{r\}\_\{k\}helps the agent improve its reflection calibration, it can slightly reduce task accuracy as part of the signal is reserved for calibration quality rather than correctness\. To balance the two, we introduce a dynamic schedule on the calibration bonus coefficientα\(t\)≥0\\alpha\(t\)\\geq 0that decays the calibration bonus as training proceeds\. We instantiateα\(t\)\\alpha\(t\)as a simple*two\-stage*schedule:
α\(t\)=\{α0t≤γ⋅T,α1t\>γ⋅T,α1<α0,\\alpha\(t\)=\\begin\{cases\}\\alpha\_\{0\}&t\\leq\\gamma\\cdot T,\\\\ \\alpha\_\{1\}&t\>\\gamma\\cdot T,\\end\{cases\}\\qquad\\alpha\_\{1\}<\\alpha\_\{0\},whereTTis the total number of training steps\. The relatively larger coefficientα0\\alpha\_\{0\}in the firstγ⋅T\\gamma\\cdot Tsteps front\-loads calibration; the smaller coefficientα1\\alpha\_\{1\}in the remaining\(1−γ\)⋅T\(1\-\\gamma\)\\cdot Tsteps lets the model focus on task performance while largely retaining the calibration ability built up in the first stage\. By default we setα0=0\.1\\alpha\_\{0\}=0\.1,α1=0\\alpha\_\{1\}=0, andγ=2/3\\gamma=2/3\. As shown in[Figures2](https://arxiv.org/html/2606.14211#S1.F2),[1](https://arxiv.org/html/2606.14211#S4.T1)and[2](https://arxiv.org/html/2606.14211#S4.T2),RefGRPOwith this schedule improves both calibration quality and task accuracy compared to outcome\-only RL baselines\. We provide ablations of the dynamic schedule in[Section4\.5](https://arxiv.org/html/2606.14211#S4.SS5)\.
## 4Experiments
Table 1:Single\-turn results across five metrics: task accuracy, reflection accuracy, overconfidence rate, underconfidence rate, and Chow score\. All values are percentages\. Best results are highlighted inbold\. Results are averaged across 5 benchmarks; per\-dataset breakdowns are provided in[SectionA\.2](https://arxiv.org/html/2606.14211#A1.SS2)\.### 4\.1Experimental Setup
We evaluate on text\-to\-SQL, a widely\-studied agentic environment with verifiable rewards\(Yu et al\.,[2018](https://arxiv.org/html/2606.14211#bib.bib46); Li et al\.,[2023](https://arxiv.org/html/2606.14211#bib.bib29); Gao et al\.,[2024](https://arxiv.org/html/2606.14211#bib.bib14); Li et al\.,[2025](https://arxiv.org/html/2606.14211#bib.bib28); Pourreza et al\.,[2025](https://arxiv.org/html/2606.14211#bib.bib35); Ma et al\.,[2025a](https://arxiv.org/html/2606.14211#bib.bib31)\)\. It provides unambiguous binary outcomes \(execution correctness\) and concrete environment feedback \(SQL query results or error messages\) at a difficulty level suited to open\-source models\.
#### Data and models\.
We train on 4,660 text\-to\-SQL problems drawn from the training set of Spider\(Yu et al\.,[2018](https://arxiv.org/html/2606.14211#bib.bib46)\)and OmniSQL\(Li et al\.,[2025](https://arxiv.org/html/2606.14211#bib.bib28)\), and evaluate on five standard benchmarks: Spider\-Dev\(Yu et al\.,[2018](https://arxiv.org/html/2606.14211#bib.bib46)\), Spider\-Domain Knowledge \(DK;Gan et al\.[2021](https://arxiv.org/html/2606.14211#bib.bib12)\), Spider\-Realistic\(Deng et al\.,[2021](https://arxiv.org/html/2606.14211#bib.bib11)\), Spider\-Test\(Yu et al\.,[2018](https://arxiv.org/html/2606.14211#bib.bib46)\), and Bird\-Dev\(Li et al\.,[2023](https://arxiv.org/html/2606.14211#bib.bib29)\)\. We provide the average results across five benchmarks in the main content and provide per\-dataset breakdowns in the appendix \([AppendixA](https://arxiv.org/html/2606.14211#A1)\)\. We experiment with three instruction\-tuned models across two scales: Qwen2\.5\-Coder\-3B/7B\-Instruct\(Hui et al\.,[2024](https://arxiv.org/html/2606.14211#bib.bib22)\)and Llama\-3\.2\-3B\-Instruct\(Grattafiori et al\.,[2024](https://arxiv.org/html/2606.14211#bib.bib16)\)\.222We use instruction\-tuned rather than RL\-trained models as our base to avoid potential failure modes induced by standard RL training \([Section3\.1](https://arxiv.org/html/2606.14211#S3.SS1.SSS0.Px2)\)\.
#### Baselines\.
We use GRPO\+\([Eq\.1](https://arxiv.org/html/2606.14211#S3.E1)\) as the outcome\-only RL baseline; it augments the original GRPO objective\(Shao et al\.,[2024](https://arxiv.org/html/2606.14211#bib.bib37)\)with DAPO\-style improvements\(Yu et al\.,[2025](https://arxiv.org/html/2606.14211#bib.bib45)\)—asymmetric clipping, token\-mean normalization, and the removal of the KL divergence term\. OurRefGRPO\([Algorithm1](https://arxiv.org/html/2606.14211#alg1)\) further augments GRPO\+with a free calibration bonus and a dynamic schedule on its coefficient \([Section3\.2](https://arxiv.org/html/2606.14211#S3.SS2.SSS0.Px2)\)\. We train both GRPO\+andRefGRPOfor 6 epochs, using the same hyperparameters detailed in[SectionA\.1](https://arxiv.org/html/2606.14211#A1.SS1)\. We additionally benchmark against two open source 7B SQL specialists: OmniSQL\-7B \(SFT with 2\.5M CoT data;Li et al\.[2025](https://arxiv.org/html/2606.14211#bib.bib28)\) and SQL\-R1\-7B \(GRPO with outcome reward;Ma et al\.[2025a](https://arxiv.org/html/2606.14211#bib.bib31)\)\.
#### Metrics\.
We report five metrics for both single\- and multi\-turn \(up to 6 turns\) results: task accuracy𝖠𝖼𝖼\{\\mathsf\{Acc\}\}, reflection accuracy𝖠𝖼𝖼𝗋𝖾𝖿\{\\mathsf\{Acc\}\_\{\\mathsf\{ref\}\}\}, the two miscalibration rates𝖮𝗏𝖾𝗋𝖢𝗈𝗇𝖿\{\\mathsf\{OverConf\}\}and𝖴𝗇𝖽𝖾𝗋𝖢𝗈𝗇𝖿\{\\mathsf\{UnderConf\}\}, and the unified metric𝖢𝗁𝗈𝗐𝖲𝖼𝗈𝗋𝖾\{\\mathsf\{ChowScore\}\}\.333To reduce clutter, in this section we drop the % symbol after reported results whenever context makes the unit unambiguous\.We consider𝖢𝗁𝗈𝗐𝖲𝖼𝗈𝗋𝖾\{\\mathsf\{ChowScore\}\}as the*main metric*that captures both task accuracy and reflection calibration \([Section2](https://arxiv.org/html/2606.14211#S2.SS0.SSS0.Px3)\)\. Following prior work\(Li et al\.,[2025](https://arxiv.org/html/2606.14211#bib.bib28); Ma et al\.,[2025a](https://arxiv.org/html/2606.14211#bib.bib31)\), the main results are reported with greedy decoding; in[Section4\.4](https://arxiv.org/html/2606.14211#S4.SS4), we report aggregated results withk=8k=8samples at temperature0\.60\.6\. We default to𝖢𝗁𝗈𝗐𝖲𝖼𝗈𝗋𝖾0\.1\{\\mathsf\{ChowScore\}\}\_\{0\.1\}and provide a sweep over error\-flag creditsβ∈\[0,0\.5\]\\beta\\in\[0,0\.5\]in the ablations \([Section4\.5](https://arxiv.org/html/2606.14211#S4.SS5)\), where we also evaluate error detection via other metrics such as precision and recall\.
### 4\.2Main Results
#### Single\-turn results\.
[Table1](https://arxiv.org/html/2606.14211#S4.T1)reports results in the single\-turn setting for two base models \(Llama\-3\.2\-3B\-Instruct and Qwen2\.5\-Coder\-3B\-Instruct\)\.RefGRPOdominates GRPO\+on all five metrics across both base models—most notably improving𝖠𝖼𝖼𝗋𝖾𝖿\{\\mathsf\{Acc\}\_\{\\mathsf\{ref\}\}\}\(Qwen: GRPO\+76\.6→76\.6\\toRefGRPO79\.779\.7; Llama: GRPO\+74\.5→74\.5\\toRefGRPO76\.076\.0\) and reducing𝖴𝗇𝖽𝖾𝗋𝖢𝗈𝗇𝖿\{\\mathsf\{UnderConf\}\}\(Qwen: GRPO\+23\.7→23\.7\\toRefGRPO1\.31\.3; Llama: GRPO\+30\.9→30\.9\\toRefGRPO23\.523\.5\)\. Compared to the base, GRPO\+only marginally reduces𝖴𝗇𝖽𝖾𝗋𝖢𝗈𝗇𝖿\{\\mathsf\{UnderConf\}\}\(Qwen: Base25\.2→25\.2\\toGRPO\+23\.723\.7; Llama: Base31\.4→31\.4\\toGRPO\+30\.930\.9\), whileRefGRPOreduces it substantially \(Qwen: Base25\.2→25\.2\\toRefGRPO1\.31\.3; Llama: Base31\.4→31\.4\\toRefGRPO23\.523\.5\)\. The large improvement in reflection quality confirms that the free calibration bonus directly addresses the credit\-assignment mismatch in outcome\-only RL \([Section3\.1](https://arxiv.org/html/2606.14211#S3.SS1.SSS0.Px2)\)\.Overall, these effects translate into substantial𝖢𝗁𝗈𝗐𝖲𝖼𝗈𝗋𝖾\{\\mathsf\{ChowScore\}\}gains at both bases\(Qwen: GRPO\+68\.5→68\.5\\toRefGRPO71\.171\.1; Llama: GRPO\+58\.4→58\.4\\toRefGRPO62\.262\.2\), the main metric that captures both task accuracy and reflection calibration\.
Table 2:Multi\-turn results across five metrics: task accuracy, reflection accuracy, overconfidence rate, underconfidence rate, and Chow score\. All values are percentages\. Best results are highlighted inbold\. Results are averaged across 5 benchmarks; per\-dataset breakdowns are provided in[SectionA\.2](https://arxiv.org/html/2606.14211#A1.SS2)\.
#### Multi\-turn results\.
The multi\-turn setting tests long\-horizon agentic performance with error correction\.[Table2](https://arxiv.org/html/2606.14211#S4.T2)reports results across five metrics at both 3B and 7B scales\. The results match the pattern in the single\-turn setting:RefGRPOoutperforms GRPO\+across metrics, especially in improving𝖠𝖼𝖼𝗋𝖾𝖿\{\\mathsf\{Acc\}\_\{\\mathsf\{ref\}\}\}\(Qwen\-7B: GRPO\+76\.2→76\.2\\toRefGRPO77\.877\.8\) and reducing𝖴𝗇𝖽𝖾𝗋𝖢𝗈𝗇𝖿\{\\mathsf\{UnderConf\}\}\(Qwen\-7B: GRPO\+44\.4→44\.4\\toRefGRPO7\.77\.7; Qwen\-3B: GRPO\+36\.5→36\.5\\toRefGRPO2\.22\.2\)\. On Qwen\-3B, GRPO\+achieves slightly better𝖮𝗏𝖾𝗋𝖢𝗈𝗇𝖿\{\\mathsf\{OverConf\}\}\(GRPO\+24\.024\.0vs\.RefGRPO24\.824\.8\) but at the cost of worsening𝖴𝗇𝖽𝖾𝗋𝖢𝗈𝗇𝖿\{\\mathsf\{UnderConf\}\}over the base model \(Base18\.7→18\.7\\toGRPO\+36\.536\.5\);RefGRPOinstead reduces both𝖮𝗏𝖾𝗋𝖢𝗈𝗇𝖿\{\\mathsf\{OverConf\}\}and𝖴𝗇𝖽𝖾𝗋𝖢𝗈𝗇𝖿\{\\mathsf\{UnderConf\}\}, with especially substantial improvement in𝖴𝗇𝖽𝖾𝗋𝖢𝗈𝗇𝖿\{\\mathsf\{UnderConf\}\}\(Base18\.7→18\.7\\toRefGRPO2\.22\.2\)\. On Qwen\-7B,RefGRPOalso meaningfully lifts𝖠𝖼𝖼\{\\mathsf\{Acc\}\}\(GRPO\+75\.1→75\.1\\toRefGRPO76\.576\.5\)\.Overall, these effects translate into substantial𝖢𝗁𝗈𝗐𝖲𝖼𝗈𝗋𝖾\{\\mathsf\{ChowScore\}\}gains at both scales\(Qwen\-7B: GRPO\+73\.0→73\.0\\toRefGRPO76\.576\.5; Qwen\-3B: GRPO\+70\.8→70\.8\\toRefGRPO73\.173\.1\), the main metric that captures both task accuracy and reflection calibration\.
Figure 4:Comparison of𝖠𝖼𝖼𝗋𝖾𝖿−𝖠𝖼𝖼\{\\mathsf\{Acc\}\_\{\\mathsf\{ref\}\}\}\-\{\\mathsf\{Acc\}\}across three 7B models\.*Greater than zero indicates the reflection is informative\.*Best result is highlighted inbold\. Results are averaged across 5 benchmarks; per\-dataset breakdowns are provided in[SectionA\.3](https://arxiv.org/html/2606.14211#A1.SS3)\.
#### Comparison with other 7B SQL specialists\.
We compare our 7B model against two open\-weight 7B SQL specialists: OmniSQL\-7B \(SFT with 2\.5M CoT data;Li et al\.[2025](https://arxiv.org/html/2606.14211#bib.bib28)\) and SQL\-R1\-7B \(GRPO with outcome reward;Ma et al\.[2025a](https://arxiv.org/html/2606.14211#bib.bib31)\)\. Since these specialists use different training data and hyperparameters, we focus on the calibration deltaΔ=𝖠𝖼𝖼𝗋𝖾𝖿−𝖠𝖼𝖼\\Delta=\{\\mathsf\{Acc\}\_\{\\mathsf\{ref\}\}\}\-\{\\mathsf\{Acc\}\}, which isolates the reflection contribution beyond raw task accuracy: a trivial “always\-commit” model hasΔ=0\\Delta=0, while a better\-calibrated model achievesΔ\>0\\Delta\>0by correctly flagging some of its own errors\. As shown in[Figure4](https://arxiv.org/html/2606.14211#S4.F4),RefGRPOachieves the largestΔ\\Deltaat\+1\.3\+1\.3; SQL\-R1 is barely positive \(\+0\.2\+0\.2\); and OmniSQL is even*negative*\(−1\.0\-1\.0\), meaning its reflection is not informative—the model can answer some questions correctly but cannot tell whether its own answers are right, even with environment feedback \(e\.g\., SQL query results\)\.
Figure 5:Self\-improvement and selective prediction results using Qwen2\.5\-Coder\-3B\-Instruct as the base model \(averaged across 5 benchmarks; per\-dataset breakdowns are provided in[SectionsA\.4](https://arxiv.org/html/2606.14211#A1.SS4)and[A\.5](https://arxiv.org/html/2606.14211#A1.SS5)\)\.\(a\)Task accuracy before \(Ckpt\) and after \(Self\) self\-improvement in the single\-turn setting\.\(b\)Commit rateℙ\(sref=1\)\\mathbb\{P\}\(s^\{\\mathrm\{ref\}\}=1\)during single\-turn self\-improvement training\.\(c\)Selective\-prediction lift𝖠𝖼𝖼𝗌𝖾𝗅@8−Avg@8\{\\mathsf\{Acc\}\_\{\\mathsf\{sel\}\}\}@8\-\\mathrm\{Avg\}@8for GRPO\+vs\.RefGRPOin single\-turn and multi\-turn settings\.
### 4\.3RefGRPOEnables Verifier\-Free Self\-Improvement
Calibration turns the agent into its own verifier: its calibrated reflection becomes an informative reward signal, so we can continue RL training using reflection scores as pseudo\-rewards*without*any outcome supervision\. To test this, we run a self\-supervised RL phase from GRPO\+andRefGRPOcheckpoints taken before performance plateaus, with the outcome reward replaced by the agent’s reflection scoresrefs^\{\\mathrm\{ref\}\}\(no calibration bonus,α=0\\alpha=0\); reflection tokens are masked so only task tokens receive gradients\. We run this phase for 2 epochs with the same hyperparameters as the original training run\.
#### Results\.
[Figure5](https://arxiv.org/html/2606.14211#S4.F5)\(a\) shows the key result: starting from aRefGRPOcheckpoint, self\-improvement lifts task accuracy by\+2\.8\+2\.8points \(67\.1→69\.967\.1\\to 69\.9\), while starting from a GRPO\+checkpoint yields only\+0\.5\+0\.5points \(68\.3→68\.868\.3\\to 68\.8\)\. Despite starting from a lower\-accuracy checkpoint, theRefGRPOtrajectory ends with higher final task accuracy than GRPO\+\(GRPO\+68\.8→68\.8\\toRefGRPO69\.969\.9\)\.
[Figure5](https://arxiv.org/html/2606.14211#S4.F5)\(b\) explains why\. During self\-improvement, the GRPO\+checkpoint’s commit rateℙ\(sref=1\)\\mathbb\{P\}\(s^\{\\mathrm\{ref\}\}\{=\}1\)rapidly saturates near1\.01\.0: the model asserts confidence on nearly every rollout, so almost all self\-rewards collapse to11regardless of correctness\. This creates two problems: \(i\) the pseudo\-rewards are systematically miscalibrated—since the actual task accuracy is below7070\([Figure5](https://arxiv.org/html/2606.14211#S4.F5)\(a\)\), the agent rewards itself for many incorrect rollouts; and \(ii\) GRPO\+’s within\-group normalization on near\-constant rewards produces near\-zero advantages, effectively halting learning\. TheRefGRPOcheckpoint, by contrast, retains a more calibrated commit rate of around0\.830\.83, so the pseudo\-reward inherits the informativeness ofsrefs^\{\\mathrm\{ref\}\}induced by the calibration bonus, and continues to drive policy improvement\.
### 4\.4RefGRPOEnables Better Test\-Time Selective Prediction
In test\-time scaling withkkrollouts per question, the agent can act as its own verifier, using its reflectionsrefs^\{\\mathrm\{ref\}\}to select which rollouts to commit to\. We measure this with*selective prediction accuracy*𝖠𝖼𝖼𝗌𝖾𝗅@k\{\\mathsf\{Acc\}\_\{\\mathsf\{sel\}\}\}@k, the expected correctness over the committed setC\(q\)=\{i:siref=1\}C\(q\)=\\\{i:s^\{\\mathrm\{ref\}\}\_\{i\}=1\\\}:
𝖠𝖼𝖼𝗌𝖾𝗅@k=𝔼q\[∑i=1k𝕀\(siref=1\)rimax\(1,\|C\(q\)\|\)\]\.\{\\mathsf\{Acc\}\_\{\\mathsf\{sel\}\}\}@k=\\mathbb\{E\}\_\{q\}\\left\[\\frac\{\\sum\_\{i=1\}^\{k\}\\mathbb\{I\}\(s^\{\\mathrm\{ref\}\}\_\{i\}=1\)\\,r\_\{i\}\}\{\\max\\bigl\(1,\\,\|C\(q\)\|\\bigr\)\}\\right\]\.On each question we average correctness over the rollouts the agent commits to \(siref=1s^\{\\mathrm\{ref\}\}\_\{i\}=1\); if it commits to none of thekksamples, the question scores0\. An always\-commit agent recovers𝖠𝖼𝖼𝗌𝖾𝗅@k=Avg@k\{\\mathsf\{Acc\}\_\{\\mathsf\{sel\}\}\}@k=\\mathrm\{Avg\}@k, while a well\-calibrated agent achieves𝖠𝖼𝖼𝗌𝖾𝗅@k\>Avg@k\{\\mathsf\{Acc\}\_\{\\mathsf\{sel\}\}\}@k\>\\mathrm\{Avg\}@kby selectively committing to correct rollouts\.
#### Results\.
[Figure5](https://arxiv.org/html/2606.14211#S4.F5)\(c\) reports the selective\-prediction*lift*𝖠𝖼𝖼𝗌𝖾𝗅@k−Avg@k\{\\mathsf\{Acc\}\_\{\\mathsf\{sel\}\}\}@k\-\\mathrm\{Avg\}@kfor GRPO\+andRefGRPOin both the single\-turn and multi\-turn settings\.RefGRPOattains the higher lift in*both*regimes: GRPO\+\+0\.6→\+0\.6\\toRefGRPO\+1\.6\+1\.6in single\-turn, and GRPO\+\+0\.9→\+0\.9\\toRefGRPO\+1\.1\+1\.1in multi\-turn—a meaningful improvement especially in the single\-turn setting\.
### 4\.5Analyses and Ablations
We conduct additional analyses and ablations using Qwen2\.5\-Coder\-3B\-Instruct, reporting averaged results across the 5 evaluation benchmarks\.
Figure 6:Comparison of GRPO\+,RefGRPOwith a fixed scheduleα=0\.1\\alpha=0\.1, andRefGRPOwith a dynamic scheduleα:0\.1→0\\alpha:0\.1\\to 0in the single\-turn setting using Qwen2\.5\-Coder\-3B\-Instruct as the base model \(averaged across 5 benchmarks\)\.\(a\)Task accuracy vs\.𝖢𝗁𝗈𝗐𝖲𝖼𝗈𝗋𝖾\{\\mathsf\{ChowScore\}\}atβ=0\.1\\beta=0\.1\.\(b\)Precision vs\. recall for error detection\.\(c\)𝖢𝗁𝗈𝗐𝖲𝖼𝗈𝗋𝖾\{\\mathsf\{ChowScore\}\}across error\-flag creditsβ∈\[0,0\.5\]\\beta\\in\[0,0\.5\]\.#### Dynamic vs\. fixed schedule\.
[Figure6](https://arxiv.org/html/2606.14211#S4.F6)\(a\) plots𝖠𝖼𝖼\{\\mathsf\{Acc\}\}vs\.𝖢𝗁𝗈𝗐𝖲𝖼𝗈𝗋𝖾\{\\mathsf\{ChowScore\}\}for GRPO\+,RefGRPOwith a fixed scheduleα=0\.1\\alpha=0\.1, andRefGRPOwith a dynamic scheduleα:0\.1→0\\alpha:0\.1\\to 0\.RefGRPOwith a fixed schedule improves𝖢𝗁𝗈𝗐𝖲𝖼𝗈𝗋𝖾\{\\mathsf\{ChowScore\}\}over GRPO\+but at the cost of a small drop in𝖠𝖼𝖼\{\\mathsf\{Acc\}\};RefGRPOwith a dynamic schedule front\-loads calibration and then switches to optimizing task performance, recovering the drop in𝖠𝖼𝖼\{\\mathsf\{Acc\}\}and dominating GRPO\+on both metrics\.
#### Error detection analysis\.
[Figure6](https://arxiv.org/html/2606.14211#S4.F6)\(b\) shows the precision\-recall trade\-off for error detection\. GRPO\+has moderate precision \(76\.376\.3\) and low recall \(26\.326\.3\)\. BothRefGRPOvariants reach precision≥98\.7\\geq 98\.7\. The variant with a fixed scheduleα=0\.1\\alpha=0\.1achieves the highest recall \(30\.230\.2\) while the variant with a dynamic scheduleα:0\.1→0\\alpha:0\.1\\to 0accepts a small recall drop in exchange for higher𝖠𝖼𝖼\{\\mathsf\{Acc\}\}\(as shown in[Figure6](https://arxiv.org/html/2606.14211#S4.F6)\(a\)\)\.
#### 𝖢𝗁𝗈𝗐𝖲𝖼𝗈𝗋𝖾\{\\mathsf\{ChowScore\}\}across error\-flag credits\.
[Figure6](https://arxiv.org/html/2606.14211#S4.F6)\(c\) plots𝖢𝗁𝗈𝗐𝖲𝖼𝗈𝗋𝖾β\{\\mathsf\{ChowScore\}\}\_\{\\beta\}across error\-flag creditsβ∈\[0,0\.5\]\\beta\\in\[0,0\.5\]\.RefGRPOdominates GRPO\+across the entire range:RefGRPOwith a dynamic scheduleα:0\.1→0\\alpha:0\.1\\to 0leads by\+2\.8\+2\.8atβ=0\\beta=0and\+1\.9\+1\.9atβ=0\.5\\beta=0\.5\. This demonstrates that the improvements in𝖢𝗁𝗈𝗐𝖲𝖼𝗈𝗋𝖾β\{\\mathsf\{ChowScore\}\}\_\{\\beta\}induced byRefGRPOare robust to the choice ofβ\\beta\.
#### Calibration coefficient sweep \(multi\-turn\)\.
[Table3](https://arxiv.org/html/2606.14211#S4.T3)sweeps the calibration coefficient with fixed schedulesα∈\{0,0\.1,0\.2\}\\alpha\\in\\\{0,0\.1,0\.2\\\}and compares against the variant with a dynamic schedule \(α:0\.1→0\\alpha:0\.1\\to 0\) in the multi\-turn setting\. Increasingα\\alphaimproves𝖠𝖼𝖼𝗋𝖾𝖿\{\\mathsf\{Acc\}\_\{\\mathsf\{ref\}\}\}monotonically \(from75\.875\.8atα=0\\alpha=0to76\.576\.5atα=0\.1\\alpha=0\.1, and to77\.477\.4atα=0\.2\\alpha=0\.2\) but can degrade𝖠𝖼𝖼\{\\mathsf\{Acc\}\}\(e\.g\., from72\.572\.5atα=0\.1\\alpha=0\.1to71\.571\.5atα=0\.2\\alpha=0\.2\); the dynamic scheduleα:0\.1→0\\alpha:0\.1\\to 0attains the highest𝖠𝖼𝖼\{\\mathsf\{Acc\}\}and𝖢𝗁𝗈𝗐𝖲𝖼𝗈𝗋𝖾\{\\mathsf\{ChowScore\}\}simultaneously\.
Table 3:Effect of the calibration coefficient schedule forRefGRPOin the multi\-turn setting using Qwen2\.5\-Coder\-3B\-Instruct as the base model: fixed schedulesα∈\{0,0\.1,0\.2\}\\alpha\\in\\\{0,0\.1,0\.2\\\}versus a dynamic scheduleα:0\.1→0\\alpha:0\.1\\to 0\. Note thatα=0\\alpha=0recovers GRPO\+\. All values are percentages\. Best results are highlighted inbold\. Results are averaged across 5 benchmarks\.
## 5Related Work
#### Reinforcement learning for LLMs\.
RL has become a central post\-training paradigm for large language models, driving substantial gains across diverse domains\(Ouyang et al\.,[2022](https://arxiv.org/html/2606.14211#bib.bib34); Bai et al\.,[2022](https://arxiv.org/html/2606.14211#bib.bib2); Achiam et al\.,[2023](https://arxiv.org/html/2606.14211#bib.bib1); Comanici et al\.,[2025](https://arxiv.org/html/2606.14211#bib.bib9)\)\. Progress is especially pronounced in domains with verifiable rewards such as math and coding, where the correctness of each action provides a clean training signal\(Shao et al\.,[2024](https://arxiv.org/html/2606.14211#bib.bib37); Jaech et al\.,[2024](https://arxiv.org/html/2606.14211#bib.bib23); Guo et al\.,[2025](https://arxiv.org/html/2606.14211#bib.bib18); Yu et al\.,[2025](https://arxiv.org/html/2606.14211#bib.bib45); Liu et al\.,[2025](https://arxiv.org/html/2606.14211#bib.bib30); Zheng et al\.,[2025](https://arxiv.org/html/2606.14211#bib.bib48); He et al\.,[2025](https://arxiv.org/html/2606.14211#bib.bib20)\)\. More recently, attention has shifted to agentic settings where models interact with environments over multiple turns and tool calls\(Jin et al\.,[2025](https://arxiv.org/html/2606.14211#bib.bib24); Wang et al\.,[2025](https://arxiv.org/html/2606.14211#bib.bib42); Cao et al\.,[2025](https://arxiv.org/html/2606.14211#bib.bib5); Wei et al\.,[2025](https://arxiv.org/html/2606.14211#bib.bib43); Gandhi et al\.,[2026](https://arxiv.org/html/2606.14211#bib.bib13); Zhang et al\.,[2026](https://arxiv.org/html/2606.14211#bib.bib47)\)\. However, the primary objective of these works remains task accuracy alone\. This includes concurrent and independent work\(Shrivastava et al\.,[2026](https://arxiv.org/html/2606.14211#bib.bib39)\), which also learns from environment feedback but optimizes a cross\-entropy loss on observation tokens rather than training the agent’s reflection to be calibrated\. We instead develop an RL algorithm that, beyond task accuracy, trains the agent to produce calibrated reflection on environment feedback: with a free calibration bonus and a dynamic schedule on the calibration coefficient, our method simultaneously improves reflection calibration and task performance relative to standard outcome\-only RL\.
#### Selective prediction and calibration\.
Selective prediction with an abstention option, and the associated error\-reject tradeoff, are classical topics in statistical learning\(Chow,[1957](https://arxiv.org/html/2606.14211#bib.bib6),[1970](https://arxiv.org/html/2606.14211#bib.bib7)\): abstaining on uncertain inputs yields polynomial sample\-complexity speedups for passive learners\(Bousquet and Zhivotovskiy,[2021](https://arxiv.org/html/2606.14211#bib.bib4)\)and exponential speedups for active learners\(Zhu and Nowak,[2022a](https://arxiv.org/html/2606.14211#bib.bib49),[b](https://arxiv.org/html/2606.14211#bib.bib50)\)\. A closely related concept, confidence calibration, studies how to align self\-predicted confidence with empirical correctness\(Guo et al\.,[2017](https://arxiv.org/html/2606.14211#bib.bib17)\)\. These ideas have recently been imported into LLM research, where models are probed for whether they know what they know\(Kadavath et al\.,[2022](https://arxiv.org/html/2606.14211#bib.bib25); Tian et al\.,[2023](https://arxiv.org/html/2606.14211#bib.bib41); Xiong et al\.,[2024](https://arxiv.org/html/2606.14211#bib.bib44)\)or trained to be better calibrated\(Tao et al\.,[2024](https://arxiv.org/html/2606.14211#bib.bib40); Leng et al\.,[2025](https://arxiv.org/html/2606.14211#bib.bib27); Bani\-Harouni et al\.,[2026](https://arxiv.org/html/2606.14211#bib.bib3)\)\. Closest to our setting isDamani et al\. \([2026](https://arxiv.org/html/2606.14211#bib.bib10)\), which uses RL with a*fixed\-coefficient*Brier bonus on the model’s confidence in its self\-generated answer*without environment feedback*\. We differ in two key respects\. First, our agent reflects on*environment feedback*—execution results or error messages—in the agentic setting, a strictly richer signal than self\-assessment over a self\-generated answer\. Second, we introduce a*dynamic*schedule on the calibration coefficient, which is important for simultaneously improving calibration and task accuracy:[Figure6](https://arxiv.org/html/2606.14211#S4.F6)\(a\) shows that a fixed coefficient improves calibration at the cost of degrading task accuracy\.
#### Self\-improvement in LLMs\.
As LLMs are increasingly deployed as autonomous agents, a central question is how they can improve themselves over time\. One line of work performs iterative verbal self\-refinement at inference time, guided by self\-generated critiques \(e\.g\., via prompting\) or external feedback\(Shinn et al\.,[2023](https://arxiv.org/html/2606.14211#bib.bib38); Madaan et al\.,[2023](https://arxiv.org/html/2606.14211#bib.bib33); Gou et al\.,[2024](https://arxiv.org/html/2606.14211#bib.bib15)\)\. A complementary line uses gradient\-based updates \(SFT or RL\) to train models to self\-correct or self\-improve, optimizing for a more correct final answer with self\-generated or external signals\(Kumar et al\.,[2025](https://arxiv.org/html/2606.14211#bib.bib26); Ma et al\.,[2025b](https://arxiv.org/html/2606.14211#bib.bib32); Zuo et al\.,[2025](https://arxiv.org/html/2606.14211#bib.bib52); Zhu et al\.,[2026](https://arxiv.org/html/2606.14211#bib.bib51)\)\. Despite some promising results,Huang et al\. \([2024](https://arxiv.org/html/2606.14211#bib.bib21)\)find that LLMs can struggle to self\-correct solely based on self\-generated signals\. Our approach differs in*what*it trains\. Rather than training the policy to directly produce more correct answers, we train the agent’s*reflection*to*accurately judge*its own correctness after observing environment feedback—a calibration objective orthogonal to task performance\. This yields the reliable signal that naive self\-correction lacks: once calibrated, the reflections serve as pseudo\-rewards for further self\-improvement without outcome supervision\. In effect, our method turns the agent into its own verifier grounded in environment feedback, instead of requiring separately trained verifiers\(Cobbe et al\.,[2021](https://arxiv.org/html/2606.14211#bib.bib8)\)\.
## 6Discussion
We identified a persistent reflection gap: LLM agents mis\-assess their own actions even after observing concrete environment feedback, and outcome\-only RL barely fixes it\. We offer a simple yet effective fix that augments the outcome reward with a calibration bonus and a dynamic schedule on its coefficient\. The calibration bonus is computed for free by contrasting the agent’s own reflection with the actual outcome, while the dynamic schedule enables the agent to simultaneously improve reflection calibration \(e\.g\., reduces underconfidence rate44\.4%→7\.7%44\.4\\%\\to 7\.7\\%\) and task accuracy \(e\.g\.,75\.1%→76\.5%75\.1\\%\\to 76\.5\\%\), lifting the unified𝖢𝗁𝗈𝗐𝖲𝖼𝗈𝗋𝖾\{\\mathsf\{ChowScore\}\}metric73\.0%→76\.5%73\.0\\%\\to 76\.5\\%\. The resulting calibrated reflection turns the agent into its own verifier grounded in environment feedback, which further enables \(i\) better self\-improvement that uses reflections as pseudo\-rewards without outcome supervision, and \(ii\) more effective test\-time selective prediction by committing only to rollouts flagged as correct\.
#### Limitations and future work\.
Due to compute constraints, the largest model we trained is at the 7B scale; validatingRefGRPOat larger scales is an important next step\. In addition, we focus on a binary reflection scoresref∈\{0,1\}s^\{\\mathrm\{ref\}\}\\in\\\{0,1\\\}, and extending it to real\-valued confidence scores in\[0,1\]\[0,1\]is an interesting direction\.
## References
- Achiam et al\. \(2023\)Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al\.Gpt\-4 technical report\.*arXiv preprint arXiv:2303\.08774*, 2023\.
- Bai et al\. \(2022\)Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al\.Constitutional ai: Harmlessness from ai feedback\.*arXiv preprint arXiv:2212\.08073*, 2022\.
- Bani\-Harouni et al\. \(2026\)David Bani\-Harouni, Chantal Pellegrini, Paul Stangel, Ege Özsoy, Kamilia Zaripova, Nassir Navab, and Matthias Keicher\.Rewarding doubt: A reinforcement learning approach to calibrated confidence expression of large language models\.In*The Fourteenth International Conference on Learning Representations*, 2026\.
- Bousquet and Zhivotovskiy \(2021\)Olivier Bousquet and Nikita Zhivotovskiy\.Fast classification rates without standard margin assumptions\.*Information and Inference: A Journal of the IMA*, 10\(4\):1389–1421, 2021\.
- Cao et al\. \(2025\)Shiyi Cao, Dacheng Li, Fangzhou Zhao, Shuo Yuan, Sumanth R Hegde, Connor Chen, Charlie Ruan, Tyler Griggs, Shu Liu, Eric Tang, et al\.Skyrl\-agent: Efficient rl training for multi\-turn llm agent\.*arXiv preprint arXiv:2511\.16108*, 2025\.
- Chow \(1957\)Chi\-Keung Chow\.An optimum character recognition system using decision functions\.*IRE Transactions on Electronic Computers*, \(4\):247–254, 1957\.
- Chow \(1970\)CK Chow\.On optimum recognition error and reject tradeoff\.*IEEE Transactions on Information Theory*, 1970\.
- Cobbe et al\. \(2021\)Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al\.Training verifiers to solve math word problems\.*arXiv preprint arXiv:2110\.14168*, 2021\.
- Comanici et al\. \(2025\)Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al\.Gemini 2\.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities\.*arXiv preprint arXiv:2507\.06261*, 2025\.
- Damani et al\. \(2026\)Mehul Damani, Isha Puri, Stewart Slocum, Idan Shenfeld, Leshem Choshen, Yoon Kim, and Jacob Andreas\.Beyond binary rewards: Training LMs to reason about their uncertainty\.In*The Fourteenth International Conference on Learning Representations*, 2026\.
- Deng et al\. \(2021\)Xiang Deng, Ahmed Hassan, Christopher Meek, Oleksandr Polozov, Huan Sun, and Matthew Richardson\.Structure\-grounded pretraining for text\-to\-sql\.In*Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies*, pages 1337–1350, 2021\.
- Gan et al\. \(2021\)Yujian Gan, Xinyun Chen, and Matthew Purver\.Exploring underexplored limitations of cross\-domain text\-to\-sql generalization\.In*Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing*, pages 8926–8931, 2021\.
- Gandhi et al\. \(2026\)Kanishk Gandhi, Shivam Garg, Noah D Goodman, and Dimitris Papailiopoulos\.Endless terminals: Scaling rl environments for terminal agents\.*arXiv preprint arXiv:2601\.16443*, 2026\.
- Gao et al\. \(2024\)Yingqi Gao, Yifu Liu, Xiaoxia Li, Xiaorong Shi, Yin Zhu, Yiming Wang, Shiqi Li, Wei Li, Yuntao Hong, Zhiling Luo, et al\.A preview of xiyan\-sql: A multi\-generator ensemble framework for text\-to\-sql\.*arXiv preprint arXiv:2411\.08599*, 2024\.
- Gou et al\. \(2024\)Zhibin Gou, Zhihong Shao, Yeyun Gong, yelong shen, Yujiu Yang, Nan Duan, and Weizhu Chen\.CRITIC: Large language models can self\-correct with tool\-interactive critiquing\.In*The Twelfth International Conference on Learning Representations*, 2024\.
- Grattafiori et al\. \(2024\)Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al\-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al\.The llama 3 herd of models\.*arXiv preprint arXiv:2407\.21783*, 2024\.
- Guo et al\. \(2017\)Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger\.On calibration of modern neural networks\.In*International conference on machine learning*, pages 1321–1330\. PMLR, 2017\.
- Guo et al\. \(2025\)Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al\.Deepseek\-r1: Incentivizing reasoning capability in llms via reinforcement learning\.*arXiv preprint arXiv:2501\.12948*, 2025\.
- Ha and Schmidhuber \(2018\)David Ha and Jürgen Schmidhuber\.World models\.*arXiv preprint arXiv:1803\.10122*, 2\(3\):440, 2018\.
- He et al\. \(2025\)Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, et al\.Justrl: Scaling a 1\.5 b llm with a simple rl recipe\.*arXiv preprint arXiv:2512\.16649*, 2025\.
- Huang et al\. \(2024\)Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou\.Large language models cannot self\-correct reasoning yet\.In*The Twelfth International Conference on Learning Representations*, 2024\.
- Hui et al\. \(2024\)Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al\.Qwen2\. 5\-coder technical report\.*arXiv preprint arXiv:2409\.12186*, 2024\.
- Jaech et al\. \(2024\)Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El\-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al\.Openai o1 system card\.*arXiv preprint arXiv:2412\.16720*, 2024\.
- Jin et al\. \(2025\)Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan O Arik, Dong Wang, Hamed Zamani, and Jiawei Han\.Search\-r1: Training LLMs to reason and leverage search engines with reinforcement learning\.In*Second Conference on Language Modeling*, 2025\.
- Kadavath et al\. \(2022\)Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield\-Dodds, Nova DasSarma, Eli Tran\-Johnson, et al\.Language models \(mostly\) know what they know\.*arXiv preprint arXiv:2207\.05221*, 2022\.
- Kumar et al\. \(2025\)Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co\-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani, and Aleksandra Faust\.Training language models to self\-correct via reinforcement learning\.In*The Thirteenth International Conference on Learning Representations*, 2025\.
- Leng et al\. \(2025\)Jixuan Leng, Chengsong Huang, Banghua Zhu, and Jiaxin Huang\.Taming overconfidence in LLMs: Reward calibration in RLHF\.In*The Thirteenth International Conference on Learning Representations*, 2025\.
- Li et al\. \(2025\)Haoyang Li, Shang Wu, Xiaokang Zhang, Xinmei Huang, Jing Zhang, Fuxin Jiang, Shuai Wang, Tieying Zhang, Jianjun Chen, Rui Shi, et al\.Omnisql: Synthesizing high\-quality text\-to\-sql data at scale\.*arXiv preprint arXiv:2503\.02240*, 2025\.
- Li et al\. \(2023\)Jinyang Li, Binyuan Hui, GE QU, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, Xuanhe Zhou, Chenhao Ma, Guoliang Li, Kevin Chang, Fei Huang, Reynold Cheng, and Yongbin Li\.Can LLM already serve as a database interface? a BIg bench for large\-scale database grounded text\-to\-SQLs\.In*Thirty\-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track*, 2023\.
- Liu et al\. \(2025\)Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin\.Understanding r1\-zero\-like training: A critical perspective\.In*Second Conference on Language Modeling*, 2025\.
- Ma et al\. \(2025a\)Peixian Ma, Xialie Zhuang, Chengjin Xu, Xuhui Jiang, Ran Chen, and Jian Guo\.Sql\-r1: Training natural language to sql reasoning model by reinforcement learning\.*arXiv preprint arXiv:2504\.08600*, 2025a\.
- Ma et al\. \(2025b\)Ruotian Ma, Peisong Wang, Cheng Liu, Xingyan Liu, Jiaqi Chen, Bang Zhang, Xin Zhou, Nan Du, and Jia Li\.S2r: Teaching llms to self\-verify and self\-correct via reinforcement learning\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 22632–22654, 2025b\.
- Madaan et al\. \(2023\)Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al\.Self\-refine: Iterative refinement with self\-feedback\.*Advances in neural information processing systems*, 36:46534–46594, 2023\.
- Ouyang et al\. \(2022\)Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al\.Training language models to follow instructions with human feedback\.*Advances in neural information processing systems*, 35:27730–27744, 2022\.
- Pourreza et al\. \(2025\)Mohammadreza Pourreza, Shayan Talaei, Ruoxi Sun, Xingchen Wan, Hailong Li, Azalia Mirhoseini, Amin Saberi, and Sercan O Arik\.Reasoning\-SQL: Reinforcement learning with SQL tailored partial rewards for reasoning\-enhanced text\-to\-SQL\.In*Second Conference on Language Modeling*, 2025\.
- Schulman et al\. \(2017\)John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov\.Proximal policy optimization algorithms\.*arXiv preprint arXiv:1707\.06347*, 2017\.
- Shao et al\. \(2024\)Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al\.Deepseekmath: Pushing the limits of mathematical reasoning in open language models\.*arXiv preprint arXiv:2402\.03300*, 2024\.
- Shinn et al\. \(2023\)Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao\.Reflexion: Language agents with verbal reinforcement learning\.*Advances in neural information processing systems*, 36:8634–8652, 2023\.
- Shrivastava et al\. \(2026\)Vaishnavi Shrivastava, Piero Kauffmann, Ahmed Awadallah, and Dimitris Papailiopoulos\.Echo: Terminal agents learn world models for free\.*arXiv preprint arXiv:2605\.24517*, 2026\.
- Tao et al\. \(2024\)Shuchang Tao, Liuyi Yao, Hanxing Ding, Yuexiang Xie, Qi Cao, Fei Sun, Jinyang Gao, Huawei Shen, and Bolin Ding\.When to trust llms: Aligning confidence with response quality\.In*Findings of the Association for Computational Linguistics: ACL 2024*, pages 5984–5996, 2024\.
- Tian et al\. \(2023\)Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D Manning\.Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine\-tuned with human feedback\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 5433–5442, 2023\.
- Wang et al\. \(2025\)Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, et al\.Ragen: Understanding self\-evolution in llm agents via multi\-turn reinforcement learning\.*arXiv preprint arXiv:2504\.20073*, 2025\.
- Wei et al\. \(2025\)Quan Wei, Siliang Zeng, Chenliang Li, William Brown, Oana Frunza, Wei Deng, Anderson Schneider, Yuriy Nevmyvaka, Yang Katie Zhao, Alfredo Garcia, et al\.Reinforcing multi\-turn reasoning in llm agents via turn\-level reward design\.*arXiv preprint arXiv:2505\.11821*, 2025\.
- Xiong et al\. \(2024\)Miao Xiong, Zhiyuan Hu, Xinyang Lu, YIFEI LI, Jie Fu, Junxian He, and Bryan Hooi\.Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs\.In*The Twelfth International Conference on Learning Representations*, 2024\.
- Yu et al\. \(2025\)Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al\.Dapo: An open\-source llm reinforcement learning system at scale\.*arXiv preprint arXiv:2503\.14476*, 2025\.
- Yu et al\. \(2018\)Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al\.Spider: A large\-scale human\-labeled dataset for complex and cross\-domain semantic parsing and text\-to\-sql task\.In*Proceedings of the 2018 conference on empirical methods in natural language processing*, pages 3911–3921, 2018\.
- Zhang et al\. \(2026\)Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhong\-Zhi Li, Xiangyuan Xue, Yijiang Li, Yifan Zhou, Yang Chen, Chen Zhang, Yutao Fan, Zihu Wang, Songtao Huang, Francisco Piedrahita Velez, Yue Liao, Hongru WANG, Mengyue Yang, Heng Ji, Jun Wang, Shuicheng YAN, Philip Torr, and LEI BAI\.The landscape of agentic reinforcement learning for LLMs: A survey\.*Transactions on Machine Learning Research*, 2026\.ISSN 2835\-8856\.
- Zheng et al\. \(2025\)Chujie Zheng, Shixuan Liu, Mingze Li, Xiong\-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al\.Group sequence policy optimization\.*arXiv preprint arXiv:2507\.18071*, 2025\.
- Zhu and Nowak \(2022a\)Yinglun Zhu and Robert Nowak\.Active learning with neural networks: Insights from nonparametric statistics\.*Advances in Neural Information Processing Systems*, 35:142–155, 2022a\.
- Zhu and Nowak \(2022b\)Yinglun Zhu and Robert Nowak\.Efficient active learning with abstention\.*Advances in Neural Information Processing Systems*, 35:35379–35391, 2022b\.
- Zhu et al\. \(2026\)Yinglun Zhu, Jiancheng Zhang, and Fuzhi Tang\.Test\-time matching: Unlocking compositional reasoning in multimodal models\.In*The Fourteenth International Conference on Learning Representations*, 2026\.
- Zuo et al\. \(2025\)Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, et al\.Ttrl: Test\-time reinforcement learning\.*arXiv preprint arXiv:2504\.16084*, 2025\.
## Appendix AAdditional Experimental Details and Results
### A\.1Implementation Details
[Table4](https://arxiv.org/html/2606.14211#A1.T4)lists the training hyperparameters shared between GRPO\+andRefGRPO\. The maximum response length is enforced*per turn*, so akk\-turn rollouts can emit up to3000×k3000\\times ktokens in total\.
Table 4:Training hyperparameters shared by GRPO\+andRefGRPO\.For evaluation, we download all base and external models from Hugging Face and run our own evaluation\. The external models \(OmniSQL\-7B and SQL\-R1\-7B\) are evaluated only in the single\-turn setting, since they are trained in a single\-turn format with long chain\-of\-thought reasoning\. For reflection accuracy \(𝖠𝖼𝖼𝗋𝖾𝖿\{\\mathsf\{Acc\}\_\{\\mathsf\{ref\}\}\}\), we report the conditional variant, which normalizes over the samples in which the model actually produces a valid reflection\.444The other metrics are unaffected by this normalization: task accuracy does not depend on reflection; the overconfidence rate and underconfidence rate are themselves conditional probabilities; and for the Chow score, a sample with an invalid reflection score is scored by its task accuracy rather than discarded\.[Table5](https://arxiv.org/html/2606.14211#A1.T5)reports the probability of generating such a valid reflection: it is consistently lower for the base and external models, whereas our trained models \(GRPO\+andRefGRPO\) are almost always100%100\\%\(and99\.9%99\.9\\%in the rare exception\)\. Crucially, although this conditional normalization*inflates*𝖠𝖼𝖼𝗋𝖾𝖿\{\\mathsf\{Acc\}\_\{\\mathsf\{ref\}\}\}for the base and external models—since samples with invalid reflections are not counted as errors—RefGRPOstill attains the best average𝖠𝖼𝖼𝗋𝖾𝖿\{\\mathsf\{Acc\}\_\{\\mathsf\{ref\}\}\}overall \(as shown in[Tables1](https://arxiv.org/html/2606.14211#S4.T1)and[2](https://arxiv.org/html/2606.14211#S4.T2)\)\.
Table 5:Probability of generating a valid reflection \(𝖯𝗏𝖺𝗅𝗂𝖽\{\\mathsf\{P\}\_\{\\mathsf\{valid\}\}\}\) for the base and external models\. We omit our trained models \(GRPO\+andRefGRPO\), whose𝖯𝗏𝖺𝗅𝗂𝖽\{\\mathsf\{P\}\_\{\\mathsf\{valid\}\}\}is almost always100%100\\%\(and99\.9%99\.9\\%in the rare exception\)\. All values are percentages \(*higher is better*\)\. Results are averaged across 5 benchmarks\.
### A\.2Per\-Dataset Results for the Main Results
[Tables6](https://arxiv.org/html/2606.14211#A1.T6)and[7](https://arxiv.org/html/2606.14211#A1.T7)provide per\-dataset results for the main results presented in[Section4\.2](https://arxiv.org/html/2606.14211#S4.SS2)\.
Table 6:Per\-dataset single\-turn results across five metrics: task accuracy, reflection accuracy, overconfidence rate, underconfidence rate, and Chow score\. All values are percentages\. Each block reports one metric across the five benchmarks plus the average; arrows in the block headers indicate the desired direction\. Best per\-dataset results within each scale block are highlighted inbold\.Table 7:Per\-dataset multi\-turn results across five metrics: task accuracy, reflection accuracy, overconfidence rate, underconfidence rate, and Chow score\. All values are percentages\. Each block reports one metric across the five benchmarks plus the average; arrows in the block headers indicate the desired direction\. Best per\-dataset results within each scale block are highlighted inbold\.
### A\.3Per\-Dataset External 7B Comparison
[Table8](https://arxiv.org/html/2606.14211#A1.T8)reports per\-dataset results for the two external 7B SQL specialists discussed in[Figure4](https://arxiv.org/html/2606.14211#S4.F4)\.
Table 8:Per\-dataset results for two external 7B SQL specialists \(OmniSQL\-7B, SQL\-R1\-7B\) across two metrics: task accuracy and reflection accuracy\. All values are percentages\. Per\-dataset results for ourRefGRPO7B are reported in[Table7](https://arxiv.org/html/2606.14211#A1.T7)\.
### A\.4Per\-Dataset Self\-Improvement Results
[Table9](https://arxiv.org/html/2606.14211#A1.T9)reports per\-dataset results for the self\-improvement results presented in[Section4\.3](https://arxiv.org/html/2606.14211#S4.SS3)\.
Table 9:Per\-dataset task accuracy before \(𝖠𝖼𝖼Ckpt\{\\mathsf\{Acc\}\}\_\{\\mathrm\{Ckpt\}\}\) and after \(𝖠𝖼𝖼Self\{\\mathsf\{Acc\}\}\_\{\\mathrm\{Self\}\}\) self\-improvement in the single\-turn setting using Qwen2\.5\-Coder\-3B\-Instruct as the base model\.Δ=𝖠𝖼𝖼Self−𝖠𝖼𝖼Ckpt\\Delta=\{\\mathsf\{Acc\}\}\_\{\\mathrm\{Self\}\}\-\{\\mathsf\{Acc\}\}\_\{\\mathrm\{Ckpt\}\}denotes the change from checkpoint to self\-improvement\. All values are percentages\. Best per\-datasetΔ\\Deltais highlighted inbold\.
### A\.5Per\-Dataset Selective Prediction Results
[Table10](https://arxiv.org/html/2606.14211#A1.T10)provides per\-dataset results for the selective prediction results presented in[Section4\.4](https://arxiv.org/html/2606.14211#S4.SS4)\.
Table 10:Per\-dataset selective\-prediction results: average pass rate \(Avg@8\\mathrm\{Avg\}@8\), accuracy of selective prediction \(𝖠𝖼𝖼𝗌𝖾𝗅@8\{\\mathsf\{Acc\}\_\{\\mathsf\{sel\}\}\}@8\), and the gainΔ=𝖠𝖼𝖼𝗌𝖾𝗅@8−Avg@8\\Delta=\{\\mathsf\{Acc\}\_\{\\mathsf\{sel\}\}\}@8\-\\mathrm\{Avg\}@8from selective prediction\. Evaluation protocol follows[Section4\.4](https://arxiv.org/html/2606.14211#S4.SS4), and both single\-turn and multi\-turn evaluations use Qwen2\.5\-Coder\-3B\-Instruct as the base model\. All values are percentages\. Best per\-datasetΔ\\Deltabetween the two methods is highlighted inbold\.GRPO\+RefGRPO\(Ours\)DatasetAvg@8\\mathrm\{Avg\}@8𝖠𝖼𝖼𝗌𝖾𝗅@8\{\\mathsf\{Acc\}\_\{\\mathsf\{sel\}\}\}@8Δ\\DeltaAvg@8\\mathrm\{Avg\}@8𝖠𝖼𝖼𝗌𝖾𝗅@8\{\\mathsf\{Acc\}\_\{\\mathsf\{sel\}\}\}@8Δ\\DeltaSingle\-turnSpider\-Dev79\.379\.1−0\.2\-0\.280\.581\.7\+1\.2\\mathbf\{\+1\.2\}Spider\-DK67\.065\.2−1\.8\-1\.869\.971\.2\+1\.3\\mathbf\{\+1\.3\}Spider\-Realistic74\.374\.7\+0\.4\+0\.475\.376\.9\+1\.6\\mathbf\{\+1\.6\}Spider\-Test79\.079\.7\+0\.7\+0\.779\.880\.9\+1\.1\\mathbf\{\+1\.1\}Bird\-Dev46\.650\.2\+3\.6\\mathbf\{\+3\.6\}45\.948\.6\+2\.7\+2\.7Average69\.269\.8\+0\.6\+0\.670\.371\.9\+1\.6\\mathbf\{\+1\.6\}Multi\-turnSpider\-Dev80\.780\.9\+0\.2\+0\.281\.182\.0\+0\.9\\mathbf\{\+0\.9\}Spider\-DK70\.871\.7\+0\.9\+0\.971\.172\.3\+1\.2\\mathbf\{\+1\.2\}Spider\-Realistic76\.676\.7\+0\.1\+0\.176\.577\.4\+0\.9\\mathbf\{\+0\.9\}Spider\-Test80\.881\.3\+0\.5\+0\.581\.181\.7\+0\.6\\mathbf\{\+0\.6\}Bird\-Dev52\.755\.5\+2\.8\\mathbf\{\+2\.8\}52\.754\.8\+2\.1\+2\.1Average72\.373\.2\+0\.9\+0\.972\.573\.6\+1\.1\\mathbf\{\+1\.1\}Similar Articles
ICRL: Learning to Internalize Self-Critique with Reinforcement Learning
This paper introduces ICRL, a framework that jointly trains a solver and critic with reinforcement learning to internalize critique guidance, enabling the solver to improve without external critique. It uses distribution calibration and role-wise group advantage estimation, achieving 6-7 point gains over GRPO on agentic and mathematical reasoning tasks.
Retrospective Progress-Aware Self-Refinement for LLM Agent Training
This paper introduces RePro, a framework that trains LLM agents to self-generate progress signals through a forward-then-reflect rollout paradigm, achieving up to 12% absolute success rate gains on WebShop, ALFWorld, and Sokoban benchmarks.
AgentV-RL: Scaling Reward Modeling with Agentic Verifier
AgentV-RL introduces an Agentic Verifier framework that enhances reward modeling through bidirectional verification with forward and backward agents augmented with tools, achieving 25.2% improvement over state-of-the-art ORMs. The approach addresses error propagation and grounding issues in verifiers for complex reasoning tasks through multi-turn deliberative processes combined with reinforcement learning.
Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation
The paper introduces Reflection-Enhanced Self-Distillation (Resd), a framework that transforms failure feedback into corrective supervision for LLMs, enabling efficient learning from rare successes. It outperforms standard self-distillation baselines and achieves faster early improvement than GRPO with fewer samples.
Faithful uncertainty in LLM agents: calibration vs utility tradeoff in practice[D]
A practitioner discusses the calibration vs. utility tradeoff in LLM agents, sharing experience with a verifier-based pipeline that reduces hallucinated tool calls by ~60% but introduces latency costs and drops easy correct answers.