CurveRL: Principled Distribution-Aware Context Reweighting for LLM Reasoning
Summary
This paper introduces CurveRL, a principled distribution-aware prompt reweighting approach for reinforcement learning with verifiable rewards (RLVR) that improves LLM reasoning by assigning weights based on the rank and density of pass rates rather than their absolute values, consistently outperforming GRPO and other baselines.
View Cached Full Text
Cached at: 05/26/26, 09:04 AM
# CurveRL: Principled Distribution-Aware Context Reweighting for LLM Reasoning
Source: [https://arxiv.org/html/2605.24331](https://arxiv.org/html/2605.24331)
Ke Sun111Equal contribution\.†Co\-corresponding authors\.1Yizhou Zhao∗1Jiayi Xin1Qi Long†1Weijie Su†1 1University of Pennsylvania \{kesun6, qlong\}@upenn\.edu, yzzhao@sas\.upenn\.edu jiayixin@seas\.upenn\.edu, suw@wharton\.upenn\.edu
###### Abstract
Context or prompt\-level reweighting has emerged as a central algorithmic lever in Reinforcement Learning with Verified Rewards \(RLVR\) for improving the reasoning capability of large language models, yet the principle determining what constitutes an optimal weighting remains poorly understood\. We address this gap by formulating prompt reweighting as a functional derivative of a utility functional defined in the pass\-rate function space, yielding a unified optimality framework that accommodates existing schemes, including REINFORCE and GRPO\. Building on this optimality framework, we propose a distribution\-aware prompt reweighting approach, calledCurveRL, based on a quantile coordinate transform, in which the weight assigned to each prompt depends not on the absolute value of pass rates but on its rank and density to reflect the distributional structure of the pass rates in the learning dynamics\. Extensive experiments across multiple benchmarks demonstrate that our proposed CurveRL consistently outperforms GRPO and other RLVR baselines\. Our study identifies context\-distribution control as a principled axis for analyzing and designing prompt\-reweighted RLVR algorithms\. The code is released in[https://github\.com/zhyzmath/CurveRL](https://github.com/zhyzmath/CurveRL)\.
### 1Introduction
Reinforcement learning with verifiable rewards \(RLVR\) has been the primary driver behind the recent emergence of reasoning models\(Jaechet al\.,[2024](https://arxiv.org/html/2605.24331#bib.bib44); Guoet al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib38)\)\. As outcome\-based rewards become more adopted, the token\-level MDP\(Puterman,[2014](https://arxiv.org/html/2605.24331#bib.bib57)\)effectively collapses to a contextual bandit\(Lattimore and Szepesvári,[2020](https://arxiv.org/html/2605.24331#bib.bib43)\), in which the entire reasoning trace is absorbed into the response as a one\-step decision\. Among modern RLVR methods, Group Relative Policy Optimization \(GRPO\)\(Shaoet al\.,[2024](https://arxiv.org/html/2605.24331#bib.bib37); Guoet al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib38)\)and its variants, such as\(Yuet al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib14); Liuet al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib46); Chuet al\.,[2026](https://arxiv.org/html/2605.24331#bib.bib47); Zhanget al\.,[2025b](https://arxiv.org/html/2605.24331#bib.bib48); Xionget al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib10); Tajwaret al\.,[2026](https://arxiv.org/html/2605.24331#bib.bib7)\), have emerged as the dominant family, offering competitive performance with low memory overhead\. The mechanisms underlying their success, however, remain poorly understood\.
Understanding why these algorithms work requires recognizing a distinctive feature of RLVR that has no direct counterpart in standard RL: the ability to directly shape the context or prompt distribution from which training samples are drawn\. In standard RL with an external environment, the state\-visitation distribution is mainly shaped indirectly through the agent’s actions and exploration strategy\(Thrun,[1992](https://arxiv.org/html/2605.24331#bib.bib62); Ladoszet al\.,[2022](https://arxiv.org/html/2605.24331#bib.bib63)\), as the state or context distribution is often treated as exogenous and cannot be directly manipulated\. The contextual\-bandit structure of outcome\-based RLVR removes this constraint, opening up a new and orthogonal axis of information acquisition, which we refer to ascontext distribution control: the prompt distribution is explicitly controllable during training, and the algorithm can decide which prompts to sample and how heavily to weight their gradients\. Recently, a growing body of work exploits this freedom through a wide range of mechanisms, such as sample selection\(Yuet al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib14); Maoet al\.,[2026](https://arxiv.org/html/2605.24331#bib.bib15); Xionget al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib10)\), curriculum strategies\(Parasharet al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib11); Rajaramanet al\.,[2026](https://arxiv.org/html/2605.24331#bib.bib66); Chenet al\.,[2025a](https://arxiv.org/html/2605.24331#bib.bib69)\), and prompt reweighting\(Davis and Recht,[2025](https://arxiv.org/html/2605.24331#bib.bib9); Tajwaret al\.,[2026](https://arxiv.org/html/2605.24331#bib.bib7)\)\. These approaches are generally grounded by separate heuristics about which prompts deserve more gradient signal at each training stage\. Yet none provides a principled answer to why its particular intervention on the prompt training distribution is the right one\. Recent work\(Tajwaret al\.,[2026](https://arxiv.org/html/2605.24331#bib.bib7)\)introduces the maximum likelihood principle into the RLVR objective by maximizing the log\-likelihood of pass rates\. While maximum likelihood estimation \(MLE\) enjoys well\-established optimality properties in classical statistics\(Shao,[1999](https://arxiv.org/html/2605.24331#bib.bib136); Casella and Berger,[2024](https://arxiv.org/html/2605.24331#bib.bib55)\), we argue that these guarantees do not transfer to RLVR, as policy optimization is structurally different from statistical estimation\. As also briefly mentioned in\(Davis and Recht,[2025](https://arxiv.org/html/2605.24331#bib.bib9)\), classical MLE optimality rests on a fixed, exogenous probability measure that characterizes the data\-generating population, against which the estimator is evaluated\. In RLVR, by contrast, the objective is evaluated under a policy\-dependent measure or data distribution that co\-evolves with the policy throughout training\. There is no fixed population to estimate, and the measure itself is a policy\-dependent endogenous object of the optimization\. Thus, the classical optimality argument in MLE no longer applies\.
##### Motivation: Prompt Reweighting and Its Optimality\.
Prompt reweighting offers a concrete way to implement context distribution control in RLVR\. To learn the LLM policyπθ\\pi\_\{\\theta\}, we consider RLVR algorithms whose policy gradient update assigns a policy\-dependent prompt weightwθ\(x\)w\_\{\\theta\}\(x\)to each promptxx\. Letr\(x,y\)r\(x,y\)denote the rule\-based binary reward function for promptxxand responseyy,d0d\_\{0\}denote the initial prompt distribution\. Define thepass ratepθ\(x\)=𝔼y∼πθ\(⋅∣x\)\[r\(x,y\)\]p\_\{\\theta\}\(x\)=\\mathbb\{E\}\_\{y\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\}\\left\[r\(x,y\)\\right\]\. A broad class of prompt\-reweighted RLVR algorithms employ the following policy gradient update:
∇θJ\(θ\)=𝔼x∼d0\[wθ\(x\)𝔼y∼πθ\(⋅\|x\)\[r\(x,y\)∇θlogπθ\(y\|x\)\]\]=𝔼x∼d0\[wθ\(x\)∇θpθ\(x\)\]\.\\displaystyle\\nabla\_\{\\theta\}J\(\\theta\)=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[w\_\{\\theta\}\(x\)\\mathbb\{E\}\_\{y\\sim\\pi\_\{\\theta\}\(\\cdot\|x\)\}\\left\[r\(x,y\)\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(y\|x\)\\right\]\\right\]=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[w\_\{\\theta\}\(x\)\\nabla\_\{\\theta\}p\_\{\\theta\}\(x\)\\right\]\.\(1\)For instance, the population counterpart of GRPO corresponds towθ\(x\)=1/pθ\(x\)\(1−pθ\(x\)\)w\_\{\\theta\}\(x\)=1/\\sqrt\{p\_\{\\theta\}\(x\)\(1\-p\_\{\\theta\}\(x\)\)\}\. A detailed explanation is given in Section[2](https://arxiv.org/html/2605.24331#S2)\. Importantly, this generic form in Eq\. \([1](https://arxiv.org/html/2605.24331#S1.E1)\) raises the central question of our study:
What are the principle and the optimality to determinewθ\(x\)w\_\{\\theta\}\(x\)in the prompt\-reweighted RLVR?
##### Our Contributions\.
In this paper, we cast prompt reweighting as context distribution control, where the algorithm directly reshapes the effective prompt distribution\. Under this view, we formulate the optimal weight as a functional derivative of a utility function over the pass\-rate function space\. The resulting optimality framework subsumes existing pointwise weighting rules and reveals theirweight collapselimitation\. We then instantiate the principle with CurveRL, with the optimal weight derived by a distribution\-aware utility function in pass\-rate quantile space, which uses rank and density information of the evolving pass\-rate distribution\. Empirically, CurveRL consistently improves the pass@11and pass@kktrade\-off across multiple reasoning benchmarks\. The contributions of our study can be succinctly summarized as follows:
- •We formulate prompt reweighting in RLVR as context distribution control and define optimal weights through utility\-dependent functional derivatives\.
- •We instantiate this principle with a distribution\-aware utility in pass\-rate quantile space and propose CurveRL that characterizes the distributional structure of the pass\-rate distribution\.
- •We perform extensive experiments, showing that CurveRL improves the pass@11and pass@kkPareto frontier over standard baselines\. The underlying mechanism of CurveRL is also analyzed\.
### 2Preliminaries and Technical Background
##### REINFORCE Objective in the Pass\-Rate Space\.
RLVR is often formulated as a contextual bandit problem, where the entire reasoning trace is absorbed into the response as a one\-step decision\. We assume a rule\-based binary rewardr\(x,y\)∈\{0,1\}r\(x,y\)\\in\\\{0,1\\\}for promptxxand responseyy\. To learnπθ\\pi\_\{\\theta\}guided by these rewards, the policy gradient method\(Suttonet al\.,[1998](https://arxiv.org/html/2605.24331#bib.bib29)\), e\.g\., REINFORCE\(Williams,[1992](https://arxiv.org/html/2605.24331#bib.bib45); Suttonet al\.,[1999](https://arxiv.org/html/2605.24331#bib.bib58)\), is generally utilized to maximizeJRL\(θ\)J\_\{\\text\{RL\}\}\(\\theta\), which is defined by the following:
JRL\(θ\)=𝔼x∼d0,y∼πθ\(⋅∣x\)\[r\(x,y\)\]=𝔼x∼d0\[𝔼y∼πθ\(⋅∣x\)\[𝕀\{y∈C\(x\)\}\]\]:=𝔼x∼d0\[pθ\(x\)\],\\displaystyle J\_\{\\text\{RL\}\}\(\\theta\)=\\mathbb\{E\}\_\{x\\sim d\_\{0\},y\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\}\\left\[r\(x,y\)\\right\]=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[\\mathbb\{E\}\_\{y\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\}\\left\[\\mathbb\{I\}\\\{y\\in C\(x\)\\\}\\right\]\\right\]:=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[p\_\{\\theta\}\(x\)\\right\],\(2\)whereC\(x\)C\(x\)is a feasible set in the response, which is often determined by the domain\-specific verifier\. The score function trick is pivotal in derivations of REINFORCE\(Williams,[1992](https://arxiv.org/html/2605.24331#bib.bib45)\), offering a generic optimization tool to solve the optimization problem under decision\-dependent sampling distributions for LLM\-based generative modelπθ\\pi\_\{\\theta\}\. Next, we can derive its gradient:
∇θJRL\(θ\)=𝔼x∼d0,y∼πθ\(⋅\|x\)\[r\(x,y\)∇θlogπθ\(y\|x\)\]=𝔼x∼d0\[∇θpθ\(x\)\]\.\\displaystyle\\nabla\_\{\\theta\}J\_\{\\text\{RL\}\}\(\\theta\)=\\mathbb\{E\}\_\{x\\sim d\_\{0\},y\\sim\\pi\_\{\\theta\}\(\\cdot\|x\)\}\\left\[r\(x,y\)\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(y\|x\)\\right\]=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[\\nabla\_\{\\theta\}p\_\{\\theta\}\(x\)\\right\]\.\(3\)In particular, for the generic prompt reweighting form in Eq\. \([1](https://arxiv.org/html/2605.24331#S1.E1)\) in the pass\-rate space, REINFORCE corresponds to a constant prompt weightwθ\(x\)=1w\_\{\\theta\}\(x\)=1\.
##### Group Relative Policy Optimization \(GRPO\)\(Shaoet al\.,[2024](https://arxiv.org/html/2605.24331#bib.bib37); Guoet al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib38)\)\.
We next rewrite the gradient of GRPO in the same prompt\-reweighting form\. For each promptxx, GRPO performsnnrollouts to generate responses\{yi\}i=1n\\\{y\_\{i\}\\\}\_\{i=1\}^\{n\}\. We denote a reference policyπref\\pi\_\{\\text\{ref\}\}and an old policyπold\\pi\_\{\\text\{old\}\}\. Under a contextual bandit or one\-step decision abstraction analysis adopted in\(Davis and Recht,[2025](https://arxiv.org/html/2605.24331#bib.bib9)\), GRPO can be written in a sequence\-level form, which is closely related to Group Sequence Policy Optimization \(GSPO\)\(Zhenget al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib137)\)\. As such, GRPO maximizes the following objective:
JGRPO\(θ\)=𝔼x∼d0,\{yi\}i=1n∼πθold\(⋅∣x\)\\displaystyle J\_\{\\mathrm\{GRPO\}\}\(\\theta\)=\\mathbb\{E\}\_\{x\\sim d\_\{0\},\\\{y\_\{i\}\\\}\_\{i=1\}^\{n\}\\sim\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(\\cdot\\mid x\)\}\(4\)\[1n∑i=1n\(min\(πθ\(yi\|x\)πθold\(yi\|x\)A^ix,clip\(πθ\(yi\|x\)πθold\(yi\|x\),1−ϵ,1\+ϵ\)A^ix\)−βDKL\(πθ\(⋅∣x\)∥πref\(⋅∣x\)\)\)\)\],\\displaystyle\\left\[\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\left\(\\min\\left\(\\frac\{\\pi\_\{\\theta\}\\left\(y\_\{i\}\|x\\right\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\\left\(y\_\{i\}\|x\\right\)\}\\hat\{A\}\_\{i\}^\{x\},\\operatorname\{clip\}\\left\(\\frac\{\\pi\_\{\\theta\}\\left\(y\_\{i\}\|x\\right\)\}\{\\pi\_\{\\theta\_\{\\text\{old\}\}\}\\left\(y\_\{i\}\|x\\right\)\},1\-\\epsilon,1\+\\epsilon\\right\)\\hat\{A\}\_\{i\}^\{x\}\\right\)\-\\left\.\\beta D\_\{\\mathrm\{KL\}\}\\left\(\\pi\_\{\\theta\}\(\\cdot\\mid x\)\\\|\\pi\_\{\\mathrm\{ref\}\}\(\\cdot\\mid x\)\\right\)\\right\)\\right\)\\right\],whereϵ\\epsilonis the clipping parameter andA^ix\\hat\{A\}\_\{i\}^\{x\}is the advantage estimate\. The clipped policy ratioπθ\(yi\|x\)πθold\(yi\|x\)\\frac\{\\pi\_\{\\theta\}\\left\(y\_\{i\}\|x\\right\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\\left\(y\_\{i\}\|x\\right\)\}preventsπθ\\pi\_\{\\theta\}from deviating dramatically from the previous policyπold\\pi\_\{\\text\{old\}\}and the regularization hyperparameterβ\\betapenalizes the divergence from the reference policyπref\\pi\_\{\\text\{ref\}\}\. Notably, the group\-based normalization scheme is employed to estimate the advantage byA^ix\\hat\{A\}\_\{i\}^\{x\}in the gradient update rule:
A^ix=r\(x,yi\)−mean\(\{r\(x,yj\)\}j=1n\)std\(\{r\(x,yj\)\}j=1n\)\.\\displaystyle\\hat\{A\}\_\{i\}^\{x\}=\\frac\{r\\left\(x,y\_\{i\}\\right\)\-\\text\{mean\}\\left\(\\left\\\{r\\left\(x,y\_\{j\}\\right\)\\right\\\}\_\{j=1\}^\{n\}\\right\)\}\{\\text\{std\}\\left\(\\left\\\{r\\left\(x,y\_\{j\}\\right\)\\right\\\}\_\{j=1\}^\{n\}\\right\)\}\.\(5\)Following\(Davis and Recht,[2025](https://arxiv.org/html/2605.24331#bib.bib9)\), when we replace the group\-wise empirical baseline and normalization by their population counterparts \(i\.e\., an infinite group size\) and ignore the clipping and the policy\-ratio term, the gradient of GRPO can be approximated by the following simple form:
∇θJGRPO\(θ\)\\displaystyle\\nabla\_\{\\theta\}J\_\{\\mathrm\{GRPO\}\}\(\\theta\)≈𝔼x∼d0,y∼πθ\(⋅∣x\)\[r\(x,y\)−𝔼y~∼πθ\(⋅\|x\)\[r\(x,y~\)\]Vary~∼πθ\(⋅∣x\)\[r\(x,y~\)\]∇θlogπθ\(y\|x\)\]\\displaystyle\\approx\\mathbb\{E\}\_\{x\\sim d\_\{0\},y\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\}\\left\[\\frac\{r\(x,y\)\-\\mathbb\{E\}\_\{\\tilde\{y\}\\sim\\pi\_\{\\theta\}\(\\cdot\|x\)\}\\left\[r\(x,\\tilde\{y\}\)\\right\]\}\{\\sqrt\{\\text\{Var\}\_\{\\tilde\{y\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\}\\left\[r\(x,\\tilde\{y\}\)\\right\]\}\}\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(y\|x\)\\right\]\(6\)=𝔼x∼d0\[1pθ\(x\)\(1−pθ\(x\)\)∇θpθ\(x\)\],\\displaystyle=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[\\frac\{1\}\{\\sqrt\{p\_\{\\theta\}\(x\)\(1\-p\_\{\\theta\}\(x\)\)\}\}\\nabla\_\{\\theta\}p\_\{\\theta\}\(x\)\\right\],which ideally corresponds to a population\-level objectiveJGRPO\(θ\)=𝔼x∼d0\[2arcsinpθ\(x\)\]J\_\{\\text\{GRPO\}\}\(\\theta\)=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[2\\arcsin\{\\sqrt\{p\_\{\\theta\}\(x\)\}\}\\right\]\(Davis and Recht,[2025](https://arxiv.org/html/2605.24331#bib.bib9)\)\. Notably, the prompt weighting function of the population\-level GRPO in Eq\. \([6](https://arxiv.org/html/2605.24331#S2.E6)\) is1/pθ\(x\)\(1−pθ\(x\)\)1/\\sqrt\{p\_\{\\theta\}\(x\)\(1\-p\_\{\\theta\}\(x\)\)\}, which emphasizes prompts with either very small \(i\.e\.,pθ\(x\)→0p\_\{\\theta\}\(x\)\\rightarrow 0\) or large \(i\.e\.,pθ\(x\)→1p\_\{\\theta\}\(x\)\\rightarrow 1\) pass rates in the learning dynamics\.
##### Advanced Prompt Reweighting RL Objectives\.
Several recent RLVR variants can also be interpreted as pointwise transformations of the pass rate\(Walder and Karkhanis,[2025](https://arxiv.org/html/2605.24331#bib.bib65); Tajwaret al\.,[2026](https://arxiv.org/html/2605.24331#bib.bib7); Xionget al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib10)\)\. A representative example is Maximum Likelihood RL \(MaxRL\)\(Tajwaret al\.,[2026](https://arxiv.org/html/2605.24331#bib.bib7)\), where the Maximum Likelihood \(ML\) principle\(Bishop and Nasrabadi,[2006](https://arxiv.org/html/2605.24331#bib.bib56); Casella and Berger,[2024](https://arxiv.org/html/2605.24331#bib.bib55)\)is heuristically introduced into the RL objective:
JML\(θ\)=𝔼x∼d0\[logpθ\(x\)\],∇θJML\\displaystyle J\_\{\\text\{ML\}\}\(\\theta\)=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[\\log p\_\{\\theta\}\(x\)\\right\],\\quad\\nabla\_\{\\theta\}J\_\{\\text\{ML\}\}=𝔼x∼d0\[1pθ\(x\)∇θpθ\(x\)\]\.\\displaystyle=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[\\frac\{1\}\{p\_\{\\theta\}\(x\)\}\\nabla\_\{\\theta\}p\_\{\\theta\}\(x\)\\right\]\.\(7\)An interesting property of MaxRL is that maximum likelihood optimizes an infinite harmonic mixture of pass@k gradients and REINFORCE therefore can be viewed as optimizing a first\-order approximation to MaxRL\. In practice, MaxRL approximates the maximum likelihood gradient in Eq\. \([7](https://arxiv.org/html/2605.24331#S2.E7)\) by truncating the Maclaurin expansion of the logarithmic function to a finite order\. In our study, we use the oracle maximum likelihood form for theoretical analysis and its practical approximation for algorithmic comparisons\. In parallel,Xionget al\.\([2025](https://arxiv.org/html/2605.24331#bib.bib10)\)studied a prompt reweighting framework that adaptively allocates the number of sampled responses, implicitly prioritizing difficult prompts to mitigate vanishing pass\-rate estimates\. In contrast, we focus on identifying what constitutes an optimal weighting scheme itself and characterizing its role in shaping the learning dynamics, rather than approaching the problem from the perspective of sampling budget allocation\.
In summary, the RLVR algorithms considered in this section can be interpreted as assigning different pointwise weights to prompts over the pass\-rate function space\. This perspective offers a technical foundation for the utility\-functional framework developed in the remainder of the paper, particularly in[Section˜3](https://arxiv.org/html/2605.24331#S3)\.
### 3Prompt Reweighting as Utility\-Dependent Context Distribution Control
#### 3\.1Prompt Reweighting as Context Distribution Control
##### Problem Setting: Policy\-Reweighted Contextual Bandit \(PRCB\)\.
Introducing prompt reweighting into the RLVR policy\-gradient update makes theeffective context distributionpolicy\-dependent in the learning dynamics\. Given a generic non\-negative weightwθ\(x\)w\_\{\\theta\}\(x\), definedθ\(x\)=d0\(x\)wθ\(x\)/Zθd\_\{\\theta\}\(x\)=d\_\{0\}\(x\)w\_\{\\theta\}\(x\)/Z\_\{\\theta\}, whereZθ=∫d0\(x\)wθ\(x\)𝑑xZ\_\{\\theta\}=\\int d\_\{0\}\(x\)w\_\{\\theta\}\(x\)dxis the normalization factor\. Under this framework, the general gradient update can be written as
∇θJ\(θ\)=𝔼x∼d0\[wθ\(x\)∇θpθ\(x\)\]=Zθ𝔼x∼dθ\[∇θpθ\(x\)\],\\displaystyle\\nabla\_\{\\theta\}J\\left\(\\theta\\right\)=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[w\_\{\\theta\}\(x\)\\nabla\_\{\\theta\}p\_\{\\theta\}\(x\)\\right\]=Z\_\{\\theta\}\\mathbb\{E\}\_\{x\\sim d\_\{\\theta\}\}\\left\[\\nabla\_\{\\theta\}p\_\{\\theta\}\(x\)\\right\],\(8\)where the positive scalarZθZ\_\{\\theta\}only changes the update scale rather than the gradient update direction at a fixed policy iteration\. Thus, prompt reweighting can be interpreted as optimizing under an effective prompt measuredθd\_\{\\theta\}rather than the fixed base measured0d\_\{0\}\. A special case is thepointwise objectiveJg\(θ\)J\_\{g\}\(\\theta\)with its gradient form:
Jg\(θ\)=𝔼x∼d0\[g\(pθ\(x\)\)\],∇θJg\(θ\)=𝔼x∼d0\[g′\(pθ\(x\)\)∇θpθ\(x\)\],\\displaystyle J\_\{g\}\(\\theta\)=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[g\\left\(p\_\{\\theta\}\(x\)\\right\)\\right\],\\quad\\nabla\_\{\\theta\}J\_\{g\}\(\\theta\)=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[g^\{\\prime\}\\left\(p\_\{\\theta\}\(x\)\\right\)\\nabla\_\{\\theta\}p\_\{\\theta\}\(x\)\\right\],\(9\)which induces the pointwise weightwθ\(x\)=g′\(pθ\(x\)\)w\_\{\\theta\}\(x\)=g^\{\\prime\}\(p\_\{\\theta\}\(x\)\)for a deterministic functiongg, and includes REINFORCE, GRPO, and MaxRL as examples from Section[2](https://arxiv.org/html/2605.24331#S2)\. We call this setting aPolicy\-Reweighted Contextual Bandit \(PRCB\): the contextual\-bandit interaction remains one\-step and outcome\-based, but the optimization measure over contexts becomes policy\-dependent through reweighting\. Unlike MDP and non\-Markovian environments that are treated as the temporal structure extension of classical contextual bandit, PRCB highlights a distinctive axis of information acquisition in RLVR in the direction of measure adaptivity via direct context distribution control\. Notably, a direct context distribution control is typically unavailable in standard RL with an external environment\.
##### New Information Acquisition in RLVR\.
In standard RL with an external environment, information acquisition is primarily achieved through action\-level exploration\(Thrun,[1992](https://arxiv.org/html/2605.24331#bib.bib62); Ladoszet al\.,[2022](https://arxiv.org/html/2605.24331#bib.bib63)\), where the policy affects future observations only indirectly through the environment dynamics\. In contrast, RLVR under the PRCB formulation introduces a fundamentally new control axis: the learner can directly reshape the effective context distribution through policy\-dependent prompt reweighting\. This mechanism is closely related in spirit to active learning\(Liuet al\.,[2024](https://arxiv.org/html/2605.24331#bib.bib110); Ménardet al\.,[2021](https://arxiv.org/html/2605.24331#bib.bib111)\), adaptive experimental design\(Mehtaet al\.,[2022](https://arxiv.org/html/2605.24331#bib.bib39); Blauet al\.,[2022](https://arxiv.org/html/2605.24331#bib.bib40)\), and Bayesian optimization\(Frazier,[2018](https://arxiv.org/html/2605.24331#bib.bib107); Brochuet al\.,[2010](https://arxiv.org/html/2605.24331#bib.bib108); Balakrishnanet al\.,[2020](https://arxiv.org/html/2605.24331#bib.bib109)\), where learning efficiency is improved by adaptively allocating sampling effort to more informative inputs\. With a slight notational abuse, we letdt\(x\)d\_\{t\}\(x\)denote the context distribution anddt\(x,y\)d\_\{t\}\(x,y\)the joint context\-response distribution at timett\. The following schematic contrast highlights the distinction between the two paradigms:
Standard RL:⋯→dt\(x\)→πθtdt\(x,y\)→updateπθt\+1→explorationdt\+1\(x\)→⋯,\\displaystyle\\cdots\\rightarrow d\_\{t\}\(x\)\\overset\{\\pi\_\{\\theta\_\{t\}\}\}\{\\rightarrow\}d\_\{t\}\(x,y\)\\overset\{\\text\{update\}\}\{\\rightarrow\}\\pi\_\{\\theta\_\{t\+1\}\}\\overset\{\\textbf\{exploration\}\}\{\\rightarrow\}d\_\{t\+1\}\(x\)\\rightarrow\\cdots,RLVR under PRCB:⋯→dt\(x\)→πθtdt\(x,y\)→updateπθt\+1→𝒘𝜽𝒕\+𝟏\(𝒙\)dt\+1\(x\)→⋯\.\\displaystyle\\cdots\\rightarrow d\_\{t\}\(x\)\\overset\{\\pi\_\{\\theta\_\{t\}\}\}\{\\rightarrow\}d\_\{t\}\(x,y\)\\overset\{\\text\{update\}\}\{\\rightarrow\}\\pi\_\{\\theta\_\{t\+1\}\}\\overset\{\\boldsymbol\{w\_\{\\theta\_\{t\+1\}\}\(x\)\}\}\{\\rightarrow\}d\_\{t\+1\}\(x\)\\rightarrow\\cdots\.
This perspective highlights that RLVR differs fundamentally from standard RL by enabling direct control over the state or context distribution, rather than relying solely on action\-level exploration\.
##### New Challenges under PRCB\.
The PRCB viewpoint also introduces two conceptual challenges\.\(1\) Coupled decision\-dependent sampling structure\. RLVR already involves decision\-dependent sampling in the response space, where the data distribution in the optimization depends onπθ\\pi\_\{\\theta\}\. Under PRCB, this dependence further extends to the context space throughdθ\(x\)d\_\{\\theta\}\(x\)\. As such, the learning dynamics are shaped jointly by response\-level policy dependence viaπθ\\pi\_\{\\theta\}and context\-level reweighting viawθw\_\{\\theta\}\.\(2\) Breakdown of classical optimality principles\.Many optimality criteria for contextual bandits, such as simple and cumulative regret minimization\(Deshmukhet al\.,[2018](https://arxiv.org/html/2605.24331#bib.bib41); Zhouet al\.,[2020](https://arxiv.org/html/2605.24331#bib.bib42); Lattimore and Szepesvári,[2020](https://arxiv.org/html/2605.24331#bib.bib43)\), are formulated under a fixed context distributiond0d\_\{0\}\. In PRCB, however, the effective training measure changes with the policy throughwθw\_\{\\theta\}\. This policy\-dependent measure adaptivity falls outside the standard analysis and motivates our utility\-based formulation as one principled optimality for the prompt reweighting introduced in Section[3\.2](https://arxiv.org/html/2605.24331#S3.SS2)\.
#### 3\.2Optimality of Prompt Reweighting is Utility\-Dependent
##### Policy Optimization in RLVR as Pass\-rate Distribution Transport\.
Let𝒳\\mathcal\{X\}be the prompt space andπ∗\\pi^\{\*\}be an oracle policy\. Define an oracle pass\-rate functionp∗:𝒳→\[0,1\]p^\{\*\}:\\mathcal\{X\}\\rightarrow\[0,1\]by𝔼y∼π∗\(⋅∣x\)\[r\(x,y\)\]:=p∗\(x\)\\mathbb\{E\}\_\{y\\sim\\pi^\{\*\}\(\\cdot\\mid x\)\}\\left\[r\(x,y\)\\right\]:=p^\{\*\}\(x\)\. Letfθf\_\{\\theta\}andf∗f^\{\*\}be the corresponding density functions of the random variablespθ\(x\)p\_\{\\theta\}\(x\)andp∗\(x\)p^\{\*\}\(x\)underx∼d0x\\sim d\_\{0\}\. The learning dynamics when performing policy optimization in RLVR can be interpreted as a distribution transport problem fromfθf\_\{\\theta\}tof∗f^\{\*\}in the pass\-rate space:
fθ\(t\)→f∗\(t\),t∈\[0,1\]\.\\displaystyle f\_\{\\theta\}\(t\)\\rightarrow f^\{\*\}\(t\),\\quad t\\in\[0,1\]\.\(10\)Importantly, perturbations topθ\(x\)p\_\{\\theta\}\(x\)at individual prompts induce heterogeneous effects on the resulting distribution transport, which are governed not only by local sensitivity about pass rates, but also by cross\-prompt interactions, coupling, and interference in the induced distributional dynamics\(He and Su,[2019](https://arxiv.org/html/2605.24331#bib.bib4); Barakatet al\.,[2026](https://arxiv.org/html/2605.24331#bib.bib6)\)\.
##### Optimal Weight Function as Marginal Contribution of a Utility Functional\.
Assumepθ∈L2\(d0\)p\_\{\\theta\}\\in L^\{2\}\(d\_\{0\}\), the space of square\-integrable functions under the prompt distributiond0d\_\{0\}\. We define a policy\-dependent functional𝒰θ:pθ\(⋅\)→ℝ\\mathcal\{U\}\_\{\\theta\}:p\_\{\\theta\}\(\\cdot\)\\rightarrow\\mathbb\{R\}on the pass\-rate function space\. Note that𝒰θ\\mathcal\{U\}\_\{\\theta\}integrates a broad transformation class ofpθ\(⋅\)p\_\{\\theta\}\(\\cdot\)overx∼d0x\\sim d\_\{0\}\. Define a pushforward measureμθ=\(pθ\)\#d0\\mu\_\{\\theta\}=\\left\(p\_\{\\theta\}\\right\)\_\{\\\#\}d\_\{0\}\(equivalently,μθ=Lawx∼d0\(pθ\(x\)\)\\mu\_\{\\theta\}=\\text\{Law\}\_\{x\\sim d\_\{0\}\}\(p\_\{\\theta\}\(x\)\)\), i\.e\., the distribution ofpθ\(x\)p\_\{\\theta\}\(x\)by mapping promptsxxto the pass rate functionpθ\(⋅\)p\_\{\\theta\}\(\\cdot\), with the property thatμθ\(\[0,1\]\)=ℙx∼d0\(pθ\(x\)∈\[0,1\]\)=1\\mu\_\{\\theta\}\(\[0,1\]\)=\\mathbb\{P\}\_\{x\\sim d\_\{0\}\}\\left\(p\_\{\\theta\}\(x\)\\in\[0,1\]\\right\)=1\. In measure theory,μθ\\mu\_\{\\theta\}induces the Cumulative Distribution Function \(CDF\)FθF\_\{\\theta\}defined byFθ\(t\)=μθ\(\[0,t\]\)=ℙx∼d0\(pθ\(x\)≤t\)F\_\{\\theta\}\(t\)=\\mu\_\{\\theta\}\(\[0,t\]\)=\\mathbb\{P\}\_\{x\\sim d\_\{0\}\}\\left\(p\_\{\\theta\}\(x\)\\leq t\\right\)\. We introduce the derivative concept to define the optimal prompt reweightingwθ⋆w\_\{\\theta\}^\{\\star\}, which characterizes the fastest direction for increasing the marginal contribution of𝒰θ\\mathcal\{U\}\_\{\\theta\}through an upward shift ofpθ\(x\)p\_\{\\theta\}\(x\)\. Concretely, we define the optimal weight functionwθ⋆\(x\)w^\{\\star\}\_\{\\theta\}\(x\)as the functional derivative\(Lions,[1971](https://arxiv.org/html/2605.24331#bib.bib90); Evans,[2022](https://arxiv.org/html/2605.24331#bib.bib91)\)of the utility functional𝒰θ\\mathcal\{U\}\_\{\\theta\}over the pass rate functionpθ\(⋅\)p\_\{\\theta\}\(\\cdot\)evaluated at the promptxxin Definition[1](https://arxiv.org/html/2605.24331#Thmdefinition1)\.
###### Definition 1\(Utility\-dependent Optimal Prompt Weightwθ⋆w\_\{\\theta\}^\{\\star\}\)\.
For any perturbationh∈L2\(d0\)h\\in L^\{2\}\(d\_\{0\}\), according to the Riesz representation, the first variation of𝒰θ\\mathcal\{U\}\_\{\\theta\}is given by
limϵ→0𝒰θ\(pθ\+ϵh\)−𝒰θ\(pθ\)ϵ=∫wθ⋆\(x\)h\(x\)d0\(x\)𝑑x,\\displaystyle\\lim\_\{\\epsilon\\to 0\}\\frac\{\\mathcal\{U\}\_\{\\theta\}\(p\_\{\\theta\}\+\\epsilon h\)\-\\mathcal\{U\}\_\{\\theta\}\(p\_\{\\theta\}\)\}\{\\epsilon\}=\\int w\_\{\\theta\}^\{\\star\}\(x\)\\,h\(x\)\\,d\_\{0\}\(x\)dx,\(11\)wherewθ⋆\(x\)w\_\{\\theta\}^\{\\star\}\(x\)is the functional derivative denoted by
wθ⋆\(x\):=δ𝒰θδpθ\(x\),\\displaystyle w\_\{\\theta\}^\{\\star\}\(x\):=\\frac\{\\delta\\mathcal\{U\}\_\{\\theta\}\}\{\\delta p\_\{\\theta\}\}\(x\),\(12\)which indicates the optimal prompt weighting function for the promptxxand the current policyπθ\\pi\_\{\\theta\}\.
As the Gateaux derivative defines a linear functional in the perturbation directionhh\(cf\. Eq\. \([11](https://arxiv.org/html/2605.24331#S3.E11)\)\), the Riesz representation theorem\(Brezis and Brézis,[2011](https://arxiv.org/html/2605.24331#bib.bib93)\)ensures that, under mild regularity conditions, this functional admits a representation in the Hilbert spaceL2\(d0\)L^\{2\}\(d\_\{0\}\)\. Particularly, there existswθ⋆∈L2\(d0\)w\_\{\\theta\}^\{\\star\}\\in L^\{2\}\(d\_\{0\}\)such that the first variation of the objective can be expressed as an inner product with respect tod0d\_\{0\}, which naturally defines the optimal weight functionwθ⋆\(x\)w\_\{\\theta\}^\{\\star\}\(x\)\. Our definition is reminiscent of the notion of influence functions widely studied in statistics\(Hampel,[1974](https://arxiv.org/html/2605.24331#bib.bib94); Ronchetti and Huber,[2009](https://arxiv.org/html/2605.24331#bib.bib97)\)and recently revisited in modern machine learning\(Koh and Liang,[2017](https://arxiv.org/html/2605.24331#bib.bib92); Grosseet al\.,[2023](https://arxiv.org/html/2605.24331#bib.bib95); Minet al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib96)\), where one quantifies the impact of infinitesimal perturbations on a global objective\. More broadly, the functional derivative generalizes the ordinary partial derivative from finite\-dimensional variables to infinite\-dimensional function spaces\(Lions,[1971](https://arxiv.org/html/2605.24331#bib.bib90); Evans,[2022](https://arxiv.org/html/2605.24331#bib.bib91)\)\. Partial derivatives with respect topθ\(x\)p\_\{\\theta\}\(x\)only capture the local sensitivity of eachxx, but*functional derivatives can characterize how the objective responds to perturbations on the entire pass rate function*pθ\(⋅\)p\_\{\\theta\}\(\\cdot\)\. The functional derivatives\-based definition generalizes the partial derivatives, enabling distribution\-aware weighting that reflects the data geometry, akin to the geometry\-aware formulations in optimal transport\(Villani and others,[2009](https://arxiv.org/html/2605.24331#bib.bib98); Peyré and Cuturi,[2019](https://arxiv.org/html/2605.24331#bib.bib99)\)\. This perspective is also the theoretical foundation of our approach in Section[4](https://arxiv.org/html/2605.24331#S4)\.
###### Example 1\(Pointwise Utility Functional\.\)\.
For the pointwise utility functional with𝒰θ=Jg\(θ\)\\mathcal\{U\}\_\{\\theta\}=J\_\{g\}\(\\theta\), the prompt weight only depends onpθ\(x\)p\_\{\\theta\}\(x\)itself by implicitly assuming the independence among prompts\. Therefore, the functional derivative degenerates to the partial derivative:
δ𝒰θδpθ\(x\)=g′\(pθ\(x\)\)=wθ⋆\(x\)\.\\displaystyle\\frac\{\\delta\\mathcal\{U\}\_\{\\theta\}\}\{\\delta p\_\{\\theta\}\}\(x\)=g^\{\\prime\}\\left\(p\_\{\\theta\}\(x\)\\right\)=w^\{\\star\}\_\{\\theta\}\(x\)\.\(13\)This reduction encompasses the aforementioned RLVR algorithms in a straightforward way, including \(1\)wθ⋆\(x\)=1w^\{\\star\}\_\{\\theta\}\(x\)=1in REINFORCE, \(2\)wθ⋆\(x\)=1/\(pθ\(x\)\(1−pθ\(x\)\)w^\{\\star\}\_\{\\theta\}\(x\)=1/\\sqrt\{\(p\_\{\\theta\}\(x\)\(1\-p\_\{\\theta\}\(x\)\)\}in GRPO, and \(3\)wθ⋆\(x\)=1/pθ\(x\)w^\{\\star\}\_\{\\theta\}\(x\)=1/p\_\{\\theta\}\(x\)in MaxRL, by taking the functional \(partial\) derivative \(see Appendix[A\.1](https://arxiv.org/html/2605.24331#A1.SS1)\)\.
###### Example 2\(Variance\-Based Utility Functional\.\)\.
Functional derivatives over the pass\-rate function are strictly more expressive than pointwise derivatives, i\.e\., in generalδ𝒰θδpθ\(x\)≠d𝒰θ\(pθ\(x\)\)dpθ\(x\)\\frac\{\\delta\\mathcal\{U\}\_\{\\theta\}\}\{\\delta p\_\{\\theta\}\}\\left\(x\\right\)\\neq\\frac\{d\\mathcal\{U\}\_\{\\theta\}\(p\_\{\\theta\}\(x\)\)\}\{dp\_\{\\theta\}\(x\)\}\. Unlike pointwise utility functionals, functional derivatives can capture global distributional structure ofpθ\(⋅\)p\_\{\\theta\}\(\\cdot\)\. As an illustrative example, consider the variance\-based utility functional𝒰θ=Varx∼d0\(pθ\(x\)\)=𝔼x∼d0\[pθ\(x\)2\]−\(𝔼x∼d0\[pθ\(x\)\]\)2\\mathcal\{U\}\_\{\\theta\}=\\operatorname\{Var\}\_\{x\\sim d\_\{0\}\}\(p\_\{\\theta\}\(x\)\)=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[p\_\{\\theta\}\(x\)^\{2\}\\right\]\-\(\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\[p\_\{\\theta\}\(x\)\]\)^\{2\}\. We can derive
δ𝒰θδpθ\(x\)=2pθ\(x\)−2𝔼x∼d0\[pθ\(x\)\],\\displaystyle\\frac\{\\delta\\mathcal\{U\}\_\{\\theta\}\}\{\\delta p\_\{\\theta\}\}\(x\)=2p\_\{\\theta\}\(x\)\-2\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\[p\_\{\\theta\}\(x\)\],\(14\)where the second term introduces a global coupling through the expectation, reflecting the dependence on the entire pass\-rate distribution rather than the local valuepθ\(x\)p\_\{\\theta\}\(x\)alone\. This serves as a motivating example of the general distribution\-aware utility functionals introduced in Section[4](https://arxiv.org/html/2605.24331#S4)\.
#### 3\.3The Optimality of Utility Functionals is Risk\-Dependent
##### Optimal Choice of the Utility Functional: A Risk\-Sensitive Control Perspective\.
According to Definition[1](https://arxiv.org/html/2605.24331#Thmdefinition1), the optimal prompt reweighting is determined by the choice of a utility function𝒰θ\\mathcal\{U\}\_\{\\theta\}that influences the performance of the RLVR algorithm\. A natural question is whether there exists a single “optimal” utility function𝒰θ\\mathcal\{U\}\_\{\\theta\}that uniformly dominates the alternatives\. However, we argue that the answer is negative, as it depends on the choice of risk preference\. The RL literature has long recognized that utility selection reflects the designer’s risk attitude rather than a universal principle\. Specifically, risk\-sensitive RL\(Mihatsch and Neuneier,[2002](https://arxiv.org/html/2605.24331#bib.bib87)\)with an objective of some risk measure on cumulative rewards, has not converged on a single optimal risk measure\. A typical strategy is to focus on concrete criteria such as exponential utility and CVaR, treating selection as a tradeoff among interpretability, safety semantics, and tractability\(Bäuerle and Jaśkiewicz,[2024](https://arxiv.org/html/2605.24331#bib.bib120); Wang and Chapman,[2022](https://arxiv.org/html/2605.24331#bib.bib119); Smith and Chapman,[2023](https://arxiv.org/html/2605.24331#bib.bib121); Majumdaret al\.,[2017](https://arxiv.org/html/2605.24331#bib.bib122)\)\. Distributional RL\(Dabneyet al\.,[2018](https://arxiv.org/html/2605.24331#bib.bib89); Bellemareet al\.,[2023](https://arxiv.org/html/2605.24331#bib.bib116)\)makes this dependence explicit by allowing arbitrary distortion functions\. From this perspective, REINFORCE, GRPO, and MaxRL are different points in a continuum of risk preference\. A useful summary of this continuum is the widely adopted entropic risk family\(Howard and Matheson,[1972](https://arxiv.org/html/2605.24331#bib.bib8)\), where the agent seeks to optimize the expectation of an exponential utility function𝒰θrisk\(η\)\\mathcal\{U\}\_\{\\theta\}^\{\\mathrm\{risk\}\}\(\\eta\)with the lower boundJRLJ\_\{\\mathrm\{RL\}\}by Jensen’s inequality:
𝒰θrisk\(η\)=𝔼x∼d0\[1ηlog𝔼y∼πθ\(⋅∣x\)eηr\(x,y\)\]≥𝔼x∼d0\[𝔼y∼πθ\(⋅∣x\)\[r\(x,y\)\]\]=JRL\(θ\),\\displaystyle\\mathcal\{U\}\_\{\\theta\}^\{\\mathrm\{risk\}\}\(\\eta\)=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[\\frac\{1\}\{\\eta\}\\log\\mathbb\{E\}\_\{y\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\}e^\{\\eta r\(x,y\)\}\\right\]\\geq\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[\\mathbb\{E\}\_\{y\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\}\\left\[r\(x,y\)\\right\]\\right\]=J\_\{\\mathrm\{RL\}\}\(\\theta\),\(15\)whereη\>0\\eta\>0controls the degree of risk\-seeking: larger values ofη\\etaplace exponentially more weight on high\-reward responses\. As shown in Proposition[1](https://arxiv.org/html/2605.24331#Thmproposition1), this entropic\-risk family provides a utility\-functional interpolation between standard RL \(i\.e\., REINFORCE\) and MaxRL\. Increasingη\\etainduces stronger reweighting toward low\-pass\-rate prompts through the resulting weight functionwθ⋆w\_\{\\theta\}^\{\\star\}\. The proof of Proposition[1](https://arxiv.org/html/2605.24331#Thmproposition1)is provided in Appendix[A\.2](https://arxiv.org/html/2605.24331#A1.SS2)\. In summary, we argue that no single utility functional𝒰θ\\mathcal\{U\}\_\{\\theta\}is inherently preferred outside of a specific design objective\.
###### Proposition 1\(Gradient of Entropic Risk RL Interpolates Standard RL and MaxRL\)\.
Assumepθ\(x\)\>0p\_\{\\theta\}\(x\)\>0ford0d\_\{0\}\-almost everyxxand‖∇θpθ\(x\)‖pθ\(x\)∈L1\(d0\)\\frac\{\\left\\\|\\nabla\_\{\\theta\}p\_\{\\theta\}\(x\)\\right\\\|\}\{p\_\{\\theta\}\(x\)\}\\in L^\{1\}\\left\(d\_\{0\}\\right\), then
limη→0\+∇θ𝒰θrisk\(η\)=∇θJRL\(θ\),limη→∞η∇θ𝒰θrisk\(η\)=∇θJML\(θ\)\.\\displaystyle\\lim\_\{\\eta\\rightarrow 0^\{\+\}\}\\nabla\_\{\\theta\}\\mathcal\{U\}\_\{\\theta\}^\{\\mathrm\{risk\}\}\(\\eta\)=\\nabla\_\{\\theta\}J\_\{\\mathrm\{RL\}\}\(\\theta\),\\ \\lim\_\{\\eta\\rightarrow\\infty\}\\eta\\nabla\_\{\\theta\}\\mathcal\{U\}^\{\\mathrm\{risk\}\}\_\{\\theta\}\(\\eta\)=\\nabla\_\{\\theta\}J\_\{\\mathrm\{ML\}\}\(\\theta\)\.\(16\)
### 4CurveRL: Distribution\-Aware Reweighting in Pass\-Rate Quantile Space
#### 4\.1From Pointwise Utility to Distribution\-Aware Utility in the Quantile Space
Even though there may not be a universally optimal utility function in the general prompt reweighting as analyzed in[Section˜3\.3](https://arxiv.org/html/2605.24331#S3.SS3), we can still systematically improve the existing pointwise utility functional family in Eq\. \([9](https://arxiv.org/html/2605.24331#S3.E9)\)\.
##### Motivation: Weight Collapse of Pointwise Utility Functionals\.
Previous approaches, such as\(Davis and Recht,[2025](https://arxiv.org/html/2605.24331#bib.bib9); Tajwaret al\.,[2026](https://arxiv.org/html/2605.24331#bib.bib7); Xionget al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib10)\), define utility as a pointwise transformation of the pass rate, i\.e\.,g\(pθ\(x\)\)g\(p\_\{\\theta\}\(x\)\), solely based on the absolute magnitudepθ\(x\)p\_\{\\theta\}\(x\)\. However, such strategies suffer from a systematic limitation that we termweight collapse: the induced weights fail to adequately differentiate prompts with distinct learning potential\. This issue manifests in both early and late training phases\. \(1\) In the early stage, most prompts havepθ\(x\)≈0p\_\{\\theta\}\(x\)\\approx 0, leading to nearly indistinguishable weights, despite substantial heterogeneity: some prompts lie near the learning frontier, while others are intrinsically difficult and less likely to benefit from further training\. \(2\) In the late stage, many prompts havepθ\(x\)≈1p\_\{\\theta\}\(x\)\\approx 1and thus yield similar weights, although certain prompts still admit meaningful improvement\. The root cause is that pointwise utility functions depend solely on the local information in the pass rate space based on the*absolute*value ofpθ\(x\)p\_\{\\theta\}\(x\), ignoring the*geometry*in the pass rate space \(i\.e\., thedistributional structureofpθ\(x\)p\_\{\\theta\}\(x\)\), including the rank, density, and spacing information\. To address this limitation, we propose a fundamentally different distribution\-aware utility function family\. This motivation is analogous to the distinction between geometry\-aware distances, such as Wasserstein distance, and pointwise divergences like Kullback–Leibler \(KL\) divergence \(see Appendix[D\.4](https://arxiv.org/html/2605.24331#A4.SS4)for more discussions\)\.
##### Distribution\-Aware Utility Functional\.
We aim to incorporate the distributional structure ofpθ\(x\)p\_\{\\theta\}\(x\)into the prompt weightingwθ\(x\)w\_\{\\theta\}\(x\)\. Consider a reference distribution \(i\.e\., probability measure\)μref\\mu\_\{\\mathrm\{ref\}\}, which induces a CDFFrefF\_\{\\mathrm\{ref\}\}with a densityfreff\_\{\\mathrm\{ref\}\}defined byFref\(t\)=μref\(\[0,t\]\)F\_\{\\mathrm\{ref\}\}\(t\)=\\mu\_\{\\mathrm\{ref\}\}\(\[0,t\]\)\. We consider aquantile coordinate transformviaFrefF\_\{\\text\{ref\}\}onpθ\(x\)p\_\{\\theta\}\(x\)to develop a new utility function:
𝒰θ\(Fref\)=𝔼x∼d0\[ψ\(Fref\(pθ\(x\)\)\)\],\\displaystyle\\mathcal\{U\}\_\{\\theta\}\(F\_\{\\text\{ref\}\}\)=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[\\psi\\left\(F\_\{\\text\{ref\}\}\\left\(p\_\{\\theta\}\(x\)\\right\)\\right\)\\right\],\(17\)whereψ\\psiis an increasing function, often calleddistortion functionrooted in distortion risk measure in economics and risk theory\(Yaari,[1987](https://arxiv.org/html/2605.24331#bib.bib100); Acerbi,[2002](https://arxiv.org/html/2605.24331#bib.bib101); Wang,[1996](https://arxiv.org/html/2605.24331#bib.bib102); Dhaeneet al\.,[2012](https://arxiv.org/html/2605.24331#bib.bib104); Balbáset al\.,[2009](https://arxiv.org/html/2605.24331#bib.bib103)\)\. In principle,FrefF\_\{\\text\{ref\}\}should approximate the distribution ofpθ\(x\)p\_\{\\theta\}\(x\)to capture meaningful distributional geometry\. Nonetheless, it should not coincide with the exact policy\-induced CDFFθF\_\{\\theta\}of the same random variablepθ\(x\)p\_\{\\theta\}\(x\)\. This is because, by probability integral transform\(Casella and Berger,[2024](https://arxiv.org/html/2605.24331#bib.bib55)\), we haveFθ\(pθ\(x\)\)∼Uniform\(0,1\)F\_\{\\theta\}\(p\_\{\\theta\}\(x\)\)\\sim\\text\{Uniform\}\(0,1\), which removes any dependency onθ\\thetain expectation for𝒰θ\(Fθ\)\\mathcal\{U\}\_\{\\theta\}\(F\_\{\\theta\}\)\. Consequently, the utility degenerates to a constant:𝔼x∼d0\[ψ\(Fθ\(pθ\(x\)\)\)\]=∫01ψ\(z\)𝑑z\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[\\psi\\left\(F\_\{\\theta\}\\left\(p\_\{\\theta\}\(x\)\\right\)\\right\)\\right\]=\\int\_\{0\}^\{1\}\\psi\(z\)dz, yielding a non\-informative𝒰θ\\mathcal\{U\}\_\{\\theta\}\. Proposition[2](https://arxiv.org/html/2605.24331#Thmproposition2)further quantifies this effect by relating the deviation betweenFrefF\_\{\\mathrm\{ref\}\}andFθF\_\{\\theta\}to a distribution\-aware 1\-Wasserstein distance\. The proof is provided in Appendix[A\.3](https://arxiv.org/html/2605.24331#A1.SS3)\.
###### Proposition 2\.
DenoteW1W\_\{1\}as 1\-Wasserstein distance\. Ifψ\\psiisLψL\_\{\\psi\}\-Lipschitz and‖fθ‖∞<∞\\\|f\_\{\\theta\}\\\|\_\{\\infty\}<\\infty, then
\|𝒰θ\(Fref\)−𝒰θ\(Fθ\)\|≤Lψ‖fθ‖∞W1\(μref,μθ\)\.\\displaystyle\\left\|\\mathcal\{U\}\_\{\\theta\}\(F\_\{\\mathrm\{ref\}\}\)\-\\mathcal\{U\}\_\{\\theta\}\(F\_\{\\theta\}\)\\right\|\\leq L\_\{\\psi\}\\\|f\_\{\\theta\}\\\|\_\{\\infty\}W\_\{1\}\(\\mu\_\{\\mathrm\{ref\}\},\\mu\_\{\\theta\}\)\.\(18\)
##### Implementation and Interpretation\.
In practice,FrefF\_\{\\text\{ref\}\}is instantiated using a lagged policy \(e\.g\.,Fref=FθoldF\_\{\\text\{ref\}\}=F\_\{\\theta\_\{\\text\{old\}\}\}\) and updated in a sliding\-window manner akin to the target network trick adopted in deep RL algorithms\(Mnihet al\.,[2015](https://arxiv.org/html/2605.24331#bib.bib105); Lillicrapet al\.,[2015](https://arxiv.org/html/2605.24331#bib.bib106)\)\. The gradient is taken with respect to the current pass ratepθ\(x\)p\_\{\\theta\}\(x\)by fixingFrefF\_\{\\text\{ref\}\}within each gradient update, analogous to a standard pointwise derivative\. Theoretically,FrefF\_\{\\text\{ref\}\}encodes the global information about the pass rate functionpθ\(⋅\)p\_\{\\theta\}\(\\cdot\)and is thus policy\-dependent\. Consequently, although the gradient computation appears pointwise in implementation by fixingFrefF\_\{\\text\{ref\}\}, the resulting utility functional𝒰θ\\mathcal\{U\}\_\{\\theta\}in Eq\. \([17](https://arxiv.org/html/2605.24331#S4.E17)\) is inherently distribution\-aware and cannot be reduced to a deterministic pointwise transformationg\(pθ\(⋅\)\)g\(p\_\{\\theta\}\(\\cdot\)\)\. From the perspective of distribution transport illustrated in Section[3\.2](https://arxiv.org/html/2605.24331#S3.SS2), the introduced quantile\-based representation viaFrefF\_\{\\mathrm\{ref\}\}aligns with a Wasserstein\-type geometry, rather than relying solely on the pointwise information like the KL divergence\. Recall the definition of 1\-Wasserstein distance asW1\(μ,ν\)=infγ∈Π\(μ,ν\)∫‖x−y‖1𝑑γ\(x,y\)=∫ℝ\|Fμ\(t\)−Fν\(t\)\|𝑑tW\_\{1\}\(\\mu,\\nu\)=\\inf\_\{\\gamma\\in\\Pi\(\\mu,\\nu\)\}\\int\\\|x\-y\\\|\_\{1\}d\\gamma\(x,y\)=\\int\_\{\\mathbb\{R\}\}\\left\|F\_\{\\mu\}\(t\)\-F\_\{\\nu\}\(t\)\\right\|dtfor two measuresμ\\muandν\\nu, whereΠ\(μ,ν\)\\Pi\(\\mu,\\nu\)is the joint coupling with the marginal distributionsμ\\muandν\\nu\. Similarly, Wasserstein geometry is also formed on the quantile space via the CDFsFμF\_\{\\mu\}andFνF\_\{\\nu\}\. We defer a detailed discussion about the induced geometry in Appendix[D\.4](https://arxiv.org/html/2605.24331#A4.SS4)\.
##### Our Method: CurveRL\.
In our main approach, we adoptψ\(u\)=logu\\psi\(u\)=\\log uas the distortion risk function\. This choice corresponds to a risk\-seeking preference that emphasizes hard prompts, a preference well aligned with RLVR’s empirical goal of improving worst\-case reasoning performance\. The log is preferable for designers who care about tail performance, but it is not universally optimal as discussed in Section[3\.3](https://arxiv.org/html/2605.24331#S3.SS3)\. Consequently, a concise and interpretable prompt reweighting is derived in the gradient form:
∇θ𝒰θ\(Fref\)\\displaystyle\\nabla\_\{\\theta\}\\mathcal\{U\}\_\{\\theta\}\(F\_\{\\mathrm\{ref\}\}\)=𝔼x∼d0\[fref\(pθ\(x\)\)Fref\(pθ\(x\)\)∇θpθ\(x\)\],\\displaystyle=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[\\frac\{f\_\{\\mathrm\{ref\}\}\\left\(p\_\{\\theta\}\(x\)\\right\)\}\{F\_\{\\mathrm\{ref\}\}\\left\(p\_\{\\theta\}\(x\)\\right\)\}\\nabla\_\{\\theta\}p\_\{\\theta\}\(x\)\\right\],\(19\)where the weight above is interpreted as emphasizing prompts with thelower\-quantilepass rates via1/Fref\(pθ\(x\)\)1/F\_\{\\mathrm\{ref\}\}\(p\_\{\\theta\}\(x\)\)andhigher densityviafref\(pθ\(x\)\)f\_\{\\mathrm\{ref\}\}\(p\_\{\\theta\}\(x\)\)\. The logarithm transformation asψ\\psiexplicitly elicits the geometry of the pass rate function space ofpθ\(⋅\)p\_\{\\theta\}\(\\cdot\), including the density and rank information ofpθ\(x\)p\_\{\\theta\}\(x\), to reflect the distributional structure\. This weight has the same form asreverse hazard ratein probability and survival analysis\(Blocket al\.,[1998](https://arxiv.org/html/2605.24331#bib.bib112); Finkelstein,[2002](https://arxiv.org/html/2605.24331#bib.bib113)\)albeit in a fundamentally different context\.
#### 4\.2Algorithm: CurveRL
Algorithm 1CurveRL Update at Steptt0:Batch
ℬ\\mathcal\{B\}of inputs, number of rollouts
NN, sliding window
WWwith the size
t0×\|ℬ\|t\_\{0\}\\times\|\\mathcal\{B\}\|
1:\# Weight Estimation
2:foreach
p∈\[1N,2N,…,N−1N\]p\\in\[\\frac\{1\}\{N\},\\frac\{2\}\{N\},\\ldots,\\frac\{N\-1\}\{N\}\]do
3:Evaluate density
f^ref\(p\)\\hat\{f\}\_\{\\mathrm\{ref\}\}\(p\)and CDF
F^ref\(p\)\\hat\{F\}\_\{\\mathrm\{ref\}\}\(p\)
4:endfor
5:\# Gradient Estimation
6:foreach input
x∈ℬx\\in\\mathcal\{B\}do
7:Sample
y1,…,yN∼πθ\(⋅∣x\)y\_\{1\},\\dots,y\_\{N\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)
8:for
i=1i=1to
NNdo
9:
ri←r\(x,yi\)r\_\{i\}\\leftarrow r\(x,y\_\{i\}\),
Si←∇θlogπθ\(yi∣x\)S\_\{i\}\\leftarrow\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(y\_\{i\}\\mid x\)
10:endfor
11:Estimate
p^←1N∑i=1Nri\\hat\{p\}\\leftarrow\\tfrac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}r\_\{i\}
12:if
p^∈\(0,1\)\\hat\{p\}\\in\(0,1\)then
13:
wt\(p^\)←f^ref\(p^\)/F^ref\(p^\)w\_\{t\}\(\\hat\{p\}\)\\leftarrow\\hat\{f\}\_\{\\mathrm\{ref\}\}\(\\hat\{p\}\)/\\hat\{F\}\_\{\\mathrm\{ref\}\}\(\\hat\{p\}\)
14:
g^\(x\)←1N∑i=1Nwt\(p^\)\(ri−p^\)Si\\hat\{g\}\(x\)\\leftarrow\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}w\_\{t\}\(\\hat\{p\}\)\\left\(r\_\{i\}\-\\hat\{p\}\\right\)S\_\{i\}
15:Append
p^\\hat\{p\}to
WW
16:endif
17:endfor
18:Remove pass rates in
t−t0t\-t\_\{0\}from
WW
19:
g^←1\|ℬ\|∑x∈ℬg^\(x\)\\hat\{g\}\\leftarrow\\frac\{1\}\{\|\\mathcal\{B\}\|\}\\sum\_\{x\\in\\mathcal\{B\}\}\\hat\{g\}\(x\)
20:return
g^\\hat\{g\}
In Algorithm[1](https://arxiv.org/html/2605.24331#alg1), we elaborate on the update details of CurveRL at each training steptt\.
##### Sliding Window Estimate\.
Denote the input batchℬ\\mathcal\{B\}and number of rolloutsNN\. We initialize a queueWWwith the sizet0×\|ℬ\|t\_\{0\}\\times\|\\mathcal\{B\}\|, which stores the pass rates collected from the stept−t0t\-t\_\{0\}tot−1t\-1\. At steptt, we estimateFref\(p\)F\_\{\\mathrm\{ref\}\}\(p\)andfref\(p\)f\_\{\\mathrm\{ref\}\}\(p\)via a histogram estimator from this lagged window for eachp∈\[1N,2N,…,N−1N\]p\\in\[\\frac\{1\}\{N\},\\frac\{2\}\{N\},\\ldots,\\frac\{N\-1\}\{N\}\]\. Therefore,FrefF\_\{\\mathrm\{ref\}\}is determined by old policies in the lastt0t\_\{0\}steps\.
##### Weight and Gradient Estimation\.
Next, for each promptxxin the batchℬ\\mathcal\{B\}, we performNNresponse attempts and estimate the pass ratep^\\hat\{p\}\. We evaluateF^ref\(p^\)\\hat\{F\}\_\{\\mathrm\{ref\}\}\(\\hat\{p\}\)andf^ref\(p^\)\\hat\{f\}\_\{\\mathrm\{ref\}\}\(\\hat\{p\}\)in the lagged window\. We further usep^\\hat\{p\}as the group\-level baseline, which reduces the variance of the policy gradient estimator\. Finally, we maintainWWas an active\-pass\-rate window, appending only prompts with non\-vanishing gradients, i\.e\.,p^∈\(0,1\)\\hat\{p\}\\in\(0,1\)\. We also remove pass rates collected at stept−t0t\-t\_\{0\}, ensuring a length\-t0t\_\{0\}sliding window estimation forFrefF\_\{\\mathrm\{ref\}\}inWW\. Thus,t0t\_\{0\}is the main hyperparameter in CurveRL\.
#### 4\.3Theoretical Interpretation and Advantages
##### A Unified Explanation of Prompt Reweighting as a PriorFrefF\_\{\\mathrm\{ref\}\}\.
The distribution\-aware utility in Eq\. \([17](https://arxiv.org/html/2605.24331#S4.E17)\) provides a unified interpretation of pointwise prompt reweighted algorithms\. Specializing to the log distortionψ\(u\)=logu\\psi\(u\)=\\log uin Eq\. \([19](https://arxiv.org/html/2605.24331#S4.E19)\), Theorem[1](https://arxiv.org/html/2605.24331#Thmtheorem1)shows that any pointwise weightwθ\(x\)w\_\{\\theta\}\(x\), e\.g\., those in REINFORCE, GRPO, and MaxRL, can be equivalently represented by a prior distributionFrefF\_\{\\mathrm\{ref\}\}in the gradient update\. In contrast, CurveRL estimatesFrefF\_\{\\mathrm\{ref\}\}from the evolving pass\-rate distribution and is thus adaptive and data\-driven\. The proof of Theorem[1](https://arxiv.org/html/2605.24331#Thmtheorem1)is given in Appendix[A\.4](https://arxiv.org/html/2605.24331#A1.SS4)\.
###### Theorem 1\(Pointwise Weight Induces a PriorFrefF\_\{\\mathrm\{ref\}\}\)\.
Denotep=pθ\(x\)p=p\_\{\\theta\}\(x\)and the pointwise weightwθ\(x\)=g′\(pθ\(x\)\):=w\(p\)w\_\{\\theta\}\(x\)=g^\{\\prime\}\(p\_\{\\theta\}\(x\)\):=w\(p\)in the pass\-rate space\. Assume∫p1w\(t\)𝑑t<∞\\int\_\{p\}^\{1\}w\(t\)dt<\\inftyfor anyp∈\(0,1\]p\\in\(0,1\]\. Under the distortionψ\(u\)=logu\\psi\(u\)=\\log uin Eq\. \([19](https://arxiv.org/html/2605.24331#S4.E19)\),w\(p\)=fref\(p\)/Fref\(p\)w\(p\)=f\_\{\\mathrm\{ref\}\}\\left\(p\\right\)/F\_\{\\mathrm\{ref\}\}\\left\(p\\right\)admits a uniqueFrefF\_\{\\mathrm\{ref\}\}:
Fref\(p\)=exp\(−∫p1w\(t\)𝑑t\)\.\\displaystyle F\_\{\\mathrm\{ref\}\}\(p\)=\\exp\\left\(\-\\int\_\{p\}^\{1\}w\(t\)dt\\right\)\.\(20\)
Next, we consider the degeneration case of our distribution\-aware prompt reweighting from the distribution ofFrefF\_\{\\mathrm\{ref\}\}\. In[Corollary˜1](https://arxiv.org/html/2605.24331#Thmcorollary1), we show thatFrefF\_\{\\mathrm\{ref\}\}isUniform\(0,1\)\\text\{Uniform\}\(0,1\), i\.e\.,Fref\(t\)=tF\_\{\\mathrm\{ref\}\}\(t\)=t, if and only if the gradients of distribution\-aware and pointwise utility functionals coincide under the same distortion measureψ\\psi, i\.e\.,∇θ𝒰θ\(Fref\)=∇θJψ\(θ\)\\nabla\_\{\\theta\}\\mathcal\{U\}\_\{\\theta\}\(F\_\{\\mathrm\{ref\}\}\)=\\nabla\_\{\\theta\}J\_\{\\psi\}\(\\theta\)\.
###### Corollary 1\(Degeneration of Distribution\-Aware Prompt Reweighting\)\.
Consider the distribution\-aware utility function𝒰θ\(Fref\)\\mathcal\{U\}\_\{\\theta\}\(F\_\{\\mathrm\{ref\}\}\)and the pointwise utility functionJψ\(θ\)J\_\{\\psi\}\(\\theta\)with the same distortion functionψ\\psi\. Then,
∇θ𝒰θ\(Fref\)=∇θJψ\(θ\)if and only ifFref\(t\)=t,t∈\[0,1\]\.\\displaystyle\\nabla\_\{\\theta\}\\mathcal\{U\}\_\{\\theta\}\(F\_\{\\mathrm\{ref\}\}\)=\\nabla\_\{\\theta\}J\_\{\\psi\}\(\\theta\)\\quad\\text\{if and only if\}\\quad F\_\{\\mathrm\{ref\}\}\(t\)=t,\\quad t\\in\[0,1\]\.\(21\)
The proof of[Corollary˜1](https://arxiv.org/html/2605.24331#Thmcorollary1)is provided in Appendix[D\.2](https://arxiv.org/html/2605.24331#A4.SS2)\. This identifies the uniform reference distribution as the boundary case in which the quantile transform does not change the gradient direction\. Away from this boundary, the effect of distribution\-aware reweighting can be characterized by a relative multiplierRψ\(p\)=ψ′\(Fref\(p\)\)fref\(p\)/ψ′\(p\)R\_\{\\psi\}\(p\)=\\psi^\{\\prime\}\(F\_\{\\mathrm\{ref\}\}\(p\)\)f\_\{\\mathrm\{ref\}\}\(p\)/\\psi^\{\\prime\}\(p\)\.
##### Sufficient Conditions for the Comparison of Weight Magnitude\.
We further derive sufficient conditions to compare the weight magnitude between distribution\-aware and pointwise prompt weighting methods based onRψ\(p\)R\_\{\\psi\}\(p\)\. In particular, whenRψ′\(p\)<0R^\{\\prime\}\_\{\\psi\}\(p\)<0, our distribution\-aware method assigns relatively more normalized weights to low\-pass\-rate prompts as opposed to the pointwise counterpart withoutFrefF\_\{\\mathrm\{ref\}\}, reflecting more risk\-seeking behavior or aggressive reweighting against low\-pass\-rate prompts\. The detailed discussion is provided in Appendix[D\.3](https://arxiv.org/html/2605.24331#A4.SS3)\.
##### Advantage 1: Distribution\-aware and Data\-Driven Reweighting\.
The classical weight function derived from the pointwise utility functional solely relies on the local information or raw value ofpθ\(x\)p\_\{\\theta\}\(x\)at each prompt\. By contrast, our weight function is derived from a distribution\-aware utility functional that characterizes the geometry \(e\.g\., ranking and density in Eq\. \([19](https://arxiv.org/html/2605.24331#S4.E19)\)\) in the pass\-rate function space based on the functional derivative defined in Definition[1](https://arxiv.org/html/2605.24331#Thmdefinition1)\. More importantly, as shown in Theorem[1](https://arxiv.org/html/2605.24331#Thmtheorem1), pointwise weights can be treated as special prior distribution ofFref\(pθ\(x\)\)F\_\{\\mathrm\{ref\}\}\(p\_\{\\theta\}\(x\)\), while our weight function of CurveRL in Eq\. \([19](https://arxiv.org/html/2605.24331#S4.E19)\) isadaptive and data\-driven, which is updated periodically via non\-parametric statistical approaches, e\.g\., a histogram function \(see Section[4\.2](https://arxiv.org/html/2605.24331#S4.SS2)\)\. Our method is more adaptive than the exogenous prompt schedule in curriculum learning \(see Appendix[D\.1](https://arxiv.org/html/2605.24331#A4.SS1)\)\.
##### Advantage 2: Mitigating Weight Collapse by Introducing Quantile Coordinate\.
As mentioned in Section[4\.1](https://arxiv.org/html/2605.24331#S4.SS1), the pointwise utility function suffers from the weight collapse issue whenpθ\(x\)≈0p\_\{\\theta\}\(x\)\\approx 0or11, leading to nearly indistinguishable learnability in the weights\. By mapping the absolute value of pass rates to the CDF/quantile coordinate viaFrefF\_\{\\mathrm\{ref\}\}in Eq\. \([17](https://arxiv.org/html/2605.24331#S4.E17)\), CurveRL becomes robust to miscalibration of pass rates and enjoys amonotone calibration invariance property\(see Appendix[A\.5](https://arxiv.org/html/2605.24331#A1.SS5)\)\. Unlike the evolving non\-stationary pass\-rate space in the learning dynamics, the CDF/quantile space viaFref\(pθ\(x\)\)F\_\{\\mathrm\{ref\}\}\(p\_\{\\theta\}\(x\)\)in CurveRL induces a more uniform difficulty spectrum and is thus more stationary asFrefF\_\{\\mathrm\{ref\}\}is close toFθF\_\{\\theta\}as analyzed in Proposition[2](https://arxiv.org/html/2605.24331#Thmproposition2)\. This potentially leads to more stable optimization\.
### 5Experiments
##### Experimental Setups\.
We train Qwen3\-1\.7B\-Base and Qwen3\-4B\-Base on POLARIS\-53K\(Anet al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib125)\), a corpus of approximately 53K mathematical reasoning prompts, using the verl framework\(Shenget al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib73)\)\. We compare CurveRL with GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2605.24331#bib.bib37)\)and MaxRL\(Tajwaret al\.,[2026](https://arxiv.org/html/2605.24331#bib.bib7)\)\. All methods use a shared training loop and differ only in the prompt\-weighting rule\. We adopt the batch size\|ℬ\|=256\|\\mathcal\{B\}\|=256, the number of rolloutsN=8N=8for all algorithms, andt0=10t\_\{0\}=10in Algorithm[1](https://arxiv.org/html/2605.24331#alg1)for CurveRL\. Training\-time correctness is verified usingMath\-Verify\(Kydlíček,[2025](https://arxiv.org/html/2605.24331#bib.bib126)\)\. In evaluation, we usepass@k\\mathrm\{pass@\}k\(Chenet al\.,[2021](https://arxiv.org/html/2605.24331#bib.bib127)\)as the primary metric, withk∈\{1,2,4,…,1024\}k\\in\\\{1,2,4,\\dots,1024\\\}for Qwen3\-1\.7B\-Base andk∈\{1,2,4,…,512\}k\\in\\\{1,2,4,\\dots,512\\\}for Qwen3\-4B\-Base\.pass@k\\mathrm\{pass@\}kmeasures the probability that at least one ofkkindependently sampled responses is correct, serving as a proxy for a model’s capability under a fixed sampling budget\. In practice,pass@1\\mathrm\{pass@\}1is the raw mean accuracy, whilepass@k\\mathrm\{pass@\}kfork≥2k\\geq 2uses10001000best\-of\-kkbootstrap resamples \(with replacement\) of sizekk, averaged across prompts\. We perform20482048rollouts per prompt for Qwen3\-1\.7B\-Base and10241024for Qwen3\-4B\-Base to reduce estimation variance\. More implementation details are provided in Appendix[B](https://arxiv.org/html/2605.24331#A2)\.
##### Evaluation\.
We evaluate trained models on eight challenging mathematical reasoning benchmarks:AIME 2025\(Balunovićet al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib128)\),BeyondAIME\(Seedet al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib135)\),HMMT 02/25\(Balunovićet al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib128)\),HMMT 02/26\(Balunovićet al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib128)\), andMATH\-500\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.24331#bib.bib74)\)in the main content, andBRUMO 2025\(Balunovićet al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib128)\),HMMT 11/25\(Balunovićet al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib128)\), andMinerva\(Lewkowyczet al\.,[2022](https://arxiv.org/html/2605.24331#bib.bib85)\)in Appendix[C](https://arxiv.org/html/2605.24331#A3)\.
#### 5\.1Main Results
Table 1:Main results on five math reasoning benchmarks\.We report pass@1 and pass@64 \(%\) for each method on Qwen3\-1\.7B\-Base and Qwen3\-4B\-Base\. The best per column is in bold\.##### Improved Trade\-off Frontier\.
[Table˜1](https://arxiv.org/html/2605.24331#S5.T1)reports evaluation results on five reasoning benchmarks, with similar results on the remaining three benchmarks deferred to Appendix[C\.1](https://arxiv.org/html/2605.24331#A3.SS1)\. Across both Qwen3\-1\.7B\-Base and Qwen3\-4B\-Base, CurveRL achieves the highest pass@6464on all benchmarks, indicating stronger performance on hard prompts\. Compared with MaxRL, the strongest baseline, it improves average pass@6464by\+5\.9%\\mathbf\{\+5\.9\\%\}and\+9\.7%\\mathbf\{\+9\.7\\%\}, respectively\. Importantly, CurveRL also achieves a higher average pass@11on both model sizes, indicating that its improvement in Best\-of\-NNperformance does not come at the expense of single\-shot accuracy\. This mitigates the pass@kkdegradation issue observed in prior work\(Yueet al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib86); Chenet al\.,[2025c](https://arxiv.org/html/2605.24331#bib.bib124)\)\. Overall, our results show that the distribution\-aware reweighting of CurveRL improves the Pareto frontier between pass@11and pass@kkover the existing pointwise reweighting methods\.
Figure 1:Pass@kkscaling on five representative benchmarks\.Top row: Qwen3\-1\.7B\-Base,k∈\{1,…,1024\}k\\in\\\{1,\\dots,1024\\\}\. Bottom row: Qwen3\-4B\-Base,k∈\{1,…,512\}k\\in\\\{1,\\dots,512\\\}\. CurveRL outperforms GRPO and MaxRL across the full range ofkkon both model sizes, and exceeds the pretrained base model on most panels\.
##### Analysis of Pass@kkScaling\.
Following prior work\(Yueet al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib86); Tajwaret al\.,[2026](https://arxiv.org/html/2605.24331#bib.bib7); Wuet al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib134)\), we further examine the capability boundary of each algorithm through pass@kkscaling over a wide range ofkk\.[Figure˜1](https://arxiv.org/html/2605.24331#S5.F1)reports results on five representative benchmarks, while the remaining three benchmarks in[Figure˜5](https://arxiv.org/html/2605.24331#A3.F5)of Appendix[C\.2](https://arxiv.org/html/2605.24331#A3.SS2)follow similar trends\. CurveRL consistently outperforms the other algorithms across a wide range of pass@kkvalues in most settings, indicating a higher upper bound on reasoning capability\. Consistent with prior observations\(Yueet al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib86)\), GRPO and MaxRL exhibit varying degrees of pass@kkdegradation relative to the pretrained base model, whereas CurveRL exceeds the pretrained base model in99out of1010panels of[Figure˜1](https://arxiv.org/html/2605.24331#S5.F1)across both model sizes and all values ofkk\. Meanwhile, CurveRL’s advantage over the strongest baseline, MaxRL, grows as both the model scale andkkincrease on challenging reasoning benchmarks, reaching approximately\+7\.3%\\mathbf\{\+7\.3\\%\}on HMMT Feb 2026 atk=1024k=1024with Qwen3\-1\.7B\-Base and\+26\.8%\\mathbf\{\+26\.8\\%\}on HMMT Feb 2025 atk=512k=512with Qwen3\-4B\-Base\. This widening advantage suggests that CurveRL effectively broadens the search space of reasoning trajectories, enabling the policy to discover rare correct solutions\. This property may further benefit both reward\-free and reward\-guided test\-time scaling methods, including scalable Best\-of\-NNselection with strong statistical signals\(Kanget al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib131); Fuet al\.,[2026](https://arxiv.org/html/2605.24331#bib.bib132)\), lightweight probing\(Guoet al\.,[2026](https://arxiv.org/html/2605.24331#bib.bib133)\), and process reward models \(PRMs\)\(Lightmanet al\.,[2023](https://arxiv.org/html/2605.24331#bib.bib130)\)\.
#### 5\.2Mechanism Analysis of CurveRL
##### Difficulty Distribution Analysis\.
We investigate how the algorithm in the post\-training changes the RLVR model’s capability over prompts with distinct difficulty levels\. We partition prompts by empirical pass ratep^\\hat\{p\}over 2048 rollouts into four groups with different prior difficulty levels:*unsolvable*\(p^=0\\hat\{p\}=0\),*hard*\(p^∈\(0,1/2\]\\hat\{p\}\\in\(0,1/2\]\),*medium*\(p^∈\(1/2,1\)\\hat\{p\}\\in\(1/2,1\)\), and*easy*\(p^=1\\hat\{p\}=1\)\. As illustrated in[Figure˜2](https://arxiv.org/html/2605.24331#S5.F2)on two benchmarks BRUMO 2025 and MATH\-500, CurveRL consistently reduces the fraction of*unsolvable*prompts, especially on BRUMO 2025\. This verifies that CurveRL has a larger potential to expand the real solvability boundary by making more unsolvable problems tractable\.
Figure 2:Prompt Distribution across Difficulty in Qwen3\-1\.7B\-Base\.Table 2:Majority@20482048on Qwen3\-1\.7B\-Base\.Moreover, CurveRL does not merely convert unsolvable prompts into barely solvable ones\. It shifts a larger fraction of prompts into the*medium*regime, wherep^\>1/2\\hat\{p\}\>1/2and majority voting\(Wanget al\.,[2023](https://arxiv.org/html/2605.24331#bib.bib129)\)becomes statistically reliable: repeated sampling can amplify individual correctness without requiring additional reward signals\. This explains the stronger reward\-free test\-time scaling of CurveRL, as reflected by the higher majority@20482048scores on BRUMO 2025 and MATH\-500 in[Table˜2](https://arxiv.org/html/2605.24331#S5.T2)\.
##### Distribution\-Aware and Data\-Driven Weighting\.
[Figure˜3](https://arxiv.org/html/2605.24331#S5.F3)visualizes how CurveRL differs from pointwise weighting algorithms from the perspective of prompt weights in the pass\-rate space\.
1. 1\.The left panel suggests the evolution of the empirical reference densityf^ref\(p^\)\\hat\{f\}\_\{\\mathrm\{ref\}\}\(\\hat\{p\}\)during training: the pass\-rate distribution shifts from a sharp concentration near the lowest pass\-rate bins toward higher\-pass\-rate regions, becoming smoother and more spread\. This provides empirical evidence for the distribution\-transport view in Eq\. \([10](https://arxiv.org/html/2605.24331#S3.E10)\) of[Section˜3\.2](https://arxiv.org/html/2605.24331#S3.SS2)\.
2. 2\.The center panel showcases the resulting weight dynamicswt\(p^\)=f^ref\(p^\)/F^ref\(p^\)w\_\{t\}\(\\hat\{p\}\)=\\hat\{f\}\_\{\\mathrm\{ref\}\}\(\\hat\{p\}\)/\\hat\{F\}\_\{\\mathrm\{ref\}\}\(\\hat\{p\}\)in CurveRL, which adapts to the evolving estimates of reference distributionFrefF\_\{\\mathrm\{ref\}\}andfreff\_\{\\mathrm\{ref\}\}\. The CDF denominator preserves an emphasis on lower\-quantile prompts, while the density numerator makes the allocation data\-driven by assigning weights to pass\-rate regions that are populated under the current policy\.
3. 3\.The right panel contrasts this adaptive behavior with the static pointwise weights of GRPO and MaxRL, which can also be interpreted as a fixed prior in[Theorem˜1](https://arxiv.org/html/2605.24331#Thmtheorem1)\. With a symmetric weight1/p^\(1−p^\)1/\\sqrt\{\\hat\{p\}\(1\-\\hat\{p\}\)\}, GRPO assigns large weights to both low\- and high\-pass\-rate bins, while MaxRL employs1/p^1/\\hat\{p\}and monotonically emphasizes prompts with smaller pass rates\.
Figure 3:Distribution\-aware and data\-driven weighting of CurveRL on Qwen3\-4B\-Base\.*Left*: dynamics of CurveRL’s empirical pass\-rate densityf^ref\(p^\)\\hat\{f\}\_\{\\mathrm\{ref\}\}\(\\hat\{p\}\), smoothed by interpolation\.*Center*: dynamics of CurveRL’s adaptive prompt weightwt\(p^\)=f^ref\(p^\)/F^ref\(p^\)w\_\{t\}\(\\hat\{p\}\)=\\hat\{f\}\_\{\\mathrm\{ref\}\}\(\\hat\{p\}\)/\\hat\{F\}\_\{\\mathrm\{ref\}\}\(\\hat\{p\}\)\.*Right*: static GRPO weight1/p^\(1−p^\)1/\\sqrt\{\\hat\{p\}\(1\-\\hat\{p\}\)\}and MaxRL weight1/p^1/\\hat\{p\}rescaled to\[0,1\]\[0,1\]\.
##### Sensitivity Analysis of Sliding Window Sizet0t\_\{0\}\.
We further compare our selected sliding\-window sizet0=10t\_\{0\}=10batches in Algorithm[1](https://arxiv.org/html/2605.24331#alg1)with two extreme choices: a single\-batch window \(t0=1t\_\{0\}=1\) and a much wider window \(t0=50t\_\{0\}=50\)\.[Table˜3](https://arxiv.org/html/2605.24331#S5.T3)suggests that CurveRL witht0=10t\_\{0\}=10achieves a better overall trade\-off, obtaining the largest pass@6464on all benchmarks while remaining competitive on pass@11relative to the alternatives\. Specifically, a smaller window size can introduce larger variance when estimating the pass\-rate distribution, while a larger window size relies more heavily on older policies and thus introduces bias in estimating the current pass\-rate distributions\. Notably, even witht0=1t\_\{0\}=1or5050, CurveRL still matches or exceeds the strongest pointwise baseline, MaxRL, on most benchmarks\. Pass@kkscaling curves for CurveRL under the twot0t\_\{0\}are provided in[Figure˜9](https://arxiv.org/html/2605.24331#A3.F9)in Appendix[C\.5](https://arxiv.org/html/2605.24331#A3.SS5)\.
Table 3:Sensitivity to sliding window sizet0t\_\{0\}regarding the pass@1 and pass@64 \(%\) performance on Qwen3\-1\.7B\-Base\.
### 6Related Work
##### Action\-level Exploration in Standard RL vs Context\-level Distribution Control in RLVR\.
In standard RL, the state\-visitation distribution is often reshaped indirectly through: \(i\) importance weighting in off\-policy correction\(Suttonet al\.,[1998](https://arxiv.org/html/2605.24331#bib.bib29); Mahmoodet al\.,[2015](https://arxiv.org/html/2605.24331#bib.bib28)\), \(ii\) experiment\-design\-style transition selection\(Mehtaet al\.,[2022](https://arxiv.org/html/2605.24331#bib.bib39); Blauet al\.,[2022](https://arxiv.org/html/2605.24331#bib.bib40)\), \(iii\) goal\-conditioning\(Liuet al\.,[2022](https://arxiv.org/html/2605.24331#bib.bib30); Eysenbachet al\.,[2022](https://arxiv.org/html/2605.24331#bib.bib31)\), \(iv\) replay variants such as prioritized experience replay\(Lin,[1992](https://arxiv.org/html/2605.24331#bib.bib33); Schaulet al\.,[2015](https://arxiv.org/html/2605.24331#bib.bib34); Andrychowiczet al\.,[2017](https://arxiv.org/html/2605.24331#bib.bib35); Kapturowskiet al\.,[2018](https://arxiv.org/html/2605.24331#bib.bib36); Panet al\.,[2022](https://arxiv.org/html/2605.24331#bib.bib32)\), all of which act on*previously collected*data\. RLVR with prompt reweighting\(Shaoet al\.,[2024](https://arxiv.org/html/2605.24331#bib.bib37); Guoet al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib38); Yuet al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib14); Tajwaret al\.,[2026](https://arxiv.org/html/2605.24331#bib.bib7); Xionget al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib10); Parasharet al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib11)\)instead manipulates the*online*prompt distribution directly\. Action\-level exploration becomes more challenging in this setting, as the trajectory\-level action space is combinatorially large and the contextual\-bandit structure provides no state\-visitation dynamics to leverage\(Cuiet al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib64); Yueet al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib86); Daiet al\.,[2026](https://arxiv.org/html/2605.24331#bib.bib16)\)\. Therefore, the direct context distribution control by prompt reweighting supplies a new axis of information acquisition, which is exactly formulated by the PRCB framework we develop in Section[3\.1](https://arxiv.org/html/2605.24331#S3.SS1)\.
##### Existing Strategies of Context\-level Distribution Control in RLVR\.
There are mainly three classes of methods that exploit the freedom to shape the prompt distribution in RLVR\. Our work casts all three mechanisms inside a singleeffective context distributiondθ\(x\)∝d0\(x\)wθ\(x\)d\_\{\\theta\}\(x\)\\propto d\_\{0\}\(x\)w\_\{\\theta\}\(x\)and supplies the optimality principle forwθw\_\{\\theta\}that has been missing from this line of work\.
- •*Sample selection and dynamic filtering*:Yuet al\.\([2025](https://arxiv.org/html/2605.24331#bib.bib14)\); Maoet al\.\([2026](https://arxiv.org/html/2605.24331#bib.bib15)\)discard prompts whose empirical pass rate equals0or11, serving as a heuristic yet effective way to select the most learnable prompts\.Xionget al\.\([2025](https://arxiv.org/html/2605.24331#bib.bib10)\)focuses on adaptively adjusting the number of samples for each prompt and prioritizes the harder prompts based on the prompt reweighting framework with non\-linear RL objectives\. We instead view reweighting as a direct context distribution control mechanism to improve the policy improvement by shaping learning dynamics, and establish the reweighting optimality from the perspective of utility functionals, which are risk\-dependent\.
- •*Curriculum strategies*: Curriculum learning in RLVR, rooted in classical curriculum learning\(Bengioet al\.,[2009](https://arxiv.org/html/2605.24331#bib.bib12); Narvekaret al\.,[2020](https://arxiv.org/html/2605.24331#bib.bib13)\), schedules prompts from easy to hard\(Parasharet al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib11); Rajaramanet al\.,[2026](https://arxiv.org/html/2605.24331#bib.bib66); Chenet al\.,[2025a](https://arxiv.org/html/2605.24331#bib.bib69)\)\. As analyzed in Appendix[D\.1](https://arxiv.org/html/2605.24331#A4.SS1), curriculum learning relies on an exogenous data schedule to help the policy optimization to chase the moving learnable window, while our method transforms the learnable window into a stationary and nearly fixed one by quantile reparameterization, thus enhancing the learning dynamics\.
- •*Prompt reweighting*:Davis and Recht \([2025](https://arxiv.org/html/2605.24331#bib.bib9)\)first formalizes the prompt weighting objective for LLM reasoning\. This motivates MaxRL\(Tajwaret al\.,[2026](https://arxiv.org/html/2605.24331#bib.bib7)\)by introducing the maximum likelihood principle into the RL objective and the adaptive sampling framework ofXionget al\.\([2025](https://arxiv.org/html/2605.24331#bib.bib10)\)through the lens of adaptive sample selection\. This prompt reweighting framework is general to accommodate a flurry of RLVR algorithms, including the pass@K optimization inWalder and Karkhanis \([2025](https://arxiv.org/html/2605.24331#bib.bib65)\)\. From the perspective of the model’s edge of competence,Zhanget al\.\([2025a](https://arxiv.org/html/2605.24331#bib.bib68)\); Huanget al\.\([2026](https://arxiv.org/html/2605.24331#bib.bib67)\)also show that selecting prompts or manipulating the training distribution is crucial for effective learning\. Our study focuses on this line of work in context distribution control of RLVR, as it allows more principled analysis than sample selection and curriculum learning, which can also be considered as special prompt reweighting in general\.
##### Theoretical understanding of GRPO and Its Variants\.
A growing GRPO family modifies the algorithm primarily through the denominator and group\-level normalization: Dr\.GRPO\(Liuet al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib46)\)removes the standard\-deviation factor to debias the gradient; GVPO\(Zhanget al\.,[2025b](https://arxiv.org/html/2605.24331#bib.bib48)\), GPG\(Chuet al\.,[2026](https://arxiv.org/html/2605.24331#bib.bib47)\), and follow\-ups\(Geet al\.,[2026](https://arxiv.org/html/2605.24331#bib.bib81); Yanget al\.,[2026](https://arxiv.org/html/2605.24331#bib.bib82); Chenet al\.,[2025b](https://arxiv.org/html/2605.24331#bib.bib84)\)explore alternative group\-level rescalings\. These modifications are typically motivated by variance reduction or empirical fixes, leaving the form of the denominator unjustified at the population level\. A separate strand analyzes GRPO and REINFORCE through the lens of U\-statistics\(Zhouet al\.,[2026](https://arxiv.org/html/2605.24331#bib.bib83)\)and optimization dynamics\(Suk and Duan,[2026](https://arxiv.org/html/2605.24331#bib.bib50); Mroueh,[2025](https://arxiv.org/html/2605.24331#bib.bib49); Huanget al\.,[2026](https://arxiv.org/html/2605.24331#bib.bib67)\)\. In particular,Huanget al\.\([2026](https://arxiv.org/html/2605.24331#bib.bib67)\)identifies a*replay effect*when the difficulty spectrum ofpθ\(x\)p\_\{\\theta\}\(x\)is sufficiently smooth\. MaxRL\(Tajwaret al\.,[2026](https://arxiv.org/html/2605.24331#bib.bib7)\)imports the maximum likelihood principle into the policy gradient, which seemingly entails the optimality of maximum likelihood in statistics\. However, we argue this is essentially a coincidence: maximum likelihood enjoys optimality under certain regularity conditions in statistical estimation, whereas policy optimization is mode\-seeking instead of statistical estimation, where the optimality oflogpθ\\log p\_\{\\theta\}does not transfer in a straightforward way\. Our utility\-functional view recovers MaxRL as one specific pointwise utility with a risk\-seeking preference, and broadly provides a principled framework to explain the optimal prompt reweighting based on the choice of utility functional\.
##### Distributional Structure in RLVR\.
The distributional structure between prompts and their rewards is crucial in algorithm analysis and design\.Barakatet al\.\([2026](https://arxiv.org/html/2605.24331#bib.bib6)\); Walder and Karkhanis \([2025](https://arxiv.org/html/2605.24331#bib.bib65)\)reveals the conflict between improving pass@KKand pass@11induced by the prompt interference, which is also supported by the fundamental property of deep nets calledlocal elasticity\(He and Su,[2019](https://arxiv.org/html/2605.24331#bib.bib4); Su,[2024](https://arxiv.org/html/2605.24331#bib.bib5)\)\. This trade\-off also mirrors the well\-known tension between adversarial robustness and clean accuracy\(Tsipraset al\.,[2019](https://arxiv.org/html/2605.24331#bib.bib3); Zhanget al\.,[2019](https://arxiv.org/html/2605.24331#bib.bib2)\): pushing harder prompts \(akin to adversarial examples\) trades off accuracy on easier ones\. Therefore, it is beneficial to include the correlations between prompts in the algorithm design instead of treating them independently\. Our distribution\-aware method emerges across the prompt distribution in the pass\-rate space by leveraging the rank/quantile and density information ofpθ\(x\)p\_\{\\theta\}\(x\), offering an explicit knob for navigating this trade\-off between hard and easy prompts\.Wuet al\.\([2026](https://arxiv.org/html/2605.24331#bib.bib123)\)proposes quantile advantage estimation by replacing the mean with a group\-wiseKK\-quantile baseline to address entropy collapse and entropy explosion, which provides further evidence to demonstrate that leveraging the distributional structure of prompts can be beneficial\.
##### Optimality of Utility Functional and Risk\-sensitive Control\.
Risk\-sensitive RL\(Howard and Matheson,[1972](https://arxiv.org/html/2605.24331#bib.bib8); Mihatsch and Neuneier,[2002](https://arxiv.org/html/2605.24331#bib.bib87); Majumdaret al\.,[2017](https://arxiv.org/html/2605.24331#bib.bib122); Wang and Chapman,[2022](https://arxiv.org/html/2605.24331#bib.bib119); Smith and Chapman,[2023](https://arxiv.org/html/2605.24331#bib.bib121); Bäuerle and Jaśkiewicz,[2024](https://arxiv.org/html/2605.24331#bib.bib120)\)replaces the expectation of cumulative rewards with other utility functionals or risk measures, reflecting the principle that there is no universal optimality in this choice, which mainly depends on the designer’s risk attitude\. Distributional RL\(Dabneyet al\.,[2018](https://arxiv.org/html/2605.24331#bib.bib89); Bellemareet al\.,[2023](https://arxiv.org/html/2605.24331#bib.bib116)\)makes this dependence explicit by allowing arbitrary distortion functions\. Our distribution\-aware utility instead consumes the distributional structure of the pass rates, not its absolute magnitude\. The CDF/quantile coordinate transform with the associated distortion function connects directly to dual utility and spectral risk measures\(Yaari,[1987](https://arxiv.org/html/2605.24331#bib.bib100); Wang,[1996](https://arxiv.org/html/2605.24331#bib.bib102); Acerbi,[2002](https://arxiv.org/html/2605.24331#bib.bib101); Balbáset al\.,[2009](https://arxiv.org/html/2605.24331#bib.bib103)\), which provides the risk\-control rationale behind the quantile coordinate transform of Section[4](https://arxiv.org/html/2605.24331#S4)\.
### 7Discussion and Conclusion
Our study suggests that the prompt reweighting in RLVR should be understood as principled context distribution control rather than heuristic modifications of weights in policy gradients\. Under this view, there is no universally optimal prompt weight independent of the training objective\. Instead, the optimal weight is utility\-dependent and determined by the marginal value of improving each prompt’s pass rate, which is formulated by the functional derivative\. By using the quantile coordinate transform, CurveRL instantiates context distribution control with a distribution\-aware utility functional in the pass\-rate quantile space, allowing the prompt measure to adapt to the model’s current competence profile through the distributional structure of pass rates independent of existing pointwise rules\.
##### Limitations and Future Work\.
A natural extension of our method is to integrate pointwise and distribution\-aware utility functionals into more expressive prompt weighting schemes\. We explored two such integrated strategies in Appendix[D\.5](https://arxiv.org/html/2605.24331#A4.SS5), but did not observe empirical improvements\. This suggests that the two utility functional classes may induce different geometries in the pass\-rate space, and simply adding or multiplying them may not reconcile their optimization effects\. A rigorous characterization of these geometries and their impact on learning dynamics remains an important direction for future work\. Moreover, defining weight optimality directly in the prompt space, rather than in the pass\-rate space, additionally captures the effect of parameter updates, although computing functional derivatives over high\-dimensional prompt representations is generally intractable\.
### Acknowledgments
This work was supported in part by NIH grants R01EB036016, R01EB037101, and R01MH143267, NSF grant DMS\-2310679, a Meta Faculty Research Award, and Wharton AI for Business\. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH\.
### References
- Spectral measures of risk: a coherent representation of subjective risk aversion\.Journal of banking & finance26\(7\),pp\. 1505–1518\.Cited by:[§4\.1](https://arxiv.org/html/2605.24331#S4.SS1.SSS0.Px2.p1.20),[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px5.p1.1)\.
- C\. An, Z\. Xie, X\. Li, L\. Li, J\. Zhang, S\. Gong, M\. Zhong, J\. Xu, X\. Qiu, M\. Wang, and L\. Kong \(2025\)Polaris: a post\-training recipe for scaling reinforcement learning on advanced reasoning models\.Note:[https://hkunlp\.github\.io/blog/2025/Polaris](https://hkunlp.github.io/blog/2025/Polaris)Cited by:[§5](https://arxiv.org/html/2605.24331#S5.SS0.SSS0.Px1.p1.16)\.
- M\. Andrychowicz, F\. Wolski, A\. Ray, J\. Schneider, R\. Fong, P\. Welinder, B\. McGrew, J\. Tobin, O\. Pieter Abbeel, and W\. Zaremba \(2017\)Hindsight experience replay\.Advances in neural information processing systems30\.Cited by:[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px1.p1.1)\.
- S\. Balakrishnan, Q\. P\. Nguyen, B\. K\. H\. Low, and H\. Soh \(2020\)Efficient exploration of reward functions in inverse reinforcement learning via bayesian optimization\.Advances in Neural Information Processing Systems33,pp\. 4187–4198\.Cited by:[§3\.1](https://arxiv.org/html/2605.24331#S3.SS1.SSS0.Px2.p1.3)\.
- A\. Balbás, J\. Garrido, and S\. Mayoral \(2009\)Properties of distortion risk measures\.Methodology and Computing in Applied Probability11\(3\),pp\. 385–399\.Cited by:[§4\.1](https://arxiv.org/html/2605.24331#S4.SS1.SSS0.Px2.p1.20),[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px5.p1.1)\.
- M\. Balunović, J\. Dekoninck, I\. Petrov, N\. Jovanović, and M\. Vechev \(2025\)MathArena: evaluating llms on uncontaminated math competitions\.arXiv preprint arXiv:2505\.23281\.Cited by:[§C\.1](https://arxiv.org/html/2605.24331#A3.SS1.p1.3),[§5](https://arxiv.org/html/2605.24331#S5.SS0.SSS0.Px2.p1.1)\.
- A\. Barakat, S\. Chakraborty, K\. Pahwa, and A\. S\. Bedi \(2026\)Why pass@ k optimization can degrade pass@ 1: prompt interference in llm post\-training\.arXiv preprint arXiv:2602\.21189\.Cited by:[§3\.2](https://arxiv.org/html/2605.24331#S3.SS2.SSS0.Px1.p1.12),[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px4.p1.4)\.
- N\. Bäuerle and A\. Jaśkiewicz \(2024\)Markov decision processes with risk\-sensitive criteria: an overview\.Mathematical Methods of Operations Research99\(1\),pp\. 141–178\.Cited by:[§3\.3](https://arxiv.org/html/2605.24331#S3.SS3.SSS0.Px1.p1.4),[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px5.p1.1)\.
- M\. G\. Bellemare, W\. Dabney, and M\. Rowland \(2023\)Distributional reinforcement learning\.MIT Press\.Cited by:[§3\.3](https://arxiv.org/html/2605.24331#S3.SS3.SSS0.Px1.p1.4),[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px5.p1.1)\.
- Y\. Bengio, J\. Louradour, R\. Collobert, and J\. Weston \(2009\)Curriculum learning\.InProceedings of the 26th annual international conference on machine learning,pp\. 41–48\.Cited by:[2nd item](https://arxiv.org/html/2605.24331#S6.I1.i2.p1.1)\.
- C\. M\. Bishop and N\. M\. Nasrabadi \(2006\)Pattern recognition and machine learning\.Vol\.4,Springer\.Cited by:[§2](https://arxiv.org/html/2605.24331#S2.SS0.SSS0.Px3.p1.1)\.
- T\. Blau, E\. V\. Bonilla, I\. Chades, and A\. Dezfouli \(2022\)Optimizing sequential experimental design with deep reinforcement learning\.InInternational conference on machine learning,pp\. 2107–2128\.Cited by:[§3\.1](https://arxiv.org/html/2605.24331#S3.SS1.SSS0.Px2.p1.3),[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px1.p1.1)\.
- H\. W\. Block, T\. H\. Savits, and H\. Singh \(1998\)The reversed hazard rate function\.Probability in the Engineering and informational Sciences12\(1\),pp\. 69–90\.Cited by:[§4\.1](https://arxiv.org/html/2605.24331#S4.SS1.SSS0.Px4.p1.6)\.
- H\. Brezis and H\. Brézis \(2011\)Functional analysis, sobolev spaces and partial differential equations\.Vol\.2,Springer\.Cited by:[§3\.2](https://arxiv.org/html/2605.24331#S3.SS2.SSS0.Px2.p2.8)\.
- E\. Brochu, V\. M\. Cora, and N\. De Freitas \(2010\)A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning\.arXiv preprint arXiv:1012\.2599\.Cited by:[§3\.1](https://arxiv.org/html/2605.24331#S3.SS1.SSS0.Px2.p1.3)\.
- G\. Casella and R\. Berger \(2024\)Statistical inference\.Chapman and Hall/CRC\.Cited by:[§1](https://arxiv.org/html/2605.24331#S1.p2.1),[§2](https://arxiv.org/html/2605.24331#S2.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2605.24331#S4.SS1.SSS0.Px2.p1.20)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. de Oliveira Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[Appendix B](https://arxiv.org/html/2605.24331#A2.SS0.SSS0.Px1.p1.18),[§5](https://arxiv.org/html/2605.24331#S5.SS0.SSS0.Px1.p1.16)\.
- X\. Chen, J\. Lu, M\. Kim, D\. Zhang, J\. Tang, A\. Piché, N\. Gontier, Y\. Bengio, and E\. Kamalloo \(2025a\)Self\-evolving curriculum for llm reasoning\.arXiv preprint arXiv:2505\.14970\.Cited by:[§1](https://arxiv.org/html/2605.24331#S1.p2.1),[2nd item](https://arxiv.org/html/2605.24331#S6.I1.i2.p1.1)\.
- X\. Chen, X\. Li, Z\. Sun, and W\. Yu \(2025b\)Beyond high\-entropy exploration: correctness\-aware low\-entropy segment\-based advantage shaping for reasoning llms\.arXiv preprint arXiv:2512\.00908\.Cited by:[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px3.p1.2)\.
- Z\. Chen, X\. Qin, Y\. Wu, Y\. Ling, Q\. Ye, W\. X\. Zhao, and G\. Shi \(2025c\)Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models\.arXiv preprint arXiv:2508\.10751\.Cited by:[§5\.1](https://arxiv.org/html/2605.24331#S5.SS1.SSS0.Px1.p1.9)\.
- X\. Chu, H\. Huang, X\. Zhang, F\. Wei, and Y\. Wang \(2026\)GPG: a simple and strong reinforcement learning baseline for model reasoning\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.24331#S1.p1.1),[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px3.p1.2)\.
- G\. Cui, Y\. Zhang, J\. Chen, L\. Yuan, Z\. Wang, Y\. Zuo, H\. Li, Y\. Fan, H\. Chen, W\. Chen,et al\.\(2025\)The entropy mechanism of reinforcement learning for reasoning language models\.arXiv preprint arXiv:2505\.22617\.Cited by:[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px1.p1.1)\.
- W\. Dabney, G\. Ostrovski, D\. Silver, and R\. Munos \(2018\)Implicit quantile networks for distributional reinforcement learning\.InInternational conference on machine learning,pp\. 1096–1105\.Cited by:[§3\.3](https://arxiv.org/html/2605.24331#S3.SS3.SSS0.Px1.p1.4),[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px5.p1.1)\.
- R\. Dai, L\. Song, H\. Liu, Z\. Liang, D\. Yu, H\. Mi, Z\. Tu, R\. Liu, T\. Zheng, H\. Zhu,et al\.\(2026\)CDE: curiosity\-driven exploration for efficient reinforcement learning in large language models\.InInternational Conference on Learning Representations,Cited by:[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px1.p1.1)\.
- D\. Davis and B\. Recht \(2025\)What is the objective of reasoning with reinforcement learning?\.arXiv preprint arXiv:2510\.13651\.Cited by:[§1](https://arxiv.org/html/2605.24331#S1.p2.1),[§2](https://arxiv.org/html/2605.24331#S2.SS0.SSS0.Px2.p1.17),[§2](https://arxiv.org/html/2605.24331#S2.SS0.SSS0.Px2.p1.18),[§2](https://arxiv.org/html/2605.24331#S2.SS0.SSS0.Px2.p1.5),[§4\.1](https://arxiv.org/html/2605.24331#S4.SS1.SSS0.Px1.p1.6),[3rd item](https://arxiv.org/html/2605.24331#S6.I1.i3.p1.1)\.
- A\. A\. Deshmukh, S\. Sharma, J\. W\. Cutler, M\. Moldwin, and C\. Scott \(2018\)Simple regret minimization for contextual bandits\.arXiv preprint arXiv:1810\.07371\.Cited by:[§3\.1](https://arxiv.org/html/2605.24331#S3.SS1.SSS0.Px3.p1.6)\.
- J\. Dhaene, A\. Kukush, D\. Linders, and Q\. Tang \(2012\)Remarks on quantiles and distortion risk measures\.European Actuarial Journal2\(2\),pp\. 319–328\.Cited by:[§4\.1](https://arxiv.org/html/2605.24331#S4.SS1.SSS0.Px2.p1.20)\.
- L\. C\. Evans \(2022\)Partial differential equations\.Vol\.19,American mathematical society\.Cited by:[§A\.1](https://arxiv.org/html/2605.24331#A1.SS1.p1.2),[§3\.2](https://arxiv.org/html/2605.24331#S3.SS2.SSS0.Px2.p1.22),[§3\.2](https://arxiv.org/html/2605.24331#S3.SS2.SSS0.Px2.p2.8)\.
- L\. C\. Evans \(2025\)Measure theory and fine properties of functions\.Chapman and Hall/CRC\.Cited by:[§A\.2](https://arxiv.org/html/2605.24331#A1.SS2.SSS0.Px2.p1.10),[§A\.2](https://arxiv.org/html/2605.24331#A1.SS2.SSS0.Px2.p1.8)\.
- B\. Eysenbach, T\. Zhang, S\. Levine, and R\. R\. Salakhutdinov \(2022\)Contrastive learning as goal\-conditioned reinforcement learning\.Advances in Neural Information Processing Systems35,pp\. 35603–35620\.Cited by:[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px1.p1.1)\.
- M\. S\. Finkelstein \(2002\)On the reversed hazard rate\.Reliability Engineering & System Safety78\(1\),pp\. 71–75\.Cited by:[§4\.1](https://arxiv.org/html/2605.24331#S4.SS1.SSS0.Px4.p1.6)\.
- P\. I\. Frazier \(2018\)A tutorial on bayesian optimization\.arXiv preprint arXiv:1807\.02811\.Cited by:[§3\.1](https://arxiv.org/html/2605.24331#S3.SS1.SSS0.Px2.p1.3)\.
- Y\. Fu, X\. Wang, H\. Zhang, Y\. Tian, and J\. Zhao \(2026\)Deep think with confidence\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=8LqHs0KIM7)Cited by:[§5\.1](https://arxiv.org/html/2605.24331#S5.SS1.SSS0.Px2.p1.13)\.
- C\. Ge, C\. H\. Yin, H\. Liang, and J\. Zhang \(2026\)Why grpo needs normalization: a local\-curvature perspective on adaptive gradients\.arXiv preprint arXiv:2601\.23135\.Cited by:[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px3.p1.2)\.
- I\. M\. Gelfand, R\. A\. Silverman,et al\.\(2000\)Calculus of variations\.Courier Corporation\.Cited by:[§A\.1](https://arxiv.org/html/2605.24331#A1.SS1.p1.2)\.
- R\. Grosse, J\. Bae, C\. Anil, N\. Elhage, A\. Tamkin, A\. Tajdini, B\. Steiner, D\. Li, E\. Durmus, E\. Perez,et al\.\(2023\)Studying large language model generalization with influence functions\.arXiv preprint arXiv:2308\.03296\.Cited by:[§3\.2](https://arxiv.org/html/2605.24331#S3.SS2.SSS0.Px2.p2.8)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§1](https://arxiv.org/html/2605.24331#S1.p1.1),[§2](https://arxiv.org/html/2605.24331#S2.SS0.SSS0.Px2),[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px1.p1.1)\.
- J\. Guo, Z\. Wu, H\. Yang, and P\. S\. Yu \(2026\)Mining intrinsic rewards from llm hidden states for efficient best\-of\-n sampling\.InProceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\. 1,pp\. 371–382\.Cited by:[§5\.1](https://arxiv.org/html/2605.24331#S5.SS1.SSS0.Px2.p1.13)\.
- F\. R\. Hampel \(1974\)The influence curve and its role in robust estimation\.Journal of the american statistical association69\(346\),pp\. 383–393\.Cited by:[§3\.2](https://arxiv.org/html/2605.24331#S3.SS2.SSS0.Px2.p2.8)\.
- H\. He and W\. J\. Su \(2019\)The local elasticity of neural networks\.International Conference on Learning Representations\.Cited by:[§3\.2](https://arxiv.org/html/2605.24331#S3.SS2.SSS0.Px1.p1.12),[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px4.p1.4)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the math dataset\.arXiv preprint arXiv:2103\.03874\.Cited by:[§5](https://arxiv.org/html/2605.24331#S5.SS0.SSS0.Px2.p1.1)\.
- R\. A\. Howard and J\. E\. Matheson \(1972\)Risk\-sensitive markov decision processes\.Management science18\(7\),pp\. 356–369\.Cited by:[§3\.3](https://arxiv.org/html/2605.24331#S3.SS3.SSS0.Px1.p1.4),[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px5.p1.1)\.
- Y\. Huang, Z\. Wen, Y\. Chi, Y\. Wei, A\. Singh, Y\. Liang, and Y\. Chen \(2026\)On the learning dynamics of rlvr at the edge of competence\.arXiv preprint arXiv:2602\.14872\.Cited by:[3rd item](https://arxiv.org/html/2605.24331#S6.I1.i3.p1.1),[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px3.p1.2)\.
- A\. Jaech, A\. Kalai, A\. Lerer, A\. Richardson, A\. El\-Kishky, A\. Low, A\. Helyar, A\. Madry, A\. Beutel, A\. Carney,et al\.\(2024\)Openai o1 system card\.arXiv preprint arXiv:2412\.16720\.Cited by:[§1](https://arxiv.org/html/2605.24331#S1.p1.1)\.
- Z\. Kang, X\. Zhao, and D\. Song \(2025\)Scalable best\-of\-n selection for large language models via self\-certainty\.Advances in neural information processing systems38,pp\. 19720–19745\.Cited by:[§5\.1](https://arxiv.org/html/2605.24331#S5.SS1.SSS0.Px2.p1.13)\.
- S\. Kapturowski, G\. Ostrovski, J\. Quan, R\. Munos, and W\. Dabney \(2018\)Recurrent experience replay in distributed reinforcement learning\.InInternational conference on learning representations,Cited by:[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px1.p1.1)\.
- P\. W\. Koh and P\. Liang \(2017\)Understanding black\-box predictions via influence functions\.InInternational conference on machine learning,pp\. 1885–1894\.Cited by:[§3\.2](https://arxiv.org/html/2605.24331#S3.SS2.SSS0.Px2.p2.8)\.
- H\. Kydlíček \(2025\)Math\-Verify\.Note:[https://github\.com/huggingface/Math\-Verify](https://github.com/huggingface/Math-Verify)version 0\.8\.0Cited by:[§5](https://arxiv.org/html/2605.24331#S5.SS0.SSS0.Px1.p1.16)\.
- P\. Ladosz, L\. Weng, M\. Kim, and H\. Oh \(2022\)Exploration in deep reinforcement learning: a survey\.Information Fusion85,pp\. 1–22\.Cited by:[§1](https://arxiv.org/html/2605.24331#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.24331#S3.SS1.SSS0.Px2.p1.3)\.
- T\. Lattimore and C\. Szepesvári \(2020\)Bandit algorithms\.Cambridge University Press\.Cited by:[§1](https://arxiv.org/html/2605.24331#S1.p1.1),[§3\.1](https://arxiv.org/html/2605.24331#S3.SS1.SSS0.Px3.p1.6)\.
- A\. Lewkowycz, A\. Andreassen, D\. Dohan, E\. Dyer, H\. Michalewski, V\. Ramasesh, A\. Slone, C\. Anil, I\. Schlag, and T\. Gutman\-Solo \(2022\)Solving quantitative reasoning problems with language models\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§C\.1](https://arxiv.org/html/2605.24331#A3.SS1.p1.3),[§5](https://arxiv.org/html/2605.24331#S5.SS0.SSS0.Px2.p1.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2023\)Let’s verify step by step\.InInternational Conference on Learning Representations,Cited by:[§5\.1](https://arxiv.org/html/2605.24331#S5.SS1.SSS0.Px2.p1.13)\.
- T\. P\. Lillicrap, J\. J\. Hunt, A\. Pritzel, N\. Heess, T\. Erez, Y\. Tassa, D\. Silver, and D\. Wierstra \(2015\)Continuous control with deep reinforcement learning\.arXiv preprint arXiv:1509\.02971\.Cited by:[§4\.1](https://arxiv.org/html/2605.24331#S4.SS1.SSS0.Px3.p1.18)\.
- L\. Lin \(1992\)Self\-improving reactive agents based on reinforcement learning, planning and teaching\.Machine learning8\(3\),pp\. 293–321\.Cited by:[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px1.p1.1)\.
- J\. L\. Lions \(1971\)Optimal control of systems governed by partial differential equations\.Vol\.170,Springer\.Cited by:[§3\.2](https://arxiv.org/html/2605.24331#S3.SS2.SSS0.Px2.p1.22),[§3\.2](https://arxiv.org/html/2605.24331#S3.SS2.SSS0.Px2.p2.8)\.
- M\. Liu, M\. Zhu, and W\. Zhang \(2022\)Goal\-conditioned reinforcement learning: problems and solutions\.arXiv preprint arXiv:2201\.08299\.Cited by:[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px1.p1.1)\.
- P\. Liu, C\. Shi, and W\. W\. Sun \(2024\)Dual active learning for reinforcement learning from human feedback\.arXiv preprint arXiv:2410\.02504\.Cited by:[§3\.1](https://arxiv.org/html/2605.24331#S3.SS1.SSS0.Px2.p1.3)\.
- Z\. Liu, C\. Chen, W\. Li, P\. Qi, T\. Pang, C\. Du, W\. S\. Lee, and M\. Lin \(2025\)Understanding r1\-zero\-like training: a critical perspective\.InConference on Language Modeling,Cited by:[§1](https://arxiv.org/html/2605.24331#S1.p1.1),[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px3.p1.2)\.
- A\. R\. Mahmood, H\. Yu, M\. White, and R\. S\. Sutton \(2015\)Emphatic temporal\-difference learning\.arXiv preprint arXiv:1507\.01569\.Cited by:[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px1.p1.1)\.
- A\. Majumdar, S\. Singh, and M\. Pavone \(2017\)Risk\-sensitive inverse reinforcement learning via coherent risk models\.InRobotics: Science and Systems \(RSS\),Vol\.16\.Cited by:[§3\.3](https://arxiv.org/html/2605.24331#S3.SS3.SSS0.Px1.p1.4),[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px5.p1.1)\.
- Y\. Mao, Y\. Qu, Q\. Wang, H\. Zou, and X\. Ji \(2026\)Dynamics\-predictive sampling for active rl finetuning of large reasoning models\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.24331#S1.p2.1),[1st item](https://arxiv.org/html/2605.24331#S6.I1.i1.p1.2)\.
- V\. Mehta, B\. Paria, J\. Schneider, S\. Ermon, and W\. Neiswanger \(2022\)An experimental design perspective on model\-based reinforcement learning\.InInternational conference on learning representations,Cited by:[§3\.1](https://arxiv.org/html/2605.24331#S3.SS1.SSS0.Px2.p1.3),[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px1.p1.1)\.
- P\. Ménard, O\. D\. Domingues, A\. Jonsson, E\. Kaufmann, E\. Leurent, and M\. Valko \(2021\)Fast active learning for pure exploration in reinforcement learning\.InInternational Conference on Machine Learning,pp\. 7599–7608\.Cited by:[§3\.1](https://arxiv.org/html/2605.24331#S3.SS1.SSS0.Px2.p1.3)\.
- O\. Mihatsch and R\. Neuneier \(2002\)Risk\-sensitive reinforcement learning\.Machine Learning49\(2\-3\),pp\. 267–290\.Cited by:[§3\.3](https://arxiv.org/html/2605.24331#S3.SS3.SSS0.Px1.p1.4),[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px5.p1.1)\.
- T\. Min, H\. Lee, Y\. Kwon, and K\. Lee \(2025\)Understanding impact of human feedback via influence functions\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 27471–27500\.Cited by:[§3\.2](https://arxiv.org/html/2605.24331#S3.SS2.SSS0.Px2.p2.8)\.
- V\. Mnih, K\. Kavukcuoglu, D\. Silver, A\. A\. Rusu, J\. Veness, M\. G\. Bellemare, A\. Graves, M\. Riedmiller, A\. K\. Fidjeland, G\. Ostrovski, S\. Petersen, C\. Beattie, A\. Sadik, I\. Antonoglou, H\. King, D\. Kumaran, D\. Wierstra, S\. Legg, and D\. Hassabis \(2015\)Human\-level control through deep reinforcement learning\.Nature518\(7540\),pp\. 529–533\.Cited by:[§4\.1](https://arxiv.org/html/2605.24331#S4.SS1.SSS0.Px3.p1.18)\.
- Y\. Mroueh \(2025\)Reinforcement learning with verifiable rewards: grpo’s effective loss, dynamics, and success amplification\.arXiv preprint arXiv:2503\.06639\.Cited by:[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px3.p1.2)\.
- S\. Narvekar, B\. Peng, M\. Leonetti, J\. Sinapov, M\. E\. Taylor, and P\. Stone \(2020\)Curriculum learning for reinforcement learning domains: a framework and survey\.Journal of Machine Learning Research21\(181\),pp\. 1–50\.Cited by:[2nd item](https://arxiv.org/html/2605.24331#S6.I1.i2.p1.1)\.
- T\. Olmo, A\. Ettinger, A\. Bertsch, B\. Kuehl, D\. Graham, D\. Heineman, D\. Groeneveld, F\. Brahman, F\. Timbers, H\. Ivison,et al\.\(2025\)Olmo 3\.arXiv preprint arXiv:2512\.13961\.Cited by:[Appendix B](https://arxiv.org/html/2605.24331#A2.SS0.SSS0.Px1.p1.18)\.
- Y\. Pan, J\. Mei, A\. Farahmand, M\. White, H\. Yao, M\. Rohani, and J\. Luo \(2022\)Understanding and mitigating the limitations of prioritized experience replay\.InUncertainty in Artificial Intelligence,pp\. 1561–1571\.Cited by:[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px1.p1.1)\.
- S\. Parashar, S\. Gui, X\. Li, H\. Ling, S\. Vemuri, B\. Olson, E\. Li, Y\. Zhang, J\. Caverlee, D\. Kalathil,et al\.\(2025\)Curriculum reinforcement learning from easy to hard tasks improves llm reasoning\.arXiv preprint arXiv:2506\.06632\.Cited by:[§D\.1](https://arxiv.org/html/2605.24331#A4.SS1.SSS0.Px1.p1.5),[§1](https://arxiv.org/html/2605.24331#S1.p2.1),[2nd item](https://arxiv.org/html/2605.24331#S6.I1.i2.p1.1),[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px1.p1.1)\.
- G\. Peyré and M\. Cuturi \(2019\)Computational optimal transport: with applications to data science\.Now Foundations and Trends\.Cited by:[§3\.2](https://arxiv.org/html/2605.24331#S3.SS2.SSS0.Px2.p2.8)\.
- M\. L\. Puterman \(2014\)Markov decision processes: discrete stochastic dynamic programming\.John Wiley & Sons\.Cited by:[§1](https://arxiv.org/html/2605.24331#S1.p1.1)\.
- N\. Rajaraman, A\. Huang, M\. Dudik, R\. Schapire, D\. J\. Foster, and A\. Krishnamurthy \(2026\)Learning to reason with curriculum i: provable benefits of autocurriculum\.arXiv preprint arXiv:2603\.18325\.Cited by:[§1](https://arxiv.org/html/2605.24331#S1.p2.1),[2nd item](https://arxiv.org/html/2605.24331#S6.I1.i2.p1.1)\.
- E\. M\. Ronchetti and P\. J\. Huber \(2009\)Robust statistics\.John Wiley & Sons Hoboken, NJ, USA\.Cited by:[§3\.2](https://arxiv.org/html/2605.24331#S3.SS2.SSS0.Px2.p2.8)\.
- T\. Schaul, J\. Quan, I\. Antonoglou, and D\. Silver \(2015\)Prioritized experience replay\.arXiv preprint arXiv:1511\.05952\.Cited by:[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px1.p1.1)\.
- B\. Seed, J\. Chen, T\. Fan, X\. Liu, L\. Liu, Z\. Lin, M\. Wang, C\. Wang, X\. Wei, W\. Xu,et al\.\(2025\)Seed1\. 5\-thinking: advancing superb reasoning models with reinforcement learning\.arXiv preprint arXiv:2504\.13914\.Cited by:[§5](https://arxiv.org/html/2605.24331#S5.SS0.SSS0.Px2.p1.1)\.
- J\. Shao \(1999\)Mathematical statistics\.Springer\.Cited by:[§1](https://arxiv.org/html/2605.24331#S1.p2.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§1](https://arxiv.org/html/2605.24331#S1.p1.1),[§2](https://arxiv.org/html/2605.24331#S2.SS0.SSS0.Px2),[§5](https://arxiv.org/html/2605.24331#S5.SS0.SSS0.Px1.p1.16),[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px1.p1.1)\.
- G\. Sheng, C\. Zhang, Z\. Ye, X\. Wu, W\. Zhang, R\. Zhang, Y\. Peng, H\. Lin, and C\. Wu \(2025\)Hybridflow: a flexible and efficient rlhf framework\.InProceedings of the Twentieth European Conference on Computer Systems,pp\. 1279–1297\.Cited by:[Appendix B](https://arxiv.org/html/2605.24331#A2.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.24331#S5.SS0.SSS0.Px1.p1.16)\.
- K\. M\. Smith and M\. P\. Chapman \(2023\)On exponential utility and conditional value\-at\-risk as risk\-averse performance criteria\.IEEE Transactions on Control Systems Technology31\(6\),pp\. 2555–2570\.Cited by:[§3\.3](https://arxiv.org/html/2605.24331#S3.SS3.SSS0.Px1.p1.4),[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px5.p1.1)\.
- W\. J\. Su \(2024\)Envisioning future deep learning theories: some basic concepts and characteristics\.Science China Information Sciences67\(10\),pp\. 203101\.Cited by:[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px4.p1.4)\.
- J\. Suk and Y\. Duan \(2026\)On the optimization dynamics of rlvr: gradient gap and step size thresholds\.arXiv\.Cited by:[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px3.p1.2)\.
- R\. S\. Sutton, A\. G\. Barto,et al\.\(1998\)Reinforcement learning: an introduction\.Vol\.1,MIT press Cambridge\.Cited by:[§2](https://arxiv.org/html/2605.24331#S2.SS0.SSS0.Px1.p1.5),[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px1.p1.1)\.
- R\. S\. Sutton, D\. McAllester, S\. Singh, and Y\. Mansour \(1999\)Policy gradient methods for reinforcement learning with function approximation\.Advances in neural information processing systems12\.Cited by:[§2](https://arxiv.org/html/2605.24331#S2.SS0.SSS0.Px1.p1.5)\.
- F\. Tajwar, G\. Zeng, Y\. Zhou, Y\. Song, D\. Arora, Y\. Jiang, J\. Schneider, R\. Salakhutdinov, H\. Feng, and A\. Zanette \(2026\)Maximum likelihood reinforcement learning\.arXiv preprint arXiv:2602\.02710\.Cited by:[Appendix B](https://arxiv.org/html/2605.24331#A2.SS0.SSS0.Px1.p1.18),[Appendix B](https://arxiv.org/html/2605.24331#A2.SS0.SSS0.Px4.p1.2),[Appendix B](https://arxiv.org/html/2605.24331#A2.SS0.SSS0.Px5.p1.5),[§1](https://arxiv.org/html/2605.24331#S1.p1.1),[§1](https://arxiv.org/html/2605.24331#S1.p2.1),[§2](https://arxiv.org/html/2605.24331#S2.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2605.24331#S4.SS1.SSS0.Px1.p1.6),[§5](https://arxiv.org/html/2605.24331#S5.SS0.SSS0.Px1.p1.16),[§5\.1](https://arxiv.org/html/2605.24331#S5.SS1.SSS0.Px2.p1.13),[3rd item](https://arxiv.org/html/2605.24331#S6.I1.i3.p1.1),[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px1.p1.1),[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px3.p1.2)\.
- Y\. Tang, L\. Dong, Y\. Hao, Q\. Dong, F\. Wei, and J\. Gu \(2026\)Multiplex thinking: reasoning via token\-wise branch\-and\-merge\.arXiv preprint arXiv:2601\.08808\.Cited by:[Appendix B](https://arxiv.org/html/2605.24331#A2.SS0.SSS0.Px1.p1.18)\.
- S\. B\. Thrun \(1992\)Efficient exploration in reinforcement learning\.Carnegie Mellon University\.Cited by:[§1](https://arxiv.org/html/2605.24331#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.24331#S3.SS1.SSS0.Px2.p1.3)\.
- D\. Tsipras, S\. Santurkar, L\. Engstrom, A\. Turner, and A\. Madry \(2019\)Robustness may be at odds with accuracy\.International Conference on Learning Representations\.Cited by:[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px4.p1.4)\.
- C\. Villaniet al\.\(2009\)Optimal transport: old and new\.Vol\.338,Springer\.Cited by:[§3\.2](https://arxiv.org/html/2605.24331#S3.SS2.SSS0.Px2.p2.8)\.
- C\. Walder and D\. Karkhanis \(2025\)Pass@ k policy optimization: solving harder reinforcement learning problems\.Advances in neural information processing systems\.Cited by:[§2](https://arxiv.org/html/2605.24331#S2.SS0.SSS0.Px3.p1.1),[3rd item](https://arxiv.org/html/2605.24331#S6.I1.i3.p1.1),[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px4.p1.4)\.
- S\. Wang \(1996\)Premium calculation by transforming the layer premium density\.ASTIN Bulletin: The Journal of the IAA26\(1\),pp\. 71–92\.Cited by:[§4\.1](https://arxiv.org/html/2605.24331#S4.SS1.SSS0.Px2.p1.20),[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px5.p1.1)\.
- X\. Wang, J\. Wei, D\. Schuurmans, Q\. Le, E\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou \(2023\)Self\-consistency improves chain of thought reasoning in language models\.InInternational Conference on Learning Representations,Cited by:[§5\.2](https://arxiv.org/html/2605.24331#S5.SS2.SSS0.Px1.p2.2)\.
- Y\. Wang and M\. P\. Chapman \(2022\)Risk\-averse autonomous systems: a brief history and recent developments from the perspective of optimal control\.Artificial Intelligence311,pp\. 103743\.Cited by:[§3\.3](https://arxiv.org/html/2605.24331#S3.SS3.SSS0.Px1.p1.4),[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px5.p1.1)\.
- R\. J\. Williams \(1992\)Simple statistical gradient\-following algorithms for connectionist reinforcement learning\.Machine learning8\(3\),pp\. 229–256\.Cited by:[§2](https://arxiv.org/html/2605.24331#S2.SS0.SSS0.Px1.p1.5),[§2](https://arxiv.org/html/2605.24331#S2.SS0.SSS0.Px1.p1.7)\.
- F\. Wu, W\. Xuan, X\. Lu, M\. Liu, Y\. Dong, Z\. Harchaoui, and Y\. Choi \(2025\)The invisible leash: why rlvr may or may not escape its origin\.arXiv preprint arXiv:2507\.14843\.Cited by:[§5\.1](https://arxiv.org/html/2605.24331#S5.SS1.SSS0.Px2.p1.13)\.
- J\. Wu, K\. Huang, J\. Wu, A\. Zhang, X\. Wang, and X\. He \(2026\)Quantile advantage estimation: stabilizing RLVR for LLM reasoning\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=WDP5b3mtFV)Cited by:[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px4.p1.4)\.
- W\. Xiong, C\. Ye, B\. Liao, H\. Dong, X\. Xu, C\. Monz, J\. Bian, N\. Jiang, and T\. Zhang \(2025\)Reinforce\-ada: an adaptive sampling framework under non\-linear rl objectives\.arXiv preprint arXiv:2510\.04996\.Cited by:[§1](https://arxiv.org/html/2605.24331#S1.p1.1),[§1](https://arxiv.org/html/2605.24331#S1.p2.1),[§2](https://arxiv.org/html/2605.24331#S2.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2605.24331#S2.SS0.SSS0.Px3.p1.2),[§4\.1](https://arxiv.org/html/2605.24331#S4.SS1.SSS0.Px1.p1.6),[1st item](https://arxiv.org/html/2605.24331#S6.I1.i1.p1.2),[3rd item](https://arxiv.org/html/2605.24331#S6.I1.i3.p1.1),[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px1.p1.1)\.
- M\. E\. Yaari \(1987\)The dual theory of choice under risk\.Econometrica: Journal of the Econometric Society,pp\. 95–115\.Cited by:[§4\.1](https://arxiv.org/html/2605.24331#S4.SS1.SSS0.Px2.p1.20),[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px5.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[Appendix B](https://arxiv.org/html/2605.24331#A2.SS0.SSS0.Px3.p1.1)\.
- F\. Yang, Z\. Chen, X\. Wang, X\. Lu, J\. Chai, G\. Yin, W\. Lin, S\. Ma, F\. Zhuang, D\. Wang,et al\.\(2026\)Your group\-relative advantage is biased\.arXiv preprint arXiv:2601\.08521\.Cited by:[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px3.p1.2)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, W\. Dai, T\. Fan, G\. Liu, L\. Liu,et al\.\(2025\)Dapo: an open\-source llm reinforcement learning system at scale\.arXiv preprint arXiv:2503\.14476\.Cited by:[§1](https://arxiv.org/html/2605.24331#S1.p1.1),[§1](https://arxiv.org/html/2605.24331#S1.p2.1),[1st item](https://arxiv.org/html/2605.24331#S6.I1.i1.p1.2),[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px1.p1.1)\.
- Y\. Yue, Z\. Chen, R\. Lu, A\. Zhao, Z\. Wang,et al\.\(2025\)Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§5\.1](https://arxiv.org/html/2605.24331#S5.SS1.SSS0.Px1.p1.9),[§5\.1](https://arxiv.org/html/2605.24331#S5.SS1.SSS0.Px2.p1.13),[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px1.p1.1)\.
- C\. Zhang, G\. Neubig, and X\. Yue \(2025a\)On the interplay of pre\-training, mid\-training, and rl on reasoning language models\.arXiv preprint arXiv:2512\.07783\.Cited by:[3rd item](https://arxiv.org/html/2605.24331#S6.I1.i3.p1.1)\.
- H\. Zhang, Y\. Yu, J\. Jiao, E\. Xing, L\. El Ghaoui, and M\. Jordan \(2019\)Theoretically principled trade\-off between robustness and accuracy\.InInternational conference on machine learning,pp\. 7472–7482\.Cited by:[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px4.p1.4)\.
- K\. Zhang, Y\. Hong, J\. Bao, H\. Jiang, Y\. Song, D\. Hong, and H\. Xiong \(2025b\)Gvpo: group variance policy optimization for large language model post\-training\.Advances in neural information processing systems\.Cited by:[§1](https://arxiv.org/html/2605.24331#S1.p1.1),[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px3.p1.2)\.
- X\. Zhang, X\. Yuan, D\. Huang, W\. You, C\. Hu, J\. Ruan, A\. Jian, K\. Chen, and X\. Hu \(2025c\)Revisiting entropy regularization: adaptive coefficient unlocks its potential for llm reinforcement learning\.arXiv preprint arXiv:2510\.10959\.Cited by:[Appendix B](https://arxiv.org/html/2605.24331#A2.SS0.SSS0.Px1.p1.18)\.
- C\. Zheng, S\. Liu, M\. Li, X\. Chen, B\. Yu, C\. Gao, K\. Dang, Y\. Liu, R\. Men, A\. Yang,et al\.\(2025\)Group sequence policy optimization\.arXiv preprint arXiv:2507\.18071\.Cited by:[§2](https://arxiv.org/html/2605.24331#S2.SS0.SSS0.Px2.p1.5)\.
- D\. Zhou, L\. Li, and Q\. Gu \(2020\)Neural contextual bandits with ucb\-based exploration\.InInternational conference on machine learning,pp\. 11492–11502\.Cited by:[§3\.1](https://arxiv.org/html/2605.24331#S3.SS1.SSS0.Px3.p1.6)\.
- H\. Zhou, K\. Ye, E\. Xu, J\. Zhu, S\. Gong, and C\. Shi \(2026\)Demystifying group relative policy optimization: its policy gradient is a u\-statistic\.arXiv preprint arXiv:2603\.01162\.Cited by:[§6](https://arxiv.org/html/2605.24331#S6.SS0.SSS0.Px3.p1.2)\.
## Appendix
### Appendix ATheoretical Results
#### A\.1Functional derivative under Pointwise Utility Reduces to Partial Derivative
If the utility function𝒰θ\\mathcal\{U\}\_\{\\theta\}equals to the pointwise utility function, functional derivative degenerates to the partial derivative, i\.e\.,
𝒰θ=∫𝒳g\(pθ\(x\)\)d0\(x\)𝑑x⇒δ𝒰θδpθ\(x\)=g′\(pθ\(x\)\)=∂g\(pθ\(x\)\)∂pθ\(x\)\.\\displaystyle\\mathcal\{U\}\_\{\\theta\}=\\int\_\{\\mathcal\{X\}\}g\\left\(p\_\{\\theta\}\(x\)\\right\)d\_\{0\}\(x\)dx\\Rightarrow\\frac\{\\delta\\mathcal\{U\}\_\{\\theta\}\}\{\\delta p\_\{\\theta\}\}\(x\)=g^\{\\prime\}\\left\(p\_\{\\theta\}\(x\)\\right\)=\\frac\{\\partial g\\left\(p\_\{\\theta\}\(x\)\\right\)\}\{\\partial p\_\{\\theta\}\(x\)\}\.Even though it is a classical theoretical result in Calculus of Variations\[Gelfandet al\.,[2000](https://arxiv.org/html/2605.24331#bib.bib114)\]and Partial Differential Equation\[Evans,[2022](https://arxiv.org/html/2605.24331#bib.bib91)\], we provide a short version of the proof as a reference for completeness, as the remaining theoretical results in our paper all rely on this backbone\.
###### Proof\.
Firstly, after the perturbation, the perturbed utility function is
𝒰θ\(pθ\+ϵh\)=∫𝒳g\(pθ\(x\)\+ϵh\(x\)\)d0\(x\)𝑑x\.\\displaystyle\\mathcal\{U\}\_\{\\theta\}\\left\(p\_\{\\theta\}\+\\epsilon h\\right\)=\\int\_\{\\mathcal\{X\}\}g\\left\(p\_\{\\theta\}\(x\)\+\\epsilon h\(x\)\\right\)d\_\{0\}\(x\)dx\.By Taylor expansion att=pθ\(x\)t=p\_\{\\theta\}\(x\), we have
g\(pθ\(x\)\+ϵh\(x\)\)=g\(pθ\(x\)\)\+ϵ⋅g′\(pθ\(x\)\)⋅h\(x\)\+rϵ\(x\),\\displaystyle g\\left\(p\_\{\\theta\}\(x\)\+\\epsilon h\(x\)\\right\)=g\\left\(p\_\{\\theta\}\(x\)\\right\)\+\\epsilon\\cdot g^\{\\prime\}\\left\(p\_\{\\theta\}\(x\)\\right\)\\cdot h\(x\)\+r\_\{\\epsilon\}\(x\),where the reminder term satisfiesrϵ\(x\)/ϵ→0r\_\{\\epsilon\}\(x\)/\\epsilon\\rightarrow 0pointwise asϵ→0\\epsilon\\rightarrow 0\. Therefore, the utility difference is
𝒰θ\(pθ\+ϵh\)−𝒰θ\(pθ\)=ϵ∫𝒳g′\(pθ\(x\)\)h\(x\)d0\(x\)𝑑x\+∫𝒳rϵ\(x\)d0\(x\)𝑑x\.\\displaystyle\\mathcal\{U\}\_\{\\theta\}\\left\(p\_\{\\theta\}\+\\epsilon h\\right\)\-\\mathcal\{U\}\_\{\\theta\}\\left\(p\_\{\\theta\}\\right\)=\\epsilon\\int\_\{\\mathcal\{X\}\}g^\{\\prime\}\\left\(p\_\{\\theta\}\(x\)\\right\)h\(x\)d\_\{0\}\(x\)dx\+\\int\_\{\\mathcal\{X\}\}r\_\{\\epsilon\}\(x\)d\_\{0\}\(x\)dx\.Next, by regularity condition, including the dominated convergence, the remainder term vanishes\. Then we have
𝒰θ\(pθ\+ϵh\)−𝒰θ\(pθ\)ϵ\\displaystyle\\frac\{\\mathcal\{U\}\_\{\\theta\}\\left\(p\_\{\\theta\}\+\\epsilon h\\right\)\-\\mathcal\{U\}\_\{\\theta\}\\left\(p\_\{\\theta\}\\right\)\}\{\\epsilon\}=∫𝒳g′\(pθ\(x\)\)h\(x\)d0\(x\)𝑑x\+∫𝒳rϵ\(x\)ϵd0\(x\)𝑑x\\displaystyle=\\int\_\{\\mathcal\{X\}\}g^\{\\prime\}\\left\(p\_\{\\theta\}\(x\)\\right\)h\(x\)d\_\{0\}\(x\)dx\+\\int\_\{\\mathcal\{X\}\}\\frac\{r\_\{\\epsilon\}\(x\)\}\{\\epsilon\}d\_\{0\}\(x\)dx→∫𝒳g′\(pθ\(x\)\)h\(x\)d0\(x\)𝑑xasϵ→0\.\\displaystyle\\rightarrow\\int\_\{\\mathcal\{X\}\}g^\{\\prime\}\\left\(p\_\{\\theta\}\(x\)\\right\)h\(x\)d\_\{0\}\(x\)dx\\quad\\text\{as\}\\ \\epsilon\\rightarrow 0\.By Definition[1](https://arxiv.org/html/2605.24331#Thmdefinition1)and the uniqueness of Riesz representation, we obtain that
∫𝒳δ𝒰θδpθ\(x\)⋅h\(x\)d0\(x\)𝑑x=∫𝒳g′\(pθ\(x\)\)⋅h\(x\)d0\(x\)𝑑x,∀h∈L2\(d0\)\.\\displaystyle\\int\_\{\\mathcal\{X\}\}\\frac\{\\delta\\mathcal\{U\}\_\{\\theta\}\}\{\\delta p\_\{\\theta\}\}\(x\)\\cdot h\(x\)d\_\{0\}\(x\)dx=\\int\_\{\\mathcal\{X\}\}g^\{\\prime\}\\left\(p\_\{\\theta\}\(x\)\\right\)\\cdot h\(x\)d\_\{0\}\(x\)dx,\\forall h\\in L^\{2\}\(d\_\{0\}\)\.Consequently, it results in the conclusion thatδ𝒰θδpθ\(x\)=g′\(pθ\(x\)\)\\frac\{\\delta\\mathcal\{U\}\_\{\\theta\}\}\{\\delta p\_\{\\theta\}\}\(x\)=g^\{\\prime\}\\left\(p\_\{\\theta\}\(x\)\\right\)ford0d\_\{0\}almost everywhere forxx\.
∎
#### A\.2Proof of Proposition[1](https://arxiv.org/html/2605.24331#Thmproposition1)
Proposition[1](https://arxiv.org/html/2605.24331#Thmproposition1)\(Entropic Risk RL Interpolates Classical RL and MaxRL\.\) Assumepθ\(x\)\>0p\_\{\\theta\}\(x\)\>0ford0d\_\{0\}\-almost everyxxand‖∇θpθ\(x\)‖pθ\(x\)∈L1\(d0\)\\frac\{\\left\\\|\\nabla\_\{\\theta\}p\_\{\\theta\}\(x\)\\right\\\|\}\{p\_\{\\theta\}\(x\)\}\\in L^\{1\}\\left\(d\_\{0\}\\right\), then
limη→0\+∇θ𝒰θrisk\(η\)=∇θJRL\(θ\),limη→∞η∇θ𝒰θrisk\(η\)=∇θJML\(θ\)\.\\displaystyle\\lim\_\{\\eta\\rightarrow 0^\{\+\}\}\\nabla\_\{\\theta\}\\mathcal\{U\}\_\{\\theta\}^\{\\mathrm\{risk\}\}\(\\eta\)=\\nabla\_\{\\theta\}J\_\{\\mathrm\{RL\}\}\(\\theta\),\\ \\lim\_\{\\eta\\rightarrow\\infty\}\\eta\\nabla\_\{\\theta\}\\mathcal\{U\}\_\{\\theta\}^\{\\mathrm\{risk\}\}\(\\eta\)=\\nabla\_\{\\theta\}J\_\{\\mathrm\{ML\}\}\(\\theta\)\.\(22\)
###### Proof\.
As the rewardr\(x,y\)r\(x,y\)is a Bernoulli random variable, we have
𝔼y∼πθ\(⋅∣x\)eηr\(x,y\)=\(1−pθ\(x\)\)e0\+pθ\(x\)eη=1−pθ\(x\)\+pθ\(x\)eη\.\\displaystyle\\mathbb\{E\}\_\{y\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\}e^\{\\eta r\(x,y\)\}=\\left\(1\-p\_\{\\theta\}\(x\)\\right\)e^\{0\}\+p\_\{\\theta\}\(x\)e^\{\\eta\}=1\-p\_\{\\theta\}\(x\)\+p\_\{\\theta\}\(x\)e^\{\\eta\}\.Therefore, we have:
𝒰θrisk\(η\)=𝔼x∼d0\[1ηlog\(1−pθ\(x\)\+pθ\(x\)eη\)\]:=𝔼x∼d0\[hη\(pθ\(x\)\)\],\\displaystyle\\mathcal\{U\}\_\{\\theta\}^\{\\mathrm\{risk\}\}\(\\eta\)=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[\\frac\{1\}\{\\eta\}\\log\\left\(1\-p\_\{\\theta\}\(x\)\+p\_\{\\theta\}\(x\)e^\{\\eta\}\\right\)\\right\]:=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[h\_\{\\eta\}\(p\_\{\\theta\}\(x\)\)\\right\],where we denote the weight function aswθη\(x\)w\_\{\\theta\}^\{\\eta\}\(x\)and therefore we have
∇θ𝒰θrisk\(η\):=𝔼x∼d0\[wθη\(x\)∇θpθ\(x\)\]=𝔼x∼d0\[eη−1η\[1\+\(eη−1\)pθ\(x\)\]∇θpθ\(x\)\]\.\\displaystyle\\nabla\_\{\\theta\}\\mathcal\{U\}\_\{\\theta\}^\{\\mathrm\{risk\}\}\(\\eta\):=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[w\_\{\\theta\}^\{\\eta\}\(x\)\\nabla\_\{\\theta\}p\_\{\\theta\}\(x\)\\right\]=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[\\frac\{e^\{\\eta\}\-1\}\{\\eta\\left\[1\+\\left\(e^\{\\eta\}\-1\\right\)p\_\{\\theta\}\(x\)\\right\]\}\\nabla\_\{\\theta\}p\_\{\\theta\}\(x\)\\right\]\.Asη\>0\\eta\>0, the weight functionwθη\(x\)w^\{\\eta\}\_\{\\theta\}\(x\)above is always positive and thushηh\_\{\\eta\}is an increasing function ofpθ\(x\)p\_\{\\theta\}\(x\), which leads to a meaningful utility function\. Next, we discuss the two limiting behaviors of∇θ𝒰θrisk\(η\)\.\\nabla\_\{\\theta\}\\mathcal\{U\}\_\{\\theta\}^\{\\mathrm\{risk\}\}\(\\eta\)\.
##### Case 1:η→0\+\\eta\\rightarrow 0^\{\+\}\.
By Taylor expansion,eη=1\+η\+o\(η\)e^\{\\eta\}=1\+\\eta\+o\(\\eta\)asη→0\+\\eta\\rightarrow 0^\{\+\}\. Consequently, we have the pointwise limit:
eη−1η\[1\+\(eη−1\)pθ\(x\)\]=η\+o\(η\)η\[1\+\(η\+o\(η\)\)pθ\(x\)\]=1\+o\(1\)1\+ηpθ\(x\)\+o\(η\)pθ\(x\)→1,\(η→0\+\)\.\\displaystyle\\frac\{e^\{\\eta\}\-1\}\{\\eta\\left\[1\+\\left\(e^\{\\eta\}\-1\\right\)p\_\{\\theta\}\(x\)\\right\]\}=\\frac\{\\eta\+o\(\\eta\)\}\{\\eta\\left\[1\+\\left\(\\eta\+o\(\\eta\)\\right\)p\_\{\\theta\}\(x\)\\right\]\}=\\frac\{1\+o\(1\)\}\{1\+\\eta p\_\{\\theta\}\(x\)\+o\(\\eta\)p\_\{\\theta\}\(x\)\}\\rightarrow 1,\\ \(\\eta\\rightarrow 0^\{\+\}\)\.Sincewη\(pθ\(x\)\)→1w\_\{\\eta\}\(p\_\{\\theta\}\(x\)\)\\to 1and is uniformly bounded, the limit can be exchanged with expectation\. This implies thatlimη→0\+∇θ𝒰θrisk\(η\)=𝔼x∼d0\[1⋅∇θpθ\(x\)\]=∇θJRL\(θ\)\\lim\_\{\\eta\\rightarrow 0^\{\+\}\}\\nabla\_\{\\theta\}\\mathcal\{U\}\_\{\\theta\}^\{\\mathrm\{risk\}\}\(\\eta\)=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[1\\cdot\\nabla\_\{\\theta\}p\_\{\\theta\}\(x\)\\right\]=\\nabla\_\{\\theta\}J\_\{\\mathrm\{RL\}\}\(\\theta\)\.
##### Case 2:η→\+∞\\eta\\rightarrow\+\\infty\.
Denoteaη∼bηa\_\{\\eta\}\\sim b\_\{\\eta\}whenη→\+∞\\eta\\rightarrow\+\\inftyifaηbη→1\\frac\{a\_\{\\eta\}\}\{b\_\{\\eta\}\}\\rightarrow 1\. Firstly, given each promptxx, we have the pointwise limit:
ηwθη\(x\)=eη−11\+\(eη−1\)pθ\(x\)=11eη−1\+pθ\(x\)→1pθ\(x\)\.\(η→\+∞\)\\displaystyle\\eta w\_\{\\theta\}^\{\\eta\}\(x\)=\\frac\{e^\{\\eta\}\-1\}\{1\+\\left\(e^\{\\eta\}\-1\\right\)p\_\{\\theta\}\(x\)\}=\\frac\{1\}\{\\frac\{1\}\{e^\{\\eta\}\-1\}\+p\_\{\\theta\}\(x\)\}\\rightarrow\\frac\{1\}\{p\_\{\\theta\}\(x\)\}\.\\ \(\\eta\\rightarrow\+\\infty\)This implieswθη\(x\)∼1ηpθ\(x\)w\_\{\\theta\}^\{\\eta\}\(x\)\\sim\\frac\{1\}\{\\eta p\_\{\\theta\}\(x\)\}asη→\+∞\\eta\\rightarrow\+\\infty\. However, the pointwise convergence does not directly ensure the limit of the integral is the integral of the limit, i\.e\.,lim𝔼=𝔼lim\\lim\\mathbb\{E\}=\\mathbb\{E\}\\lim, which additionally requires some mild conditions \(i\.e\., Dominated convergence theorem\[Evans,[2025](https://arxiv.org/html/2605.24331#bib.bib88)\]\)\. Sinceηwθη\(x\)=11eη−1\+pθ\(x\)≤1pθ\(x\)\\eta w\_\{\\theta\}^\{\\eta\}\(x\)=\\frac\{1\}\{\\frac\{1\}\{e^\{\\eta\}\-1\}\+p\_\{\\theta\}\(x\)\}\\leq\\frac\{1\}\{p\_\{\\theta\}\(x\)\}, according to our assumption in Proposition[1](https://arxiv.org/html/2605.24331#Thmproposition1), we have
\|ηwθη\(pθ\(x\)\)∇θpθ\(x\)\|≤\|∇θpθ\(x\)\|pθ\(x\)∈L1\(d0\),\\displaystyle\\left\|\\eta w\_\{\\theta\}^\{\\eta\}\\left\(p\_\{\\theta\}\(x\)\\right\)\\nabla\_\{\\theta\}p\_\{\\theta\}\(x\)\\right\|\\leq\\frac\{\\left\|\\nabla\_\{\\theta\}p\_\{\\theta\}\(x\)\\right\|\}\{p\_\{\\theta\}\(x\)\}\\in L^\{1\}\\left\(d\_\{0\}\\right\),where the functiong\(x\)∈L1\(d0\)g\(x\)\\in L^\{1\}\\left\(d\_\{0\}\\right\)indicates that𝔼x∼d0\[\|g\(x\)\|\]<∞\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\[\|g\(x\)\|\]<\\infty\. This inequality implies that a sequence of functions is bounded in absolute value by an integrable function\. Therefore, the dominated convergence theorem\[Evans,[2025](https://arxiv.org/html/2605.24331#bib.bib88)\]yields
limη→∞η∇θ𝒰θrisk\(η\)=limη→∞𝔼x∼d0\[ηwθη\(x\)∇θpθ\(x\)\]=𝔼x∼d0\[1pθ\(x\)∇θpθ\(x\)\]=∇θJML\(θ\)\.\\displaystyle\\lim\_\{\\eta\\rightarrow\\infty\}\\eta\\nabla\_\{\\theta\}\\mathcal\{U\}\_\{\\theta\}^\{\\mathrm\{risk\}\}\(\\eta\)=\\lim\_\{\\eta\\rightarrow\\infty\}\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[\\eta w\_\{\\theta\}^\{\\eta\}\(x\)\\nabla\_\{\\theta\}p\_\{\\theta\}\(x\)\\right\]=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[\\frac\{1\}\{p\_\{\\theta\}\(x\)\}\\nabla\_\{\\theta\}p\_\{\\theta\}\(x\)\\right\]=\\nabla\_\{\\theta\}J\_\{\\mathrm\{ML\}\}\(\\theta\)\.∎
#### A\.3Proof of Proposition[2](https://arxiv.org/html/2605.24331#Thmproposition2)
Proposition[2](https://arxiv.org/html/2605.24331#Thmproposition2)\. DenoteW1W\_\{1\}as 1\-Wasserstein distance\. Ifψ\\psiisLψL\_\{\\psi\}\-Lipschitz and‖fθ‖∞<∞\\\|f\_\{\\theta\}\\\|\_\{\\infty\}<\\infty, then
\|𝒰θ\(Fref\)−𝒰θ\(Fθ\)\|≤Lψ‖fθ‖∞W1\(μref,μθ\)\.\\displaystyle\\left\|\\mathcal\{U\}\_\{\\theta\}\(F\_\{\\text\{ref\}\}\)\-\\mathcal\{U\}\_\{\\theta\}\(F\_\{\\theta\}\)\\right\|\\leq L\_\{\\psi\}\\\|f\_\{\\theta\}\\\|\_\{\\infty\}W\_\{1\}\(\\mu\_\{\\text\{ref\}\},\\mu\_\{\\theta\}\)\.\(23\)
###### Proof\.
Asψ\\psiisLψL\_\{\\psi\}\-Lipschitz, we have\|ψ\(a\)−ψ\(b\)\|≤Lψ\|a−b\|\\left\|\\psi\(a\)\-\\psi\(b\)\\right\|\\leq L\_\{\\psi\}\\left\|a\-b\\right\|\.‖fθ‖∞<∞\\\|f\_\{\\theta\}\\\|\_\{\\infty\}<\\inftyimplies a bounded density function or a bounded essential supremum of the density, i\.e\.,fθ\(t\)≤‖fθ‖∞f\_\{\\theta\}\(t\)\\leq\\\|f\_\{\\theta\}\\\|\_\{\\infty\}almost everywhere \(a\.e\.\), indicating that the inequality holds except on a set of measure zero\. Recall the definition of 1\-Wasserstein distanceW1\(μ,ν\)W\_\{1\}\(\\mu,\\nu\)to characterize the difference of two measureμ\\muandν\\nu:
W1\(μ,ν\)=infγ∈Π\(μ,ν\)∫‖x−y‖1𝑑γ\(x,y\)=∫ℝ\|Fμ\(t\)−Fν\(t\)\|𝑑t\.\\displaystyle W\_\{1\}\(\\mu,\\nu\)=\\inf\_\{\\gamma\\in\\Pi\(\\mu,\\nu\)\}\\int\\\|x\-y\\\|\_\{1\}d\\gamma\(x,y\)=\\int\_\{\\mathbb\{R\}\}\\left\|F\_\{\\mu\}\(t\)\-F\_\{\\nu\}\(t\)\\right\|dt\.whereγ\\gammais the coupling or the transport plan, andΠ\(μ,ν\)\\Pi\(\\mu,\\nu\)is the joint distribution space with the marginal distributions asμ\\muandν\\nu\. The RHS above is the equivalent form of 1\-Wasserstein distance, which is pivotal for our upper bound\. In addition, we also write down the two utility functions:
𝒰θ\(Fref\)\\displaystyle\\mathcal\{U\}\_\{\\theta\}\(F\_\{\\text\{ref\}\}\)=𝔼x∼d0\[ψ\(Fref\(pθ\(x\)\)\)\]=∫ψ\(Fref\(t\)\)𝑑μθ\(t\)\\displaystyle=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[\\psi\\left\(F\_\{\\mathrm\{ref\}\}\\left\(p\_\{\\theta\}\(x\)\\right\)\\right\)\\right\]=\\int\\psi\(F\_\{\\mathrm\{ref\}\}\(t\)\)d\\mu\_\{\\theta\}\(t\)𝒰θ\(Fθ\)\\displaystyle\\mathcal\{U\}\_\{\\theta\}\(F\_\{\\theta\}\)=𝔼x∼d0\[ψ\(Fθ\(pθ\(x\)\)\)\]=∫ψ\(Fθ\(t\)\)𝑑μθ\(t\)=∫01ψ\(z\)𝑑z=const\.\\displaystyle=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[\\psi\\left\(F\_\{\\theta\}\\left\(p\_\{\\theta\}\(x\)\\right\)\\right\)\\right\]=\\int\\psi\(F\_\{\\theta\}\(t\)\)d\\mu\_\{\\theta\}\(t\)=\\int\_\{0\}^\{1\}\\psi\(z\)dz=\\text\{const\}\.Putting all together, we can derive
\|𝒰θ\(Fref\)−𝒰θ\(Fθ\)\|\\displaystyle\\left\|\\mathcal\{U\}\_\{\\theta\}\(F\_\{\\text\{ref\}\}\)\-\\mathcal\{U\}\_\{\\theta\}\(F\_\{\\theta\}\)\\right\|=∫\[ψ\(Fref\(t\)\)−ψ\(Fθ\(t\)\)\]𝑑μθ\(t\)\\displaystyle=\\int\\left\[\\psi\(F\_\{\\mathrm\{ref\}\}\(t\)\)\-\\psi\(F\_\{\\theta\}\(t\)\)\\right\]d\\mu\_\{\\theta\}\(t\)≤∫\|ψ\(Fref\(t\)\)−ψ\(Fθ\(t\)\)\|𝑑μθ\(t\)\\displaystyle\\leq\\int\\left\|\\psi\(F\_\{\\mathrm\{ref\}\}\(t\)\)\-\\psi\(F\_\{\\theta\}\(t\)\)\\right\|d\\mu\_\{\\theta\}\(t\)≤\(a\)Lψ∫\|Fref\(t\)−Fθ\(t\)\|𝑑μθ\(t\)\\displaystyle\\overset\{\(a\)\}\{\\leq\}L\_\{\\psi\}\\int\\left\|F\_\{\\mathrm\{ref\}\}\(t\)\-F\_\{\\theta\}\(t\)\\right\|d\\mu\_\{\\theta\}\(t\)=Lψ∫\|Fref\(t\)−Fθ\(t\)\|fθ\(t\)𝑑t\\displaystyle=L\_\{\\psi\}\\int\\left\|F\_\{\\mathrm\{ref\}\}\(t\)\-F\_\{\\theta\}\(t\)\\right\|f\_\{\\theta\}\(t\)dt≤\(b\)Lψ‖fθ‖∞∫\|Fref\(t\)−Fθ\(t\)\|𝑑t\\displaystyle\\overset\{\(b\)\}\{\\leq\}L\_\{\\psi\}\\\|f\_\{\\theta\}\\\|\_\{\\infty\}\\int\\left\|F\_\{\\mathrm\{ref\}\}\(t\)\-F\_\{\\theta\}\(t\)\\right\|dt=\(c\)Lψ‖fθ‖∞W1\(μref,μθ\),\\displaystyle\\overset\{\(c\)\}\{=\}L\_\{\\psi\}\\\|f\_\{\\theta\}\\\|\_\{\\infty\}W\_\{1\}\(\\mu\_\{\\text\{ref\}\},\\mu\_\{\\theta\}\),where\(a\)\(a\)holds by theLψL\_\{\\psi\}\-Lipschitz condition ofψ\\psi,\(b\)\(b\)holds by‖fθ‖∞<∞\\\|f\_\{\\theta\}\\\|\_\{\\infty\}<\\infty, and\(c\)\(c\)satisfies due to the definition of 1\-Wasserstein distance\.
##### Interpretation\.
The upper bound in Proposition[2](https://arxiv.org/html/2605.24331#Thmproposition2)strengthens the connection of our distribution\-aware utility and a distribution\- or geometry\-aware Wasserstein distance\. Specifically, this bound shows that the discrepancy between the two utility functionals is controlled by the Wasserstein distance between the reference and current pass\-rate distributions\. In particular, it provides a formal justification that our method is distribution\-aware: the utility depends on the global structure of the pass\-rate distribution rather than pointwise values\.
∎
#### A\.4Proof of Theorem[1](https://arxiv.org/html/2605.24331#Thmtheorem1)
Theorem[1](https://arxiv.org/html/2605.24331#Thmtheorem1)\(Pointwise Weight Induces a PriorFrefF\_\{\\mathrm\{ref\}\}\) Denotep=pθ\(x\)p=p\_\{\\theta\}\(x\)and the pointwise weightwθ\(x\)=g′\(pθ\(x\)\):=w\(p\)w\_\{\\theta\}\(x\)=g^\{\\prime\}\(p\_\{\\theta\}\(x\)\):=w\(p\)in the pass\-rate space\. Assume∫p1w\(t\)𝑑t<∞\\int\_\{p\}^\{1\}w\(t\)dt<\\inftyfor anyp∈\(0,1\]p\\in\(0,1\]\. Under the distortionψ\(u\)=logu\\psi\(u\)=\\log uin Eq\. \([19](https://arxiv.org/html/2605.24331#S4.E19)\),w\(p\)=fref\(p\)/Fref\(p\)w\(p\)=f\_\{\\mathrm\{ref\}\}\\left\(p\\right\)/F\_\{\\mathrm\{ref\}\}\\left\(p\\right\)admits a uniqueFrefF\_\{\\mathrm\{ref\}\}:
Fref\(p\)=exp\(−∫p1w\(t\)𝑑t\)\.\\displaystyle F\_\{\\mathrm\{ref\}\}\(p\)=\\exp\\left\(\-\\int\_\{p\}^\{1\}w\(t\)dt\\right\)\.
###### Proof\.
Recall thatwθ:\(0,1\]→ℝ≥0w\_\{\\theta\}:\(0,1\]\\rightarrow\\mathbb\{R\}\_\{\\geq 0\}is a non\-negative weight function that satisfies
∫p1w\(t\)𝑑t<∞\.\\displaystyle\\int\_\{p\}^\{1\}w\(t\)dt<\\infty\.for anyp∈\(0,1\]p\\in\(0,1\]\. The connection between the existing pointwise weightwθ\(⋅\)w\_\{\\theta\}\(\\cdot\)and the distribution\-aware weightfref\(p\)Fref\(p\)\\frac\{f\_\{\\mathrm\{ref\}\}\\left\(p\\right\)\}\{F\_\{\\mathrm\{ref\}\}\\left\(p\\right\)\}requires:
w\(p\)=fref\(p\)Fref\(p\)\.\\displaystyle w\(p\)=\\frac\{f\_\{\\mathrm\{ref\}\}\\left\(p\\right\)\}\{F\_\{\\mathrm\{ref\}\}\\left\(p\\right\)\}\.
##### Example 1: REINFORCE withw\(p\)=1w\(p\)=1\.
This requires\(logFref\(p\)\)′=1\(\\log F\_\{\\mathrm\{ref\}\}\(p\)\)^\{\\prime\}=1andlogFref\(p\)=p\+C\\log F\_\{\\mathrm\{ref\}\}\(p\)=p\+C, i\.e\.,
Fref\(p\)=ep\+C\\displaystyle F\_\{\\mathrm\{ref\}\}\(p\)=e^\{p\+C\}LetFref\(1\)=1F\_\{\\mathrm\{ref\}\}\(1\)=1, we haveC=−1C=\-1andfref=ep−1f\_\{\\mathrm\{ref\}\}=e^\{p\-1\}\. Notably, there is a point mass atp=0p=0asFref\(0\)=e−1F\_\{\\mathrm\{ref\}\}\(0\)=e^\{\-1\}\. This corresponds to a reflected and truncated exponential distribution\. AssumeZ∼Exp\(1\)Z\\sim\\text\{Exp\}\(1\)withpZ\(z\)=e−zp\_\{Z\}\(z\)=e^\{\-z\}forz∈\[0,\+∞\)z\\in\[0,\+\\infty\)\. DefineY=1−ZY=1\-Zsuch thaty∈\(−∞,1\]y\\in\(\-\\infty,1\]\.PY\(Y≤y\)=P\(1−Z≤y\)=1−FZ\(1−y\)=ey−1P\_\{Y\}\(Y\\leq y\)=P\(1\-Z\\leq y\)=1\-F\_\{Z\}\(1\-y\)=e^\{y\-1\}\. Furthermore, by putting all mass in\(−∞,0\)\(\-\\infty,0\)on the pointz=0z=0, the sub\-distribution becomesFref\(p\)=ep−1,p∈\[0,1\]F\_\{\\mathrm\{ref\}\}\(p\)=e^\{p\-1\},\\ p\\in\[0,1\], with the sub\-densityfref\(p\)=ep−1f\_\{\\mathrm\{ref\}\}\(p\)=e^\{p\-1\}\.
##### Example 2: GRPO withw\(p\)=1/p\(1−p\)w\(p\)=1/\\sqrt\{p\(1\-p\)\}\.
This requires\(logFref\)′\(p\)=1/p\(1−p\)\\left\(\\log F\_\{\\mathrm\{ref\}\}\\right\)^\{\\prime\}\(p\)=1/\\sqrt\{p\(1\-p\)\}, implying
logFref\(p\)=∫dpp\(1−p\)=2arcsin\(p\)\+C⇒Fref\(p\)=exp\(2arcsin\(p\)\+C\)\.\\displaystyle\\log F\_\{\\mathrm\{ref\}\}\(p\)=\\int\\frac\{dp\}\{\\sqrt\{p\(1\-p\)\}\}=2\\arcsin\(\\sqrt\{p\}\)\+C\\Rightarrow F\_\{\\mathrm\{ref\}\}\(p\)=\\exp\(2\\arcsin\(\\sqrt\{p\}\)\+C\)\.Asarcsin\(1\)=π2\\arcsin\(1\)=\\frac\{\\pi\}\{2\}, we letFref\(1\)=1F\_\{\\mathrm\{ref\}\}\(1\)=1\. ThenC=−πC=\-\\pi\. Consequently, we have
Fref\(p\)=exp\(2arcsin\(p\)−π\),fref\(p\)=exp\(2arcsin\(p\)−π\)p\(1−p\)\.\\displaystyle F\_\{\\mathrm\{ref\}\}\(p\)=\\exp\(2\\arcsin\(\\sqrt\{p\}\)\-\\pi\),\\quad f\_\{\\mathrm\{ref\}\}\(p\)=\\frac\{\\exp\(2\\arcsin\(\\sqrt\{p\}\)\-\\pi\)\}\{\\sqrt\{p\(1\-p\)\}\}\.Note that this distribution is not common one, but a normalized arcsine distribution after an exponential transformation\.
##### Example 3: MaxRL withw\(p\)=1/pw\(p\)=1/p\.
It requires\(logFref\)′\(p\)=1/p\\left\(\\log F\_\{\\mathrm\{ref\}\}\\right\)^\{\\prime\}\(p\)=1/p, which implieslogFref\(p\)=logp\+C\\log F\_\{\\mathrm\{ref\}\}\(p\)=\\log p\+CandFref\(p\)=exp\(logp\+C\)F\_\{\\mathrm\{ref\}\}\(p\)=\\exp\(\\log p\+C\)\. LetFref\(1\)=1F\_\{\\mathrm\{ref\}\}\(1\)=1, we haveC=0C=0\. Therefore, we haveFref\(p\)=pF\_\{\\mathrm\{ref\}\}\(p\)=pandfref\(p\)=1f\_\{\\mathrm\{ref\}\}\(p\)=1, which is a uniform distribution in\[0,1\]\[0,1\]\. We illustrate the density and CDF in Figure[4](https://arxiv.org/html/2605.24331#A1.F4)\.
Figure 4:freff\_\{\\mathrm\{ref\}\}andFrefF\_\{\\mathrm\{ref\}\}for REINFORCE, GRPO, and MaxRL\.
##### General Case\.
In general, there is a key connection equation:
∫p1w\(t\)𝑑t=∫p1fref\(t\)Fref\(t\)𝑑t=logFref\(1\)−logFref\(p\)=−logFref\(p\)\.\\displaystyle\\int\_\{p\}^\{1\}w\(t\)dt=\\int\_\{p\}^\{1\}\\frac\{f\_\{\\text\{ref\}\}\\left\(t\\right\)\}\{F\_\{\\text\{ref\}\}\\left\(t\\right\)\}dt=\\log F\_\{\\text\{ref\}\}\(1\)\-\\log F\_\{\\text\{ref\}\}\(p\)=\-\\log F\_\{\\text\{ref\}\}\(p\)\.This implies that
Fref\(p\)=exp\(−∫p1w\(t\)𝑑t\)\.\\displaystyle F\_\{\\text\{ref\}\}\(p\)=\\exp\\left\(\-\\int\_\{p\}^\{1\}w\(t\)dt\\right\)\.Sincewwis non\-negative, we can easily verify thatFrefF\_\{\\text\{ref\}\}is non\-decreasing function asFref\(p1\)<Fref\(p2\)F\_\{\\text\{ref\}\}\(p\_\{1\}\)<F\_\{\\text\{ref\}\}\(p\_\{2\}\)ifp1<p2p\_\{1\}<p\_\{2\}andFref\(1\)=1F\_\{\\text\{ref\}\}\(1\)=1\. Therefore,FrefF\_\{\\text\{ref\}\}is a valid CDF\. Given the fact thatddp\(−∫p1w\(t\)𝑑t\)=w\(p\)\\frac\{d\}\{dp\}\\left\(\-\\int\_\{p\}^\{1\}w\(t\)dt\\right\)=w\(p\), by the chain rule, we can derive
fref\(p\)=ddp\(exp\(−∫p1w\(t\)𝑑t\)\)=Fref\(p\)w\(p\),\\displaystyle f\_\{\\text\{ref\}\}\(p\)=\\frac\{d\}\{dp\}\\left\(\\exp\\left\(\-\\int\_\{p\}^\{1\}w\(t\)dt\\right\)\\right\)=F\_\{\\text\{ref\}\}\(p\)w\(p\),which is the exact goal we want to prove in the beginning\.
##### Uniqueness\.
AssumeF~\\tilde\{F\}also satisfiesf~\(p\)F~\(p\)=w\(p\)\\frac\{\\tilde\{f\}\(p\)\}\{\\tilde\{F\}\(p\)\}=w\(p\)\. This implies that
ddplogF~\(p\)=w\(p\)\.\\displaystyle\\frac\{d\}\{dp\}\\log\\tilde\{F\}\(p\)=w\(p\)\.By taking the integral between\[p,1\]\[p,1\]on both sides, we derive
logF~\(1\)−logF~\(p\)=∫p1w\(t\)𝑑t⇒logF~\(p\)=−∫p1w\(t\)𝑑t⇒F~\(p\)=exp\(−∫p1w\(t\)𝑑t\)\.\\displaystyle\\log\\tilde\{F\}\(1\)\-\\log\\tilde\{F\}\(p\)=\\int\_\{p\}^\{1\}w\(t\)dt\\Rightarrow\\log\\tilde\{F\}\(p\)=\-\\int\_\{p\}^\{1\}w\(t\)dt\\Rightarrow\\tilde\{F\}\(p\)=\\exp\\left\(\-\\int\_\{p\}^\{1\}w\(t\)dt\\right\)\.Therefore, we haveF~=Fref\\tilde\{F\}=F\_\{\\mathrm\{ref\}\}andFrefF\_\{\\mathrm\{ref\}\}is unique\.
##### Remark\.
The inducedFrefF\_\{\\mathrm\{ref\}\}only depends on the pointwise weightw\(p\)w\(p\)with the pass rate valueppin the pass\-rate space, which is independent of the promptx∼d0x\\sim d\_\{0\}and the policy parameterθ\\theta\.
∎
#### A\.5Monotone Calibration Invariance Property
For any increasing functionG:\[0,1\]→\[0,1\]G:\[0,1\]\\rightarrow\[0,1\]on the absolute value of pass ratepθ\(x\)p\_\{\\theta\}\(x\), the gradient of our distribution\-aware method keeps the same form, which we call a monotone calibration invariance property in Proposition[3](https://arxiv.org/html/2605.24331#Thmproposition3):
###### Proposition 3\(Monotone Calibration Invariance\)\.
AssumeFref=FθoldF\_\{\\mathrm\{ref\}\}=F\_\{\\theta\_\{\\text\{old\}\}\}and denotep~θ\(x\)=G\(pθ\(x\)\)\\tilde\{p\}\_\{\\theta\}\(x\)=G\(p\_\{\\theta\}\(x\)\), i\.e\.,p~θ=G∘pθ\\tilde\{p\}\_\{\\theta\}=G\\circ p\_\{\\theta\}\. Denote the induced reference CDF and density asF~ref\\tilde\{F\}\_\{\\mathrm\{ref\}\}andf~ref\\tilde\{f\}\_\{\\mathrm\{ref\}\}, then
∇θJcurve\(θ;pθ,Fref\)=∇θJcurve\(θ;p~θ,F~ref\)\.\\displaystyle\\nabla\_\{\\theta\}J\_\{\\text\{curve\}\}\\left\(\\theta;p\_\{\\theta\},F\_\{\\mathrm\{ref\}\}\\right\)=\\nabla\_\{\\theta\}J\_\{\\text\{curve \}\}\\left\(\\theta;\\tilde\{p\}\_\{\\theta\},\\tilde\{F\}\_\{\\mathrm\{ref\}\}\\right\)\.
###### Proof\.
After the transformation withp~θ\\tilde\{p\}\_\{\\theta\}, we have
F~ref\(u\)=F~θold\(u\)=ℙx∼d0\(p~θold\(x\)≤u\)=ℙx∼d0\(pθold\(x\)≤G−1\(u\)\)=Fref\(G−1\(u\)\)\.\\displaystyle\\tilde\{F\}\_\{\\text\{ref\}\}\(u\)=\\tilde\{F\}\_\{\\theta\_\{\\text\{old\}\}\}\(u\)=\\mathbb\{P\}\_\{x\\sim d\_\{0\}\}\(\\tilde\{p\}\_\{\\theta\_\{\\text\{old\}\}\}\(x\)\\leq u\)=\\mathbb\{P\}\_\{x\\sim d\_\{0\}\}\(p\_\{\\theta\_\{\\text\{old\}\}\}\(x\)\\leq G^\{\-1\}\(u\)\)=F\_\{\\mathrm\{ref\}\}\(G^\{\-1\}\(u\)\)\.Then, we can derive
f~ref\(u\)=dF~ref\(u\)du=fref\(G−1\(u\)\)dG−1\(u\)du\.\\displaystyle\\tilde\{f\}\_\{\\text\{ref\}\}\(u\)=\\frac\{d\\tilde\{F\}\_\{\\text\{ref\}\}\(u\)\}\{du\}=f\_\{\\mathrm\{ref\}\}\(G^\{\-1\}\(u\)\)\\frac\{dG^\{\-1\}\(u\)\}\{du\}\.Based on the derivative equation of the inverse function and letu=G\(v\)u=G\(v\), we know\(G−1\(u\)\)′=1G′\(v\)\(G^\{\-1\}\(u\)\)^\{\\prime\}=\\frac\{1\}\{G^\{\\prime\}\(v\)\}\. Therefore, we have
f~ref\(u\)=fref\(G−1\(u\)\)dG−1\(u\)du=fref\(G−1\(u\)\)1G′\(G−1\(u\)\)\.\\displaystyle\\tilde\{f\}\_\{\\text\{ref\}\}\(u\)=f\_\{\\mathrm\{ref\}\}\(G^\{\-1\}\(u\)\)\\frac\{dG^\{\-1\}\(u\)\}\{du\}=f\_\{\\mathrm\{ref\}\}\(G^\{\-1\}\(u\)\)\\frac\{1\}\{G^\{\\prime\}\(G^\{\-1\}\(u\)\)\}\.
Recall the gradients before and after the transformation have the following form:
∇θJcurve\(θ;pθ,Fref\)\\displaystyle\\nabla\_\{\\theta\}J\_\{\\text\{curve\}\}\(\\theta;p\_\{\\theta\},F\_\{\\mathrm\{ref\}\}\)=𝔼x∼d0\[fref\(pθ\(x\)\)Fref\(pθ\(x\)\)∇θpθ\(x\)\]\\displaystyle=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[\\frac\{f\_\{\\text\{ref\}\}\\left\(p\_\{\\theta\}\(x\)\\right\)\}\{F\_\{\\text\{ref\}\}\\left\(p\_\{\\theta\}\(x\)\\right\)\}\\nabla\_\{\\theta\}p\_\{\\theta\}\(x\)\\right\]∇θJcurve\(θ;p~θ,F~ref\)\\displaystyle\\nabla\_\{\\theta\}J\_\{\\text\{curve\}\}\(\\theta;\\tilde\{p\}\_\{\\theta\},\\tilde\{F\}\_\{\\mathrm\{ref\}\}\)=𝔼x∼d0\[f~ref\(p~θ\(x\)\)F~ref\(p~θ\(x\)\)∇θp~θ\(x\)\]\.\\displaystyle=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[\\frac\{\\tilde\{f\}\_\{\\text\{ref\}\}\\left\(\\tilde\{p\}\_\{\\theta\}\(x\)\\right\)\}\{\\tilde\{F\}\_\{\\text\{ref\}\}\\left\(\\tilde\{p\}\_\{\\theta\}\(x\)\\right\)\}\\nabla\_\{\\theta\}\\tilde\{p\}\_\{\\theta\}\(x\)\\right\]\.Next, we know that∇p~θ\(x\)=G′\(pθ\(x\)\)∇pθ\(x\)\\nabla\\tilde\{p\}\_\{\\theta\}\(x\)=G^\{\\prime\}\(p\_\{\\theta\}\(x\)\)\\nabla p\_\{\\theta\}\(x\)\. By putting all together, we have the following key result:
f~ref\(p~θ\(x\)\)F~ref\(p~θ\(x\)\)∇θp~θ\(x\)\\displaystyle\\frac\{\\tilde\{f\}\_\{\\text\{ref\}\}\\left\(\\tilde\{p\}\_\{\\theta\}\(x\)\\right\)\}\{\\tilde\{F\}\_\{\\text\{ref\}\}\\left\(\\tilde\{p\}\_\{\\theta\}\(x\)\\right\)\}\\nabla\_\{\\theta\}\\tilde\{p\}\_\{\\theta\}\(x\)=fref\(G−1\(p~θ\(x\)\)\)1G′\(G−1\(p~θ\(x\)\)\)Fref\(G−1\(p~θ\(x\)\)\)∇θp~θ\(x\)\\displaystyle=\\frac\{f\_\{\\mathrm\{ref\}\}\(G^\{\-1\}\(\\tilde\{p\}\_\{\\theta\}\(x\)\)\)\\frac\{1\}\{G^\{\\prime\}\(G^\{\-1\}\(\\tilde\{p\}\_\{\\theta\}\(x\)\)\)\}\}\{F\_\{\\mathrm\{ref\}\}\(G^\{\-1\}\(\\tilde\{p\}\_\{\\theta\}\(x\)\)\)\}\\nabla\_\{\\theta\}\\tilde\{p\}\_\{\\theta\}\(x\)=fref\(G−1\(G\(pθ\(x\)\)\)\)1G′\(G−1\(G\(pθ\(x\)\)\)\)Fref\(G−1\(G\(pθ\(x\)\)\)\)∇θp~θ\(x\)\\displaystyle=\\frac\{f\_\{\\mathrm\{ref\}\}\(G^\{\-1\}\(G\(p\_\{\\theta\}\(x\)\)\)\)\\frac\{1\}\{G^\{\\prime\}\(G^\{\-1\}\(G\(p\_\{\\theta\}\(x\)\)\)\)\}\}\{F\_\{\\mathrm\{ref\}\}\(G^\{\-1\}\(G\(p\_\{\\theta\}\(x\)\)\)\)\}\\nabla\_\{\\theta\}\\tilde\{p\}\_\{\\theta\}\(x\)=fref\(pθ\(x\)\)1G′\(pθ\(x\)\)Fref\(pθ\(x\)\)G′\(pθ\(x\)\)∇pθ\(x\)\\displaystyle=\\frac\{f\_\{\\mathrm\{ref\}\}\(p\_\{\\theta\}\(x\)\)\\frac\{1\}\{G^\{\\prime\}\(p\_\{\\theta\}\(x\)\)\}\}\{F\_\{\\mathrm\{ref\}\}\(p\_\{\\theta\}\(x\)\)\}G^\{\\prime\}\(p\_\{\\theta\}\(x\)\)\\nabla p\_\{\\theta\}\(x\)=fref\(pθ\(x\)\)Fref\(pθ\(x\)\)∇pθ\(x\)\.\\displaystyle=\\frac\{f\_\{\\mathrm\{ref\}\}\(p\_\{\\theta\}\(x\)\)\}\{F\_\{\\mathrm\{ref\}\}\(p\_\{\\theta\}\(x\)\)\}\\nabla p\_\{\\theta\}\(x\)\.This implies that
∇θJcurve\(θ;p~θ,F~ref\)\\displaystyle\\nabla\_\{\\theta\}J\_\{\\text\{curve\}\}\(\\theta;\\tilde\{p\}\_\{\\theta\},\\tilde\{F\}\_\{\\mathrm\{ref\}\}\)=𝔼x∼d0\[f~ref\(p~θ\(x\)\)F~ref\(p~θ\(x\)\)∇θp~θ\(x\)\]\\displaystyle=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[\\frac\{\\tilde\{f\}\_\{\\text\{ref\}\}\\left\(\\tilde\{p\}\_\{\\theta\}\(x\)\\right\)\}\{\\tilde\{F\}\_\{\\text\{ref\}\}\\left\(\\tilde\{p\}\_\{\\theta\}\(x\)\\right\)\}\\nabla\_\{\\theta\}\\tilde\{p\}\_\{\\theta\}\(x\)\\right\]=𝔼x∼d0\[fref\(pθ\(x\)\)Fref\(pθ\(x\)\)∇θpθ\(x\)\]=∇θJcurve\(θ;pθ,Fref\)\.\\displaystyle=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[\\frac\{f\_\{\\text\{ref\}\}\\left\(p\_\{\\theta\}\(x\)\\right\)\}\{F\_\{\\text\{ref\}\}\\left\(p\_\{\\theta\}\(x\)\\right\)\}\\nabla\_\{\\theta\}p\_\{\\theta\}\(x\)\\right\]=\\nabla\_\{\\theta\}J\_\{\\text\{curve\}\}\\left\(\\theta;p\_\{\\theta\},F\_\{\\mathrm\{ref\}\}\\right\)\.∎
For the pointwise objective function, this monotone calibration invariance property does not hold\. Immediately, we have the following corollary\.
###### Corollary 2\.
DenoteJg\(θ;pθ\)=𝔼x∼d0\[g\(pθ\(x\)\)\]J\_\{g\}\(\\theta;p\_\{\\theta\}\)=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[g\\left\(p\_\{\\theta\}\(x\)\\right\)\\right\]andJg\(θ;p~θ\)=𝔼x∼d0\[g\(p~θ\(x\)\)\]J\_\{g\}\(\\theta;\\tilde\{p\}\_\{\\theta\}\)=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[g\\left\(\\tilde\{p\}\_\{\\theta\}\(x\)\\right\)\\right\]withp~θ=G∘pθ\\tilde\{p\}\_\{\\theta\}=G\\circ p\_\{\\theta\}\. There exists someggsuch that
∇θJg\(θ;pθ\)≠∇θJg\(θ;p~θ\)\.\\displaystyle\\nabla\_\{\\theta\}J\_\{g\}\(\\theta;p\_\{\\theta\}\)\\neq\\nabla\_\{\\theta\}J\_\{g\}\(\\theta;\\tilde\{p\}\_\{\\theta\}\)\.
###### Proof\.
Firstly, we have
∇θJg\(θ;p~θ\)=𝔼x∼d0\[g′\(p~θ\(x\)\)∇θp~θ\(x\)\]=𝔼x∼d0\[g′\(G\(pθ\(x\)\)\)⋅G′\(pθ\(x\)\)⋅∇θpθ\(x\)\]\.\\displaystyle\\nabla\_\{\\theta\}J\_\{g\}\\left\(\\theta;\\tilde\{p\}\_\{\\theta\}\\right\)=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[g^\{\\prime\}\\left\(\\tilde\{p\}\_\{\\theta\}\(x\)\\right\)\\nabla\_\{\\theta\}\\tilde\{p\}\_\{\\theta\}\(x\)\\right\]=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[g^\{\\prime\}\\left\(G\\left\(p\_\{\\theta\}\(x\)\\right\)\\right\)\\cdot G^\{\\prime\}\\left\(p\_\{\\theta\}\(x\)\\right\)\\cdot\\nabla\_\{\\theta\}p\_\{\\theta\}\(x\)\\right\]\.If we let
∇θJg\(θ\)=𝔼x∼d0\[g′\(pθ\(x\)\)∇θpθ\(x\)\]=𝔼x∼d0\[g′\(G\(pθ\(x\)\)\)⋅G′\(pθ\(x\)\)⋅∇θpθ\(x\)\],\\displaystyle\\nabla\_\{\\theta\}J\_\{g\}\(\\theta\)=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[g^\{\\prime\}\\left\(p\_\{\\theta\}\(x\)\\right\)\\nabla\_\{\\theta\}p\_\{\\theta\}\(x\)\\right\]=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[g^\{\\prime\}\\left\(G\\left\(p\_\{\\theta\}\(x\)\\right\)\\right\)\\cdot G^\{\\prime\}\\left\(p\_\{\\theta\}\(x\)\\right\)\\cdot\\nabla\_\{\\theta\}p\_\{\\theta\}\(x\)\\right\],this implies a functional equation for allpθ\(x\)p\_\{\\theta\}\(x\)andGG\.
g′\(pθ\(x\)\)=g′\(G\(pθ\(x\)\)\)G′\(pθ\(x\)\),\\displaystyle g^\{\\prime\}\\left\(p\_\{\\theta\}\(x\)\\right\)=g^\{\\prime\}\\left\(G\\left\(p\_\{\\theta\}\(x\)\\right\)\\right\)G^\{\\prime\}\\left\(p\_\{\\theta\}\(x\)\\right\),A counterexample is wheng′\(t\)=1tg^\{\\prime\}\(t\)=\\frac\{1\}\{t\}in MaxRL andG\(t\)=t2G\(t\)=t^\{2\}, we have:
1pθ\(x\)≠1pθ\(x\)22pθ\(x\)=2pθ\(x\)\.\\displaystyle\\frac\{1\}\{p\_\{\\theta\}\(x\)\}\\neq\\frac\{1\}\{p\_\{\\theta\}\(x\)^\{2\}\}2p\_\{\\theta\}\(x\)=\\frac\{2\}\{p\_\{\\theta\}\(x\)\}\.In summary, our distribution\-aware prompt reweighting enjoys the monotone calibration invariance and is thus robust to the miscalibration for the absolute value of pass rates\. By contrast, the pointwise prompt reweighting does not satisfy this property in general\. ∎
### Appendix BImplementation Details
##### More Details of Experimental Setup\.
Based on verl framework, our training applies a fully on\-policy policy gradient update, meaning that there is no importance ratio or associated clipping\. We also disable both the KL penalty and the entropy bonus\. As such, the effect of the prompt\-weighting function is not entangled with auxiliary regularizers, thereby avoiding complex tuning\. This practice is also adopted in prior work on RLVR for LLM reasoning\[Tajwaret al\.,[2026](https://arxiv.org/html/2605.24331#bib.bib7), Olmoet al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib51), Tanget al\.,[2026](https://arxiv.org/html/2605.24331#bib.bib53), Zhanget al\.,[2025c](https://arxiv.org/html/2605.24331#bib.bib52)\]\. During training across10001000RL steps, we employ the temperature1\.01\.0and the learning rate1×10−61\\times 10^\{\-6\}\. The maximum prompt length is capped at10241024tokens and the maximum response length at40964096tokens for both Qwen3\-1\.7B\-Base and Qwen3\-4B\-Base\. In evaluation, we usepass@k\\mathrm\{pass@\}k\[Chenet al\.,[2021](https://arxiv.org/html/2605.24331#bib.bib127)\]as the primary metric, withk∈\{1,2,4,…,1024\}k\\in\\\{1,2,4,\\dots,1024\\\}for Qwen3\-1\.7B\-Base andk∈\{1,2,4,…,512\}k\\in\\\{1,2,4,\\dots,512\\\}for Qwen3\-4B\-Base\.pass@k\\mathrm\{pass@\}kmeasures the probability that at least one ofkkindependently sampled responses is correct, serving as a proxy for a model’s exploration capability under a fixed sampling budget\. In practice,pass@1\\mathrm\{pass@\}1is the raw mean accuracy, whilepass@k\\mathrm\{pass@\}kfork≥2k\\geq 2uses10001000best\-of\-kkbootstrap resamples \(with replacement\) of sizekk, averaged across prompts\. We sample20482048rollouts per prompt for Qwen3\-1\.7B\-Base and10241024for Qwen3\-4B\-Base to reduce estimation variance\.
##### RL Stack and Computational Devices\.
All experiments are run within the verl post\-training framework\[Shenget al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib73)\], which couples a policy\-gradient trainer with a vLLM\-backed rollout engine\. Using a single inference stack for both training rollouts and downstream evaluation removes a common source of distribution shift between the two stages\. Both backbones are optimized with the AdamW optimizer using its default first\- and second\-moment coefficients, under bfloat16 mixed precision and FSDP\-style sharding of parameters, gradients, and optimizer states\. Each experiment runs on a single node of8×8\\timesNVIDIA B200 GPUs interconnected by NVLink\.
Qwen\-math prompt template``` <|im_start|>system Please reason step by step and put the final answer in \boxed{}. <|im_end|> <|im_start|>user {problem statement} Let’s think step by step and put the final answer within \boxed{}. <|im_end|> <|im_start|>assistant ```
##### Prompt Template\.
Every training and evaluation prompt is rendered with the Qwen\-math chat template\[Yanget al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib71)\]\. We append the instruction “Let’s think step by step and put the final answer within \\boxed\{\}\.” to the problem statement and route the result through the tokenizer’s chat template, yielding inputs of the form shown below\. At inference time the verifier extracts the final boxed expression from the generated chain of thought and grades it withMath\-Verify\.
##### On\-policy Update\.
Each RL step generates a fresh batch of rollouts from the current policy and consumes it with a single optimizer update\. Therefore, the data\-generating policy and the target policy coincide\. The token\-level importance ratio is identically one along the entire response,
ρi,t\(θ\)=πθ\(yi,t∣x,yi,<t\)πθold\(yi,t∣x,yi,<t\)≡1,\\rho\_\{i,t\}\(\\theta\)\\;=\\;\\frac\{\\pi\_\{\\theta\}\(y\_\{i,t\}\\mid x,\\,y\_\{i,<t\}\)\}\{\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(y\_\{i,t\}\\mid x,\\,y\_\{i,<t\}\)\}\\;\\equiv\\;1,which implies that the PPO clipping range\[1−ϵ,1\+ϵ\]\[1\-\\epsilon,\\,1\+\\epsilon\]is never active and the thresholdϵ\\epsilonhas no effect on the optimization trajectory\. The KL penalty and the entropy bonus are also disabled, isolating the contribution of the prompt\-weighting rule from auxiliary regularizers and matching the simplified objective adopted by\[Tajwaret al\.,[2026](https://arxiv.org/html/2605.24331#bib.bib7)\]\. This supports the concise formulations in the preliminaries of[Section˜2](https://arxiv.org/html/2605.24331#S2)\.
##### Evaluation Decoding\.
Following\[Tajwaret al\.,[2026](https://arxiv.org/html/2605.24331#bib.bib7)\], evaluation generations are sampled at temperature0\.60\.6with top\-pp0\.950\.95, and both top\-kkand min\-pptruncation are disabled\. We do not employ adaptive sampling and do not apply any logit correction for the mismatch between inference and training runtimes\. Reported metrics are computed on the final checkpoint of each run\.
Table 4:Training hyperparameters shared by both Qwen3 backbones\.
##### Hyperparameter Summary\.
[Table˜4](https://arxiv.org/html/2605.24331#A2.T4)consolidates the training\-side hyperparameters, which are shared across both Qwen3\-1\.7B\-Base and Qwen3\-4B\-Base\.
### Appendix CSupplemental Experimental Results
This appendix contains experimental results that are not shown in the main text due to the space limit\. The training setup, evaluation protocol, and bootstrap procedure are identical to those described in[Section˜5](https://arxiv.org/html/2605.24331#S5)\.
#### C\.1Pass@1 and Pass@6464on Three Additional Benchmarks
Table 5:Supplemental results on three additional math reasoning benchmarks\.pass@1 and pass@64 \(%\), best per column in bold\.[Table˜5](https://arxiv.org/html/2605.24331#A3.T5)reports pass@11and pass@6464on three additional benchmarks:BRUMO 2025\[Balunovićet al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib128)\],HMMT 11/25\[Balunovićet al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib128)\], andMinerva Math\[Lewkowyczet al\.,[2022](https://arxiv.org/html/2605.24331#bib.bib85)\]\. The qualitative picture mirrors[Table˜1](https://arxiv.org/html/2605.24331#S5.T1): CurveRL matches or exceeds MaxRL at pass@6464on all benchmarks across both model sizes \(except BRUMO 2025 under Qwen3\-4B\-Base\)\.
#### C\.2Pass@kkCurves on the Additional Benchmarks
Figure 5:Pass@kkscaling on the additional three benchmarks\.Top row: Qwen3\-1\.7B\-Base,k∈\{1,…,1024\}k\\in\\\{1,\\dots,1024\\\}\. Bottom row: Qwen3\-4B\-Base,k∈\{1,…,512\}k\\in\\\{1,\\dots,512\\\}\. CurveRL tracks or dominates the GRPO and MaxRL baselines on most benchmarks\.
#### C\.3Additional Difficulty Distribution
Figure 6:Qwen3\-1\.7B\-Base post\-training prompt\-difficulty distribution on the remaining six benchmarks\.Across benchmarks, CurveRL’s*unsolvable*fraction is mostly no higher than that of the strongest pointwise\-weighted baseline, while the*hard*/*medium*mass is preserved or enlarged\.Figure 7:Qwen3\-4B\-Base post\-training prompt\-difficulty distribution on all eight benchmarks\.CurveRL’s*unsolvable*fraction tracks or improves on the strongest pointwise\-weighted baseline at the 4B scale as well, mirroring the 1\.7B trend in[Figure˜6](https://arxiv.org/html/2605.24331#A3.F6)\.
#### C\.4Learning Signals Across Training Dynamics
[Figure˜8](https://arxiv.org/html/2605.24331#A3.F8)tracks the fraction of training prompts with strictly positive empirical pass ratep^\\hat\{p\}at each RL step on Qwen3\-1\.7B\-Base and Qwen3\-4B\-Base\. Larger values indicate a richer pool of prompts that provide non\-vanishing gradient signals\. While all three methods start from a similar pool, CurveRL and MaxRL remain consistently above GRPO throughout the training, with the gap widening after the early phase\. This suggests that they slow down the collapse of prompts into the zero\-gradientp^=0\\hat\{p\}=0region, preserving useful learning signals for more steps and broadening the reasoning boundary\.
Figure 8:Fraction of prompts where the model generates at least one correct rollout out of 8 samples\.
#### C\.5Supplemental Results on Sensitivity Analysis
Figure 9:Pass@kkscaling for the three sliding\-window sizes\.Five benchmarks from[Table˜3](https://arxiv.org/html/2605.24331#S5.T3); each panel overlays CurveRL witht0∈\{1,10,50\}t\_\{0\}\\\!\\in\\\!\\\{1,10,50\\\}batches on Qwen3\-1\.7B\-Base\. The defaultt0=10t\_\{0\}\\\!=\\\!10tracks or improves on both alternatives at everykkon the most benchmarks\.
#### C\.6Stable Training Dynamics
[Figure˜10](https://arxiv.org/html/2605.24331#A3.F10)reports additional training dynamics on Qwen3\-1\.7B\-Base \(top row\) and Qwen3\-4B\-Base \(bottom row\), including mean response length, policy entropy, and gradient norm\. CurveRL generally produces longer chains of thought and maintains higher actor entropy, supporting greater reasoning diversity and a wider capability boundary, as shown in[Section˜5](https://arxiv.org/html/2605.24331#S5)\. CurveRL also exhibits a flatter gradient\-norm profile, indicating more stable training\.
Figure 10:Additional training dynamics metrics for Qwen3\-1\.7B\-Base \(top row\) and Qwen3\-4B\-Base \(bottom row\)\.Mean response length, policy entropy, and gradient norm over RL steps for three RLVR algorithms\.
#### C\.7Validation Accuracy During Training
[Figures˜11](https://arxiv.org/html/2605.24331#A3.F11)and[12](https://arxiv.org/html/2605.24331#A3.F12)report the validation accuracy on the MATH\-500 during training\. CurveRL generally outperforms other baselines on both pass@11and pass@kkduring training\.
Figure 11:Qwen3\-1\.7B\-Base validation accuracy during training\.Figure 12:Qwen3\-4B\-Base validation accuracy during training\.
#### C\.8Distribution\-Aware Weighting on Qwen3\-1\.7B\-Base
Results shown in[Figure˜13](https://arxiv.org/html/2605.24331#A3.F13)on Qwen3\-1\.7B\-Base are similar to[Figure˜3](https://arxiv.org/html/2605.24331#S5.F3)on Qwen3\-4B\-Base\. The qualitative pattern matches the 4B case in the main text: the empirical pass\-rate densityf^ref\\hat\{f\}\_\{\\mathrm\{ref\}\}drifts toward higher pass\-rate bins as training proceeds, while the data\-driven weightwt=f^ref/F^refw\_\{t\}=\\hat\{f\}\_\{\\mathrm\{ref\}\}/\\hat\{F\}\_\{\\mathrm\{ref\}\}continues to emphasize the low\-pass\-rate prompts\. The static GRPO and MaxRL weights are model\-size invariant and therefore identical to the 4B panels\.
Figure 13:Weighting schemes on Qwen3\-1\.7B\-Base\.Same layout and scale as[Figure˜3](https://arxiv.org/html/2605.24331#S5.F3)\. Only CurveRL’s panels \(left two\) reflect 1\.7B training, since the GRPO and MaxRL weights are static\.
### Appendix DMore Discussions
#### D\.1Curriculum Learning under Our Framework
##### Curriculum Learning is a Time\-Varying Pointwise Prompt Reweighting Method\.
The curriculum learning strategy in RLVR, such as\[Parasharet al\.,[2025](https://arxiv.org/html/2605.24331#bib.bib11)\], develops different data scheduling from easy to hard prompts, therefore improving the policy improvement\. Within our framework in Definition[1](https://arxiv.org/html/2605.24331#Thmdefinition1), it indeed changes the measure from a fixedd0d\_\{0\}to a time\-varying measured0\(t\)d\_\{0\}^\{\(t\)\}that is manually developed, which is orthogonal to the policy\-dependent prompt reweightingwθ\(x\)w\_\{\\theta\}\(x\)\. The utility function𝒰θ\\mathcal\{U\}\_\{\\theta\}based on the pointwise objectiveJg\(θ\)J\_\{g\}\(\\theta\)in Eq\. \([9](https://arxiv.org/html/2605.24331#S3.E9)\) is changed to a time\-varying version:
𝒰θ\(t\)=𝔼x∼d0\(t\)\[g\(pθ\(x\)\)\]=𝔼x∼d0\[d0\(t\)\(x\)d0\(x\)g\(pθ\(x\)\)\]\.\\displaystyle\\mathcal\{U\}\_\{\\theta\}^\{\(t\)\}=\\mathbb\{E\}\_\{x\\sim d\_\{0\}^\{\(t\)\}\}\\left\[g\\left\(p\_\{\\theta\}\(x\)\\right\)\\right\]=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[\\frac\{d\_\{0\}^\{\(t\)\}\(x\)\}\{d\_\{0\}\(x\)\}g\\left\(p\_\{\\theta\}\(x\)\\right\)\\right\]\.By the functional derivative defined in Definition[1](https://arxiv.org/html/2605.24331#Thmdefinition1), for any perturbationh∈L2\(d0\(t\)\)h\\in L^\{2\}\(d\_\{0\}^\{\(t\)\}\), according to the Riesz representation, the first variation of𝒰θ\(t\)\\mathcal\{U\}^\{\(t\)\}\_\{\\theta\}is given by
limϵ→0𝒰θ\(t\)\(pθ\+ϵh\)−𝒰θ\(t\)\(pθ\)ϵ=∫wθ∗\(x\)h\(x\)d0\(t\)\(x\)𝑑x=∫wθ∗\(x\)d0\(t\)\(x\)d0\(x\)h\(x\)d0\(x\)𝑑x\.\\displaystyle\\lim\_\{\\epsilon\\to 0\}\\frac\{\\mathcal\{U\}^\{\(t\)\}\_\{\\theta\}\(p\_\{\\theta\}\+\\epsilon h\)\-\\mathcal\{U\}^\{\(t\)\}\_\{\\theta\}\(p\_\{\\theta\}\)\}\{\\epsilon\}=\\int w^\{\*\}\_\{\\theta\}\(x\)\\,h\(x\)\\,d^\{\(t\)\}\_\{0\}\(x\)dx=\\int w^\{\*\}\_\{\\theta\}\(x\)\\,\\frac\{d^\{\(t\)\}\_\{0\}\(x\)\}\{d\_\{0\}\(x\)\}\\,h\(x\)\\,d\_\{0\}\(x\)dx\.whered0\(t\)\(x\)d0\(x\)\\frac\{d\_\{0\}^\{\(t\)\}\(x\)\}\{d\_\{0\}\(x\)\}is the Radon\-Nikodym derivative\. Immediately, we can derive a time\-varying optimal weight function for the pointwise objectiveJg\(θ\)J\_\{g\}\(\\theta\)as
wθ∗\(x,t\)=δ𝒰θ\(t\)δpθ\(x\)=d0\(t\)\(x\)d0\(x\)g′\(pθ\(x\)\)\.\\displaystyle w\_\{\\theta\}^\{\*\}\(x,t\)=\\frac\{\\delta\\mathcal\{U\}\_\{\\theta\}^\{\(t\)\}\}\{\\delta p\_\{\\theta\}\}\(x\)=\\frac\{d\_\{0\}^\{\(t\)\}\(x\)\}\{d\_\{0\}\(x\)\}g^\{\\prime\}\(p\_\{\\theta\}\(x\)\)\.This illuminates the role of curriculum learning in RLVR: it incorporates a pointwise time\-varying prompt reweighting by manually developingd0\(t\)d\_\{0\}^\{\(t\)\}in a curriculum way, which is on top of the policy\-dependent prompt reweighting functionwθ\(x\)w\_\{\\theta\}\(x\)\.
##### Similarity and Difference Between Curriculum Learning and Our Distribution\-aware Method\.
Given the aforementioned mechanism of curriculum learning in RLVR, a natural question is:why curriculum learning and our distribution\-aware method are both effective in RLVR and what the fundamental distinction is\. We argue that both methods aim to reshape the learning dynamics of RLVR to focus on prompts near the learning frontier or in the current learnable window at each training time by prompt reweighting, even though they are in different ways:
- •The easy\-to\-hard schedule employed in curriculum learning incorporates a non\-stationary sample allocation mechanism to help the policy optimization tochasethe moving learnable window\. However, the endogenous easy\-to\-hard schedule is typically conducted by leveraging prior knowledge and based on the absolute values of pass rates, to the best of our knowledge\.
- •By introducing the quantile coordinate transform, i\.e\., quantile reparameterization, our distribution\-aware method flattens the difficulty distribution in a nearly uniform and stationary spectrum, transforming the moving learnable windows to a stationary and fixed one\. This strategy implicitly yet adaptively helps the policy to concentrate on the prompts in the learning frontier, and at the same time it takes advantage of the distributional information of pass rates in the pass rate function space\.
#### D\.2When Gradients of Distribution\-aware and Pointwise Utility Functionals Equal
###### Proof\.
Under the same distortion function, the distribution\-aware and pointwise utility functions are
𝒰θ\(Fref\)=𝔼x∼d0\[ψ\(Fref\(pθ\(x\)\)\)\],𝒰θ:=Jψ\(θ\)=𝔼x∼d0\[ψ\(pθ\(x\)\)\],\\displaystyle\\mathcal\{U\}\_\{\\theta\}\(F\_\{\\mathrm\{ref\}\}\)=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[\\psi\\left\(F\_\{\\text\{ref \}\}\\left\(p\_\{\\theta\}\(x\)\\right\)\\right\)\\right\],\\ \\mathcal\{U\}\_\{\\theta\}:=J\_\{\\psi\}\(\\theta\)=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[\\psi\\left\(p\_\{\\theta\}\(x\)\\right\)\\right\],with their respective gradients:
∇θ𝒰θ\(Fref\)\\displaystyle\\nabla\_\{\\theta\}\\mathcal\{U\}\_\{\\theta\}\(F\_\{\\mathrm\{ref\}\}\)=𝔼x∼d0\[ψ′\(Fref\(pθ\(x\)\)\)fref\(pθ\(x\)\)∇θpθ\(x\)\],\\displaystyle=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[\\psi^\{\\prime\}\\left\(F\_\{\\text\{ref \}\}\\left\(p\_\{\\theta\}\(x\)\\right\)\\right\)f\_\{\\text\{ref \}\}\\left\(p\_\{\\theta\}\(x\)\\right\)\\nabla\_\{\\theta\}p\_\{\\theta\}\(x\)\\right\],∇θJψ\(θ\)\\displaystyle\\nabla\_\{\\theta\}J\_\{\\psi\}\(\\theta\)=𝔼x∼d0\[ψ′\(pθ\(x\)\)∇θpθ\(x\)\]\.\\displaystyle=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[\\psi^\{\\prime\}\\left\(p\_\{\\theta\}\(x\)\\right\)\\nabla\_\{\\theta\}p\_\{\\theta\}\(x\)\\right\]\.Denotep=pθ\(x\)p=p\_\{\\theta\}\(x\)\. For the general increasing functionψ\\psi, the equivalence of the two gradients implies
ψ′\(p\)=ψ′\(Fref\(p\)\)fref\(p\)⇒ddpψ\(Fref\(p\)\)=ddpψ\(p\)\.\\displaystyle\\psi^\{\\prime\}\(p\)=\\psi^\{\\prime\}\(\{F\_\{\\text\{ref \}\}\(p\)\}\)f\_\{\\text\{ref \}\}\(p\)\\Rightarrow\\frac\{d\}\{dp\}\\psi\\left\(F\_\{\\mathrm\{ref\}\}\(p\)\\right\)=\\frac\{d\}\{dp\}\\psi\(p\)\.By taking the integral on both sides betweenppand 1, we have
ψ\(Fref\(1\)\)−ψ\(Fref\(p\)\)=ψ\(1\)−ψ\(p\)\.\\displaystyle\\psi\(F\_\{\\mathrm\{ref\}\}\(1\)\)\-\\psi\(F\_\{\\mathrm\{ref\}\}\(p\)\)=\\psi\(1\)\-\\psi\(p\)\.As we knowFref\(1\)=1F\_\{\\mathrm\{ref\}\}\(1\)=1, this implies that
ψ\(Fref\(p\)\)=ψ\(p\),\\displaystyle\\psi\(F\_\{\\mathrm\{ref\}\}\(p\)\)=\\psi\(p\),which holds for eachpp\. Asψ\\psiis an increasing function, we can derive
Fref\(p\)=p,\\displaystyle F\_\{\\mathrm\{ref\}\}\(p\)=p,which implies thatFrefF\_\{\\mathrm\{ref\}\}isUniform\(0,1\)\\text\{Uniform\}\(0,1\)for the pass rate distributions ofpθ\(x\)p\_\{\\theta\}\(x\)\.
On the other proof direction, we assumeFrefF\_\{\\mathrm\{ref\}\}isUniform\(0,1\)\\text\{Uniform\}\(0,1\), which immediately impliesψ\(Fref\(p\)\)=ψ\(p\)\\psi\(F\_\{\\mathrm\{ref\}\}\(p\)\)=\\psi\(p\)\. Consequently, the distribution\-aware utility function degrades to the pointwise one and this degradation also happens to the prompt weighting function\.
##### Remark\.
This connection with the uniform distribution reveals a fundamental source of differences between distribution\-aware prompt reweighting and pointwise counterpart: the mismatch between model capability and data difficulty\. Specifically, in the general learning dynamics, the pass rate distributions can be more concentrated around 0 if the model struggles to achieve high pass rates for most prompts, especially in the early training phase\. That’s when theweight collapseissue happens and distribution\-aware prompt reweighting differs from the pointwise counterpart\. For example, if we increase the model size on a fixed training dataset, the advantage may diminish as the pass rate distribution becomes more spread and closer to a uniform distribution\. However, this does not happen in the general learning dynamics\. Therefore, the fundamental source of the advantage of distribution\-aware prompt reweighting over the pointwise version depends on capability\-difficulty match between the model and the dataset\. ∎
#### D\.3Sufficient Conditions for Prompt Weight Comparison and Risk Preference
Under the same risk measureψ\\psi, we define the pointwise and distribution\-aware prompt weight by
w0\(p\)=ψ′\(p\),wF\(p\)=ψ′\(Fref\(p\)\)fref\(p\)\.\\displaystyle w\_\{0\}\(p\)=\\psi^\{\\prime\}\(p\),\\ w\_\{F\}\(p\)=\\psi^\{\\prime\}\(\{F\_\{\\text\{ref \}\}\(p\)\}\)f\_\{\\text\{ref \}\}\(p\)\.Following the analysis in Appendix[D\.2](https://arxiv.org/html/2605.24331#A4.SS2), we hope to know whenw0\(p\)\>wF\(p\)w\_\{0\}\(p\)\>w\_\{F\}\(p\)for the concerned pass rates, e\.g\., low pass rate in the early training\. Answering this question recovers a fundamental relationship between model capability and data difficulty\.
Figure 14:The probability density function, CDF and weight function in terms of the pass rateppamong MaxRL and the two considered examples\. The blue line shows a more aggressive weight update on the low\-pass\-rate prompts, while the red line suggests a more conservative update relative to MaxRL\.##### Forψ\(t\)=log\(t\)\\psi\(t\)=\\log\(t\)\.
w0\(p\)=1pw\_\{0\}\(p\)=\\frac\{1\}\{p\}andwF\(p\)=fref\(p\)Fref\(p\)w\_\{F\}\(p\)=\\frac\{f\_\{\\mathrm\{ref\}\}\(p\)\}\{F\_\{\\mathrm\{ref\}\}\(p\)\}\. As the two weights have different magnitudes, comparing the weight requires us to perform normalization\. ForNNnumber of rollouts, the pass rate satisfiesp∈\[0,1N,…,NN\]p\\in\[0,\\frac\{1\}\{N\},\\ldots,\\frac\{N\}\{N\}\]\. Denotepi=iNp\_\{i\}=\\frac\{i\}\{N\}and thus the normalized weights are
w¯0\(pi\)=w0\(pi\)∑i=0Nw0\(pi\),w¯F\(pi\)=wF\(pi\)∑i=0NwF\(pi\)\.\\displaystyle\\bar\{w\}\_\{0\}\(p\_\{i\}\)=\\frac\{w\_\{0\}\(p\_\{i\}\)\}\{\\sum\_\{i=0\}^\{N\}w\_\{0\}\(p\_\{i\}\)\},\\ \\bar\{w\}\_\{F\}\(p\_\{i\}\)=\\frac\{w\_\{F\}\(p\_\{i\}\)\}\{\\sum\_\{i=0\}^\{N\}w\_\{F\}\(p\_\{i\}\)\}\.We now analyze the scenario when the distribution\-aware weightw¯F\\bar\{w\}\_\{F\}is more aggressive than the pointwise weightw¯0\\bar\{w\}\_\{0\}\. We first define:
r\(p\)=wF\(p\)w0\(p\)=pfref\(p\)Fref\(p\)=dlogFref\(p\)dlogp=fref\(p\)1p∫0pfref\(t\)𝑑t\.\\displaystyle r\(p\)=\\frac\{w\_\{F\}\(p\)\}\{w\_\{0\}\(p\)\}=\\frac\{pf\_\{\\mathrm\{ref\}\}\(p\)\}\{F\_\{\\mathrm\{ref\}\}\(p\)\}=\\frac\{d\\log F\_\{\\mathrm\{ref\}\}\(p\)\}\{d\\log p\}=\\frac\{f\_\{\\mathrm\{ref\}\}\(p\)\}\{\\frac\{1\}\{p\}\\int\_\{0\}^\{p\}f\_\{\\mathrm\{ref\}\}\(t\)dt\}\.Given anypl<pup\_\{l\}<p\_\{u\}, being more aggressive forw¯F\\bar\{w\}\_\{F\}indicates that
w¯F\(pl\)w¯F\(pu\)\>w¯0\(pl\)w¯0\(pu\)⇔wF\(pl\)wF\(pu\)\>w0\(pl\)w0\(pu\)⇔r\(pl\)\>r\(pu\)\.\\displaystyle\\frac\{\\bar\{w\}\_\{F\}\(p\_\{l\}\)\}\{\\bar\{w\}\_\{F\}\(p\_\{u\}\)\}\>\\frac\{\\bar\{w\}\_\{0\}\(p\_\{l\}\)\}\{\\bar\{w\}\_\{0\}\(p\_\{u\}\)\}\\Leftrightarrow\\frac\{\{w\}\_\{F\}\(p\_\{l\}\)\}\{\{w\}\_\{F\}\(p\_\{u\}\)\}\>\\frac\{\{w\}\_\{0\}\(p\_\{l\}\)\}\{\{w\}\_\{0\}\(p\_\{u\}\)\}\\Leftrightarrow r\(p\_\{l\}\)\>r\(p\_\{u\}\)\.Therefore,the sufficient conditionis
r′\(p\)<0,\\displaystyle r^\{\\prime\}\(p\)<0,in the considered range ofpp, which can guaranteer\(pl\)\>r\(pu\)r\(p\_\{l\}\)\>r\(p\_\{u\}\)ifpl<pup\_\{l\}<p\_\{u\}\. Letr′\(p\)=0r^\{\\prime\}\(p\)=0, we have:
dlogFref\(p\)dlogp=α⇒logFref\(p\)=αlogp\+C,\\displaystyle\\frac\{d\\log F\_\{\\mathrm\{ref\}\}\(p\)\}\{d\\log p\}=\\alpha\\Rightarrow\\log F\_\{\\mathrm\{ref\}\}\(p\)=\\alpha\\log p\+C,whereα\\alphais a constant\. LetFref\(1\)=1F\_\{\\mathrm\{ref\}\}\(1\)=1, we haveC=0C=0and
Fref\(p\)=pα,fref\(p\)=αpα−1,wF\(p\)=αp\.\\displaystyle F\_\{\\mathrm\{ref\}\}\(p\)=p^\{\\alpha\},\\ f\_\{\\mathrm\{ref\}\}\(p\)=\\alpha p^\{\\alpha\-1\},\\ w\_\{F\}\(p\)=\\frac\{\\alpha\}\{p\}\.which is a power function\. In our case,α=1\\alpha=1\. This result indicates that ifr\(p\)r\(p\)is a decreasing function regardingpporwF\(p\)w\_\{F\}\(p\)decreases faster thanw0\(p\)=1pw\_\{0\}\(p\)=\\frac\{1\}\{p\}, we will observe a more aggressive prompt reweighting on the low\-pass\-rate prompts over MaxRL\.
For example, consider a random variableZZwith a truncated exponential distribution in\[0,1\]\[0,1\]:
fZ\(p\)=λe−λp1−e−λ,FZ\(p\)=1−e−λp1−e−λ,p∈\[0,1\]\.\\displaystyle f\_\{Z\}\(p\)=\\frac\{\\lambda e^\{\-\\lambda p\}\}\{1\-e^\{\-\\lambda\}\},\\ F\_\{Z\}\(p\)=\\frac\{1\-e^\{\-\\lambda p\}\}\{1\-e^\{\-\\lambda\}\},\\ p\\in\[0,1\]\.IfwF\(p\)w\_\{F\}\(p\)follows the truncated exponential distribution above, we have
wF\(p\)=λe−λp1−e−λp=λeλp−1,r\(p\)=λpeλp−1⇒r′\(p\)=λeλp−1−λpeλp\(eλp−1\)2<0,\\displaystyle w\_\{F\}\(p\)=\\frac\{\\lambda e^\{\-\\lambda p\}\}\{1\-e^\{\-\\lambda p\}\}=\\frac\{\\lambda\}\{e^\{\\lambda p\}\-1\},r\(p\)=\\frac\{\\lambda p\}\{e^\{\\lambda p\}\-1\}\\Rightarrow r^\{\\prime\}\(p\)=\\lambda\\frac\{e^\{\\lambda p\}\-1\-\\lambda pe^\{\\lambda p\}\}\{\(e^\{\\lambda p\}\-1\)^\{2\}\}<0,as we can easily prove thatg\(z\)=ez−1−zezg\(z\)=e^\{z\}\-1\-ze^\{z\}withz=λp\>0z=\\lambda p\>0is a decreasing function with a negative derivative andg\(z\)<g\(0\)=0g\(z\)<g\(0\)=0\. Another example is a symmetric random variableYYrelative toZZin\[0,1\]\[0,1\], i\.e\.,Y=1−ZY=1\-Z\. Therefore, we have
fY\(p\)=λeλpeλ−1,FY\(p\)=eλp−1eλ−1,p∈\[0,1\]\.\\displaystyle f\_\{Y\}\(p\)=\\frac\{\\lambda e^\{\\lambda p\}\}\{e^\{\\lambda\}\-1\},\\ F\_\{Y\}\(p\)=\\frac\{e^\{\\lambda p\}\-1\}\{e^\{\\lambda\}\-1\},\\ p\\in\[0,1\]\.Similarly, this distribution implies that
wF\(p\)=λ1−e−λp,r\(p\)=pλ1−e−λp⇒r′\(p\)=λ1−e−λp−λpe−λp\(1−e−λp\)2\>0,\\displaystyle w\_\{F\}\(p\)=\\frac\{\\lambda\}\{1\-e^\{\-\\lambda p\}\},r\(p\)=\\frac\{p\\lambda\}\{1\-e^\{\-\\lambda p\}\}\\Rightarrow r^\{\\prime\}\(p\)=\\lambda\\frac\{1\-e^\{\-\\lambda p\}\-\\lambda pe^\{\-\\lambda p\}\}\{\(1\-e^\{\-\\lambda p\}\)^\{2\}\}\>0,which indicates that this strategy is more conservative than MaxRL with a slower decreasing speed ofWF\(p\)W\_\{F\}\(p\)than1p\\frac\{1\}\{p\}\. Figure[14](https://arxiv.org/html/2605.24331#A4.F14)showcases the density function, CDF, and weight functions in terms of the pass rateppof MaxRL and two considered examples, with different aggressiveness of weights in the gradient update\.
Empirically, we compare the dynamics of normalized weights among GRPO, MaxRL, and CurveRL in[Figure˜15](https://arxiv.org/html/2605.24331#A4.F15)\. It suggests that CurveRL behaves more aggressively and puts larger weights on low\-pass\-rate prompts than MaxRL with a faster decreasing speed when we increasepp\. We argue that the current model capability is still less sufficient for the dataset POLARIS\-53K, such that a more risk\-seeking utility function is preferable, which is guided in principle by the relationship between data difficulty and the model’s capability\. This result demonstrates that our method is data\-driven with an adaptive weight function relative to the static ones in GRPO and MaxRL\. Conversely, a conservative weight should be observed when we deploy a larger model on an easier dataset\. This is an interesting investigation for empirical demonstration, which we leave as future work\.
\(a\)Normalized Weight Dynamics on Six Steps\.
\(b\)Combined Normalized Weight Dynamics\.
Figure 15:Dynamics of normalized weights among GRPO, MaxRL, and CurveRL on Qwen3\-4B\-Base\.
##### For the Generalψ\\psi\.
Our sufficient condition can be extended to the general risk functionψ\\psi, where the condition quantity is
Rψ\(p\):=wF\(p\)w0\(p\)=ψ′\(Fref\(p\)\)fref\(p\)ψ′\(p\)\.\\displaystyle R\_\{\\psi\}\(p\):=\\frac\{w\_\{F\}\(p\)\}\{w\_\{0\}\(p\)\}=\\frac\{\\psi^\{\\prime\}\\left\(F\_\{\\mathrm\{ref\}\}\(p\)\\right\)f\_\{\\mathrm\{ref\}\}\(p\)\}\{\\psi^\{\\prime\}\(p\)\}\.IfRψ′\(p\)<0R\_\{\\psi\}^\{\\prime\}\(p\)<0, our distribution\-aware weight functionwF\(p\)w\_\{F\}\(p\)is more aggressive over the low\-pass\-rate prompts than the pointwise counterpart in the learning dynamics\.
##### Relationship between Model Capacity and Data Difficulty\.
A key observation is that our distribution\-aware weight is data\-driven and depends on both the modelπθ\\pi\_\{\\theta\}and the dataset\. The mixmatch between the model capability and data difficulty fundamentally determines the dynamics of the distribution\-aware weights, including whether the weight is more aggressive than the pointwise version or not\. In particular, if a simpler model is employed on a hard dataset, more pass rates of prompts will be more concentrated around 0, leading to more aggressive weight on the low\-pass\-rate prompts\. On the other hand, if a large model is employed on an easy dataset, the weight distribution is more spread and tends to be more uniform\. Our distribution\-aware weight likely behaves similarly as the pointwise counterpart, e\.g\., MaxRL, or even more conservatively on low\-pass\-rate prompts\. We leave more empirical demonstration as future work\.
#### D\.4Induced Wasserstein\-type Geometry in Our Method
The transformFrefF\_\{\\text\{ref\}\}provides a quantile\-based difficulty coordinate: it measures the relative position of the pass ratepθ\(x\)p\_\{\\theta\}\(x\)under a fixed reference distribution\. This representation depends on the ordering structure of pass rates rather than their raw scale\. In this sense, it is closer in spirit to one\-dimensional Wasserstein\-type quantile representations than to pointwise density\-ratio\-based objectives such as Kullback–Leibler \(KL\) divergence\.
##### Kullback–Leibler \(KL\) divergence with pointwise density geometry\.
When we compare the distances of two one\-dimensional distributionsμ\\muandν\\nu, if they are absolutely continuous with respect to Lebesgue measure, the densitiespμp\_\{\\mu\}andpνp\_\{\\nu\}exist\. Consequently, the Kullback–Leibler \(KL\) divergence is defined by using the pointwise probability ratio:
DKL\(μ,ν\)=∫ℝpμ\(x\)logpμ\(x\)pν\(x\)dx,\\displaystyle D\_\{\\text\{KL\}\}\(\\mu,\\nu\)=\\int\_\{\\mathbb\{R\}\}p\_\{\\mu\}\(x\)\\log\\frac\{p\_\{\\mu\}\(x\)\}\{p\_\{\\nu\}\(x\)\}dx,which induces a pointwise geometry in the density space\.
##### Optimal Transport \(OT\)\-based Distance with expressive geometry\.
As a typical instance of optimal transport,pp\-Wasserstein distance on one\-dimensional random variables is defined by
Wpp\(μ,ν\)=infγ∈Π\(μ,ν\)∫‖x−y‖p𝑑γ\(x,y\),\\displaystyle W\_\{p\}^\{p\}\(\\mu,\\nu\)=\\inf\_\{\\gamma\\in\\Pi\(\\mu,\\nu\)\}\\int\\\|x\-y\\\|^\{p\}d\\gamma\(x,y\),whereγ\\gammais the coupling or the transport plan, andΠ\(μ,ν\)\\Pi\(\\mu,\\nu\)is the joint distribution space with the marginal distributions asμ\\muandν\\nu\. By definition, Wasserstein distance explicitly accounts for the underlying geometry of the data space, as opposed to the KL divergence\. In the one\-dimensional space, Wasserstein distance is equivalent to thequantile matching:
Wpp\(μ,ν\)=∫01\|Fμ−1\(u\)−Fν−1\(u\)\|p𝑑u,\\displaystyle W\_\{p\}^\{p\}\(\\mu,\\nu\)=\\int\_\{0\}^\{1\}\\left\|F\_\{\\mu\}^\{\-1\}\(u\)\-F\_\{\\nu\}^\{\-1\}\(u\)\\right\|^\{p\}du,which evaluates the distance in the quantile space with respect to the quantile functionF−1F^\{\-1\}\. Whenp=1p=1, 1\-Wasserstein distance has an equivalent form based on the CDF:
W1\(μ,ν\)=∫ℝ\|Fμ\(t\)−Fν\(t\)\|𝑑t,\\displaystyle W\_\{1\}\(\\mu,\\nu\)=\\int\_\{\\mathbb\{R\}\}\\left\|F\_\{\\mu\}\(t\)\-F\_\{\\nu\}\(t\)\\right\|dt,which induces a transport or ordering geometry\.
In summary, our method operatesFref\(pθ\(x\)\)F\_\{\\text\{ref\}\}\(p\_\{\\theta\}\(x\)\)in the distribution\-aware utility functional, a surrogate of the one from the perspective of distribution transport \(moving the mass frompθ\(x\)p\_\{\\theta\}\(x\)top∗\(x\)p^\{\*\}\(x\)\) as introduced in Section[3\.2](https://arxiv.org/html/2605.24331#S3.SS2)\. Since optimal transport in the one\-dimensional space reduces to quantile matching and CDF matching, our method effectively operates in the quantile space \(equivalent to the CDF space\), inducing a Wasserstein\-type geometry\.
#### D\.5Integrating both Pointwise and Distribution\-aware Utility Functionals
##### Strategy 1: Single Function Transformation\.
We integrate bothpθ\(x\)p\_\{\\theta\}\(x\)andFref\(pθ\(x\)\)F\_\{\\mathrm\{ref\}\}\\left\(p\_\{\\theta\}\(x\)\\right\)into a single function transformationϕ\\phi:
𝒰θ=𝔼x∼d0\[ϕ\(pθ\(x\),Fref\(pθ\(x\)\)\)\]\.\\displaystyle\\mathcal\{U\}\_\{\\theta\}=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[\\phi\\left\(p\_\{\\theta\}\(x\),F\_\{\\mathrm\{ref\}\}\\left\(p\_\{\\theta\}\(x\)\\right\)\\right\)\\right\]\.We chooseϕ\(a,b\)=log\(a1−λbλ\)\\phi\(a,b\)=\\log\(a^\{1\-\\lambda\}b^\{\\lambda\}\)for computational convenience, which leads to
𝒰θ=𝔼x∼d0\[log\(pθ\(x\)1−λFref\(pθ\(x\)\)λ\)\]=\(1−λ\)𝔼x∼d0\[logpθ\(x\)\]\+λ𝔼x∼d0\[logFref\(pθ\(x\)\)\]\.\\displaystyle\\mathcal\{U\}\_\{\\theta\}=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[\\log\\left\(p\_\{\\theta\}\(x\)^\{1\-\\lambda\}F\_\{\\text\{ref\}\}\\left\(p\_\{\\theta\}\(x\)\\right\)^\{\\lambda\}\\right\)\\right\]=\(1\-\\lambda\)\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[\\log p\_\{\\theta\}\(x\)\\right\]\+\\lambda\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[\\log F\_\{\\text\{ref\}\}\\left\(p\_\{\\theta\}\(x\)\\right\)\\right\]\.Notably, this utility functional is naturally meaningful as it is a monotonically increasing function ofpθ\(x\)p\_\{\\theta\}\(x\)\. Therefore, by applying the functional derivative in Definition[1](https://arxiv.org/html/2605.24331#Thmdefinition1), the optimal weight function and gradient update rule is:
∇θJintegrated\(θ\)\\displaystyle\\nabla\_\{\\theta\}J\_\{\\text\{integrated\}\}\(\\theta\)=𝔼x∼d0\[\(\(1−λ\)1pθ\+λfref\(pθ\(x\)\)Fref\(pθ\(x\)\)\)∇θpθ\(x\)\]\.\\displaystyle=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[\\left\(\(1\-\\lambda\)\\frac\{1\}\{p\_\{\\theta\}\}\+\\lambda\\frac\{f\_\{\\text\{ref\}\}\\left\(p\_\{\\theta\}\(x\)\\right\)\}\{F\_\{\\text\{ref\}\}\\left\(p\_\{\\theta\}\(x\)\\right\)\}\\right\)\\nabla\_\{\\theta\}p\_\{\\theta\}\(x\)\\right\]\.
##### Strategy 2: Multiplication\.
Alternatively, we can directly multiply the two function transformationsψ\\psiandggunder a monotonic constraint to ensure a meaningful utility function overpθ\(x\)p\_\{\\theta\}\(x\):
𝒰θ=𝔼x∼d0\[ψ\(Fref\(pθ\(x\)\)\)g\(pθ\(x\)\)\],\\displaystyle\\mathcal\{U\}\_\{\\theta\}=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[\\psi\\left\(F\_\{\\text\{ref\}\}\\left\(p\_\{\\theta\}\(x\)\\right\)\\right\)g\\left\(p\_\{\\theta\}\(x\)\\right\)\\right\],s\.t\.δUθδpθ\(x\)=ψ\(Fref\(pθ\(x\)\)\)g′\(pθ\(x\)\)\+ψ′\(Fref\(pθ\(x\)\)\)fref\(pθ\(x\)\)g\(pθ\(x\)\)\>0\.\\displaystyle\\text\{s\.t\.\}\\ \\frac\{\\delta U\_\{\\theta\}\}\{\\delta p\_\{\\theta\}\}\(x\)=\\psi\\left\(F\_\{\\text\{ref \}\}\(p\_\{\\theta\}\(x\)\)\\right\)g^\{\\prime\}\(p\_\{\\theta\}\(x\)\)\+\\psi^\{\\prime\}\\left\(F\_\{\\text\{ref \}\}\(p\_\{\\theta\}\(x\)\)\\right\)f\_\{\\text\{ref \}\}\(p\_\{\\theta\}\(x\)\)g\(p\_\{\\theta\}\(x\)\)\>0\.where the monotonic constraint is derived by ensuring a positive first\-order derivative in terms ofpθ\(x\)p\_\{\\theta\}\(x\)\. When we employψ\(μ\)=−log\(μ\)\\psi\(\\mu\)=\-\\log\(\\mu\)andg\(ν\)=log\(ν\)g\(\\nu\)=\\log\(\\nu\), it guaranteesδUθδpθ\(x\)\>0\\frac\{\\delta U\_\{\\theta\}\}\{\\delta p\_\{\\theta\}\}\(x\)\>0\. Consequently, the integrated gradient update rule is:
∇θJintegrated\(θ\)\\displaystyle\\nabla\_\{\\theta\}J\_\{\\text\{integrated\}\}\(\\theta\)=𝔼x∼d0\[−\(logFref\(pθ\(x\)\)pθ\(x\)\+fref\(pθ\(x\)\)logpθ\(x\)Fref\(pθ\(x\)\)\)∇θpθ\(x\)\]\.\\displaystyle=\\mathbb\{E\}\_\{x\\sim d\_\{0\}\}\\left\[\-\\left\(\\frac\{\\log F\_\{\\text\{ref \}\}\\left\(p\_\{\\theta\}\(x\)\\right\)\}\{p\_\{\\theta\}\(x\)\}\+\\frac\{f\_\{\\text\{ref \}\}\\left\(p\_\{\\theta\}\(x\)\\right\)\\log p\_\{\\theta\}\(x\)\}\{F\_\{\\text\{ref \}\}\\left\(p\_\{\\theta\}\(x\)\\right\)\}\\right\)\\nabla\_\{\\theta\}p\_\{\\theta\}\(x\)\\right\]\.\(24\)
##### Remark\.
Although the above integrated formulations provide a natural way to combine local sensitivity and distributional information, we do not observe empirical improvements in practice\. This suggests that simply combining pointwise and distribution\-aware terms does not necessarily lead to more effective learning signals due to thegeometry mismatch or gradient conflicts\. This observation demonstrates that the benefit of distribution\-aware objectives does not arise from a direct additive or multiplicative fusion with pointwise terms, but rather from explicitly modeling the geometry of the pass\-rate distribution\. Therefore, it further helps to posit our contribution: purely distribution\-aware utility functionals with clearer geometric interpretation are not considered as a regularization or supplement to pointwise ones, but correspond to a fundamentally different objective class that derives a new separate weight function family\.Similar Articles
GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards
GRAIL introduces gradient-reweighted advantages to improve token-level credit assignment in reinforcement learning for LLM reasoning, outperforming GRPO across multiple models.
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
This paper challenges the assumption that RL teaches new reasoning capabilities to LLMs, arguing instead that it performs sparse policy selection at high-entropy decision points. It introduces ReasonMaxxer, an RL-free method that matches full RL performance with significantly lower training costs.
Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs
This paper investigates whether reinforcement learning can improve the direct recall of parametric knowledge in LLMs beyond reasoning tasks. It demonstrates that RL with binary rewards yields significant gains in factual QA benchmarks by redistributing probability mass to unlock latent knowledge rather than acquiring new facts.
Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR
This paper introduces POW3R, a policy-aware rubric reward framework for reinforcement learning with verifiable rewards (RLVR). It shows that static rubric aggregation misallocates learning signal, and POW3R achieves faster convergence and better performance across multiple settings.
Prioritizing the Best: Incentivizing Reliable Multimodal Reasoning by Rewarding Beyond Answer Correctness
Researchers introduce Groupwise Ranking Reward to fix reasoning-answer inconsistency in multimodal RL, boosting reliability-conditioned accuracy from 47.4% to 54.7% over standard RLVR.