Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport
Summary
This paper introduces Distributional Process Reward Models, using conditional optimal transport to calibrate PRMs for more accurate success probability estimates in inference-time scaling. It demonstrates improved calibration and downstream performance on mathematical reasoning benchmarks like MATH-500 and AIME.
View Cached Full Text
Cached at: 05/11/26, 06:51 AM
# Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport
Source: [https://arxiv.org/html/2605.06785](https://arxiv.org/html/2605.06785)
Rachel Ma MIT CSAIL rachelm8@mit\.edu &Dylan Hadfield\-Menell MIT CSAIL &Kristjan Greenewald IBM Research
###### Abstract
Inference\-time scaling methods rely on Process Reward Models \(PRMs\), which are often poorly calibrated and overestimate success probabilities\. We propose, to our knowledge, the first use of conditional optimal transport for calibrating PRMs, modifying conditional OT \(CondOT\) map learningBunneet al\.\([2022](https://arxiv.org/html/2605.06785#bib.bib2)\)to estimate a monotonic conditional quantile function over success probabilities estimated by the PRM, conditioned on PRM hidden states\. This yields structurally valid quantile estimates and enables efficient extraction of confidence bounds at arbitrary levels, which we integrate into the instance\-adaptive scaling \(IAS\) framework ofParket al\.\([2025](https://arxiv.org/html/2605.06785#bib.bib1)\)\. We evaluate on mathematical reasoning benchmarks spanning moderate\-difficulty problems \(MATH\-500\) and harder out\-of\-distribution problems \(AIME\)\. For PRMs with reliable ranking signals, our method substantially improves calibration over both uncalibrated PRMs and quantile regression\. On downstream Best\-of\-N IAS performance, our method generally improves over uncalibrated PRMs\. These results establish conditional optimal transport as another principled and practical approach to PRM calibration, offering structural guarantees and flexible uncertainty estimation\.
## 1Introduction
Scaling inference\-time compute has emerged as a powerful paradign for improving large language model \(LLM\) performance on reasoning tasksSnellet al\.\([2024](https://arxiv.org/html/2605.06785#bib.bib8)\); Brownet al\.\([2024](https://arxiv.org/html/2605.06785#bib.bib16)\)\. Instead of just relying on a fixed model, inference\-time scaling methods generate multiple candidate reasoning trajectories and use a scoring model to select among them\. Process Reward Models \(PRMs\), which score intermediate reasoning steps with respect to a task \(Cobbeet al\.\([2021](https://arxiv.org/html/2605.06785#bib.bib5)\); Uesatoet al\.\([2022](https://arxiv.org/html/2605.06785#bib.bib6)\); Lightmanet al\.\([2023](https://arxiv.org/html/2605.06785#bib.bib7)\)\), can provide a per\-step signal that guides search, selection, and budget allocation\. The quality of these decisions depends directly on how well PRM scores reflect true success probabilities\.

Figure 1:Estimated success probability for a math reasoning trajectory: uncalibrated base PRMs typically overestimate, quantile regression provides limited flexibility and can produce crossing violations\. Our conditional OT method guarantees a monotonic quantile function and enables flexible uncertainty estimation at arbitrary confidence levels\.In practice, however, state\-of\-the\-art PRMs are often poorly calibrated and optimisticParket al\.\([2025](https://arxiv.org/html/2605.06785#bib.bib1)\)\. This is particularly damaging in inference\-time scaling, where budget allocation strategies such as Best\-of\-NBrownet al\.\([2024](https://arxiv.org/html/2605.06785#bib.bib16)\)sampling and instance\-adaptive scaling \(IAS\)Parket al\.\([2025](https://arxiv.org/html/2605.06785#bib.bib1)\)treat PRM scores as proxies for success probability\. An overconfident PRM assigns inflated scores to incorrect trajectories, causing the algorithm to under\-sample hard problems and commit to wrong solutions\. Improving PRM calibration is a prerequisite for inference\-time scaling to work as intended\.
Recent work byParket al\.\([2025](https://arxiv.org/html/2605.06785#bib.bib1)\)addresses this through quantile regression, fitting a model to predict a fixed set of quantile levels of the success probability distribution given PRM representations\. While effective, this approach has a structural limitation: the quantile levels must be fixed at training time, so the model cannot be queried at arbitrary confidence levels without retraining\. Moreover, quantile regression treats each quantile independently, so a higher quantile may produce a lower predicted value than a lower one, violating the basic properties of a valid quantile function\. In settings where inference\-time scaling decisions depend on the shape of the uncertainty distribution, these limitations constrain both flexibility and reliability\.
We propose to address these limitations using conditional optimal transport \(OT\)\. Building on the dual network architecture ofBunneet al\.\([2022](https://arxiv.org/html/2605.06785#bib.bib2)\), we modify CondOT architecture to condition on PRM hidden states, learning a calibrated optimal transport mapping from PRM representations to the distribution of empirical success outcomes\. Calibrated uncertainty estimates can directly inform per\-question compute allocation with the IAS framework ofParket al\.\([2025](https://arxiv.org/html/2605.06785#bib.bib1)\)\. Rather than collapsing the calibrated posterior to a point estimate, our method propagates the full predictive distribution into the allocation decision, increasing sample budgets for questions where success probability is uncertain and reducing them where it is confidently high\.
We summarize our specific contributions:
1. 1\.A PRM calibration method based on conditional optimal transport that learns a full monotone conditional quantile function from PRM hidden states, enabling flexible uncertainty estimates at arbitrary confidence levels from a single model without retraining\.
2. 2\.Empirical evidence that OT calibration substantially improves Brier score, ECE, and weighted quantile loss over uncalibrated PRMs and quantile regression for well\-specified PRMs, on both in\-distribution \(MATH\-500\) and out\-of\-distribution \(AIME24\-25\) benchmarks\.
3. 3\.An analysis of downstream Best\-of\-N IASParket al\.\([2025](https://arxiv.org/html/2605.06785#bib.bib1)\)performance showing that OT’s flexible predictive distribution enables accuracy improvement over the uncalibrated base PRM\.
## 2Related Work
Process Reward Models \(PRMs\): Process Reward Models \(PRMs\) score intermediate reasoning steps, estimating their contribution to producing a correct final solutionCobbeet al\.\([2021](https://arxiv.org/html/2605.06785#bib.bib5)\); Lightmanet al\.\([2023](https://arxiv.org/html/2605.06785#bib.bib7)\); Uesatoet al\.\([2022](https://arxiv.org/html/2605.06785#bib.bib6)\)\. They are widely used in inference time scaling algorithms, which often relies on Best\-of\-NN\(BoN\) sampling: generating multiple candidate responses and selecting the highest\-scoring output with a reward modelChowet al\.\([2024](https://arxiv.org/html/2605.06785#bib.bib15)\); Cobbeet al\.\([2021](https://arxiv.org/html/2605.06785#bib.bib5)\); Brownet al\.\([2024](https://arxiv.org/html/2605.06785#bib.bib16)\)\. Recent PRMs such as Qwen\-PRMZhanget al\.\([2025](https://arxiv.org/html/2605.06785#bib.bib17)\), Shepherd\-PRMWanget al\.\([2024](https://arxiv.org/html/2605.06785#bib.bib14)\), and ReasonEvalXiaet al\.\([2025](https://arxiv.org/html/2605.06785#bib.bib18)\)demonstrate strong performance on reasoning benchmarks\.Parket al\.\([2025](https://arxiv.org/html/2605.06785#bib.bib1)\)has also proposed instance\-adaptive sampling strategies for LLM inference\-time scaling, that we also employ in this work\.
However, PRMs are often poorly calibrated, producing overconfident estimates of success that can lead to suboptimal search decisions\. Prior work has addressed this issue using quantile regression to model uncertainty over PRM outputsParket al\.\([2025](https://arxiv.org/html/2605.06785#bib.bib1)\), but such approaches require fixing quantile levels during training, limiting flexibility at inference time\. We propose PRM calibration through learning the full conditional quantile function through conditional optimal transport, that allows for flexible and efficient uncertainty estimation without requiring retraining for different quantile levels\.
Uncertainty Quantification for LLMs: Uncertainty quantification \(UQ\) for large language models has been studied across a range of settings, including token\-level prediction, sequence\-level generation, and decision\-making over structured outputs\. Approaches include predictive likelihood\-based measuresKadavathet al\.\([2022](https://arxiv.org/html/2605.06785#bib.bib22)\), self\-consistencyWanget al\.\([2022](https://arxiv.org/html/2605.06785#bib.bib23)\), conformal predictionYeet al\.\([2024](https://arxiv.org/html/2605.06785#bib.bib25)\)\. These methods primarily focus on estimating uncertainty over final outputs or token predictions, while we focus on intermediate steps\. Various methods for calibrating uncertainty predictions for LLMs have been proposed such as temperature scalingGuoet al\.\([2017](https://arxiv.org/html/2605.06785#bib.bib26)\), reinforcement learning with rewardsDamaniet al\.\([2025](https://arxiv.org/html/2605.06785#bib.bib27)\), learning a mapping from semantic meaning to confidence scoresCoxet al\.\([2025](https://arxiv.org/html/2605.06785#bib.bib24)\), or via probing techniquesLiuet al\.\([2024](https://arxiv.org/html/2605.06785#bib.bib28)\)\. In contrast to these approaches, which typically produce pointwise or task\-specific uncertainty estimates, our method learns a structured, representation\-conditioned uncertainty model over PRM outputs, enabling consistent estimation across confidence levels\.
Conditional Optimal Transport with Neural Networks: Optimal transport provides a principled framework for mapping between probability distributions \(Villani and others \([2009](https://arxiv.org/html/2605.06785#bib.bib9)\); Peyré and Cuturi \([2019](https://arxiv.org/html/2605.06785#bib.bib10)\)\), and has been applied to generative modelingArjovskyet al\.\([2017](https://arxiv.org/html/2605.06785#bib.bib11)\)and domain adaptationCourtyet al\.\([2016](https://arxiv.org/html/2605.06785#bib.bib12),[2017](https://arxiv.org/html/2605.06785#bib.bib13)\)\. Recent work has explored learning optimal transport maps conditioned on context using neural networksRodriguez\-Pardoet al\.\([2025](https://arxiv.org/html/2605.06785#bib.bib3)\); Bunneet al\.\([2022](https://arxiv.org/html/2605.06785#bib.bib2)\); Wanget al\.\([2025](https://arxiv.org/html/2605.06785#bib.bib4)\)\. In this work, we adaptBunneet al\.\([2022](https://arxiv.org/html/2605.06785#bib.bib2)\)’s dual network to condition on large language model \(LLM\) hidden states\. While prior work primarily uses conditional optimal transport for distribution modeling and generative modeling, we instead use it to learn calibrated mappings between predicted scores and outcomes, with the goal of learning a consistent conditional quantile function for uncertainty estimation\.
## 3Preliminaries
We use inference\-time scalingParket al\.\([2025](https://arxiv.org/html/2605.06785#bib.bib1)\)for large language models \(LLMs\), where multiple reasoning trajectories are generated and evaluated using a Process Reward Model \(PRM\)\. We introduce notation for trajectories, success probabilities, and instance\-adaptive inference\-time scaling used in this setting\.
Best\-of\-NBrownet al\.\([2024](https://arxiv.org/html/2605.06785#bib.bib16)\):
𝐱\(i\)=\(x1\(i\),x2\(i\),…,xT\(i\)\(i\)\)∼LLM\(q\),fori=1,…,N\.\\mathbf\{x\}^\{\(i\)\}=\\left\(x^\{\(i\)\}\_\{1\},\\ x^\{\(i\)\}\_\{2\},\\ \\ldots,\\ x^\{\(i\)\}\_\{T^\{\(i\)\}\}\\right\)\\sim\\text\{LLM\}\(q\),\\quad\\text\{for \}i=1,\\ldots,N\.whereqqis the query,xxis the reasoning trajectory generated by an LLM \(there areNNtotal complete trajectories generated\), andxix\_\{i\}is theii\-th reasoning step, andTTis the total length of the trajectory\.
The score of each trajectory gets assigned by the PRM \(ri=PRM\(q,x\(i\)\)r^\{i\}=\\text\{PRM\}\(q,\\textbf\{x\}^\{\(i\)\}\)\), and the final output is the trajectory awarded the highest reward\.
Success Probability:A key quantity in inference\-time scaling is the success probability of a partial trajectory\.
p≜Pr\(xt\+1:Tgenerated by LLM yields a correct answer∣q,𝐱0:t\)p\\triangleq\\Pr\\left\(x\_\{t\+1:T\}\\text\{ generated by LLM yields a correct answer\}\\mid q,\\mathbf\{x\}\_\{0:t\}\\right\)whereqqis the query,x0:t\\textbf\{x\}\_\{0:t\}is the portion of the trajectory that is generated so far from step 1 to steptt\. Note that thex0:0\\textbf\{x\}\_\{0:0\}is an empty sequence\. This quantity captures the likelihood that continuing the current reasoning path leads to a correct solution\. In practice,ppis unknown and must be estimated, introducing uncertainty that directly affects downstream decision\-making\.
Instance\-Adaptive Inference\-Time Scaling \(IAS\):allocates computational budget based on estimated success probability\. Given a target confidenceC∈\(0,1\)C\\in\(0,1\), the number of samples requires to achieve this confidence is:
N⋆\(p,C\)=△min\{n∈ℕ:Pr\(at least one out ofntrajectories is correct\)≥C\}\.N^\{\\star\}\(p,C\)\\overset\{\\triangle\}\{=\}\\min\\left\\\{n\\in\\mathbb\{N\}:\\Pr\(\\text\{at least one out of \}n\\text\{ trajectories is correct\}\)\\geq C\\right\\\}\.In practice, this is approximated asParket al\.\([2025](https://arxiv.org/html/2605.06785#bib.bib1)\)calculates this with PRMs, wherer^\(β\)\\hat\{r\}^\{\(\\beta\)\}is the PRM’s estimated success probability at quantile levelβ\\betaandNmaxN\_\{\\text\{max\}\}is a maximum budget constraint\.
NIAS\(p,C\)=△log\(1−C\)log\(1−p\)N\_\{\\text\{IAS\}\}\(p,C\)\\overset\{\\triangle\}\{=\}\\frac\{\\log\(1\-C\)\}\{\\log\(1\-p\)\}NIAS=△min\{⌈NIAS\(r^\(β\),C\)⌉,Nmax\}N\_\{\\text\{IAS\}\}\\overset\{\\triangle\}\{=\}\\min\\\{\\lceil N\_\{\\text\{IAS\}\}\(\\hat\{r\}^\{\(\\beta\)\},C\)\\rceil,\\ N\_\{\\text\{max\}\}\\\}\(1\)
For effectiveness of IAS, it is important to have well\-calibrated estimates ofpp\.
## 4Calibrating PRMs with Conditional Optimal Transport
Our goal is to learn a calibrated estimate of the success probabilityppgiven PRMs\. Rather than predicting a single scalar or a fixed set of quantiles \(as done for quantile regression calibration\), we aim to learn the full conditional distribution of outcomes, enabling consistent uncertainty estimation across all quantile levels\.
Optimal Transport for Calibration:Optimal transport \(OT\) provides a principled framework for mapping between probability distributions\. In the Monge formulation, OT seeks a mapT⋆T^\{\\star\}that pushes a source distributionμ\\muto a target distributionν\\nuwhile minimizing transport cost:
T⋆:=arginfT\#μ=ν∫ℝd‖x−T\(x\)‖2𝑑μ\(x\)\.T^\{\\star\}:=\\arg\\inf\_\{T\_\{\\\#\}\\mu=\\nu\}\\int\_\{\\mathbb\{R\}^\{d\}\}\\\|x\-T\(x\)\\\|^\{2\}\\,d\\mu\(x\)\.In our setting,μ\\mucorresponds to the distribution of PRM predictions conditioned on PRM hidden states, whileν\\nucorresponds to the empirical distribution of success outcomes\. The learned transport map aligns predicted scores with calibrated outcome distributions\.
Conditional Optimal Transport \(CondOT\):We make changes to CondOTBunneet al\.\([2022](https://arxiv.org/html/2605.06785#bib.bib2)\)\. CondOT learns context\-conditioned optimal transport maps via the dual formulation of optimal transport\. Given contexthh, CondOT learns a map that transports a source distribution to a target distribution\. In the dual formulation, optimal transport is expressed in terms of two scalar potentialsffandgg\. CondOT parameterizes these dual potentials using two partially input\-convex neural networks \(PICNNs\),g:PICNNθg\(⋅,h\)g:\\text\{PICNN\}\_\{\\theta\_\{g\}\}\(\\cdot,h\)andf:PICNNθf\(⋅,h\)f:\\text\{PICNN\}\_\{\\theta\_\{f\}\}\(\\cdot,h\), enabling flexible function approximation while enforcing convexity in the transported variable\. The dual potentialsffandggcan be learned via the following min\-max objectives:
ℓDOTf\(μ,ν,c;θf\)\\displaystyle\\ell^\{f\}\_\{\\mathrm\{DOT\}\}\(\\mu,\\nu,c;\\theta\_\{f\}\)=𝔼x∼μ\[PICNNθg\(x,c\)\]−𝔼y∼ν\[PICNNθf\(∇yPICNNθg\(y,c\),c\)\],\\displaystyle=\\mathbb\{E\}\_\{x\\sim\\mu\}\\\!\\left\[\\mathrm\{PICNN\}\_\{\\theta\_\{g\}\}\(x,c\)\\right\]\-\\mathbb\{E\}\_\{y\\sim\\nu\}\\\!\\left\[\\mathrm\{PICNN\}\_\{\\theta\_\{f\}\}\\\!\\left\(\\nabla\_\{y\}\\mathrm\{PICNN\}\_\{\\theta\_\{g\}\}\(y,c\),\\,c\\right\)\\right\],\(2\)ℓDOTg\(μ,ν,c;θg\)\\displaystyle\\ell^\{g\}\_\{\\mathrm\{DOT\}\}\(\\mu,\\nu,c;\\theta\_\{g\}\)=−𝔼y∼ν\[⟨y,∇yPICNNθg\(y,c\)⟩−PICNNθf\(∇yPICNNθg\(y,c\),c\)\]\.\\displaystyle=\-\\mathbb\{E\}\_\{y\\sim\\nu\}\\\!\\left\[\\left\\langle y,\\,\\nabla\_\{y\}\\mathrm\{PICNN\}\_\{\\theta\_\{g\}\}\(y,c\)\\right\\rangle\-\\mathrm\{PICNN\}\_\{\\theta\_\{f\}\}\\\!\\left\(\\nabla\_\{y\}\\mathrm\{PICNN\}\_\{\\theta\_\{g\}\}\(y,c\),\\,c\\right\)\\right\]\.\(3\)
The resulting transport map is given by the gradient of the target potential:
Tθ\(x,h\)=∇xgθ\(x,h\)\.T\_\{\\theta\}\(x,h\)=\\nabla\_\{x\}g\_\{\\theta\}\(x,h\)\.
This parameterization induces a structured and monotone mapping between predicted scores and outcome distributions\. In particular, monotonicity with respect to the source variable ensures a globally consistent relationship between confidence levels and predicted success probabilities, preventing quantile crossing\.
In our setting, the contexthhcorresponds to PRM hidden states, the source distribution represents uncalibrated PRM\-derived scores, and the target distribution corresponds to empirical success outcomes\. The learned conditional transport map therefore defines a calibrated mapping from PRM representations to outcome distributions\.
Learning Conditional Quantiles:A key property of one\-dimensional optimal transport is that the transport map corresponds to the quantile function of the target distribution\. Leveraging this, we use the learned conditional transport map to recover the full conditional quantile function:
Qθ\(β∣h\)Q\_\{\\theta\}\(\\beta\\mid h\)enabling calibrated uncertainty estimates at arbitrary confidence levels without retraining\.
Modification to CondOT conditioning\.We keep the PICNN architecture used for the dual potentials, but simplify the conditioning mechanism\. Whereas CondOT uses separate context embedding and combinator modules, we instead embed the PRM hidden state directly within each PICNN\. Concretely, before the PICNN recurrence, we pass the conditioning variablehh\(PRM hidden state\) through a small MLP to help map it to a lower\-dimensional representation,
u0=ϕη\(h\),u\_\{0\}=\\phi\_\{\\eta\}\(h\),whereϕη\\phi\_\{\\eta\}is a learned MLP, and useu0u\_\{0\}as the initial context representation for the PICNN\. This dimensionality reduction is important in our setting, as PRM hidden states are high\-dimensional \(e\.g\., thousands of dimensions\)\. Directly conditioning on these large representations can lead to unstable training and inefficient parameterization\. The learned embedding provides a compact, task\-adaptive representation of context, improving optimization while preserving the conditional transport formulation\. As a result, separate embedding and combinator modules from CondOT are no longer necessary\.
This design preserves convexity with respect to the transported variable while allowing flexible dependence on the context\. As a result, the learned transport map adapts to different reasoning trajectories while maintaining the structural guarantees required for optimal transport\.
TrainingWe construct calibration data following a procedure similar toParket al\.\([2025](https://arxiv.org/html/2605.06785#bib.bib1)\)\. For each queryqq, we first generate multiple reasoning trajectories using the target LLM\. From each trajectory, for each prefixx0:t\(i\)x\_\{0:t\}^\{\(i\)\}, additional trajectories are generated, and we compute the empirical success probability:
p~=\# correct completions\# rollouts\\tilde\{p\}=\\frac\{\\text\{\\\# correct completions\}\}\{\\text\{\\\# rollouts\}\}This yields training triples\(r,h,p~\)\(r,h,\\tilde\{p\}\), whererris the uncalibrated PRM score,hhdenotes the PRM hidden state corresponding to the prefix, andp~\\tilde\{p\}is the empirical success probability\. We generate training samples from a subset of the MATH500 benchmarkHendryckset al\.\([2021](https://arxiv.org/html/2605.06785#bib.bib19)\)\.
These data provide supervision for learning a calibrated mapping from PRM representations to outcome distributions\. In contrast to prior work that fits quantile regression targets directly, we use these samples to train a conditional optimal transport map that aligns the distribution of predicted scores with empirical success outcomes conditioned onhh\. In practice, we optimize the transport objective with mini\-batches sampled from the conditional source and target distributions\.
We perform early stopping based on calibration performance on a held\-out validation set\. At each checkpoint, we evaluate the learned conditional quantile function by computing the empirical calibration curveℙ\(y≤Qθ\(β∣h\)\)\\mathbb\{P\}\\left\(y\\leq Q\_\{\\theta\}\(\\beta\\mid h\)\\right\)whereQθ\(β∣h\)Q\_\{\\theta\}\(\\beta\\mid h\)denotes the predictedβ\\beta\-quantile conditioned on the PRM hidden statehh\. In practice, this is approximated over a discrete grid of quantile levels\. We therefore select the checkpoint that minimizes the area between the empirical calibration curve and the ideal calibration curve, providing a measure of aggregate miscalibration across quantile levels\. Additional training details can be found in Appendix[A\.1](https://arxiv.org/html/2605.06785#A1.SS1)\.
## 5Experiments
We evaluate on MATH500Hendryckset al\.\([2021](https://arxiv.org/html/2605.06785#bib.bib19)\)and AIME24\-25 \(consisting of AIME2024Zhang and Math\-AI \([2024](https://arxiv.org/html/2605.06785#bib.bib29)\)111[https://huggingface\.co/datasets/HuggingFaceH4/aime\_2024](https://huggingface.co/datasets/HuggingFaceH4/aime_2024)and AIME2025222[https://huggingface\.co/datasets/opencompass/AIME2025](https://huggingface.co/datasets/opencompass/AIME2025)\)\. We look at six LLMs: Llama\-3\.2\-1B and 3\.1\-8B\-InstructTouvronet al\.\([2023](https://arxiv.org/html/2605.06785#bib.bib30)\), Qwen2\.5\-Math\-1\.5B and 7B\-InstructYanget al\.\([2025](https://arxiv.org/html/2605.06785#bib.bib31)\), DeepSeek\-R1\-Distill\-Llama and Qwen\-8BGuoet al\.\([2025](https://arxiv.org/html/2605.06785#bib.bib32)\)\. We focus primarily on Qwen2\.5\-Math\-PRM\-7BZhanget al\.\([2025](https://arxiv.org/html/2605.06785#bib.bib17)\)as the scoring PRM, as it is the strongest small open\-source performing PRMSonget al\.\([2025](https://arxiv.org/html/2605.06785#bib.bib33)\), and include additional PRMs \(ReasonEval\-7BXiaet al\.\([2025](https://arxiv.org/html/2605.06785#bib.bib18)\)and Math\-Shepherd\-Mistral\-7BWanget al\.\([2024](https://arxiv.org/html/2605.06785#bib.bib14)\)\) in the Appendix\.
Our experiments evaluate three aspects: \(1\) the flexibility of the learned quantile function, \(2\) calibration performance relative to uncalibrated base PRMs and quantile regression baselines, and \(3\) downstream performance when integrated into instance\-adaptive inference\-time scalingParket al\.\([2025](https://arxiv.org/html/2605.06785#bib.bib1)\)\.
Quantile Regression \(QR\) Baseline\.As a baseline, we adapt the quantile regression \(QR\) calibration approach ofParket al\.\([2025](https://arxiv.org/html/2605.06785#bib.bib1)\), to predict a fixed set of quantiles from PRM hidden states offline\. Specifically, given a hidden representationhh, the model uses a linear layer withMMoutputs, each corresponding to a target quantile level\. The model is trained using the standard pinball lossKoenker and Bassett Jr \([1978](https://arxiv.org/html/2605.06785#bib.bib36)\), which independently penalizes deviations between predicted quantiles and empirical success probabilities at each quantile level\. While this approach provides direct estimates of uncertainty, it requires pre\-specifying the set of quantile levelsβm\{\\beta\_\{m\}\}during training and treats each quantile independently\. On default, we train on 11 quantiles instead of the 3 quantiles inParket al\.\([2025](https://arxiv.org/html/2605.06785#bib.bib1)\)\.
### 5\.1Flexibility of Quantile Estimates
Figure[2](https://arxiv.org/html/2605.06785#S5.F2)illustrates the flexibility of our conditional optimal transport \(OT\) approach compared to a quantile regression \(QR\) baseline on a single example\. The quantile levelτ\\tauon the x\-axis indexes uncertainty inpp: at eachτ\\tau, the curve gives the value such that a fractionτ\\tauof the predictive distribution overppfalls at or below it, so higherτ\\taucorresponds to a more optimistic estimate of success\. OT learns a continuous, globally consistent quantile function, allowing the success probability to be queried at arbitrary confidence levelsτ∈\[0,1\]\\tau\\in\[0,1\]without retraining\. This results in a smooth and monotone curve across quantiles, reflecting a coherent underlying distribution over success probabilities\. In contrast, QR estimates quantiles independently at a fixed set of levels chosen during training, limiting flexibility at inference time\. As a result, QR can only be evaluated at those discrete points and may exhibit inconsistencies such as quantile crossing, where higher quantiles produce lower predicted values\. This violates the defining properties of a valid quantile function and can lead to unreliable uncertainty estimates\. Overall, this comparison highlights that OT provides both greater flexibility and stronger structural guarantees, enabling more reliable uncertainty quantification for inference\-time scaling\.
Figure 2:Estimated success probability for one question \(DeepSeek\-R1\-Distill\-Qwen\-7B, Qwen2\.5\-Math\-PRM\-7B scorer\)\. OT allows for any quantile to be queried freely at inference; OT \(blue\), produces a smooth, monotonic curve \(100 levels\)\. QR \(orange\) can only be evaluated at prefixed values \(11 levels\); and has quantile crossing\.Figure[6](https://arxiv.org/html/2605.06785#A1.F6)illustrates how downstream performance varies as a function of the quantile levelτ\\tauused to query the learned distribution over success probabilities\. The quantile level effectively indexes different points along the predictive distribution, with lowerτ\\taucorresponding to more conservative estimates and higherτ\\tauto more optimistic ones\. Our OT\-based method supports continuous queries overτ∈\[0,1\]\\tau\\in\[0,1\], enabling smooth and stable behavior across quantile levels\. In contrast, the QR baseline is restricted to a fixed grid of quantiles determined at training time, limiting its ability to adapt at inference\. As a result, QR exhibits unstable performance, particularly at higherτ\\tau, where it relies on poorly calibrated tail estimates\. This effect is especially pronounced on the more challenging AIME benchmarks, as shown in the graph for DeepSeek\-R1\-Distill\-Qwen\-7B scored by QwenPRM\-7B where accuracy drops sharply beyondτ\>0\.8\\tau\>0\.8\. Additional figures for other LLMs scored by QwenPRM\-7B can be found in Appendix[A\.2](https://arxiv.org/html/2605.06785#A1.SS2)\.
These results highlight that OT provides a flexible, continuous representation of uncertainty that supports consistent performance across a wide range of quantile queries, whereas QR’s discretized estimates lead to brittle behavior\.
Figure 3:Accuracy as a function of the quantile levelτ\\tauused in the IAS stopping criterion, at fixed targetC=0\.99C=0\.99, for R1\-Qwen\-7B scored by QwenPRM\-7B on MATH500 \(left\) and AIME24\-25 \(right\)\. OT \(blue\) sweepsτ\\tauover 100 continuous levels; QR \(orange\) is restricted to 11 fixed quantiles\. OT maintains higher accuracy across all levels and degrades more gracefully at largeτ\\tau, while QR drops sharply on AIME atτ\>0\.8\\tau\>0\.8, reflecting miscalibration at the tails of the QR quantile grid\.
### 5\.2Calibration Evaluations
We evaluate calibration comparing three score variants: uncalibrated base PRM scores, OT\-calibrated scores, and regression\-calibrated \(QR\) scores\. For point\-estimate metrics, OT uses the integrated expected success probability𝔼\[p^∣h\]=∫01Q\(τ∣h\)𝑑τ\\mathbb\{E\}\[\\hat\{p\}\\mid h\]=\\int\_\{0\}^\{1\}Q\(\\tau\\mid h\)\\,d\\tau\(approximated via the trapezoid rule over 11 quantile levels\), while QR uses the median quantile prediction \(τ=0\.5\\tau=0\.5\)\. For each \(dataset, model, PRM\) triple we report four metrics\. The Brier scoreGlenn and others \([1950](https://arxiv.org/html/2605.06785#bib.bib34)\)and positive\-class Brier score \(PosBrier\) are both means over allNNnumber of \(problem, response\) pairs\.ECENaeiniet al\.\([2015](https://arxiv.org/html/2605.06785#bib.bib35)\)uses 12 equal\-width bins whose range is extended slightly beyond\[0,1\]\[0,1\]to avoid clipping predictions at the boundary\. WQLKoenker and Bassett Jr \([1978](https://arxiv.org/html/2605.06785#bib.bib36)\)is reported only for OT and QR as the base PRM has no predictive distribution and receivesNaNfor this metric\. It averages the pinball loss over allNNsamples at each of 11 quantile levelsτ∈\{0\.0,0\.1,…,1\.0\}\\tau\\in\\\{0\.0,0\.1,\\ldots,1\.0\\\}, then averages equally across quantile levels\. We use 11 quantile levels for OT to match the QR baseline for fair comparison, while keeping evaluation computationally efficient\.
For MATH500, calibration metrics are computed over the full PRM calibration dataset created byParket al\.\([2025](https://arxiv.org/html/2605.06785#bib.bib1)\), including both training and validation splits, since calibration is assessed as a property of both the learned scoring function and its generalization to unseen data\. In contrast, all AIME24\-25 results are evaluated entirely out\-of\-distribution, using held\-out problems not seen during calibration\. As shown in Table[1](https://arxiv.org/html/2605.06785#S5.T1), across PRMs, OT calibration overall reduces calibration error except in cases where the base PRM is already well calibrated, and is overall the best variant for the LLM models, with particularly dramatic improvements for the weakest models: for Llama\-3\.2\-1B on AIME24\-25\. Calibration tables for ReasonEval\-7B and Math\-Shepherd\-Mistral\-7B can be found in Appendix[A\.3](https://arxiv.org/html/2605.06785#A1.SS3)\. Further details about training and evaluation data for all experiments can be found in Appendix[A\.1](https://arxiv.org/html/2605.06785#A1.SS1)\.
Table 1:Calibration metrics \(Brier, PosBrier, ECE, WQL; all lower is better\) for Base, OT, and QR across six generator models paired with the Qwen2\.5\-Math\-PRM\-7B scorer, on MATH500 \(in\-distribution\) and AIME24\-25 \(out\-of\-distribution\)\. Cells are colored per \(Dataset, Model\) group: pink = worst variant, white = mid, blue = best variant; bold indicates the best variant per \(Model, metric\) pair\.†OT’s WQL improvement is reported relative to QR, since Base produces no quantile predictions\.
### 5\.3Instance\-Adaptive BoN
We compare instance\-adaptive Best\-of\-NN\(BoN\) sampling for the uncalibrated base PRM and the conditional OT calibrated method\. In instance\-adaptive sampling, each questioniiis assigned a sample budgetNi∈\{1,…,Nmax\}N\_\{i\}\\in\\\{1,\\ldots,N\_\{\\max\}\\\}based on a target success probabilityC∈\(0,1\)C\\in\(0,1\)and a per\-question success probability estimate\. The final prediction is the candidate with the highest*uncalibrated*PRM score among theNiN\_\{i\}generated samples\. We sweep C over 10 levels from 0\.5 to 0\.999, setNmaxN\_\{\\text\{max\}\}to 64, and report accuracy averaged over 100 Monte Carlo trials and all questions\. Figure[4](https://arxiv.org/html/2605.06785#S5.F4)plots accuracy against average normalized costN¯/Nmax\\bar\{N\}/N\_\{\\text\{max\}\}\.
Figure 4:Best\-of\-NNwith IAS \(QwenPRM\-7B scorer\)\.Each panel shows accuracy \(y\-axis\) against normalized sampling budgetN¯/Nmax\\bar\{N\}/N\_\{\\max\}\(x\-axis\) for six generator models onMATH500\(left\) andAIME24\-25\(right\), using the Qwen2\.5\-Math\-PRM\-7B scorer\. Under Base IAS, the budget is in a narrow range and accuracy is flat across model\. Under OT IAS, the budget spans a larger range and has a smooth, monotonically increasing cost–accuracy frontier\.Figure 5:Calibrated BoN viaβ\\beta\-Threshold Selection \(QwenPRM\-7B scorer\)\.Each curve traces the cost–accuracy Pareto frontier obtained by sweeping theβ\\betastopping threshold \(11 values\) at fixed confidence levelC=0\.9C=0\.9\. For each question, the OT predictive distribution over success probability determines the per\-question sample budgetNN;β\\betacontrols how aggressively early stopping is applied\. Lowerβ\\betahalts sampling sooner \(smallN¯/Nmax\\bar\{N\}/N\_\{\\max\}, lower accuracy\), while higherβ\\betarequires a stronger PRM signal before stopping \(largeN¯/Nmax\\bar\{N\}/N\_\{\\max\}, higher accuracy\)\.For the uncalibrated base PRM, we produce a scalar estimate of the per\-question success probabilitypip\_\{i\}and assume independent Bernoulli trials\. The allocation is
Ni=min\{N∈\{1,…,Nmax\}:1−\(1−ri\)N≥C\},N\_\{i\}=\\min\\bigl\\\{N\\in\\\{1,\\ldots,N\_\{\\max\}\\\}:1\-\(1\-r\_\{i\}\)^\{N\}\\geq C\\bigr\\\},\(4\)withNi=NmaxN\_\{i\}=N\_\{\\max\}if no suchNNexists,rir\_\{i\}is the uncalibrated PRM score for the question\.
For OT, rather than reducing the calibrated posterior to a point estimate, we retain anMM\-point discrete approximation via a fixed grid of quantile levels\{βm\}m=1M\\\{\\beta\_\{m\}\\\}\_\{m=1\}^\{M\}, computingp^i\(m\)=Qθ\(βm∣hi\)\\hat\{p\}\_\{i\}^\{\(m\)\}=Q\_\{\\theta\}\(\\beta\_\{m\}\\mid h\_\{i\}\)for eachmm\. The allocation is then:
Ni=min\{N∈\{1,…,Nmax\}:1M∑m=1M\[1−\(1−p^i\(m\)\)N\]≥C\},N\_\{i\}=\\min\\\!\\left\\\{N\\in\\\{1,\\ldots,N\_\{\\max\}\\\}:\\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}\\Bigl\[1\-\\bigl\(1\-\\hat\{p\}\_\{i\}^\{\(m\)\}\\bigr\)^\{N\}\\Bigr\]\\geq C\\right\\\},\(5\)withNi=NmaxN\_\{i\}=N\_\{\\max\}if no suchNNexists\. The average1M∑m=1M\[1−\(1−p^i\(m\)\)N\]\\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}\\bigl\[1\-\(1\-\\hat\{p\}\_\{i\}^\{\(m\)\}\)^\{N\}\\bigr\]is a Monte Carlo estimate of𝔼p∼Πi\[1−\(1−p\)N\]\\mathbb\{E\}\_\{p\\sim\\Pi\_\{i\}\}\[1\-\(1\-p\)^\{N\}\], the expected probability of at least one success inNNtrials under the calibrated posteriorΠi\\Pi\_\{i\}overpip\_\{i\}\. Compared with point\-estimate methods, this formulation explicitly propagates uncertainty inpip\_\{i\}into the allocation: the budget increases for questions wherepip\_\{i\}is uncertain and decreases when it is confidently high\.
Figure[4](https://arxiv.org/html/2605.06785#S5.F4)compares Best\-of\-NNwith IAS selection under two probability estimators: uncalibrated Base scores and OT\-calibrated posteriors\. For the uncalibrated Base PRM, the normalized budgetN¯/Nmax\\bar\{N\}/N\_\{\\max\}is virtually constant across all quantile levels, and accuracy is flat\. The uncalibrated PRM scores concentrate nearly all probability mass in the same narrow range, providing no useful signal for adaptive allocation\. For OT IAS, the budget spans a larger portion of the full possible range\[0,1\]\[0,1\]and every model traces a smooth, monotonically increasing cost–accuracy frontier\. Stronger models again benefit most at low budget, as R1\-Qwen\-7B and Qwen\-2\.5\-7B reach near\-ceiling accuracy onMATH500byN¯/Nmax≈0\.1\\bar\{N\}/N\_\{\\max\}\\approx 0\.1\. Additional graphs for ReasonEval\-7B and Math\-Shepherd\-Mistral\-7B can be found in Appendix[A\.4](https://arxiv.org/html/2605.06785#A1.SS4)\. We focus on OT and the uncalibrated base PRM as the primary comparison; results for the QR baseline under IAS are included in Appendix[A\.4](https://arxiv.org/html/2605.06785#A1.SS4)for completeness\.
Figure[5](https://arxiv.org/html/2605.06785#S5.F5)shows the cost–accuracy tradeoff curves produced by sweeping theβ\\betastopping threshold under OT calibration atC=0\.9C=0\.9\. The curves are monotonically increasing in budget, confirming that the OT predictive distribution provides a consistent selection signal: allocating more samples to questions where the model is estimated to be uncertain yields measurable accuracy gains\. By contrast, Llama\-3\.2\-1B collapses to near\-zero accuracy on AIME24\-25 regardless ofβ\\beta, indicating a hard model capability floor that additional sampling cannot overcome\. Theβ\\betasweep curves flatten immediately, confirming that the method correctly identifies these as unsolvable under the given model and does not wastefully over\-sample them\. Taken together, these results show thatβ\\beta\-threshold selection under OT calibration smoothly interpolates between aggressive early stopping and near\-full\-budget evaluation, providing a practical knob for compute\-accuracy tradeoffs without retraining\.
## 6Conclusion
We proposed a PRM calibration method based on conditional optimal transport, adapting the CondOT frameworkBunneet al\.\([2022](https://arxiv.org/html/2605.06785#bib.bib2)\)to learn a full monotonic conditional quantile function over success probabilities from PRM hidden states\. Unlike quantile regression, our method produces structurally valid, crossing\-free quantile estimates at arbitrary confidence levels without retraining, and integrates naturally into the instance\-adaptive scaling \(IAS\) framework ofParket al\.\([2025](https://arxiv.org/html/2605.06785#bib.bib1)\)and improves on Best ofNNperformance\. Empirically, our conditional OT method generally improves calibration over both uncalibrated PRMs and quantile regression baselines on MATH500, and on harder out\-of\-distribution problems \(AIME\)\. Together, these results establish conditional optimal transport as a principled and another practical approach to PRM calibration, offering structural guarantees and flexible uncertainty estimation that complement existing inference\-time scaling methods\.
Limitations:Calibration quality inherits the ranking quality of the base PRM\. When the underlying model produces unreliable signals, OT calibration cannot recover meaningful uncertainty estimates, and downstream performance degrades\. OT performance is sensitive to distribution shift between training and evaluation problems\. Training targets are estimated from a finite number of rollouts, introducing noise that can degrade calibration at the tails where success rates are near zero\. Our method operates as an offline calibration layer and does not update the PRM itself; a promising direction for future work is to incorporate calibration objectives directly into PRM fine\-tuning\. Due to compute constraints, we evaluate on Best\-of\-N sampling but other inference\-time scaling methods such as beam search are compatible\. Finally, our evaluation is limited to mathematical reasoning benchmarks; whether these generalize to other domains or other models remains an open question\.
## 7Impact Statement
This work improves the reliability of inference\-time scaling for LLMs by producing better\-calibrated uncertainty estimates from PRMs\. More accurate uncertainty quantification could reduce wasted compute in reasoning pipelines and enable more trustworthy and safer \(with uncertainty estimation\) deployment of LLMs in high\-stakes settings, where overconfident models can silently commit to wrong solutions\. On the risk side, more efficient inference\-time scaling could accelerate the deployment of powerful reasoning systems in contexts where additional human oversight would be prudent\.
## Acknowledgments
We thank Young\-Jin Park for answering questions about his prior work, and Kaveh Alim, Hao Wang, and Navid Azizan for helpful conversations\. This work was supported in part by the MIT\-IBM Watson AI Lab\. Rachel also thanks Dylan and Kristjan for finding "quantity uncertification" funny, an accidental name that became an inside joke throughout the project\.
## References
- \[1\]M\. Arjovsky, S\. Chintala, and L\. Bottou\(2017\)Wasserstein generative adversarial networks\.InInternational conference on machine learning,pp\. 214–223\.Cited by:[§2](https://arxiv.org/html/2605.06785#S2.p4.1)\.
- \[2\]B\. Brown, J\. Juravsky, R\. Ehrlich, R\. Clark, Q\. V\. Le, C\. Ré, and A\. Mirhoseini\(2024\)Large language monkeys: scaling inference compute with repeated sampling\.arXiv preprint arXiv:2407\.21787\.Cited by:[§1](https://arxiv.org/html/2605.06785#S1.p1.1),[§1](https://arxiv.org/html/2605.06785#S1.p2.1),[§2](https://arxiv.org/html/2605.06785#S2.p1.1),[§3](https://arxiv.org/html/2605.06785#S3.p2.7)\.
- \[3\]C\. Bunne, A\. Krause, and M\. Cuturi\(2022\)Supervised training of conditional monge maps\.Advances in Neural Information Processing Systems35,pp\. 6859–6872\.Cited by:[§1](https://arxiv.org/html/2605.06785#S1.p4.1),[§2](https://arxiv.org/html/2605.06785#S2.p4.1),[§4](https://arxiv.org/html/2605.06785#S4.p3.7),[§6](https://arxiv.org/html/2605.06785#S6.p1.1)\.
- \[4\]Y\. Chow, G\. Tennenholtz, I\. Gur, V\. Zhuang, B\. Dai, S\. Thiagarajan, C\. Boutilier, R\. Agarwal, A\. Kumar, and A\. Faust\(2024\)Inference\-aware fine\-tuning for best\-of\-n sampling in large language models\.arXiv preprint arXiv:2412\.15287\.Cited by:[§2](https://arxiv.org/html/2605.06785#S2.p1.1)\.
- \[5\]K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§1](https://arxiv.org/html/2605.06785#S1.p1.1),[§2](https://arxiv.org/html/2605.06785#S2.p1.1)\.
- \[6\]N\. Courty, R\. Flamary, A\. Habrard, and A\. Rakotomamonjy\(2017\)Joint distribution optimal transportation for domain adaptation\.Advances in neural information processing systems30\.Cited by:[§2](https://arxiv.org/html/2605.06785#S2.p4.1)\.
- \[7\]N\. Courty, R\. Flamary, D\. Tuia, and A\. Rakotomamonjy\(2016\)Optimal transport for domain adaptation\.IEEE transactions on pattern analysis and machine intelligence39\(9\),pp\. 1853–1865\.Cited by:[§2](https://arxiv.org/html/2605.06785#S2.p4.1)\.
- \[8\]K\. Cox, J\. Xu, Y\. Han, R\. Xu, T\. Li, C\. Hsu, T\. Chen, W\. Gerych, and Y\. Ding\(2025\)Mapping from meaning: addressing the miscalibration of prompt\-sensitive language models\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 23696–23703\.Cited by:[§2](https://arxiv.org/html/2605.06785#S2.p3.1)\.
- \[9\]M\. Damani, I\. Puri, S\. Slocum, I\. Shenfeld, L\. Choshen, Y\. Kim, and J\. Andreas\(2025\)Beyond binary rewards: training lms to reason about their uncertainty\.arXiv preprint arXiv:2507\.16806\.Cited by:[§2](https://arxiv.org/html/2605.06785#S2.p3.1)\.
- \[10\]W\. B\. Glennet al\.\(1950\)Verification of forecasts expressed in terms of probability\.Monthly weather review78\(1\),pp\. 1–3\.Cited by:[§5\.2](https://arxiv.org/html/2605.06785#S5.SS2.p1.6)\.
- \[11\]C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger\(2017\)On calibration of modern neural networks\.InInternational conference on machine learning,pp\. 1321–1330\.Cited by:[§2](https://arxiv.org/html/2605.06785#S2.p3.1)\.
- \[12\]D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§5](https://arxiv.org/html/2605.06785#S5.p1.1)\.
- \[13\]D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt\(2021\)Measuring mathematical problem solving with the math dataset\. neurips, 1–22\.arXiv preprint arXiv:2103\.03874\.Cited by:[§A\.1](https://arxiv.org/html/2605.06785#A1.SS1.p2.1),[§4](https://arxiv.org/html/2605.06785#S4.p10.6),[§5](https://arxiv.org/html/2605.06785#S5.p1.1)\.
- \[14\]S\. Kadavath, T\. Conerly, A\. Askell, T\. Henighan, D\. Drain, E\. Perez, N\. Schiefer, Z\. Hatfield\-Dodds, N\. DasSarma, E\. Tran\-Johnson,et al\.\(2022\)Language models \(mostly\) know what they know\.arXiv preprint arXiv:2207\.05221\.Cited by:[§2](https://arxiv.org/html/2605.06785#S2.p3.1)\.
- \[15\]D\. P\. Kingma and J\. Ba\(2014\)Adam: a method for stochastic optimization\.arXiv preprint arXiv:1412\.6980\.Cited by:[§A\.1](https://arxiv.org/html/2605.06785#A1.SS1.p4.8)\.
- \[16\]R\. Koenker and G\. Bassett Jr\(1978\)Regression quantiles\.Econometrica: journal of the Econometric Society,pp\. 33–50\.Cited by:[§5\.2](https://arxiv.org/html/2605.06785#S5.SS2.p1.6),[§5](https://arxiv.org/html/2605.06785#S5.p3.3)\.
- \[17\]H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe\(2023\)Let’s verify step by step\.InThe twelfth international conference on learning representations,Cited by:[§1](https://arxiv.org/html/2605.06785#S1.p1.1),[§2](https://arxiv.org/html/2605.06785#S2.p1.1)\.
- \[18\]H\. Liu, Z\. Dou, Y\. Wang, N\. Peng, and Y\. Yue\(2024\)Uncertainty calibration for tool\-using language agents\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 16781–16805\.Cited by:[§2](https://arxiv.org/html/2605.06785#S2.p3.1)\.
- \[19\]M\. P\. Naeini, G\. Cooper, and M\. Hauskrecht\(2015\)Obtaining well calibrated probabilities using bayesian binning\.InProceedings of the AAAI conference on artificial intelligence,Vol\.29\.Cited by:[§5\.2](https://arxiv.org/html/2605.06785#S5.SS2.p1.6)\.
- \[20\]Y\. Park, K\. Greenewald, K\. Alim, H\. Wang, and N\. Azizan\(2025\)Know what you don’t know: uncertainty calibration of process reward models\.arXiv preprint arXiv:2506\.09338\.Cited by:[§A\.1](https://arxiv.org/html/2605.06785#A1.SS1.p1.1),[§A\.1](https://arxiv.org/html/2605.06785#A1.SS1.p3.1),[item 3](https://arxiv.org/html/2605.06785#S1.I1.i3.p1.1),[§1](https://arxiv.org/html/2605.06785#S1.p2.1),[§1](https://arxiv.org/html/2605.06785#S1.p3.1),[§1](https://arxiv.org/html/2605.06785#S1.p4.1),[§2](https://arxiv.org/html/2605.06785#S2.p1.1),[§2](https://arxiv.org/html/2605.06785#S2.p2.1),[§3](https://arxiv.org/html/2605.06785#S3.p1.1),[§3](https://arxiv.org/html/2605.06785#S3.p5.4),[§4](https://arxiv.org/html/2605.06785#S4.p10.2),[§5\.2](https://arxiv.org/html/2605.06785#S5.SS2.p2.1),[§5](https://arxiv.org/html/2605.06785#S5.p2.1),[§5](https://arxiv.org/html/2605.06785#S5.p3.3),[§6](https://arxiv.org/html/2605.06785#S6.p1.1)\.
- \[21\]G\. Peyré and M\. Cuturi\(2019\)Computational optimal transport: with applications to data science\.Now Foundations and Trends\.Cited by:[§2](https://arxiv.org/html/2605.06785#S2.p4.1)\.
- \[22\]C\. Rodriguez\-Pardo, L\. Chiani, E\. Borgonovo, and M\. Tavoni\(2025\)Neural conditional transport maps\.arXiv preprint arXiv:2505\.15808\.Cited by:[§2](https://arxiv.org/html/2605.06785#S2.p4.1)\.
- \[23\]C\. Snell, J\. Lee, K\. Xu, and A\. Kumar\(2024\)Scaling llm test\-time compute optimally can be more effective than scaling model parameters\.arXiv preprint arXiv:2408\.03314\.Cited by:[§1](https://arxiv.org/html/2605.06785#S1.p1.1)\.
- \[24\]M\. Song, Z\. Su, X\. Qu, J\. Zhou, and Y\. Cheng\(2025\)PRMBench: a fine\-grained and challenging benchmark for process\-level reward models\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 25299–25346\.Cited by:[§5](https://arxiv.org/html/2605.06785#S5.p1.1)\.
- \[25\]H\. Touvron, T\. Lavril, G\. Izacard, X\. Martinet, M\. Lachaux, T\. Lacroix, B\. Rozière, N\. Goyal, E\. Hambro, F\. Azhar,et al\.\(2023\)Llama: open and efficient foundation language models\.arXiv preprint arXiv:2302\.13971\.Cited by:[§5](https://arxiv.org/html/2605.06785#S5.p1.1)\.
- \[26\]J\. Uesato, N\. Kushman, R\. Kumar, F\. Song, N\. Siegel, L\. Wang, A\. Creswell, G\. Irving, and I\. Higgins\(2022\)Solving math word problems with process\-and outcome\-based feedback\.arXiv preprint arXiv:2211\.14275\.Cited by:[§1](https://arxiv.org/html/2605.06785#S1.p1.1),[§2](https://arxiv.org/html/2605.06785#S2.p1.1)\.
- \[27\]C\. Villaniet al\.\(2009\)Optimal transport: old and new\.Vol\.338,Springer\.Cited by:[§2](https://arxiv.org/html/2605.06785#S2.p4.1)\.
- \[28\]P\. Wang, L\. Li, Z\. Shao, R\. Xu, D\. Dai, Y\. Li, D\. Chen, Y\. Wu, and Z\. Sui\(2024\)Math\-shepherd: verify and reinforce llms step\-by\-step without human annotations\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 9426–9439\.Cited by:[§2](https://arxiv.org/html/2605.06785#S2.p1.1),[§5](https://arxiv.org/html/2605.06785#S5.p1.1)\.
- \[29\]X\. Wang, J\. Wei, D\. Schuurmans, Q\. Le, E\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou\(2022\)Self\-consistency improves chain of thought reasoning in language models\.arXiv preprint arXiv:2203\.11171\.Cited by:[§2](https://arxiv.org/html/2605.06785#S2.p3.1)\.
- \[30\]Z\. O\. Wang, R\. Baptista, Y\. Marzouk, L\. Ruthotto, and D\. Verma\(2025\)Efficient neural network approaches for conditional optimal transport with applications in bayesian inference\.SIAM Journal on Scientific Computing47\(4\),pp\. C979–C1005\.Cited by:[§2](https://arxiv.org/html/2605.06785#S2.p4.1)\.
- \[31\]S\. Xia, X\. Li, Y\. Liu, T\. Wu, and P\. Liu\(2025\)Evaluating mathematical reasoning beyond accuracy\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 27723–27730\.Cited by:[§2](https://arxiv.org/html/2605.06785#S2.p1.1),[§5](https://arxiv.org/html/2605.06785#S5.p1.1)\.
- \[32\]A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§5](https://arxiv.org/html/2605.06785#S5.p1.1)\.
- \[33\]F\. Ye, M\. Yang, J\. Pang, L\. Wang, D\. F\. Wong, E\. Yilmaz, S\. Shi, and Z\. Tu\(2024\)Benchmarking llms via uncertainty quantification\.Advances in Neural Information Processing Systems37,pp\. 15356–15385\.Cited by:[§2](https://arxiv.org/html/2605.06785#S2.p3.1)\.
- \[34\]Y\. Zhang and T\. Math\-AI\(2024\)American invitational mathematics examination \(aime\) 2024\.Cited by:[§5](https://arxiv.org/html/2605.06785#S5.p1.1)\.
- \[35\]Z\. Zhang, C\. Zheng, Y\. Wu, B\. Zhang, R\. Lin, B\. Yu, D\. Liu, J\. Zhou, and J\. Lin\(2025\)The lessons of developing process reward models in mathematical reasoning\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 10495–10516\.Cited by:[§2](https://arxiv.org/html/2605.06785#S2.p1.1),[§5](https://arxiv.org/html/2605.06785#S5.p1.1)\.
## Appendix ATechnical appendices and supplementary material
### A\.1Training Hyperparameters and Details
The training and validation dataset and for conditional optimal transport and quantile regression baseline is constructed from a subset of 500 problems from the MATH500 train split\[hendrycks2021math\]generated from instructions\[[20](https://arxiv.org/html/2605.06785#bib.bib1)\]with the addition of the corresponding PRM hidden states for all six models and three PRMs \(Nmax=64N\_\{\\text\{max\}\}=64with 8 generations at each stage\)\. We split this set into 80% training and 20% test at the*question*level usingGroupShuffleSplit\(random\_seed=42for reproducibility\), ensuring that all samples associated with a given question appear exclusively in one split and preventing question\-level leakage between train and test\. Inputs are precomputed embeddings loaded from disk rather than raw text\. The training loader uses random shuffling per epoch\.
For Best\-of\-NNIAS experiments, the test data for MATH500 is a subset of problems taken from the test split\[[13](https://arxiv.org/html/2605.06785#bib.bib19)\]\. For out of distribution evaluation, we use AIME 2024 \(30 problems\) at[https://huggingface\.co/datasets/HuggingFaceH4/aime\_2024](https://huggingface.co/datasets/HuggingFaceH4/aime_2024)and both parts of AIME 2025 \(30 problems each, 60 total\) at[https://huggingface\.co/datasets/opencompass/AIME2025](https://huggingface.co/datasets/opencompass/AIME2025), which we also concatenate into a single combined AIME benchmark of 90 problems to report aggregate competition\-level performance\.
The calibration experiments are evaluated on all of the prefixes and generations and questions from the calibration dataset created by\[[20](https://arxiv.org/html/2605.06785#bib.bib1)\]at[https://huggingface\.co/datasets/young\-j\-park/prm\_calibration](https://huggingface.co/datasets/young-j-park/prm_calibration), which includes both questions that we use for training and validation for our QR baseline and OT method, and out\-of\-distribution unseen questions from AIME24\-25, since calibration is both assessed as a property of the learned scoring function and its generalization to unseen data \(AIME24\-25\)\.
The OT calibration map is parameterized by two partially input\-convex neural networks \(PICNNs\),ffandgg\. Both networks are trained with the Adam optimizer\[[15](https://arxiv.org/html/2605.06785#bib.bib37)\]\.ffandgguses varying learning rates depending on model, as well as different hidden state layers for the PICNNs and the MLP reduction of the PRM hidden states\. Learning rates are decayed via step schedulers\.ggis updated at every minibatch step whileffis updated once everyxxsteps, following the standard alternating schedule for Kantorovich\-dual training\. Gradients are clipped to unit norm \(max\_norm=1\.0\) for both networks\. Early stopping is applied with patience of 175 evaluation steps and minimum improvement thresholdδ=10−4\\delta=10^\{\-4\}; the checkpoint achieving the lowest calibration area score is selected\.
We provide all the best parameters descriptions \(learning rate, step scheduler and decay, and the rate thatggis updated, hidden state layer information, and MLP parameters\) in shell scripts in the provided code in supplementary material\.
A single NVIDIA H100 80GB SXM5 or NVIDIA A100 80GB SXM4, with a single worker, with 4 CPUs per task, with default mmemory of 16G, was requested for each \(model, PRM\) training run\.
### A\.2Flexibility of Quantile Estimates for BoN\+IAS
Figure 6:Accuracy as a function of the quantile levelτ\\tauused in the IAS stopping criterion, at fixed thresholdC=0\.99C=0\.99, for MATH500 \(left\) and AIME24\-25 \(right\)\. At eachτ\\tau, the IAS rule stops sampling when theτ\\tau\-th quantile of the calibrated posterior overppexceedsCC; lowerτ\\tauis more conservative \(requires even the pessimistic tail to be high\) and higherτ\\tauis more lenient\. OT \(blue\) sweepsτ\\tauover 100 continuous levels; QR \(orange\) is restricted to 11 fixed quantiles\. OT maintains higher accuracy across all levels and degrades more gracefully at largeτ\\tau, while QR drops sharply on AIME atτ\>0\.8\\tau\>0\.8, reflecting miscalibration at the tails of the QR quantile grid\.
### A\.3Calibration Metrics for ReasonEval and Shepherd
Table 2:Calibration metrics forReasonEval\-7B\. Colored per \(Dataset, Model\): pink = worst variant, white = mid, blue = best variant\.Table 3:Calibration metrics forShepherd\-7B\. Colored per \(Dataset, Model\): pink = worst variant, white = mid, blue = best variant\.
### A\.4Additional BoN\+IAS Results
Figure 7:Best\-of\-NNwith IAS Selection: Base \(all PRMs, all models\)\.Each panel shows accuracy \(y\-axis\) against normalized sampling budgetN¯/Nmax\\bar\{N\}/N\_\{\\max\}\(x\-axis, the mean number of candidates drawn per question relative to the maximum\) for six generator models onMATH500\(left\) andAIME24\-25\(right\), with rows corresponding to three PRMs: QwenPRM\-7B \(top\), ReasonEval\-7B \(middle\), and Shepherd\-7B \(bottom\)\. The Base method uses raw PRM scores to rank and select candidates via IAS; because these scores are uncalibrated, the effective budget range is extremely narrow and accuracy curves are essentially flat across all models and PRMs\.Figure 8:Best\-of\-NNwith IAS Selection: OT \(all PRMs, all models\)\.Each panel shows accuracy \(y\-axis\) against normalized sampling budgetN¯/Nmax\\bar\{N\}/N\_\{\\max\}\(x\-axis\) for six generator models onMATH500\(left\) andAIME24\-25\(right\), across three PRMs \(rows: QwenPRM\-7B, ReasonEval\-7B, Shepherd\-7B\)\. IS selects candidates whose OT\-predicted success probability exceeds a threshold; sweeping this threshold traces the full budget range\[0,1\]\[0,1\]\. OT produces a smooth, monotonically increasing accuracy–budget frontier for virtually all model–PRM combinations: accuracy rises steadily as more budget is allocated, reaching up to≈\\approx0\.9 for the strongest models \(R1\-Qwen\-7B, R1\-Llama\-8B\) onMATH500with QwenPRM\-7B\. OnAIME24\-25, absolute accuracies are lower \(up to≈\\approx0\.25\) but the upward trend is preserved\. Across PRMs, QwenPRM\-7B yields the cleanest, best\-separated frontiers\.Figure 9:Best\-of\-NNwith IAS Selection: QR atβ=0\.1\\beta=0\.1\(all PRMs, all models\)\.Each panel shows accuracy \(y\-axis\) against normalized sampling budgetN¯/Nmax\\bar\{N\}/N\_\{\\max\}\(x\-axis\) for six generator models onMATH500\(left\) andAIME24\-25\(right\), across three PRMs \(rows: QwenPRM\-7B, ReasonEval\-7B, Shepherd\-7B\)\.Similar Articles
Recovering Hidden Reward in Diffusion-Based Policies
This research paper explores methods for recovering hidden rewards within diffusion-based policies, likely aiming to improve the alignment or efficiency of such models.
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
This paper proposes an empirical 'sparse-to-dense' reward principle for language model post-training, arguing that scarce labeled data should be used with sparse rewards for teacher model discovery and dense rewards for student compression via distillation. The authors demonstrate that this staged approach, bridging sparse RL and on-policy distillation, outperforms direct GRPO on deployment-sized models in math benchmarks.
Path-Coupled Bellman Flows for Distributional Reinforcement Learning
This paper introduces Path-Coupled Bellman Flows (PCBF), a continuous-time distributional reinforcement learning method that uses flow matching to model return distributions without heuristic projections. It addresses boundary mismatch and high-variance issues in previous flow-based approaches by coupling current and successor return flows through shared base noise.
DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification
DeltaRubric is a research paper introducing a two-step multimodal preference evaluation approach using a single MLLM to improve reward modeling reliability through joint planning and verification.
TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment
This paper introduces Trajectory Matching Policy Optimization (TMPO), a method for aligning diffusion models that addresses reward hacking and visual mode collapse by matching trajectory-level reward distributions rather than maximizing scalar rewards.