Process Rewards with Learned Reliability
Summary
BetaPRM is a process reward model that predicts both a step-level success probability and the reliability of that prediction using a Beta belief from Monte Carlo continuations, enabling adaptive computation allocation that reduces token usage by up to 33.57% while improving accuracy.
View Cached Full Text
Cached at: 05/18/26, 06:32 AM
# Process Rewards with Learned Reliability
Source: [https://arxiv.org/html/2605.15529](https://arxiv.org/html/2605.15529)
Jinyuan Li1, Langlin Huang1, Chengsong Huang1, Shaoyang Xu2, Donghong Cai1, Yuyi Yang1, Wenxuan Zhang2, Jiaxin Huang1 1Washington University in St\. Louis2Singapore University of Technology and Design \{ljinyuan,jiaxinh\}@wustl\.edu
###### Abstract
Process Reward Models \(PRMs\) provide step\-level feedback for reasoning, but current PRMs usually output only a single reward score for each step\. Downstream methods must therefore treat imperfect step\-level reward predictions as reliable decision signals, with no indication of when these predictions should be trusted\. We proposeBetaPRM, a distributional PRM that predicts both a step\-level success probability and the reliability of that prediction\. Given step\-success supervision from Monte Carlo continuations,BetaPRMlearns a Beta belief that explains the observed number of successful continuations through a Beta\-Binomial likelihood, rather than regressing to the finite\-sample success ratio as a point target\. This learned reliability signal indicates when a step reward should be trusted, enabling downstream applications to distinguish reliable rewards from uncertain ones\. As one application, we introduce Adaptive Computation Allocation \(ACA\) for PRM\-guided Best\-of\-NNreasoning\. ACA uses the learned reliability signal to stop when a high\-reward solution is reliable and to spend additional computation on uncertain candidate prefixes\. Experiments across four backbones and four reasoning benchmarks show thatBetaPRMimproves PRM\-guided Best\-of\-NNselection while preserving standard step\-level error detection\. Built on this signal, ACA improves the accuracy–token tradeoff over fixed\-budget Best\-of\-1616, reducing token usage by up to33\.57%33\.57\\%while improving final\-answer accuracy\.
## 1Introduction
Process Reward Models \(PRMs\)\[[14](https://arxiv.org/html/2605.15529#bib.bib14),[19](https://arxiv.org/html/2605.15529#bib.bib19),[31](https://arxiv.org/html/2605.15529#bib.bib31),[41](https://arxiv.org/html/2605.15529#bib.bib41),[43](https://arxiv.org/html/2605.15529#bib.bib43),[59](https://arxiv.org/html/2605.15529#bib.bib59),[61](https://arxiv.org/html/2605.15529#bib.bib61),[72](https://arxiv.org/html/2605.15529#bib.bib72),[73](https://arxiv.org/html/2605.15529#bib.bib73)\]provide step\-level feedback for reasoning by scoring the intermediate steps of a solution\. Because these step\-level scores can guide candidate selection\[[5](https://arxiv.org/html/2605.15529#bib.bib5),[21](https://arxiv.org/html/2605.15529#bib.bib21),[35](https://arxiv.org/html/2605.15529#bib.bib35)\]and policy optimization\[[12](https://arxiv.org/html/2605.15529#bib.bib12),[36](https://arxiv.org/html/2605.15529#bib.bib36)\], PRMs have become a useful interface for both test\-time scaling\[[2](https://arxiv.org/html/2605.15529#bib.bib2),[10](https://arxiv.org/html/2605.15529#bib.bib10),[30](https://arxiv.org/html/2605.15529#bib.bib30)\]and reinforcement learning\[[34](https://arxiv.org/html/2605.15529#bib.bib34),[70](https://arxiv.org/html/2605.15529#bib.bib70)\]\. However, existing PRMs typically expose this interface as a single point estimate of step correctness, such as the probability that a step is correct\. Downstream methods\[[17](https://arxiv.org/html/2605.15529#bib.bib17),[42](https://arxiv.org/html/2605.15529#bib.bib42),[71](https://arxiv.org/html/2605.15529#bib.bib71)\]often have to treat this imperfect score as a reliable decision signal, because no additional signal is available\. A single PRM score tells us which step or candidate the model prefers, but not whether that preference should be trusted\. As a result, an unreliable score can directly affect downstream decisions without being identified as uncertain\.
As shown in Fig\.[1](https://arxiv.org/html/2605.15529#S1.F1), this classic interface mismatches both test\-time usage and training supervision:
First, a single scalar reward cannot capture the predictive uncertainty of intermediate steps\. At inference time, a causal PRM judges a step from the problem and current prefix, without seeing future continuations\[[54](https://arxiv.org/html/2605.15529#bib.bib54),[57](https://arxiv.org/html/2605.15529#bib.bib57),[48](https://arxiv.org/html/2605.15529#bib.bib48)\]\. Even when no local error is obvious, it is uncertain whether a seemingly correct prefix will lead to a correct final answer\. A more natural PRM output should capture both the estimated probability of success and the uncertainty of that estimate\.
Figure 1:Motivation ofBetaPRM\. Repeated Monte Carlo continuations from the same prefix can produce different empirical success ratios\. Standard PRMs treat these ratios as point targets, whereasBetaPRMmodels the prefix success probability as a Beta belief\. The Beta meanμ\\mugives the process reward, while the concentrationκ\\kappacaptures the reliability of the estimate, allowing the model to assign likelihood to the observed countKKout ofNNrather than treatingK/NK/Nas an exact point label\.Second, step\-level PRM labels are often noisy finite\-sample estimates\. A common source of supervision\[[61](https://arxiv.org/html/2605.15529#bib.bib61),[63](https://arxiv.org/html/2605.15529#bib.bib63),[65](https://arxiv.org/html/2605.15529#bib.bib65),[72](https://arxiv.org/html/2605.15529#bib.bib72)\]samplesNNcontinuations from a reasoning prefix and counts how many reach the correct final answer\. IfKKcontinuations succeed, the empirical ratioK/NK/Nis only a Monte Carlo estimate of the prefix success probability, not the true underlying probability\. Repeating the procedure from the same prefix could yield a differentKKdue to sampling randomness\. Standard PRM training\[[13](https://arxiv.org/html/2605.15529#bib.bib13),[14](https://arxiv.org/html/2605.15529#bib.bib14),[31](https://arxiv.org/html/2605.15529#bib.bib31)\]nevertheless regresses to this observed ratio as a point label, forcing the model to fit a noisy finite\-sample outcome with a single scalar prediction\. A better objective should keep the supervision in counting form: the model should assign high probability to observingKKsuccesses out ofNNcontinuations, rather than only regress to the single ratioK/NK/N\.
In this paper, we address both limitations by giving the PRM a way to express uncertainty about its own prediction\. A step\-level reward supported by a confident belief should not be treated the same as one produced under ambiguity\. This motivatesBetaPRM, a distributional PRM that predicts both how promising a reasoning prefix is and how reliable that prediction is\. As illustrated in Figure[2](https://arxiv.org/html/2605.15529#S4.F2),BetaPRMpredicts a Beta distribution over the prefix success probability, and is trained so that this distribution can explain the Monte Carlo observations from sampled continuations\. This distribution is parameterized by \(1\)the predicted success probabilityμ\\mu, which serves as the usual PRM score, and \(2\)the concentrationκ\\kappa, which controls how tightly the belief is centered around that prediction\. High concentration gives a sharp belief, while low concentration gives a flattened belief that can explain a wider range of Monte Carlo observations\.
The learned concentration changes how PRM scores can be used\. Rather than treating every scalar reward as equally trustworthy, downstream algorithms can distinguish confident rewards from uncertain ones\. It is broadly useful for PRM\-guided decision making; in this paper, we demonstrate one concrete test\-time use case: Adaptive Computation Allocation \(ACA\) for Best\-of\-NNreasoning\. Fixed\-budget Best\-of\-NN\[[11](https://arxiv.org/html/2605.15529#bib.bib11),[33](https://arxiv.org/html/2605.15529#bib.bib33)\]spends the same rollout budget on every problem, even when the current pool already contains a high\-scoring candidate whose PRM judgment is reliable\. ACA spends the budget through progressive batches: it stops when the selected answer is reliably ahead, and otherwise continues from uncertain prefixes where more computation may change the decision\.
Empirically,BetaPRMimproves PRM\-guided Best\-of\-NNselection across four backbones and four benchmarks \(e\.g\.,\+3\.37\+3\.37points on average on InternVL2\.5\-8B\), while preserving standard step\-level error detection ability\. Further analyses show that the learned concentration provides a nontrivial reliability signal\. Built on this reliability signal, ACA improves the inference\-time accuracy\-token tradeoff compared with vanilla Best\-of\-1616, where it reduces token usage by up to33\.57%33\.57\\%and even pushes final\-answer accuracy higher\.
## 2Related Work
#### Process Reward Models\.
PRMs\[[13](https://arxiv.org/html/2605.15529#bib.bib13),[56](https://arxiv.org/html/2605.15529#bib.bib56),[47](https://arxiv.org/html/2605.15529#bib.bib47),[29](https://arxiv.org/html/2605.15529#bib.bib29)\]provide step\-level feedback for reasoning, unlike outcome reward models\[[11](https://arxiv.org/html/2605.15529#bib.bib11),[67](https://arxiv.org/html/2605.15529#bib.bib67)\]that score only final answers\. Prior work trains PRMs either as step judges for local error detection\[[63](https://arxiv.org/html/2605.15529#bib.bib63),[14](https://arxiv.org/html/2605.15529#bib.bib14)\], or as Q\-value\-style models that estimate whether a prefix can be completed correctly\[[13](https://arxiv.org/html/2605.15529#bib.bib13),[31](https://arxiv.org/html/2605.15529#bib.bib31)\]\. We focus on a limitation of the latter view: Monte Carlo continuations provide finite\-sample evidence about prefix success, yet existing methods often collapse this evidence into a single point label\. Our approach instead makes reliability part of the PRM output, so downstream methods can use not only the predicted reward but also how trustworthy it is\.
#### Test\-Time Scaling\.
Test\-time scaling\[[22](https://arxiv.org/html/2605.15529#bib.bib22),[53](https://arxiv.org/html/2605.15529#bib.bib53),[3](https://arxiv.org/html/2605.15529#bib.bib3),[58](https://arxiv.org/html/2605.15529#bib.bib58),[66](https://arxiv.org/html/2605.15529#bib.bib66)\]improves reasoning by spending more inference compute, including voting\[[64](https://arxiv.org/html/2605.15529#bib.bib64)\], verifier\-guided selection\[[74](https://arxiv.org/html/2605.15529#bib.bib74)\], and search over reasoning paths\[[17](https://arxiv.org/html/2605.15529#bib.bib17)\]\. A common and simple instance is Best\-of\-NN\[[33](https://arxiv.org/html/2605.15529#bib.bib33)\]: sample multiple candidate solutions and select one using a verifier or reward model\. Most Best\-of\-NNmethods use a fixed budget\[[3](https://arxiv.org/html/2605.15529#bib.bib3)\], allocating the same number of samples to every problem despite large variation in difficulty\. Recent methods\[[49](https://arxiv.org/html/2605.15529#bib.bib49)\]calibrate PRM success estimates to choose instance\-specific budgets for sampling complete solutions\. In contrast, our method usesBetaPRM’s reward and learned reliability during generation to decide when to stop and which uncertain prefix to continue\.
## 3Preliminaries
### 3\.1Prefix\-Conditioned Process Rewards
Given an input problemxx, lets1:T=\(s1,…,sT\)s\_\{1:T\}=\(s\_\{1\},\\ldots,s\_\{T\}\)denote a step\-by\-step solution\. We insert a special process marker<prm\>after each step, and the PRM produces a score at each marker position:
x,s1,<prm\>,s2,<prm\>,…,sT,<prm\>\.x,s\_\{1\},\\texttt\{<prm\>\},s\_\{2\},\\texttt\{<prm\>\},\\ldots,s\_\{T\},\\texttt\{<prm\>\}\.Since the reward model is a causal language model, the score at thett\-th marker is computed from the prefixct=\(x,s≤t\)c\_\{t\}=\(x,s\_\{\\leq t\}\), without access to future stepsst\+1:Ts\_\{t\+1:T\}\. This matches the online use of PRMs in generation or search, where a partial reasoning state is evaluated before its continuation is observed\.
We therefore interpret process rewards as prefix\-level quantities\. Instead of assigning an isolated correctness label to steptt, we define its quality as the prefix success probabilityqt=Pr\(final answer is correct∣x,s≤t\)q\_\{t\}=\\Pr\(\\text\{final answer is correct\}\\mid x,s\_\{\\leq t\}\)\. Sinceqtq\_\{t\}is a latent variable, the next subsection describes how finite continuation samples provide supervision to learn this variable\.
### 3\.2Monte Carlo Step Supervision
The prefix success probabilityqtq\_\{t\}is an unobserved latent variable\. A widely used way to construct step\-level supervision is to sampleNNcontinuations from a prefixct=\(x,s≤t\)c\_\{t\}=\(x,s\_\{\\leq t\}\)and count how many reach the correct final answer\. LetKtK\_\{t\}denote the number of successful continuations\. The empirical ratioq^t=Kt/N\\hat\{q\}\_\{t\}=K\_\{t\}/Nis a Monte Carlo estimate ofqtq\_\{t\}\.
Standard PRM objectives\[[13](https://arxiv.org/html/2605.15529#bib.bib13),[31](https://arxiv.org/html/2605.15529#bib.bib31),[61](https://arxiv.org/html/2605.15529#bib.bib61),[62](https://arxiv.org/html/2605.15529#bib.bib62),[65](https://arxiv.org/html/2605.15529#bib.bib65),[72](https://arxiv.org/html/2605.15529#bib.bib72)\]often reduce this observation to a single point target by optimizing cross\-entropy againstq^t\\hat\{q\}\_\{t\}:
ℒCE=−q^tlogpt−\(1−q^t\)log\(1−pt\),\\mathcal\{L\}\_\{\\mathrm\{CE\}\}=\-\\hat\{q\}\_\{t\}\\log p\_\{t\}\-\(1\-\\hat\{q\}\_\{t\}\)\\log\(1\-p\_\{t\}\),whereptp\_\{t\}is the predicted step score\. This treats the empirical ratio as if it were the latent prefix success probability itself\. Becauseq^t\\hat\{q\}\_\{t\}is computed from a small number of continuations, repeating the same procedure could produce a differentKtK\_\{t\}\. Thus, forcing the model to learn the single point estimateq^t\\hat\{q\}\_\{t\}might lead to overfitting to sample noise\. Instead, it is more natural to treat the supervision as a count observation \(KtK\_\{t\}success out ofNNtrials\)\.
## 4BetaPRM
Figure 2:Intuition of Beta\-Binomial supervision\. A predicted Beta belief over prefix success induces a distribution over possible observed success ratiosK/NK/N\. The green curve is concentrated and aligned with the observed count, the orange curve is concentrated but misaligned and thus penalized, and the gray curve has lower concentration and allows a wider range of finite\-sample observations\.### 4\.1Beta\-Binomial Count Model
To formalize the count\-based supervision, we assume a binomial generative process for the successful continuations:Kt∣qt∼Binomial\(N,qt\)K\_\{t\}\\mid q\_\{t\}\\sim\\mathrm\{Binomial\}\(N,q\_\{t\}\)\. Becauseqtq\_\{t\}is an unknown latent success probability in\[0,1\]\[0,1\], we model it with a Beta belief,qt∼Beta\(αt,βt\)q\_\{t\}\\sim\\mathrm\{Beta\}\(\\alpha\_\{t\},\\beta\_\{t\}\), which naturally pairs with the Binomial count observation above\. For better interpretability, we reparameterize the Beta distribution by its meanμt=αt/\(αt\+βt\)\\mu\_\{t\}=\\alpha\_\{t\}/\(\\alpha\_\{t\}\+\\beta\_\{t\}\)and concentrationκt=αt\+βt\\kappa\_\{t\}=\\alpha\_\{t\}\+\\beta\_\{t\}\. Under this formulation,μt\\mu\_\{t\}acts as the expected success probability \(the standard PRM output score\), whileκt\\kappa\_\{t\}controls how sharply the belief is concentrated around that mean\. Marginalizing out the latentqtq\_\{t\}yields a Beta\-Binomial distribution overKtK\_\{t\}, providing a likelihood for count observations rather than a point target forq^t\\hat\{q\}\_\{t\}\.
### 4\.2BetaPRMParameterization
BetaPRMinstantiates the Beta belief by predicting its mean and concentration at each process marker\. At thett\-th<prm\>marker, the language model produces a hidden statehth\_\{t\}and vocabulary logitsztz\_\{t\}\. LetztYesz\_\{t\}^\{\\mathrm\{Yes\}\}andztNoz\_\{t\}^\{\\mathrm\{No\}\}denote the logits of the two reward tokensYesandNo\. We define the predicted success probability by applying a softmax only over these two logits:
μt=exp\(ztYes\)exp\(ztYes\)\+exp\(ztNo\)\.\\mu\_\{t\}=\\frac\{\\exp\(z\_\{t\}^\{\\mathrm\{Yes\}\}\)\}\{\\exp\(z\_\{t\}^\{\\mathrm\{Yes\}\}\)\+\\exp\(z\_\{t\}^\{\\mathrm\{No\}\}\)\}\.This preserves the standard PRM interpretation of theYesprobability as the scalar reward\.
To estimate reliability,BetaPRMpredicts a separate concentration parameterκt\\kappa\_\{t\}:
κt=softplus\(gϕ\(ht\)\)\+κmin,\\kappa\_\{t\}=\\mathrm\{softplus\}\(g\_\{\\phi\}\(h\_\{t\}\)\)\+\\kappa\_\{\\mathrm\{min\}\},wheregϕg\_\{\\phi\}is a lightweight linear head andκmin\\kappa\_\{\\mathrm\{min\}\}is a small fixed lower bound for numerical stability\. This separates the reward from the reliability channel: the reward\-token logits determineμt\\mu\_\{t\}, while the additional head determines how concentrated the model’s belief should be\.
The Beta parameters are then derived usingαt=μtκt\\alpha\_\{t\}=\\mu\_\{t\}\\kappa\_\{t\}andβt=\(1−μt\)κt\\beta\_\{t\}=\(1\-\\mu\_\{t\}\)\\kappa\_\{t\}\. Hereμt\\mu\_\{t\}centers the belief over prefix success and serves as the scalar PRM score, whileκt\\kappa\_\{t\}controls the concentration, allowing prefixes with similar scores to carry different reliability estimates\.
### 4\.3Beta\-Binomial Training Objective
We train the predicted Beta belief by maximizing the likelihood of the observed countKtK\_\{t\}\. As shown in Figure[2](https://arxiv.org/html/2605.15529#S4.F2), a concentrated belief centered near the observed ratio assigns high probability to the count, while a concentrated but misaligned belief receives a large loss\. A lower\-concentration belief spreads probability mass over a wider range of possible finite\-sample observations, reflecting lower confidence\.
Using the Beta\-Binomial formulation, the predictive probability of the observed count is
p\(Kt∣N,αt,βt\)=\(NKt\)B\(Kt\+αt,N−Kt\+βt\)B\(αt,βt\),p\(K\_\{t\}\\mid N,\\alpha\_\{t\},\\beta\_\{t\}\)=\\binom\{N\}\{K\_\{t\}\}\\frac\{B\(K\_\{t\}\+\\alpha\_\{t\},N\-K\_\{t\}\+\\beta\_\{t\}\)\}\{B\(\\alpha\_\{t\},\\beta\_\{t\}\)\},whereB\(⋅,⋅\)B\(\\cdot,\\cdot\)is the Beta function\. Let𝒫\\mathcal\{P\}be the set of supervised process markers in a mini\-batch\. We define the Beta\-Binomial loss,ℒBeta\-Binomial\\mathcal\{L\}\_\{\\mathrm\{Beta\\text\{\-\}Binomial\}\}, as the negative log\-likelihood of the observed counts:
ℒBeta\-Binomial=−1\|𝒫\|∑t∈𝒫logp\(Kt∣N,αt,βt\)\.\\mathcal\{L\}\_\{\\mathrm\{Beta\\text\{\-\}Binomial\}\}=\-\\frac\{1\}\{\|\\mathcal\{P\}\|\}\\sum\_\{t\\in\\mathcal\{P\}\}\\log p\(K\_\{t\}\\mid N,\\alpha\_\{t\},\\beta\_\{t\}\)\.Minimizing this loss encourages the model to assign high probability to the observed count\.
We add an auxiliary regularization loss to explicitly encourage calibrated reliability estimates\. Ifμt\\mu\_\{t\}disagrees with the observed ratioKt/NK\_\{t\}/N, it contradicts with a largeκt\\kappa\_\{t\}that indicates high confidence\. We therefore penalize the product of disagreement and concentration:
ℒreg=λreg1\|𝒫\|∑t∈𝒫\|sg\(μt\)−KtN\|κt,\\mathcal\{L\}\_\{\\mathrm\{reg\}\}=\\lambda\_\{\\mathrm\{reg\}\}\\frac\{1\}\{\|\\mathcal\{P\}\|\}\\sum\_\{t\\in\\mathcal\{P\}\}\\left\|\\mathrm\{sg\}\(\\mu\_\{t\}\)\-\\frac\{K\_\{t\}\}\{N\}\\right\|\\kappa\_\{t\},wheresg\(⋅\)\\mathrm\{sg\}\(\\cdot\)denotes the stop\-gradient operation\. The stop\-gradient operation prevents this auxiliary term from pullingμt\\mu\_\{t\}toward the noisy ratio, which would make it another point\-label regression loss\. Instead, it mainly calibrates the concentration parameter: highκt\\kappa\_\{t\}is discouraged whenμt\\mu\_\{t\}disagrees with the count evidence, and encouraged when they are consistent\.
The overall training objective is
ℒ=ℒBeta\-Binomial\+ℒreg\.\\mathcal\{L\}=\\mathcal\{L\}\_\{\\mathrm\{Beta\\text\{\-\}Binomial\}\}\+\\mathcal\{L\}\_\{\\mathrm\{reg\}\}\.
## 5Reliability\-Aware Inference: Adaptive Computation Allocation
Figure 3:Overview of Adaptive Computation Allocation \(ACA\)\. ACA generates candidates in batches, usesBetaPRMscores and reliability estimates to test whether the current winner remains reliably ahead, and otherwise samples new continuations from uncertain prefixes\.BetaPRMoutputs both a reward mean and a reliability estimate\. As shown in Figure[3](https://arxiv.org/html/2605.15529#S5.F3), we study a straightforward inference\-time use case: allocating computation in PRM\-guided Best\-of\-NNreasoning\. In standard practices\[[11](https://arxiv.org/html/2605.15529#bib.bib11),[33](https://arxiv.org/html/2605.15529#bib.bib33)\], Best\-of\-NNimproves inference by sampling multiple candidate solutions and selecting one according to a scoring rule, which can be a process reward model\. In addition, every query receives the same number of sampled rollouts\. We introduce Adaptive Computation Allocation \(ACA\) that saves computation when the current sampled pool may already contain a high\-scoring answer\. ACA utilizesBetaPRMto estimate uncertainty and mainly works by two logic: \(1\) stop early when a reliable answer is found, and \(2\) redirect computation for uncertain prefixes\.
#### Risk\-Adjusted Candidate Score\.
ACA compares complete candidates using both reward and reliability\. We convert the Beta belief into a step\-level uncertainty,σt=μt\(1−μt\)/\(κt\+1\)\\sigma\_\{t\}=\\sqrt\{\\mu\_\{t\}\(1\-\\mu\_\{t\}\)/\(\\kappa\_\{t\}\+1\)\}, the standard deviation of the predicted Beta distribution\. Largerκt\\kappa\_\{t\}gives smallerσt\\sigma\_\{t\}, indicating a more reliable reward estimate\. We then define a risk\-adjusted step scorert=μt−λσtr\_\{t\}=\\mu\_\{t\}\-\\lambda\\sigma\_\{t\}, whereλ\\lambdacontrols the uncertainty penalty, and aggregate into a candidate\-level uncertainty fory=s1:Ty=s\_\{1:T\}as
S\(y\)=1T∑t=1T\(μt−λσt\)\.S\(y\)=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\(\\mu\_\{t\}\-\\lambda\\sigma\_\{t\}\)\.Thus, candidates are ranked by predicted process quality discounted by uncertainty\.
#### Progressive Batch Generation and Early Stopping\.
Standard Best\-of\-NNgenerates allNNcandidates in one shot\. ACA instead spends the budget in a progressive way: it first samples a small pool ofn0n\_\{0\}candidates, scores them withBetaPRM, and then either stops or allocates another batch, up to the maximum budgetNN\.
At each stage, ACA selects the highest\-scoring candidatey⋆=argmaxyS\(y\)y^\{\\star\}=\\arg\\max\_\{y\}S\(y\)for the stopping test, where we construct lower and upper confidence bounds \(LCB\\mathrm\{LCB\}andUCB\\mathrm\{UCB\}\):
LCB\(y\)=S\(y\)−cstopU\(y\),UCB\(y\)=S\(y\)\+cstopU\(y\),U\(y\)=1T∑t=1Tσt,\\mathrm\{LCB\}\(y\)=S\(y\)\-c\_\{\\mathrm\{stop\}\}U\(y\),\\qquad\\mathrm\{UCB\}\(y\)=S\(y\)\+c\_\{\\mathrm\{stop\}\}U\(y\),\\qquad U\(y\)=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\sigma\_\{t\},wherecstopc\_\{\\mathrm\{stop\}\}scales the width of the confidence bounds\. ACA terminates the allocation process for the current problem and returnsy⋆y^\{\\star\}if
LCB\(y⋆\)\>maxy≠y⋆UCB\(y\)\.\\mathrm\{LCB\}\(y^\{\\star\}\)\>\\max\_\{y\\neq y^\{\\star\}\}\\mathrm\{UCB\}\(y\)\.This criterion means that the highest\-scoring candidate dominates the current pool: even its pessimistic score exceeds the optimistic score of every competitor\. In this case, further expanding the pool with additional continuations is unlikely to change the PRM\-guided selection\.
#### Uncertainty\-Guided Prefix Repair\.
If the stopping criterion is not met, ACA spends the next batch on a competitive existing response, chosen as the non\-winner candidate with the highest UCB, where additional computation is most likely to change the current decision\. To choose where to repair this response, ACA uses a deterministic cutpoint rule over reasoning steps\. It first computes a conservative step scoreμt−ccutσt\\mu\_\{t\}\-c\_\{\\mathrm\{cut\}\}\\sigma\_\{t\}and selects the earliest step whose value falls below a low\-quality thresholdpbadp\_\{\\mathrm\{bad\}\}\. If no such step exists, ACA falls back to the most uncertain eligible reasoning step, i\.e\., the step with the largestσt\\sigma\_\{t\}\. The selected step is treated as a cutpoint: ACA keeps the prefix before the cutpoint, discards the subsequent generation, and samples new continuations from that prefix\. The procedure repeats until the confidence condition holds or the budgetNNis reached\.
## 6Experiments
### 6\.1Experimental Setup
We evaluate our proposed methods from two aspects\. First, we evaluateBetaPRMas a PRM on PRM\-guided Best\-of\-NNselection and step\-level error detection\. Second, we evaluate whether its uncertainty estimates improve Adaptive Computation Allocation \(ACA\) in Best\-of\-NNreasoning\.
We train on VisualPRM400K\-v1\.1111[https://huggingface\.co/datasets/OpenGVLab/VisualPRM400K\-v1\.1\-Raw](https://huggingface.co/datasets/OpenGVLab/VisualPRM400K-v1.1-Raw)\[[63](https://arxiv.org/html/2605.15529#bib.bib63)\], the available dataset that reportsKKsuccessful continuations out ofN=16N=16Monte Carlo samples for each prefix\. The standard PRM baseline is trained with cross\-entropy using the empirical ratioK/NK/Nas a single\-point target, whileBetaPRMuses the Beta\-Binomial objective on\(K,N\)\(K,N\)\. We evaluateBetaPRMas a PRM with four backbones: InternVL2\.5\-8B\[[9](https://arxiv.org/html/2605.15529#bib.bib9)\], InternVL3\-8B\[[75](https://arxiv.org/html/2605.15529#bib.bib75)\], InternVL3\-14B\[[75](https://arxiv.org/html/2605.15529#bib.bib75)\], and Qwen2\.5\-VL\-7B\[[1](https://arxiv.org/html/2605.15529#bib.bib1)\]\. Best\-of\-NNselection uses candidate pools generated by InternVL2\.5\-8B\[[9](https://arxiv.org/html/2605.15529#bib.bib9)\]and reports final\-answer accuracy on MathVision\[[60](https://arxiv.org/html/2605.15529#bib.bib60)\], OlympiadBench\[[18](https://arxiv.org/html/2605.15529#bib.bib18)\], MathVerse\[[68](https://arxiv.org/html/2605.15529#bib.bib68)\], and MathVista\[[40](https://arxiv.org/html/2605.15529#bib.bib40)\]\. Step\-level error detection is evaluated on VisualProcessBench\[[63](https://arxiv.org/html/2605.15529#bib.bib63)\]\. ACA is evaluated on two representative backbones, InternVL2\.5\-8B\[[9](https://arxiv.org/html/2605.15529#bib.bib9)\]and Qwen2\.5\-VL\-7B\[[1](https://arxiv.org/html/2605.15529#bib.bib1)\], against fixed\-budget Best\-of\-NNunder the same maximum budgetN=16N=16, reporting accuracy and generated tokens\. Full training, evaluation, and ACA implementation details are provided in Appendix[A](https://arxiv.org/html/2605.15529#A1)\.
Table 1:PRM\-guided Best\-of\-1616final\-answer accuracy\. All PRMs select from the same candidate pools generated by InternVL2\.5\-8B\. Values with gray↑/↓\\uparrow/\\downarrowindicate improvement/decline over the single\-pass baseline, and Avg\.Δ\\Deltaaverages this improvement/decline over the four benchmarks\.black\!45
\\rowcolortblHeadSelectorMathVisionOlympiadBenchMathVerseMathVistaAvg\.Δ\\DeltaSingle Pass18\.088\.6535\.3152\.77–black\!40*InternVL3\-14B*black\!40 \+Base \(w/o training\)19\.74↑\\uparrow1\.6611\.33↑\\uparrow2\.6836\.17↑\\uparrow0\.8652\.50↓\\downarrow0\.27↑\\uparrow1\.23\+Standard PRM23\.03↑\\uparrow4\.9516\.67↑\\uparrow8\.0245\.41↑\\uparrow10\.1060\.70↑\\uparrow7\.93↑\\uparrow7\.75\\rowcolortblHi \+BetaPRM25\.66↑\\uparrow7\.5816\.67↑\\uparrow8\.0246\.35↑\\uparrow11\.0462\.30↑\\uparrow9\.53↑\\uparrow9\.04\(\+1\.29\)black\!40*InternVL3\-8B*black\!40 \+Base \(w/o training\)18\.75↑\\uparrow0\.6713\.33↑\\uparrow4\.6837\.21↑\\uparrow1\.9052\.40↓\\downarrow0\.37↑\\uparrow1\.72\+Standard PRM22\.69↑\\uparrow4\.6115\.33↑\\uparrow6\.6844\.80↑\\uparrow9\.4960\.00↑\\uparrow7\.23↑\\uparrow7\.00\\rowcolortblHi \+BetaPRM24\.34↑\\uparrow6\.2618\.00↑\\uparrow9\.3545\.20↑\\uparrow9\.8961\.10↑\\uparrow8\.33↑\\uparrow8\.46\(\+1\.46\)black\!40*InternVL2\.5\-8B*black\!40 \+Base \(w/o training\)20\.72↑\\uparrow2\.649\.33↑\\uparrow0\.6836\.83↑\\uparrow1\.5251\.90↓\\downarrow0\.87↑\\uparrow0\.99\+Standard PRM21\.38↑\\uparrow3\.3011\.33↑\\uparrow2\.6842\.81↑\\uparrow7\.5057\.60↑\\uparrow4\.83↑\\uparrow4\.58\\rowcolortblHi \+BetaPRM25\.66↑\\uparrow7\.5815\.33↑\\uparrow6\.6844\.31↑\\uparrow9\.0061\.30↑\\uparrow8\.53↑\\uparrow7\.95\(\+3\.37\)black\!40*Qwen2\.5\-VL\-7B*black\!40 \+Base \(w/o training\)15\.46↓\\downarrow2\.628\.00↓\\downarrow0\.6535\.84↑\\uparrow0\.5350\.70↓\\downarrow2\.07↓\\downarrow1\.20\+Standard PRM21\.38↑\\uparrow3\.3014\.00↑\\uparrow5\.3544\.92↑\\uparrow9\.6160\.30↑\\uparrow7\.53↑\\uparrow6\.45\\rowcolortblHi \+BetaPRM24\.34↑\\uparrow6\.2617\.33↑\\uparrow8\.6845\.99↑\\uparrow10\.6863\.60↑\\uparrow10\.83↑\\uparrow9\.11\(\+2\.66\)
### 6\.2BetaPRMEvaluation
#### BetaPRMimproves Best\-of\-NNselection across four backbones and four benchmarks\.
Table[6\.1](https://arxiv.org/html/2605.15529#S6.SS1)evaluates PRMs as solution selectors under the same candidate pools\. Standard PRM selects the candidate with the highest average process reward\.BetaPRMexposes both a reward mean and a learned reliability estimate, so we use its full output through a risk\-budget selector:
SRB\(y\)=1T∑t=1Tμt−λ1T∑t=1T𝟏\[σt\>τ\],σt=μt\(1−μt\)κt\+1\.S\_\{\\mathrm\{RB\}\}\(y\)=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mu\_\{t\}\-\\lambda\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathbf\{1\}\[\\sigma\_\{t\}\>\\tau\],\\qquad\\sigma\_\{t\}=\\sqrt\{\\frac\{\\mu\_\{t\}\(1\-\\mu\_\{t\}\)\}\{\\kappa\_\{t\}\+1\}\}\.It keeps the average reward term, but discounts rollouts that contain many high\-uncertainty steps\.
BetaPRMachieves the best accuracy in every backbone–benchmark block\. Its average gains over standard PRM are\+1\.29\+1\.29,\+1\.46\+1\.46,\+3\.37\+3\.37, and\+2\.66\+2\.66points for InternVL3\-14B, InternVL3\-8B, InternVL2\.5\-8B, and Qwen2\.5\-VL\-7B, respectively\. These gains reflect that our Beta\-Binomial objective effectively learnsμt\\mu\_\{t\}to explain the success ratioKt/NK\_\{t\}/Nas a deterministic soft label and concentrationκt\\kappa\_\{t\}to down\-weight high\-scoring traces whose rewards are uncertain\.
Table 2:Step\-level error detection on VisualProcessBench\. Overall denotes micro\-F1 over all annotated steps, while each source column reports macro\-F1 on that subset\.\\rowcolortblHeadModel\\columncoloroverallColOverallMathVisionMathVerseMMMUDynaMathWeMath*InternVL3\-14B*black\!40 Base \(w/o training\)\\columncoloroverallCol49\.4047\.6150\.9748\.8650\.1947\.34Standard PRM\\columncoloroverallCol61\.9059\.9061\.9363\.1262\.9363\.79\\rowcolortblHiBetaPRM\\columncoloroverallCol61\.9060\.9462\.7459\.1861\.6764\.59black\!40\\columncoloroverallCol*InternVL3\-8B*black\!40 Base \(w/o training\)\\columncoloroverallCol48\.3947\.2148\.9147\.0149\.4149\.00Standard PRM\\columncoloroverallCol60\.6960\.2060\.4158\.7361\.6363\.23\\rowcolortblHiBetaPRM\\columncoloroverallCol61\.8559\.6562\.1762\.4262\.7164\.23black\!40\\columncoloroverallCol*InternVL2\.5\-8B*black\!40 Base \(w/o training\)\\columncoloroverallCol52\.2852\.4052\.0450\.2154\.8549\.95Standard PRM\\columncoloroverallCol61\.5460\.7860\.4762\.0562\.9164\.38\\rowcolortblHiBetaPRM\\columncoloroverallCol60\.9760\.4360\.7060\.4863\.0059\.98black\!40\\columncoloroverallCol*Qwen2\.5\-VL\-7B*black\!40 Base \(w/o training\)\\columncoloroverallCol49\.6850\.2249\.5849\.8549\.6248\.51Standard PRM\\columncoloroverallCol62\.2362\.1761\.2561\.4462\.8865\.55\\rowcolortblHiBetaPRM\\columncoloroverallCol62\.9162\.1962\.9159\.4963\.7566\.69
#### BetaPRMpreserves standard PRM error\-detection ability\.
Table[6\.2](https://arxiv.org/html/2605.15529#S6.SS2.SSS0.Px1)reports results on VisualProcessBench\[[63](https://arxiv.org/html/2605.15529#bib.bib63)\], a benchmark for step\-level error detection\. Each reasoning trace has human step\-wise correctness labels, and a PRM score is thresholded into a binary prediction of whether each step is correct or erroneous\.
BetaPRMremains competitive with standard PRM under this thresholding setting\. Across the evaluated backbones, its overall micro\-F1 remains comparable to PRM: it matches PRM on InternVL3\-14B, improves slightly on InternVL3\-8B and Qwen2\.5\-VL\-7B, and is slightly lower on InternVL2\.5\-8B\. Together with the Best\-of\-1616results, this shows that Beta\-Binomial training improves the relative ranking of candidate solutions without degrading the PRM’s ability to separate correct and erroneous steps under a standard decision threshold\.
#### The auxiliary evidence regularizer improves concentration calibration\.
Table[3](https://arxiv.org/html/2605.15529#S6.T3)isolates the effect ofLregL\_\{\\mathrm\{reg\}\}\. Adding it to the Beta\-Binomial likelihood improves all four Best\-of\-1616benchmarks, with an average gain of\+1\.02\+1\.02points\. This matches its intended role: when the predicted meanμt\\mu\_\{t\}disagrees with the observed Monte Carlo ratioKt/NK\_\{t\}/N, the regularizer penalizes high concentration\. The stop\-gradient is crucial here: it avoids pullingμt\\mu\_\{t\}towardKt/NK\_\{t\}/Nand making the auxiliary term another soft\-label regression objective\. With stop\-gradient, the term instead focuses on updating the concentration parameterκt\\kappa\_\{t\}\. The consistent gains suggest that explicitly calibrating concentration improves the reliability signal used in candidate ranking\.
Table 3:Ablation of the auxiliary evidence regularizer on InternVL2\.5\-8B under PRM\-guided Best\-of\-1616selection\. RemovingLregL\_\{\\mathrm\{reg\}\}consistently reduces accuracy\.\\rowcolortblHeadMethodMathVisionOlympiadBenchMathVerseMathVistaAvg\.\\rowcolortblHiBetaPRM25\.6615\.3344\.3161\.3036\.65BetaPRMw/oLregL\_\{\\mathrm\{reg\}\}24\.67↓\\downarrow0\.9914\.00↓\\downarrow1\.3343\.63↓\\downarrow0\.6860\.20↓\\downarrow1\.1035\.63↓\\downarrow1\.02
#### BetaPRMlearns adaptive confidence\.
Figure 4:Training dynamics of the learned concentrationκt\\kappa\_\{t\}\. The mean and the 90th percentile both decrease early in training and later recover, showing thatBetaPRMfirst becomes conservative and then learns to assign higher confidence to prefixes whose reward estimates are better supported\.Figure[4](https://arxiv.org/html/2605.15529#S6.F4)tracks the learned concentrationκt\\kappa\_\{t\}during training\. Across 4 different backbones, both the mean and the 90th percentile ofκt\\kappa\_\{t\}drop sharply at the beginning and then gradually increase\. This is the expected behavior: early in training, the reward meanμt\\mu\_\{t\}is still unreliable, so the model lowers its confidence instead of making sharp predictions\. As training progresses, the model assigns higher concentration to prefixes whose predicted reward is better supported by the observed number of successful continuations\.
The upper\-tail behavior is also important\. After the initial drop, the 90th percentile recovers more strongly than the mean and remains clearly separated from it\. This suggests that the model is not simply raisingκt\\kappa\_\{t\}uniformly; it forms an upper tail of prefixes with substantially higher confidence\. This separation is useful for reliability\-aware use: if all prefixes had similar concentration,κt\\kappa\_\{t\}would provide little guidance about which rewards are trustworthy\. A high\-confidence upper tail lets downstream methods treat some predictions as more strongly supported while staying conservative on ordinary or low\-confidence predictions\.
Table 4:ACA improves the accuracy–token tradeoff in PRM\-guided Best\-of\-1616\. Token counts are reported in thousands\. Percentages indicate token reduction relative to Vanilla BoN\.black\!45
MethodAdaptiveAllocationEarlyStoppingMathVisionOlympiadBenchMathVerseMathVistablack\!25Acc\.↑\\uparrowTokens↓\\downarrowAcc\.↑\\uparrowTokens↓\\downarrowAcc\.↑\\uparrowTokens↓\\downarrowAcc\.↑\\uparrowTokens↓\\downarrow*InternVL2\.5\-8B*black\!40 Vanilla BoN×\\times×\\times25\.001383k15\.331151k44\.4717932k60\.902790kACAw/o EarlyStop\{\}\_\{\\text\{w/o EarlyStop\}\}✓\\checkmark×\\times24\.011237k\(↓\\downarrow10\.56%\)15\.331028k\(↓\\downarrow10\.69%\)42\.9914692k\(↓\\downarrow18\.07%\)60\.202462k\(↓\\downarrow11\.76%\)\\rowcolortblHi ACA✓\\checkmark✓\\checkmark26\.32965k\(↓\\downarrow30\.24%\)16\.67958k\(↓\\downarrow16\.76%\)45\.5811912k\(↓\\downarrow33\.57%\)62\.201949k\(↓\\downarrow30\.14%\)black\!40*Qwen2\.5\-VL\-7B*black\!40 Vanilla BoN×\\times×\\times24\.671383k16\.671151k45\.7417932k63\.302790kACAw/o EarlyStop\{\}\_\{\\text\{w/o EarlyStop\}\}✓\\checkmark×\\times24\.341205k\(↓\\downarrow12\.87%\)16\.671084k\(↓\\downarrow5\.82%\)44\.3416075k\(↓\\downarrow10\.36%\)62\.802551k\(↓\\downarrow8\.57%\)\\rowcolortblHi ACA✓\\checkmark✓\\checkmark26\.65988k\(↓\\downarrow28\.57%\)18\.00928k\(↓\\downarrow19\.39%\)46\.4012015k\(↓\\downarrow33\.00%\)64\.002030k\(↓\\downarrow27\.22%\)
Table 5:ACA ablation under a Best\-of\-1616budget\. Learned uncertainty fromBetaPRMgives a stronger accuracy–token tradeoff than Standard PRM with proxy uncertainty or reward\-only allocation\.black\!45
MethodMathVisionOlympiadBenchMathVerseMathVistablack\!25Acc\.TokensAcc\.TokensAcc\.TokensAcc\.Tokens*InternVL2\.5\-8B*black\!40\\rowcolortblHi ACA w\.BetaPRM\(Learned Uncertainty\)25\.99965k16\.67958k45\.5811912k62\.101949kACA w\. Standard PRM \(Proxy Uncertainty\)22\.371225k14\.00994k44\.2914551k61\.402304kACA w\. Standard PRM \(Reward\-Only\)21\.38738k14\.67527k43\.0210783k58\.801799kblack\!40*Qwen2\.5\-VL\-7B*black\!40\\rowcolortblHi ACA w\.BetaPRM\(Learned Uncertainty\)26\.65988k18\.00928k46\.4012015k64\.002030kACA w\. Standard PRM \(Proxy Uncertainty\)24\.671133k15\.33915k45\.4113595k62\.302072kACA w\. Standard PRM \(Reward\-Only\)21\.38604k14\.67499k44\.479138k60\.301618k
### 6\.3ACA Improves the Inference\-Time Accuracy\-Token Tradeoff
#### ACA uses fewer tokens while improving Best\-of\-NNaccuracy\.
Table[6\.2](https://arxiv.org/html/2605.15529#S6.SS2.SSS0.Px4)compares ACA with fixed\-budget PRM\-guided Best\-of\-NN\. All methods use the same maximum budget ofN=16N=16candidate generations and the sameBetaPRMrisk\-budget selectorSRBS\_\{\\mathrm\{RB\}\}for final selection; they differ only in how the budget is spent\. Vanilla Best\-of\-NNgenerates all candidates from scratch, whereas ACA spends the budget in stages and usesBetaPRMuncertainty to decide whether to stop or repair uncertain prefixes\.
ACA improves the accuracy–token tradeoff across both backbones\. Across both InternVL2\.5\-8B and Qwen2\.5\-VL\-7B, ACA improves all four benchmarks, saving16\.76%16\.76\\%–33\.57%33\.57\\%tokens on InternVL2\.5\-8B and19\.39%19\.39\\%–33\.00%33\.00\\%on Qwen2\.5\-VL\-7B\. The ablation without early stopping shows that adaptive expansion alone mainly reduces computation, but can keep spending budget even after the current selected answer is already reliable, introducing additional candidates that may distract an imperfect PRM selector\. The full ACA gives the strongest tradeoff by combining uncertainty\-guided expansion with confidence\-based stopping\.
#### BetaPRMprovides a distinct learned uncertainty signal\.
We investigate whether the explicit uncertainty modeling inBetaPRMis actually necessary for ACA, or if a standard PRM would suffice to acheive the same efficiency\. We compareBetaPRMwith ACA variants using a standard PRM\. The first baseline,ACA with Standard PRM \(Reward Only\), removes uncertainty entirely: it ranks candidates by the average process reward from standard PRM, uses a score\-margin stopping rule, and repairs the lowest\-scoring step when allocating more computation\. The second baseline,ACA with Standard PRM \(Proxy Uncertainty\), usesσt=μt\(1−μt\)\\sigma\_\{t\}=\\sqrt\{\\mu\_\{t\}\(1\-\\mu\_\{t\}\)\}, as an uncertainty proxy for ACA\. Our full variant,ACA withBetaPRM\(Learned Uncertainty\), uses the uncertainty induced by its learned concentrationκt\\kappa\_\{t\}\. For fair comparison, all variants use the same linear risk\-adjusted scoreSlin\(y\)=1T∑t=1T\(μt−λσt\)\.S\_\{\\mathrm\{lin\}\}\(y\)=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\(\\mu\_\{t\}\-\\lambda\\sigma\_\{t\}\)\.For the reward\-only baseline, we setσt=0\\sigma\_\{t\}=0, so this reduces to the average process reward\. This shared form keeps the comparison well\-defined across variants and focuses the ablation on the source of uncertainty\.
As shown in Table[6\.2](https://arxiv.org/html/2605.15529#S6.SS2.SSS0.Px4),BetaPRMwith learned uncertainty gives the best accuracy–token tradeoff\. Across the evaluated backbones, it improves over proxy uncertainty in both dimensions, achieving higher accuracy while using fewer tokens in all evaluated settings\. In contrast, theμ\\mu\-only variant often uses the fewest tokens, but at a clear accuracy cost\. Without a reliability estimate, it treats low reward as the only reason to repair and large reward margins as sufficient evidence to stop\. This misses the key cases ACA is designed for: prefixes whose score is not necessarily low, but whose PRM score is uncertain enough that additional continuations could change the selected answer\. Thus, reward\-only allocation saves computation by reducing exploration, but loses accuracy because it cannot identify where uncertainty still matters\.
## 7Conclusion
We study how PRMs can score reasoning steps while also indicating when those scores should be trusted\. We proposeBetaPRM, a distributional PRM that represents each reasoning prefix with a Beta belief over its success probability and trains it from Monte Carlo observations using a Beta\-Binomial objective\. This gives the model both a predicted prefix success probability and a learned reliability estimate for that prediction\. Experiments show thatBetaPRMimproves PRM\-guided Best\-of\-NNselection without sacrificing step\-level error detection\. Using this reliability signal, Adaptive Computation Allocation further improves final\-answer accuracy while reducing inference tokens by up to33\.57%33\.57\\%\. Overall,BetaPRMturns scalar process rewards into reliability\-aware signals for test\-time selection and computation allocation\.
## Acknowledgement
This research was supported in part by the NVIDIA Academic Grant Program and WashU Ignite Interdisciplinary Grants\.
## References
- Bai et al\. \[2025\]Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al\.Qwen2\. 5\-vl technical report\.*arXiv e\-prints*, pages arXiv–2502, 2025\.
- Bilal et al\. \[2026\]Ahsan Bilal, Ahmed Mohsin, Muhammad Umer, Ali Subhan, Hassan Rizwan, Ayesha Mohsin, and Dean Hougen\.What if we allocate test\-time compute adaptively?*arXiv preprint arXiv:2602\.01070*, 2026\.
- Brown et al\. \[2024\]Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini\.Large language monkeys: Scaling inference compute with repeated sampling\.*arXiv preprint arXiv:2407\.21787*, 2024\.
- Cao and Xiao \[2022\]Jie Cao and Jing Xiao\.An augmented benchmark dataset for geometric question answering through dual parallel text encoding\.In*Proceedings of the 29th international conference on computational linguistics*, pages 1511–1520, 2022\.
- Chae et al\. \[2026\]Hyungjoo Chae, Sunghwan Kim, Junhee Cho, Seungone Kim, Seungjun Moon, Gyeom Hwangbo, Dongha Lim, Minjin Kim, Yeonjun Hwang, Minju Gwak, Dongwook Choi, Minseok Kang, Gwanhoon Im, ByeongUng Cho, Hyojun Kim, Jun Hee Han, Taeyoon Kwon, Minju Kim, Beong woo Kwak, Dongjin Kang, and Jinyoung Yeo\.Web\-shepherd: Advancing PRMs for reinforcing web agents\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*, 2026\.URL[https://openreview\.net/forum?id=G2kMroO9UV](https://openreview.net/forum?id=G2kMroO9UV)\.
- Chang et al\. \[2022\]Shuaichen Chang, David Palzer, Jialin Li, Eric Fosler\-Lussier, and Ningchuan Xiao\.MapQA: A dataset for question answering on choropleth maps\.In*NeurIPS 2022 First Table Representation Workshop*, 2022\.URL[https://openreview\.net/forum?id=znKbVjeR0yI](https://openreview.net/forum?id=znKbVjeR0yI)\.
- Chen et al\. \[2022\]Jiaqi Chen, Tong Li, Jinghui Qin, Pan Lu, Liang Lin, Chongyu Chen, and Xiaodan Liang\.Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression\.In*Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing*, pages 3313–3323, 2022\.
- Chen et al\. \[2024a\]Qiguang Chen, Libo Qin, Jin Zhang, Zhi Chen, Xiao Xu, and Wanxiang Che\.M3cot: A novel benchmark for multi\-domain multi\-step multi\-modal chain\-of\-thought\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 8199–8221, 2024a\.
- Chen et al\. \[2024b\]Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al\.Expanding performance boundaries of open\-source multimodal models with model, data, and test\-time scaling\.*arXiv preprint arXiv:2412\.05271*, 2024b\.
- Chen et al\. \[2026\]Zhengyu Chen, Yudong Wang, Teng Xiao, Ruochen Zhou, Xusheng Yang, Wei Wang, Zhifang Sui, and Jingang Wang\.From mathematical reasoning to code: Generalization of process reward models in test\-time scaling\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 40, pages 30368–30376, 2026\.
- Cobbe et al\. \[2021\]Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al\.Training verifiers to solve math word problems\.*arXiv preprint arXiv:2110\.14168*, 2021\.
- Dai et al\. \[2024\]Ning Dai, Zheng Wu, Renjie Zheng, Ziyun Wei, Wenlei Shi, Xing Jin, Guanlin Liu, Chen Dun, Liang Huang, and Lin Yan\.Process supervision\-guided policy optimization for code generation\.*arXiv preprint arXiv:2410\.17621*, 2024\.
- Du et al\. \[2025\]Lingxiao Du, Fanqing Meng, Zongkai Liu, Zhixiang Zhou, Ping Luo, Qiaosheng Zhang, and Wenqi Shao\.Mm\-prm: Enhancing multimodal mathematical reasoning with scalable step\-level supervision\.*arXiv preprint arXiv:2505\.13427*, 2025\.
- Duan et al\. \[2025\]Keyu Duan, Zichen Liu, Xin Mao, Tianyu Pang, Changyu Chen, Qiguang Chen, Michael Qizhe Shieh, and Longxu Dou\.Efficient process reward model training via active learning\.In*Second Conference on Language Modeling*, 2025\.URL[https://openreview\.net/forum?id=CJ2FmPmoDE](https://openreview.net/forum?id=CJ2FmPmoDE)\.
- Gao et al\. \[2025\]Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing HONG, Jianhua Han, Hang Xu, Zhenguo Li, and Lingpeng Kong\.G\-LLaVA: Solving geometric problem with multi\-modal large language model\.In*The Thirteenth International Conference on Learning Representations*, 2025\.URL[https://openreview\.net/forum?id=px1674Wp3C](https://openreview.net/forum?id=px1674Wp3C)\.
- Goyal et al\. \[2017\]Yash Goyal, Tejas Khot, Douglas Summers\-Stay, Dhruv Batra, and Devi Parikh\.Making the v in vqa matter: Elevating the role of image understanding in visual question answering\.In*Proceedings of the IEEE conference on computer vision and pattern recognition*, pages 6904–6913, 2017\.
- Guan et al\. \[2025\]Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang\.rstar\-math: Small LLMs can master math reasoning with self\-evolved deep thinking\.In*Forty\-second International Conference on Machine Learning*, 2025\.URL[https://openreview\.net/forum?id=5zwF1GizFa](https://openreview.net/forum?id=5zwF1GizFa)\.
- He et al\. \[2024a\]Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al\.Olympiadbench: A challenging benchmark for promoting agi with olympiad\-level bilingual multimodal scientific problems\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 3828–3850, 2024a\.
- He et al\. \[2024b\]Mingqian He, Yongliang Shen, Wenqi Zhang, Zeqi Tan, and Weiming Lu\.Advancing process verification for large language models via tree\-based preference learning\.In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, pages 2086–2099, 2024b\.
- Hosu et al\. \[2020\]Vlad Hosu, Hanhe Lin, Tamas Sziranyi, and Dietmar Saupe\.Koniq\-10k: An ecologically valid database for deep learning of blind image quality assessment\.*IEEE Transactions on Image Processing*, 29:4041–4056, 2020\.
- Hu et al\. \[2025\]Pengfei Hu, Zhenrong Zhang, Qikai Chang, Shuhang Liu, Jiefeng Ma, Jun Du, Jianshu Zhang, Quan Liu, Jianqing Gao, Feng Ma, et al\.Prm\-bas: Enhancing multimodal reasoning through prm\-guided beam annealing search\.*arXiv preprint arXiv:2504\.10222*, 2025\.
- Huang et al\. \[2025\]Chengsong Huang, Langlin Huang, Jixuan Leng, Jiacheng Liu, and Jiaxin Huang\.Efficient test\-time scaling via self\-calibration\.*arXiv preprint arXiv:2503\.00031*, 2025\.
- Huang et al\. \[2019\]Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, and CV Jawahar\.Icdar2019 competition on scanned receipt ocr and information extraction\.In*2019 International Conference on Document Analysis and Recognition \(ICDAR\)*, pages 1516–1520\. IEEE, 2019\.
- Johnson et al\. \[2017\]Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei\-Fei, C Lawrence Zitnick, and Ross Girshick\.Clevr: A diagnostic dataset for compositional language and elementary visual reasoning\.In*Proceedings of the IEEE conference on computer vision and pattern recognition*, pages 2901–2910, 2017\.
- Kafle et al\. \[2018\]Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan\.Dvqa: Understanding data visualizations via question answering\.In*Proceedings of the IEEE conference on computer vision and pattern recognition*, pages 5648–5656, 2018\.
- Kahou et al\. \[2017\]Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkinson, Ákos Kádár, Adam Trischler, and Yoshua Bengio\.Figureqa: An annotated figure dataset for visual reasoning\.*arXiv preprint arXiv:1710\.07300*, 2017\.
- Kazemi et al\. \[2024\]Mehran Kazemi, Hamidreza Alvari, Ankit Anand, Jialin Wu, Xi Chen, and Radu Soricut\.Geomverse: A systematic evaluation of large models for geometric reasoning\.In*AI for Math Workshop@ ICML 2024*, 2024\.
- Kembhavi et al\. \[2016\]Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi\.A diagram is worth a dozen images\.In*European conference on computer vision*, pages 235–251\. Springer, 2016\.
- Khalifa et al\. \[2025\]Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, and Lu Wang\.Process reward models that think\.*arXiv preprint arXiv:2504\.16828*, 2025\.
- Kim et al\. \[2025\]Seungone Kim, Ian Wu, Jinu Lee, Xiang Yue, Seongyun Lee, Mingyeong Moon, Kiril Gashteovski, Carolin Lawrence, Julia Hockenmaier, Graham Neubig, et al\.Scaling evaluation\-time compute with reasoning models as process evaluators\.*arXiv preprint arXiv:2503\.19877*, 2025\.
- Li et al\. \[2026\]Jinyuan Li, Chengsong Huang, Langlin Huang, Shaoyang Xu, Haolin Liu, Wenxuan Zhang, and Jiaxin Huang\.Training data efficiency in multimodal process reward models\.*arXiv preprint arXiv:2602\.04145*, 2026\.
- Li et al\. \[2023\]Zhuowan Li, Xingrui Wang, Elias Stengel\-Eskin, Adam Kortylewski, Wufei Ma, Benjamin Van Durme, and Alan L Yuille\.Super\-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning\.In*Proceedings of the IEEE/CVF conference on computer vision and pattern recognition*, pages 14963–14973, 2023\.
- Lightman et al\. \[2024\]Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe\.Let’s verify step by step\.In*The Twelfth International Conference on Learning Representations*, 2024\.URL[https://openreview\.net/forum?id=v8L0pN6EOi](https://openreview.net/forum?id=v8L0pN6EOi)\.
- Liu et al\. \[2026\]Haolin Liu, Dian Yu, Sidi Lu, Yujun Zhou, Rui Liu, Zhenwen Liang, Haitao Mi, Chen\-Yu Wei, and Dong Yu\.Save the good prefix: Precise error penalization via process\-supervised rl to enhance llm reasoning\.*arXiv preprint arXiv:2601\.18984*, 2026\.
- Liu et al\. \[2025a\]Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli Ouyang, and Bowen Zhou\.Can 1b LLM surpass 405b LLM? rethinking compute\-optimal test\-time scaling\.In*Workshop on Reasoning and Planning for Large Language Models*, 2025a\.URL[https://openreview\.net/forum?id=CvjX9Lhpze](https://openreview.net/forum?id=CvjX9Lhpze)\.
- Liu et al\. \[2025b\]Wei Liu, Junlong Li, Xiwen Zhang, Fan Zhou, Yu Cheng, and Junxian He\.Diving into self\-evolving training for multimodal reasoning\.In*Forty\-second International Conference on Machine Learning*, 2025b\.URL[https://openreview\.net/forum?id=X3ikghfWwD](https://openreview.net/forum?id=X3ikghfWwD)\.
- Lu et al\. \[2021a\]Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song\-Chun Zhu\.Inter\-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning\.In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors,*Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\)*, pages 6774–6786, Online, August 2021a\. Association for Computational Linguistics\.doi:10\.18653/v1/2021\.acl\-long\.528\.URL[https://aclanthology\.org/2021\.acl\-long\.528/](https://aclanthology.org/2021.acl-long.528/)\.
- Lu et al\. \[2021b\]Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song\-Chun Zhu\.IconQA: A new benchmark for abstract diagram understanding and visual language reasoning\.In*Thirty\-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track \(Round 2\)*, 2021b\.URL[https://openreview\.net/forum?id=uXa9oBDZ9V1](https://openreview.net/forum?id=uXa9oBDZ9V1)\.
- Lu et al\. \[2022\]Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai\-Wei Chang, Song\-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan\.Learn to explain: Multimodal reasoning via thought chains for science question answering\.*Advances in Neural Information Processing Systems*, 35:2507–2521, 2022\.
- Lu et al\. \[2024\]Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai\-Wei Chang, Michel Galley, and Jianfeng Gao\.Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts\.In*The Twelfth International Conference on Learning Representations*, 2024\.URL[https://openreview\.net/forum?id=KUNzEQMWU7](https://openreview.net/forum?id=KUNzEQMWU7)\.
- Luo et al\. \[2024\]Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, et al\.Improve mathematical reasoning in language models by automated process supervision\.*arXiv preprint arXiv:2406\.06592*, 2024\.
- Luo et al\. \[2025\]Ruilin Luo, Zhuofan Zheng, Yifan Wang, Xinzhe Ni, Zicheng Lin, Songtao Jiang, Yiyao Yu, Chufan Shi, Lei Wang, Ruihang Chu, et al\.Unlocking multimodal mathematical reasoning via process reward model\.*arXiv preprint arXiv:2501\.04686*, 2025\.
- Ma et al\. \[2023\]Qianli Ma, Haotian Zhou, Tingkai Liu, Jianbo Yuan, Pengfei Liu, Yang You, and Hongxia Yang\.Let’s reward step by step: Step\-level reward model as the navigators for reasoning\.*arXiv preprint arXiv:2310\.10080*, 2023\.
- Masry et al\. \[2022\]Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque\.Chartqa: A benchmark for question answering about charts with visual and logical reasoning\.In*Findings of the association for computational linguistics: ACL 2022*, pages 2263–2279, 2022\.
- Mathew et al\. \[2021\]Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar\.Docvqa: A dataset for vqa on document images\.In*Proceedings of the IEEE/CVF winter conference on applications of computer vision*, pages 2200–2209, 2021\.
- Mathew et al\. \[2022\]Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and C\.V\. Jawahar\.Infographicvqa\.In*Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision \(WACV\)*, pages 1697–1706, January 2022\.
- Pala et al\. \[2025\]Tej Deep Pala, Panshul Sharma, Amir Zadeh, Chuan Li, and Soujanya Poria\.Error typing for smarter rewards: Improving process reward models with error\-aware hierarchical supervision\.*arXiv preprint arXiv:2505\.19706*, 2025\.
- Pan et al\. \[2025\]xu Zhao Pan, Pengfei Zhou, Jiaxin Ai, Wangbo Zhao, Kai Wang, Xiaojiang Peng, Wenqi Shao, Hongxun Yao, and Kaipeng Zhang\.MPBench: A comprehensive multimodal reasoning benchmark for process errors identification\.In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,*Findings of the Association for Computational Linguistics: ACL 2025*, pages 21586–21606, Vienna, Austria, July 2025\. Association for Computational Linguistics\.ISBN 979\-8\-89176\-256\-5\.doi:10\.18653/v1/2025\.findings\-acl\.1112\.URL[https://aclanthology\.org/2025\.findings\-acl\.1112/](https://aclanthology.org/2025.findings-acl.1112/)\.
- Park et al\. \[2026\]Young\-Jin Park, Kristjan Greenewald, Kaveh Alim, Hao Wang, and Navid Azizan\.Know what you don’t know: Uncertainty calibration of process reward models\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*, 2026\.URL[https://openreview\.net/forum?id=hzMkfIrdDT](https://openreview.net/forum?id=hzMkfIrdDT)\.
- Seo et al\. \[2015\]Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi, Oren Etzioni, and Clint Malcolm\.Solving geometry problems: Combining text and diagram interpretation\.In*Proceedings of the 2015 conference on empirical methods in natural language processing*, pages 1466–1476, 2015\.
- Shi et al\. \[2024\]Wenhao Shi, Zhiqiang Hu, Yi Bin, Junhua Liu, Yang Yang, See Kiong Ng, Lidong Bing, and Roy Ka\-Wei Lee\.Math\-llava: Bootstrapping mathematical reasoning for multimodal large language models\.In*Findings of the Association for Computational Linguistics: EMNLP 2024*, pages 4663–4680, 2024\.
- Singh et al\. \[2024\]Shweta Singh, Aayan Yadav, Jitesh Jain, Humphrey Shi, Justin Johnson, and Karan Desai\.Benchmarking object detectors with coco: A new path forward\.In*European Conference on Computer Vision*, pages 279–295\. Springer, 2024\.
- Snell et al\. \[2024\]Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar\.Scaling llm test\-time compute optimally can be more effective than scaling model parameters\.*arXiv preprint arXiv:2408\.03314*, 2024\.
- Song et al\. \[2025\]Mingyang Song, Zhaochen Su, Xiaoye Qu, Jiawei Zhou, and Yu Cheng\.Prmbench: A fine\-grained and challenging benchmark for process\-level reward models\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 25299–25346, 2025\.
- Suhr et al\. \[2019\]Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi\.A corpus for reasoning about natural language grounded in photographs\.In*Proceedings of the 57th annual meeting of the association for computational linguistics*, pages 6418–6428, 2019\.
- Sun et al\. \[2025\]Lin Sun, Chuang Liu, Xiaofeng Ma, Tao Yang, Weijia Lu, and Ning Wu\.Freeprm: Training process reward models without ground truth process labels\.*arXiv preprint arXiv:2506\.03570*, 2025\.
- Tu et al\. \[2025\]Haoqin Tu, Weitao Feng, Hardy Chen, Hui Liu, Xianfeng Tang, and Cihang Xie\.Vilbench: A suite for vision\-language process reward modeling\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*, pages 6775–6790, 2025\.
- Uesato et al\. \[2022\]Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins\.Solving math word problems with process\-and outcome\-based feedback\.*arXiv preprint arXiv:2211\.14275*, 2022\.
- Wang et al\. \[2024a\]Jun Wang, Meng Fang, Ziyu Wan, Muning Wen, Jiachen Zhu, Anjie Liu, Ziqin Gong, Yan Song, Lei Chen, Lionel M Ni, et al\.Openr: An open source framework for advanced reasoning with large language models\.*arXiv preprint arXiv:2410\.09671*, 2024a\.
- Wang et al\. \[2024b\]Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li\.Measuring multimodal mathematical reasoning with math\-vision dataset\.*Advances in Neural Information Processing Systems*, 37:95095–95169, 2024b\.
- Wang et al\. \[2024c\]Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui\.Math\-shepherd: Verify and reinforce llms step\-by\-step without human annotations\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 9426–9439, 2024c\.
- Wang et al\. \[2025\]Shuai Wang, Zhenhua Liu, Jiaheng Wei, Xuanwu Yin, Dong Li, and Emad Barsoum\.Athena: Enhancing multimodal reasoning with data\-efficient process reward models\.*arXiv preprint arXiv:2506\.09532*, 2025\.
- Wang et al\. \[2026\]Weiyun Wang, Zhangwei Gao, Lianjie Chen, Zhe Chen, Jinguo Zhu, Xiangyu Zhao, Yangzhou Liu, Yue Cao, Shenglong Ye, Xizhou Zhu, Lewei Lu, Haodong Duan, Yu Qiao, Jifeng Dai, and Wenhai Wang\.VisualPRM400k: An effective dataset for training multimodal process reward models\.In*The Fourteenth International Conference on Learning Representations*, 2026\.URL[https://openreview\.net/forum?id=IHyY6vdYZw](https://openreview.net/forum?id=IHyY6vdYZw)\.
- Wang et al\. \[2023\]Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H\. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou\.Self\-consistency improves chain of thought reasoning in language models\.In*The Eleventh International Conference on Learning Representations*, 2023\.URL[https://openreview\.net/forum?id=1PL1NIMMrw](https://openreview.net/forum?id=1PL1NIMMrw)\.
- Xiong et al\. \[2025\]Wei Xiong, Wenting Zhao, Weizhe Yuan, Olga Golovneva, Tong Zhang, Jason Weston, and Sainbayar Sukhbaatar\.Stepwiser: Stepwise generative judges for wiser reasoning\.*arXiv preprint arXiv:2508\.19229*, 2025\.
- You et al\. \[2025\]Runyang You, Yongqi Li, Meng Liu, Wenjie Wang, Liqiang Nie, and Wenjie Li\.Parallel test\-time scaling for latent reasoning models\.*arXiv preprint arXiv:2510\.07745*, 2025\.
- Yu et al\. \[2024\]Fei Yu, Anningzhe Gao, and Benyou Wang\.Ovm, outcome\-supervised value models for planning in mathematical reasoning\.In*Findings of the Association for Computational Linguistics: NAACL 2024*, pages 858–875, 2024\.
- Zhang et al\. \[2024\]Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai\-Wei Chang, Yu Qiao, et al\.Mathverse: Does your multi\-modal llm truly see the diagrams in visual math problems?In*European Conference on Computer Vision*, pages 169–186\. Springer, 2024\.
- Zhang et al\. \[2025a\]Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Ziyu Guo, Yichi Zhang, Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Shanghang Zhang, Peng Gao, and Hongsheng Li\.MAVIS: Mathematical visual instruction tuning with an automatic data engine\.In*The Thirteenth International Conference on Learning Representations*, 2025a\.URL[https://openreview\.net/forum?id=MnJzJ2gvuf](https://openreview.net/forum?id=MnJzJ2gvuf)\.
- Zhang et al\. \[2026\]Wenlin Zhang, Xiangyang Li, Kuicai Dong, Yichao Wang, Pengyue Jia, Xiaopeng Li, Yingyi Zhang, Derong Xu, Zhaocheng Du, Huifeng Guo, Ruiming Tang, and Xiangyu Zhao\.Process vs\. outcome reward: Which is better for agentic RAG reinforcement learning\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*, 2026\.URL[https://openreview\.net/forum?id=h3LlJ6Bh4S](https://openreview.net/forum?id=h3LlJ6Bh4S)\.
- Zhang et al\. \[2025b\]Yuxin Zhang, Meihao Fan, Ju Fan, Mingyang Yi, Yuyu Luo, Jian Tan, and Guoliang Li\.Reward\-sql: Boosting text\-to\-sql via stepwise reasoning and process\-supervised rewards\.*arXiv preprint arXiv:2505\.04671*, 2025b\.
- Zhang et al\. \[2025c\]Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin\.The lessons of developing process reward models in mathematical reasoning\.In*Findings of the Association for Computational Linguistics: ACL 2025*, pages 10495–10516, 2025c\.
- Zheng et al\. \[2025\]Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin\.Processbench: Identifying process errors in mathematical reasoning\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 1009–1024, 2025\.
- Zheng et al\. \[2026\]Tong Zheng, Chengsong Huang, Runpeng Dai, Yun He, Rui Liu, Xin Ni, Huiwen Bao, Kaishen Wang, Hongtu Zhu, Jiaxin Huang, et al\.Parallel\-probe: Towards efficient parallel thinking via 2d probing\.*arXiv preprint arXiv:2602\.03845*, 2026\.
- Zhu et al\. \[2025\]Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al\.Internvl3: Exploring advanced training and test\-time recipes for open\-source multimodal models\.*arXiv preprint arXiv:2504\.10479*, 2025\.
## Appendix AExperimental Setup and Implementation Details
### A\.1Training Data and Backbones
We train all PRMs on VisualPRM400K\-v1\.1\[[63](https://arxiv.org/html/2605.15529#bib.bib63)\]\. We use this version because it exposes the raw Monte Carlo supervision needed by our objective: for each supervised reasoning prefix, the dataset reports the numberKKof successful continuations amongN=16N=16sampled continuations\. This allowsBetaPRMto train on the count pair\(K,N\)\(K,N\), while the standard PRM baseline is trained with cross\-entropy using the empirical ratioK/NK/Nas a soft label\.
After validity filtering, the training split contains 565,096 rollouts and 3,174,394 annotated steps\. The average solution contains 5\.62 reasoning steps, with 27\.8 words per step on average\. The dataset is broad in source coverage, containing 38 subsets across diagram understanding, chart and document QA, general visual question answering, science reasoning, and mathematical/geometry reasoning\.
We instantiate both the standard PRM andBetaPRMwith four multimodal backbones: InternVL2\.5\-8B\[[9](https://arxiv.org/html/2605.15529#bib.bib9)\], InternVL3\-8B\[[75](https://arxiv.org/html/2605.15529#bib.bib75)\], InternVL3\-14B\[[75](https://arxiv.org/html/2605.15529#bib.bib75)\], and Qwen2\.5\-VL\-7B\[[1](https://arxiv.org/html/2605.15529#bib.bib1)\]\. For each backbone, we insert a<prm\>marker after every reasoning step and supervise only the marker positions\. The reward meanμt\\mu\_\{t\}is computed from theYes/Noreward\-token logits\. ForBetaPRM, we additionally attach a lightweight linear head on the marker hidden state to predict the concentrationκt\\kappa\_\{t\}\. Unless otherwise specified, we freeze the vision encoder and fine\-tune the language model together with the multimodal projection modules\. Training takes about 48 hours on 4 A100 GPUs for the 7B/8B backbones and about 48 hours on 8 A100 GPUs for the 14B backbone\.
Table 6:Source coverage of VisualPRM400K\-v1\.1 used for PRM training\.black\!45
\\rowcolortblHeadGroupRepresentative sourcesDiagram / Synthetic ReasoningAI2D\[[28](https://arxiv.org/html/2605.15529#bib.bib28)\], CLEVR\[[24](https://arxiv.org/html/2605.15529#bib.bib24)\], Super\-CLEVR\[[32](https://arxiv.org/html/2605.15529#bib.bib32)\], NLVR2\[[55](https://arxiv.org/html/2605.15529#bib.bib55)\], FigureQA\[[26](https://arxiv.org/html/2605.15529#bib.bib26)\], IconQA\[[38](https://arxiv.org/html/2605.15529#bib.bib38)\]Chart / Document / OCR QAChartQA\[[44](https://arxiv.org/html/2605.15529#bib.bib44)\], DocVQA\[[45](https://arxiv.org/html/2605.15529#bib.bib45)\], DVQA\[[25](https://arxiv.org/html/2605.15529#bib.bib25)\], InfographicVQA\[[46](https://arxiv.org/html/2605.15529#bib.bib46)\], SROIE\[[23](https://arxiv.org/html/2605.15529#bib.bib23)\]General VQA and Visual ReasoningVQAv2\[[16](https://arxiv.org/html/2605.15529#bib.bib16)\], COCO\-ReM\[[52](https://arxiv.org/html/2605.15529#bib.bib52)\], KonIQ\-10k\[[20](https://arxiv.org/html/2605.15529#bib.bib20)\], M3CoT\[[8](https://arxiv.org/html/2605.15529#bib.bib8)\], MAPQA\-SUV\[[6](https://arxiv.org/html/2605.15529#bib.bib6)\]Science and Math ReasoningScienceQA\[[39](https://arxiv.org/html/2605.15529#bib.bib39)\], MathV360K\[[51](https://arxiv.org/html/2605.15529#bib.bib51)\], MAVIS variants\[[69](https://arxiv.org/html/2605.15529#bib.bib69)\]Geometry ReasoningGeo170K\[[15](https://arxiv.org/html/2605.15529#bib.bib15)\], Geometry3K\[[37](https://arxiv.org/html/2605.15529#bib.bib37)\], GeoQA\+\[[4](https://arxiv.org/html/2605.15529#bib.bib4)\], GEOS\[[50](https://arxiv.org/html/2605.15529#bib.bib50)\], GeomVerse\[[27](https://arxiv.org/html/2605.15529#bib.bib27)\], UniGeo\[[7](https://arxiv.org/html/2605.15529#bib.bib7)\]
### A\.2Optimization and Model Hyperparameters
We use the same optimization recipe for the standard PRM baseline andBetaPRMwhenever they share the same backbone\. The standard PRM is trained with the cross\-entropy objective over theYes/Noreward tokens, whileBetaPRMreplaces this objective with the Beta\-Binomial loss and adds the concentration head\. The InternVL2\.5\-8B, InternVL3\-8B, and InternVL3\-14B experiments use the same hyperparameters; Qwen2\.5\-VL\-7B uses the same optimization settings with its native image preprocessing\. Table[7](https://arxiv.org/html/2605.15529#A1.T7)summarizes the hyperparameters needed to reproduce training\.
Table 7:Optimization and model hyperparameters\. Beta\-Binomial\-specific rows apply only toBetaPRM\.black\!45
\\rowcolortblHeadItemValueOptimizerAdamWLearning Rate1×10−51\\times 10^\{\-5\}Weight Decay0\.050\.05LR ScheduleCosine Decay with WarmupWarmup Ratio0\.050\.05Epochs11Global Batch Size512512Max Sequence Length81928192Trainable ModulesLLM \+ multimodal projector; vision encoder frozenϵ\\epsilonin Beta Parameters1×10−61\\times 10^\{\-6\}κmin\\kappa\_\{\\min\}1×10−31\\times 10^\{\-3\}Initialκ\\kappa4\.04\.0LregL\_\{\\mathrm\{reg\}\}Coefficient5×10−25\\times 10^\{\-2\}Concentration\-head LR Multiplier10\.010\.0
For InternVL backbones, we use dynamic image resolution with image size448448, at most66image patches, down\-sampling ratio0\.50\.5, and drop\-path rate0\.40\.4\. For Qwen2\.5\-VL\-7B, we use the native Qwen2\.5\-VL preprocessing with minimum and maximum pixel counts784784and200,704200\{,\}704, respectively\. All models insert<prm\>after each reasoning step and computeμt\\mu\_\{t\}from theYesandNologits at these marker positions\.
### A\.3Best\-of\-N Evaluation Protocol
All PRM selectors use the same candidate pools, so the comparison isolates the effect of the reward model and selection rule\.
Each candidatey=s1:Ty=s\_\{1:T\}is formatted by inserting a<prm\>marker after every reasoning step:
Question:xProcess:s1,<prm\>,…,sT,<prm\>\.\\texttt\{Question: \}x\\quad\\texttt\{Process: \}s\_\{1\},\\texttt\{<prm\>\},\\ldots,s\_\{T\},\\texttt\{<prm\>\}\.At each marker, the model scores the step using the normalizedYesprobability over the reward tokensYesandNo\. For the Standard PRM baseline, candidates are ranked by the average reward,
SPRM\(y\)=1T∑t=1Tμt\.S\_\{\\mathrm\{PRM\}\}\(y\)=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mu\_\{t\}\.
ForBetaPRM, we additionally extract the concentrationκt\\kappa\_\{t\}and compute the Beta standard deviation
σt=μt\(1−μt\)κt\+1\.\\sigma\_\{t\}=\\sqrt\{\\frac\{\\mu\_\{t\}\(1\-\\mu\_\{t\}\)\}\{\\kappa\_\{t\}\+1\}\}\.We rank candidates with the risk\-budget selector used in the main experiments:
SRB\(y\)=1T∑t=1Tμt−λ1T∑t=1T𝟏\[σt\>τ\]\.S\_\{\\mathrm\{RB\}\}\(y\)=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mu\_\{t\}\-\\lambda\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathbf\{1\}\[\\sigma\_\{t\}\>\\tau\]\.The penalty weightλ\\lambdaand uncertainty thresholdτ\\tauare selected from the same small fixed grid for all reportedBetaPRMruns:λ∈\{0\.2,0\.5,0\.7,1\.0,1\.5\}\\lambda\\in\\\{0\.2,0\.5,0\.7,1\.0,1\.5\\\}, withτ\\tauset by theqq\-th percentile of step\-levelσt\\sigma\_\{t\},q∈\{0\.7,0\.8,0\.9\}q\\in\\\{0\.7,0\.8,0\.9\\\}\.
### A\.4VisualProcessBench Evaluation Protocol
We evaluate step\-level error detection on VisualProcessBench\[[63](https://arxiv.org/html/2605.15529#bib.bib63)\]\. For each instance, we concatenate the question with the provided step\-by\-step rationale and insert a<prm\>marker after every step, using the same input format as PRM training\. The model produces one score at each marker, which is then converted into a binary prediction of whether the corresponding step is correct or erroneous\. Neutral labels are ignored when computing metrics\.
For the Standard PRM baseline, the step score is the normalizedYesprobabilityμt\\mu\_\{t\}\. ForBetaPRM, we use the same reward mean together with the learned concentration to computeσt=μt\(1−μt\)/\(κt\+1\)\\sigma\_\{t\}=\\sqrt\{\\mu\_\{t\}\(1\-\\mu\_\{t\}\)/\(\\kappa\_\{t\}\+1\)\}, and evaluate the risk\-adjusted step scorest=μt−λσts\_\{t\}=\\mu\_\{t\}\-\\lambda\\sigma\_\{t\}, withλ=0\.5\\lambda=0\.5\. This uses the reliability signal in the same direction as our downstream selection experiments: uncertain positive\-looking steps are scored more conservatively\.
Given a thresholdτcls\\tau\_\{\\mathrm\{cls\}\}, steps withst≥τclss\_\{t\}\\geq\\tau\_\{\\mathrm\{cls\}\}are classified as correct and those below the threshold are classified as erroneous\. Following the benchmark protocol\[[63](https://arxiv.org/html/2605.15529#bib.bib63)\], we choose a single global threshold per model by sweepingτcls\\tau\_\{\\mathrm\{cls\}\}and maximizing the overall validation F1\. We report the overall score and the per\-source macro\-F1 breakdown on VisualProcessBench\.
### A\.5Adaptive Computation Allocation Details
ACA is evaluated under the same maximum Best\-of\-1616budget as the fixed\-budget baseline\. For each problem, ACA first samplesn0=4n\_\{0\}=4complete candidates from scratch\. If the stopping criterion is not satisfied, it allocates another batch ofm=4m=4candidates, up to the maximum budgetN=16N=16\. All new candidates are generated by InternVL2\.5\-8B with the same decoding parameters as the fixed\-budget Best\-of\-1616baseline: temperature0\.70\.7, top\-p=0\.9p=0\.9, top\-k=30k=30and maximum new tokens20482048\.
At each stage, candidates are scored by the linear risk\-adjusted score used in the ACA stopping rule,
Slin\(y\)=1T∑t=1T\(μt−λσt\),S\_\{\\mathrm\{lin\}\}\(y\)=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\(\\mu\_\{t\}\-\\lambda\\sigma\_\{t\}\),withλ=0\.5\\lambda=0\.5\. The lower and upper confidence bounds useU\(y\)=T−1∑tσtU\(y\)=T^\{\-1\}\\sum\_\{t\}\\sigma\_\{t\}withcstop=0\.3c\_\{\\mathrm\{stop\}\}=0\.3\. When ACA continues, it expands the highest\-UCB non\-winner competitor\. For prefix repair, we usepbad=0\.3p\_\{\\mathrm\{bad\}\}=0\.3as the low\-quality threshold andccut=1\.0c\_\{\\mathrm\{cut\}\}=1\.0in the conservative step scoreμt−ccutσt\\mu\_\{t\}\-c\_\{\\mathrm\{cut\}\}\\sigma\_\{t\}\.
For the main ACA results, final candidate selection uses the same risk\-budget selectorSRBS\_\{\\mathrm\{RB\}\}as the Best\-of\-1616evaluation\. For the ACA ablation in Table[6\.2](https://arxiv.org/html/2605.15529#S6.SS2.SSS0.Px4), all variants instead use the shared linear scoreSlinS\_\{\\mathrm\{lin\}\}, withσt=0\\sigma\_\{t\}=0for the reward\-only Standard PRM baseline\. This keeps the ablation focused on the source of uncertainty\.
## Appendix BLimitations
BetaPRMrequires supervision that preserves the Monte Carlo count used to estimate prefix success, rather than only binarized step labels\. Our experiments therefore use VisualPRM400K\-v1\.1\[[63](https://arxiv.org/html/2605.15529#bib.bib63)\], which, to our knowledge, is the only publicly available PRM training dataset that reports the number of successful continuations for each prefix\. This availability constraint is why our experiments focus on multimodal PRMs, although the Beta\-Binomial formulation itself is not tied to multimodal inputs\.
## Appendix CBroader Societal Impact
BetaPRMimproves the reliability and efficiency of reasoning systems by enabling PRMs to report both reward estimates and learned reliability\. Such signals can help downstream methods avoid over\-trusting uncertain judgments, allocate computation more adaptively, and reduce unnecessary inference cost\. More broadly, reliability\-aware reward modeling may make AI reasoning systems easier to audit and more useful for research, education, and other reasoning\-intensive applications\.
Care should still be taken when applyingBetaPRMbeyond the evaluated benchmarks\. Learned reliability is an additional signal rather than a guarantee of correctness, so high\-stakes uses should involve human oversight, calibration checks, and domain\-specific evaluation\.Similar Articles
Unsupervised Process Reward Models
This paper proposes unsupervised Process Reward Models (uPRM) that eliminate the need for human annotations by using LLM next-token probabilities to identify erroneous reasoning steps, achieving up to 15% accuracy improvements over LLM-as-a-Judge and performing comparably to supervised PRMs as verifiers and reward signals.
Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport
This paper introduces Distributional Process Reward Models, using conditional optimal transport to calibrate PRMs for more accurate success probability estimates in inference-time scaling. It demonstrates improved calibration and downstream performance on mathematical reasoning benchmarks like MATH-500 and AIME.
Improving Model Safety Behavior with Rule-Based Rewards
OpenAI introduces Rule-Based Rewards (RBRs), a method to improve AI model safety by using explicit rules instead of human feedback in reinforcement learning. RBRs have been integrated into GPT-4 and subsequent models to maintain safety-helpfulness balance while reducing reliance on human feedback collection.
C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences
C2 proposes a scalable rubric-augmented reward modeling framework that trains a cooperative rubric generator and critical verifier exclusively from binary preferences, eliminating the need for costly rubric annotations while achieving up to 6.5 point gains on RM-Bench.
DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification
DeltaRubric is a research paper introducing a two-step multimodal preference evaluation approach using a single MLLM to improve reward modeling reliability through joint planning and verification.