The Weakest Link Tells It All: Outcome-Supervised Process Reward Modeling via Learnable Credit Assignment
Summary
This paper proposes Outcome-Supervised Process Reward Modeling via Learnable Credit Assignment (LCA), a framework that jointly learns credit assignment and reward modeling under a weakest-link principle, formulated as a Multiple Instance Learning problem with Softmax-Weighted-Sum pooling. Experiments show it outperforms existing outcome-supervised PRMs across multiple tasks.
View Cached Full Text
Cached at: 06/29/26, 05:25 AM
# The Weakest Link Tells It All: Outcome-Supervised Process Reward Modeling via Learnable Credit Assignment
Source: [https://arxiv.org/html/2606.27739](https://arxiv.org/html/2606.27739)
Tianyu Jia1,2,3\*, Yue Fang1,2,3\*, Hongxin Ding1,2,3, Rihong Qiu1,2,3, Zhibang Yang1,2,3, Zhijing Wu4, Xu Chu2,3,5†, Junfeng Zhao2,3†, Yasha Wang1,3,6† 1National Engineering Research Center of Software Engineering, Peking University, Beijing, China 2School of Computer Science, Peking University, Beijing, China 3Key Laboratory of High Confidence Software Technologies, Ministry of Education, Beijing, China 4GRG Banking Equipment Co\., Ltd\., Guangzhou, China 5Center on Frontiers of Computing Studies, Peking University, Beijing, China 6Peking University Information Technology Institute \(Tianjin Binhai\), Tianjin, China \{jiatianyu,yuefang25\}@stu\.pku\.edu\.cn, \{chu\_xu, zhaojf, wangyasha\}@pku\.edu\.cn
###### Abstract
Process reward models \(PRMs\) enhance the reasoning capabilities of large language models \(LLMs\) by providing fine\-grained feedback, yet training PRMs typically requires expensive stepwise annotations\. Outcome\-supervised PRMs offer a scalable alternative by learning from final\-answer correctness alone, but this introduces a fundamental*credit assignment*challenge, i\.e\., attributing outcomes to responsible reasoning steps\. Existing approaches rely on either uniform or causal assignment, both of which fail to anchor credit in step correctness and thus hinder process error identification\.
In this work, we propose Outcome\-Supervised Process Reward Modeling viaLearnableCreditAssignment \(LCA\), an outcome\-supervised PRM framework that jointly learns credit assignment and reward modeling under the principle of*Weakest Link Assignment: a reasoning chain is as strong as its weakest link*\. To address mutual dependence between credit assignment and reward modeling, we formalize outcome\-supervised PRM as a Multiple Instance Learning \(MIL\) problem and introduce Softmax\-Weighted\-Sum \(SWS\) pooling, an MIL pooling technique tailored for strong dependence and redundancy among reasoning states\. We prove Bayes consistency of our algorithm under mild assumptions\. Extensive experiments demonstrate thatLCAconsistently outperforms state\-of\-the\-art outcome\-supervised PRMs across multiple tasks and backbones\. Code is available at[https://anonymous\.4open\.science/r/LCA](https://anonymous.4open.science/r/LCA)\.
The Weakest Link Tells It All: Outcome\-Supervised Process Reward Modeling via Learnable Credit Assignment
Tianyu Jia1,2,3\*, Yue Fang1,2,3\*, Hongxin Ding1,2,3, Rihong Qiu1,2,3, Zhibang Yang1,2,3,Zhijing Wu4, Xu Chu2,3,5†, Junfeng Zhao2,3†, Yasha Wang1,3,6†1National Engineering Research Center of Software Engineering, Peking University, Beijing, China2School of Computer Science, Peking University, Beijing, China3Key Laboratory of High Confidence Software Technologies, Ministry of Education, Beijing, China4GRG Banking Equipment Co\., Ltd\., Guangzhou, China5Center on Frontiers of Computing Studies, Peking University, Beijing, China6Peking University Information Technology Institute \(Tianjin Binhai\), Tianjin, China\{jiatianyu,yuefang25\}@stu\.pku\.edu\.cn, \{chu\_xu, zhaojf, wangyasha\}@pku\.edu\.cn
11footnotetext:Equal contribution\.22footnotetext:Corresponding authors\.## 1Introduction
Process reward modeling \(PRM\) plays a crucial role in reasoning\-intensive tasks for large language models \(LLMs\), such as mathematical problem\-solvingShaoet al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib43)\); Yuet al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib44)\)and code generationLiet al\.\([2025](https://arxiv.org/html/2606.27739#bib.bib57)\); Zhanget al\.\([2025a](https://arxiv.org/html/2606.27739#bib.bib58)\)\. Unlike outcome reward models \(ORMs\)Cobbeet al\.\([2021](https://arxiv.org/html/2606.27739#bib.bib40)\), which evaluate only the final answer , PRMsWanget al\.\([2024c](https://arxiv.org/html/2606.27739#bib.bib2)\); Lightmanet al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib1)\); Luoet al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib3)\)assess the correctness of each intermediate reasoning step, providing dense, step\-level signals that enhance test\-time scalingWanget al\.\([2024c](https://arxiv.org/html/2606.27739#bib.bib2)\); Snellet al\.\([2025](https://arxiv.org/html/2606.27739#bib.bib41)\)and reinforcement learningSetluret al\.\([2025](https://arxiv.org/html/2606.27739#bib.bib25)\); Chenget al\.\([2026](https://arxiv.org/html/2606.27739#bib.bib24)\)\.
However, training PRMs requires step\-level correctness labels, which are prohibitively expensive to collect from humansLightmanet al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib1)\)\. A natural alternative is to use final\-answer correctness, cheap to verify and widely available, as ground\-truth supervision\. To recover step\-level rewards from this trajectory\-level signal, one must solve a*credit assignment*problem: attributing trajectory\-level correctness to the steps that cause it\. This setting is termed*outcome\-supervised PRM*\.
Figure 1:Illustration of different credit assignment strategies\.Existing methods mainly fall into two perspectives\.
- •Uniform Assignmentvia Additive Aggregation: As illustrated in Fig\.[1](https://arxiv.org/html/2606.27739#S1.F1)\(a\), this lineChenet al\.\([2025](https://arxiv.org/html/2606.27739#bib.bib42)\); Yuanet al\.\([2025](https://arxiv.org/html/2606.27739#bib.bib28)\)aggregates step\-level scores by summation or averaging into a single trajectory score for optimization\. Every step thus receives equal credit, regardless of which is actually responsible for the outcome\.
- •Causal Assignmentvia Rollout Estimation: As illustrated in Fig\.[1](https://arxiv.org/html/2606.27739#S1.F1)\(b\), this lineWanget al\.\([2024c](https://arxiv.org/html/2606.27739#bib.bib2)\); Luoet al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib3)\); Guanet al\.\([2025](https://arxiv.org/html/2606.27739#bib.bib23)\)pseudo\-labels each step via Monte Carlo estimation of the probability of reaching a correct final answer\. Each step’s score thus reflects the expected success of its downstream rollouts rather than its own correctness, making the actual error hard to localize\. The estimation further depends on rollout policy and sampling design, which risks introducing bias and noise\.
Both approaches share a common deficiency: neitherattributes credit based on whether each step is actually correct\. Uniform Assignment treats every step the same; Causal Assignment lets a step’s score depend on what comes after it\. We propose to anchor credit assignment directly in step correctness, drawing on a long\-standing intuition: a reasoning chain is only as strong as its weakest linkPrasadet al\.\([2023](https://arxiv.org/html/2606.27739#bib.bib47)\); Jacoviet al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib49)\); Liet al\.\([2023](https://arxiv.org/html/2606.27739#bib.bib48)\); Gilda and Gilda \([2026](https://arxiv.org/html/2606.27739#bib.bib46)\)\. A chain is correct if and only if every step is correct, and incorrect if and only if any step is wrong\. We term this*Weakest Link Assignment*\(Figure[1](https://arxiv.org/html/2606.27739#S1.F1)\(c\)\)\. Training under this principle directly targets the same first\-error criterion widely adopted in PRM evaluation and applicationsChenget al\.\([2026](https://arxiv.org/html/2606.27739#bib.bib24)\); Lightmanet al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib1)\); Zhenget al\.\([2025](https://arxiv.org/html/2606.27739#bib.bib35)\), thereby closing the training–test gap and producing a PRM that accurately attributes credit to the outcome\-determining steps\.
Yet applying this principle to training poses a chicken\-and\-egg dilemma: assigning credit to specific steps requires knowing which step is wrong, while identifying wrong steps is precisely what the PRM is being trained to do\.Credit assignmentandreward modelingare thus mutually dependent\.
To resolve this challenge, we propose Outcome\-Supervised Process Reward Modeling viaLearnableCreditAssignment \(LCA\), an outcome\-supervised PRM framework that jointly learns credit assignment and reward modeling\. Through a structural reduction, we formalize outcome\-supervised PRM as a Multiple Instance Learning \(MIL\) problemWaqaset al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib55)\): each trajectory prefix is an “instance”, the whole trajectory is a “bag” \(i\.e\., a set of instances\), and the Weakest Link principle is the Standard Multi\-Instance Assumption verbatim\. This formalization places outcome\-supervised PRM within the principled MIL framework, which jointly learns instance classification and the bag\-to\-instance attribution under bag\-level supervision, thereby dissolving the chicken\-and\-egg dilemma between credit assignment and reward modeling\. Through Softmax\-Weighted\-Sum \(SWS\) pooling, we resolve the degeneration of mainstream MIL methods under inter\-instance dependence and redundancy, and embed a smoothed Weakest Link Assignment as a structural prior, yielding a learning objective that is Bayes consistent at the instance level\.
To summarize, our contributions are threefold:
- •Insightfully, through a systematic analysis of existing outcome\-supervised PRM methods, we identify a shared deficiency: neitherUniform AssignmentnorCausal Assignmentanchors credit in step correctness\. We proposeWeakest Link Assignmentas a principled alternative, yielding PRMs that accurately attribute credit to the steps that cause the outcome\.
- •Technically, we proposeLCA, an outcome\-supervised PRM framework that \(1\)formalizes outcome\-supervised credit assignment as a Multiple Instance Learning \(MIL\) problemunder the Weakest Link principle; \(2\)introduces Softmax\-Weighted\-Sum \(SWS\) pooling, an MIL pooling technique that adaptively reweights reasoning steps and embeds a smoothed Weakest Link Assignment as a structural prior, addressing the degeneration of mainstream MIL methods under inter\-instance dependence and redundancy; and \(3\)proves Bayes consistencyof the resulting learning objective at the instance level\.
- •Experimentally, on error identification and best\-of\-N test\-time scaling,LCAconsistently outperforms current state\-of\-the\-art outcome\-supervised PRM baselines across multiple LLM backbones and generator models\. Ablation studies further validate SWS pooling against widely used MIL pooling techniques and characterize the role of the temperature parameter\.
## 2Preliminaries
### 2\.1Multiple Instance Learning
Multiple Instance Learning \(MIL\) is a weakly\-supervised learning method designed for settings where only coarse\-grained labels are available but fine\-grained classification is required\. MIL assumes a hierarchical structure of instances and bags\. Let𝒳\\mathcal\{X\}be the instance space\. A bagBBis a finite set of instancesxi∈𝒳x\_\{i\}\\in\\mathcal\{X\}\. The bag space isℬ\\mathcal\{B\}, with a data distributionℙ\\mathbb\{P\}overℬ\\mathcal\{B\}\.
In the standard binary MIL setting, each instance and each bag has a labely∈\{0,1\}y\\in\\\{0,1\\\}\(0 = negative, 1 = positive\)\. Instance labels are unobserved; only bag labels are given\. Given a training setD=\{⟨Bk,y\(Bk\)⟩\}k=1KD=\\\{\\langle B^\{k\},y\(B^\{k\}\)\\rangle\\\}\_\{k=1\}^\{K\}drawn fromℙ\\mathbb\{P\}, the goal is to learn an instance\-level classifierf:𝒳→\[0,1\]f:\\mathcal\{X\}\\to\[0,1\]\.
Most MIL algorithms rely on the*Standard Multi\-Instance Assumption*Foulds and Frank \([2010](https://arxiv.org/html/2606.27739#bib.bib53)\):
###### Assumption 2\.1\(Standard Multi\-Instance Assumption\)\.
A bag is positive if and only if it contains at least one positive instance:
y\(B\)=max\{y\(x1\),…,y\(xn\)\},\\displaystyle y\(B\)=\\max\\\{y\(x\_\{1\}\),\\dots,y\(x\_\{n\}\)\\\},∀B=\{x1,…,xn\}∈ℬ\.\\displaystyle\\quad\\forall B=\\\{x\_\{1\},\\dots,x\_\{n\}\\\}\\in\\mathcal\{B\}\.
This assumption holds in many real\-world applications, such as anomaly detection in long time series or whole\-slide images \(WSIs\)Xuet al\.\([2014](https://arxiv.org/html/2606.27739#bib.bib6)\); Earlyet al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib7)\)\. Anomalies typically appear in local regions, yet the label is assigned to the entire input\. The goal of MIL in such cases is to pinpoint the anomaly location\.
Modern MIL methods are predominantly pooling\-based\. These methods develop instance\-level classifiers and use a pooling function to aggregate instance predictions into a bag\-level prediction\. Standard supervised learning optimizes the bag\-level prediction, typically using a cross\-entropy loss:
LCE\(y\(B\),Pool\(\{p1,…,pn\}\)\),\\displaystyle L\_\{CE\}\\Big\(y\(B\),\\text\{Pool\}\(\\\{p\_\{1\},\\dots,p\_\{n\}\\\}\)\\Big\),wherep1,…,pnp\_\{1\},\\dots,p\_\{n\}are instance predictions andPoolis the pooling function\. In this way, instance classification and bag\-to\-instance attribution are jointly optimized in an end\-to\-end manner\.
A more detailed discussion of MIL pooling techniques is deferred to Appendix[B](https://arxiv.org/html/2606.27739#A2)\.
### 2\.2Process Reward Model
#### Reasoning trajectory\.
A reasoning trajectoryτ\\tauconsists of a questionqqfollowed by a sequence of stepss1,…,sTs\_\{1\},\\dots,s\_\{T\}:
τ:=⟨q,s1,…,sT⟩,\\tau:=\\langle q,s\_\{1\},\\dots,s\_\{T\}\\rangle,where each step advances the reasoning based on the question and all previous steps\.111For clarity, we consider only sequential reasoning without self\-correction in the main text\. Our method naturally applies to reasoning with reflection and backtracking through a simple conceptual transformation; see Appendix[E](https://arxiv.org/html/2606.27739#A5)\.An incomplete trajectory is termed a*prefix*:
τ:t:=⟨q,s1,…,st⟩,t≤T\.\\tau\_\{:t\}:=\\langle q,s\_\{1\},\\dots,s\_\{t\}\\rangle,\\quad t\\leq T\.
#### Correctness of reasoning prefixes\.
A reasoning step is logically erroneous if its conclusion is false or does not follow from the preceding context\. Once such an error occurs, subsequent steps inherit the error, making their individual correctness ill\-definedLightmanet al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib1)\)\. We therefore assign correctness labels to prefixes rather than to isolated steps\.
Formally, define correctness of a prefixτ:t\\tau\_\{:t\}as
y\(τ:t\)∈\{0,1\},y\(\\tau\_\{:t\}\)\\in\\\{0,1\\\},where11denotes thatτ:t\\tau\_\{:t\}contains logical error\(s\) \(positive\) and0denotes no error \(negative\), matching the standard MIL convention\. By this definition, error monotonicity follows immediately:
y\(τ:t\+1\)≥y\(τ:t\),y\(\\tau\_\{:t\+1\}\)\\geq y\(\\tau\_\{:t\}\),\(1\)because any extension of an erroneous prefix remains erroneous\. The correctness of the full trajectory is thus
y\(τ\)=y\(τ:T\)=max\{y\(τ:1\),…,y\(τ:T\)\},y\(\\tau\)=y\(\\tau\_\{:T\}\)=\\max\\\{y\(\\tau\_\{:1\}\),\\dots,y\(\\tau\_\{:T\}\)\\\},\(2\)which is precisely the Weakest Link Assignment\.
#### Process reward modeling\.
Process reward modeling learns a classifierffthat maps each prefix to a correctness score:f\(τ:t\)∈\[0,1\]f\(\\tau\_\{:t\}\)\\in\[0,1\], with the goal thatf\(τ:t\)≈y\(τ:t\)f\(\\tau\_\{:t\}\)\\approx y\(\\tau\_\{:t\}\)\. This is equivalent to*first\-error detection*Lightmanet al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib1)\); Zhenget al\.\([2025](https://arxiv.org/html/2606.27739#bib.bib35)\): the first prefix for whichffsignals an error identifies the earliest step that caused the failure\.
## 3Analysis
Figure 2:Overview ofLCA\. The PRM outputs per\-prefix error probabilities𝒑\\bm\{p\}, which are then aggregated via Softmax\-Weighted\-Sum \(SWS\) pooling\. The PRM is trained end‑to‑end with a trajectory‑level cross‑entropy loss\.### 3\.1Process Reward Modeling as Multiple Instance Learning
Under the Weakest Link principle, we formalize outcome\-supervised process reward modeling as a Multiple Instance Learning \(MIL\) problem:
- •Data structure\.A full trajectoryτ\\taucorresponds to a bag, and each prefixτ:t\\tau\_\{:t\}to an instance\. We thus use “bag” and “trajectory” \(resp\. “instance” and “prefix”\) interchangeably when the context is clear\.
- •Assumption\.By the definition of prefix correctness, the bag\-level label is the maximum over instance labels \(Eq\.[2](https://arxiv.org/html/2606.27739#S2.E2)\); this exactly matches the Standard Multi\-Instance Assumption\.
- •Learning objective\.Given a dataset with trajectory\-level labels:D=\{⟨τk,y\(τk\)⟩\}k=1KD=\\\{\\langle\\tau^\{k\},y\(\\tau^\{k\}\)\\rangle\\\}\_\{k=1\}^\{K\}, our goal is to learn a prefix\-level classifierffsuch thatf\(τ:t\)≈y\(τ:t\)f\(\\tau\_\{:t\}\)\\approx y\(\\tau\_\{:t\}\), consistent with the paradigm of MIL\.
Despite this formal alignment, outcome\-supervised PRM raises a unique challenge absent in classic MIL:prefixes within a trajectory are both highly dependent and redundant, violating the common assumption of instance independence or weak dependenceWaqaset al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib55)\)\. This violation stems from three structural properties of reasoning trajectories:
- •Autoregressive: each reasoning step is conditioned on all previous steps\.
- •Nested: each prefix includes previous ones\.
- •Monotonic: prefix labels are monotonic \(Eq\.[1](https://arxiv.org/html/2606.27739#S2.E1)\)\.
Together, these three properties render each instance highly informative about the others\. Most critically, the last prefix alone suffices to determine the bag label, making all earlier prefixes ignorable for bag\-level classification\.
Consequently, conventional MIL pooling techniques struggle in our setting\. In preliminary experiments, we applied max\-pooling\-based MIL, the most direct realization of the Standard Multi\-Instance Assumption\. Yet the PRM rapidly degenerated into an ORM, producing meaningful predictions only at the final prefix while outputting a trivial value for all preceding prefixes\. Detailed setup and results are provided in Appendix[D\.1](https://arxiv.org/html/2606.27739#A4.SS1)\. Attention\-based MIL methods, though most prevalent in the deep learning era, are prone to a similar failure mode\. The model could focus its attention exclusively on the last prefix, thus offering no guarantee of instance\-level classification performance\.
### 3\.2Multiple Instance Learning with Softmax\-Weighted\-Sum Pooling
To handle the strong dependence and redundancy among instances, we propose Softmax\-Weighted\-Sum \(SWS\) pooling\. Given instance\-level predictions𝒑=\(p1,…,pT\)∈\[0,1\]T\\bm\{p\}=\(p\_\{1\},\\dots,p\_\{T\}\)\\in\[0,1\]^\{T\}, the bag\-level prediction is
SWSα\(𝒑\)=∑t=1Twtpt,wt=ept/α∑j=1Tepj/α,\\text\{SWS\}\_\{\\alpha\}\(\\bm\{p\}\)=\\sum\_\{t=1\}^\{T\}w\_\{t\}p\_\{t\},\\quad w\_\{t\}=\\frac\{e^\{p\_\{t\}/\\alpha\}\}\{\\sum\_\{j=1\}^\{T\}e^\{p\_\{j\}/\\alpha\}\},\(3\)where the temperatureα\>0\\alpha\>0controls the entropy of the weight distribution\. Asα→0\\alpha\\to 0, SWS approaches max pooling; asα→∞\\alpha\\to\\infty, it approaches average pooling\.
SWS pooling offers two key properties for our setting\. First, because every weightwtw\_\{t\}admits a strictly positive lower bound under bounded bag sizes, each instance is guaranteed to receive a non\-negligible credit\. This prevents the model from basing its predictions on an overly narrow subset of instances and thus promotes more generalizable instance classification\. Second, the softmax weighting naturally encodes the Standard Multi\-Instance Assumption as a structural prior: the weight of each instance grows with its predicted positivity, aligning with the Weakest Link Assignment where positive instances dominate the bag prediction\.
SWS\-pooling\-based MIL provides an adaptive yet principled credit assignment mechanism\. The model learns instance predictionsptp\_\{t\}, which in turn determine the softmax weightswtw\_\{t\}and thus each instance’s contribution to the bag\-level loss\. In this way, the model*learns*which instances drive the optimization objective under pure bag\-level supervision, while the softmax\-based architecture embeds the Standard MIL assumption to constrain the solution space\.
### 3\.3Theoretical Analysis
We provide theoretical results about the Bayes consistency of SWS pooling\. The analysis applies not only to our settings of outcome\-supervised PRM but to any MIL setting that satisfies Assumption[3\.1](https://arxiv.org/html/2606.27739#S3.Thmassumption1)\. We therefore adopt the general MIL notation introduced in Section[2\.1](https://arxiv.org/html/2606.27739#S2.SS1)\.
###### Definition 3\.1\(Bag cover set\)\.
LetS⊆𝒳S\\subseteq\\mathcal\{X\}be a set of instances\. The*bag cover set*ofSSis the set of all bags that contain at least one instance fromSS:
𝒞\(S\):=\{B∈ℬ∣B∩S≠∅\}\.\\mathcal\{C\}\(S\):=\\\{B\\in\\mathcal\{B\}\\mid B\\cap S\\neq\\emptyset\\\}\.Similarly, the*positive bag cover set*𝒞\+\(S\)\\mathcal\{C\}^\{\+\}\(S\)is the set of all positive bags containing at least one instance fromSS; the*negative bag cover set*𝒞−\(S\)\\mathcal\{C\}^\{\-\}\(S\)is defined analogously\.
###### Assumption 3\.1\(MM\-Assumption\)\.
There exists a constantM\>0M\>0such that for any setSSof negative instances, if
ℙ\(𝒞\(S\)\)=ℙ\(𝒞\+\(S\)\)\+ℙ\(𝒞−\(S\)\)\>0,\\mathbb\{P\}\(\\mathcal\{C\}\(S\)\)=\\mathbb\{P\}\(\\mathcal\{C\}^\{\+\}\(S\)\)\+\\mathbb\{P\}\(\\mathcal\{C\}^\{\-\}\(S\)\)\>0,then
ℙ\(𝒞\+\(S\)\)ℙ\(𝒞−\(S\)\)≤M\.\\frac\{\\mathbb\{P\}\(\\mathcal\{C\}^\{\+\}\(S\)\)\}\{\\mathbb\{P\}\(\\mathcal\{C\}^\{\-\}\(S\)\)\}\\leq M\.
Assumption[3\.1](https://arxiv.org/html/2606.27739#S3.Thmassumption1)is a mild assumption that prevents the probability ratio of a negative instance appearing in positive bags versus negative bags from being arbitrarily large\.
###### Theorem 3\.1\(Unique global minimizer\)\.
Let Assumption[3\.1](https://arxiv.org/html/2606.27739#S3.Thmassumption1)hold\. LetNNbe the maximum allowed bag size\. Letf\(x\)∈\[0,1\]f\(x\)\\in\[0,1\]be an instance classifier\. For a bagB=\(x1,…,xn\)B=\(x\_\{1\},\\dots,x\_\{n\}\)with labely\(B\)∈\{0,1\}y\(B\)\\in\\\{0,1\\\}, define the softmax weighted sum
p^\(B\)=SWSα\(f\(x1\),⋯,f\(xn\)\),\\hat\{p\}\(B\)=\\text\{SWS\}\_\{\\alpha\}\(f\(x\_\{1\}\),\\cdots,f\(x\_\{n\}\)\),and the expected loss
ℒ\(f\)=𝔼B∼ℙLCE\(p^\(B\),y\(B\)\)\.\\mathcal\{L\}\(f\)=\\mathbb\{E\}\_\{B\\sim\\mathbb\{P\}\}L\_\{\\text\{CE\}\}\(\\hat\{p\}\(B\),y\(B\)\)\.Then there exists a sufficiently small temperatureα\>0\\alpha\>0such that, up to a set of instancesSSwhose bag cover set has probability zero,ℒ\\mathcal\{L\}is uniquely minimized by the true label function:
f∗\(x\)=y\(x\)∈\{0,1\},∀x∉S\.f^\{\*\}\(x\)=y\(x\)\\in\\\{0,1\\\},\\qquad\\forall x\\notin S\.
###### Theorem 3\.2\(Convergence in probability\)\.
Let Assumption[3\.1](https://arxiv.org/html/2606.27739#S3.Thmassumption1)hold\. LetNNbe the maximum allowed bag size\. Letf∗\(x\)=y\(x\)f^\{\*\}\(x\)=y\(x\)be the Bayes optimal classifier, and let\{fl\}\\\{f\_\{l\}\\\}be a sequence of instance classifiers\. Define the expected lossℒ\\mathcal\{L\}as in Theorem[3\.1](https://arxiv.org/html/2606.27739#S3.Thmtheorem1)\. Then there exists a sufficiently small temperatureα\>0\\alpha\>0such that, if\{ℒ\(fl\)\}\\\{\\mathcal\{L\}\(f\_\{l\}\)\\\}converges toℒ\(f∗\)\\mathcal\{L\}\(f^\{\*\}\):
liml→∞ℒ\(fl\)=argminfℒ\(f\)=ℒ\(f∗\),\\lim\_\{l\\to\\infty\}\\mathcal\{L\}\(f\_\{l\}\)=\\arg\\min\_\{f\}\\mathcal\{L\}\(f\)=\\mathcal\{L\}\(f^\{\*\}\),then\{fl\}\\\{f\_\{l\}\\\}converges tof∗f^\{\*\}in probability\.
Together, Theorem[3\.1](https://arxiv.org/html/2606.27739#S3.Thmtheorem1)and Theorem[3\.2](https://arxiv.org/html/2606.27739#S3.Thmtheorem2)guarantee a desirable property of SWS\-pooling\-based MIL algorithms: by minimizing the expected lossℒ\\mathcal\{L\}, the instance classifierffapproximates the Bayes optimal classifierf∗f^\{\*\}\. Consequently, even without instance\-level supervision, the learned classifier is expected to perform well on instance classification\.
Our theoretical analysis reveals the central role of temperatureα\\alphain SWS pooling\. Intuitively, a largerα\\alphamakes the credit assignment more uniform, risking misclassification of negative instances that frequently appear in positive bags\. So Bayes consistency requiresα\\alphato be sufficiently small, as stated in Theorem[3\.1](https://arxiv.org/html/2606.27739#S3.Thmtheorem1)and Theorem[3\.2](https://arxiv.org/html/2606.27739#S3.Thmtheorem2)\. On the other hand, however, an excessively smallα\\alphaconcentrates the credit assignment on a single instance, thereby starving co\-occurring positive instances of adequate optimization signals\. Consequently, the choice ofα\\alphainvolves a trade\-off between theoretical soundness and empirical effectiveness\.
A more formal statement of the definitions, assumptions, theorems, and their detailed proofs can be found in Appendix[F](https://arxiv.org/html/2606.27739#A6)\.
## 4Method
Building on the above analysis, we proposeLCA, an outcome\-supervised PRM framework that jointly learns credit assignment and reward modeling via SWS\-pooling\-based MIL algorithm\. Figure[2](https://arxiv.org/html/2606.27739#S3.F2)illustrates the overall pipeline\.
We adopt the standard discriminative PRM architectureZhenget al\.\([2026](https://arxiv.org/html/2606.27739#bib.bib54)\), which appends a binary classification head to a pretrained LLM backbone\. Given a full reasoning trajectoryτ=⟨q,s1,…,sT⟩\\tau=\\langle q,s\_\{1\},\\dots,s\_\{T\}\\rangle, the LLM backboneℳ\\mathcal\{M\}extracts per\-prefix representations in a single forward pass and the classification headψ\\psioutputs the predicted error probability for each prefix:
f\(τ\)=ψ\(ℳ\(τ\)\)=\(p1,…,pT\)∈\[0,1\]T,f\(\\tau\)=\\psi\(\\mathcal\{M\}\(\\tau\)\)=\(p\_\{1\},\\dots,p\_\{T\}\)\\in\[0,1\]^\{T\},\(4\)whereptp\_\{t\}denotes the probability that prefixτ:t\\tau\_\{:t\}is*incorrect*\.
We then aggregate these prefix\-level predictions into a trajectory\-level score using SWS pooling:
p^\(τ\)=SWSα\(𝒑\)=SWSα\(p1,⋯,pT\)\.\\hat\{p\}\(\\tau\)=\\text\{SWS\}\_\{\\alpha\}\(\\bm\{p\}\)=\\text\{SWS\}\_\{\\alpha\}\(p\_\{1\},\\cdots,p\_\{T\}\)\.\(5\)
We train the PRM end\-to\-end by minimizing the empirical cross\-entropy loss at trajectory level:
ℒ^K\\displaystyle\\hat\{\\mathcal\{L\}\}\_\{K\}=1K∑k=1KLCE\(p^\(τk\),y\(τk\)\)\\displaystyle=\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}L\_\{CE\}\\bigl\(\\hat\{p\}\(\\tau^\{k\}\),y\(\\tau^\{k\}\)\\bigr\)\(6\)=−1K∑k=1K\(y\(τk\)logp^\(τk\)\\displaystyle=\-\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}\\Bigl\(y\(\\tau^\{k\}\)\\log\\hat\{p\}\(\\tau^\{k\}\)\+\(1−y\(τk\)\)log\(1−p^\(τk\)\)\)\.\\displaystyle\\qquad\\qquad\+\(1\-y\(\\tau^\{k\}\)\)\\log\(1\-\\hat\{p\}\(\\tau^\{k\}\)\)\\Bigr\)\.wherey\(τk\)∈\{0,1\}y\(\\tau^\{k\}\)\\in\\\{0,1\\\}indicates whether the final answer is correct \(0\) or incorrect \(11\)\. No step\-level supervision is required\. In practice, we apply a stop\-gradient operation to the softmax weightswtw\_\{t\}to stabilize training\.
ModelGSM8KMATHOlympiadBenchOmni\-MATHAvg\.Large Language Models as CriticLlama\-3\-8B\-Instruct13\.113\.84\.812\.611\.1Llama\-3\.1\-8B\-Instruct10\.95\.12\.81\.65\.1Qwen2\.5\-7B\-Instruct36\.536\.629\.727\.432\.6Qwen2\.5\-Math\-7B\-Instruct26\.825\.714\.212\.719\.9Qwen2\.5\-Coder\-7B\-Instruct14\.36\.54\.11\.86\.7Open Source Process Reward ModelsMath\-Shepherd\-PRM\-7B47\.929\.524\.823\.831\.5RLHFlow\-PRM\-Mistral\-8B50\.433\.413\.815\.828\.4RLHFlow\-PRM\-Deepseek\-8B38\.833\.816\.916\.926\.6EurusPRM\-Stage144\.335\.621\.723\.131\.2EurusPRM\-Stage247\.335\.721\.220\.931\.3Qwen2\.5\-Math\-7B\-Math\-Shepherd\-PRM62\.531\.613\.77\.728\.9Process Reward Models Trained onLlama3\.2\-3B\-InstructImplicitPRM54\.834\.122\.212\.430\.9MathShepherd38\.822\.714\.511\.421\.8SCAN47\.627\.616\.212\.225\.9OmegaPRM38\.730\.118\.214\.125\.3PQM34\.438\.320\.227\.930\.2\\rowcolorblue\!10LCA\(Ours,α=1\\alpha=1\)48\.442\.130\.329\.437\.6Process Reward Models Trained onQwen2\.5\-Math\-7B\-InstructImplicitPRM45\.236\.323\.525\.132\.5MathShepherd59\.231\.112\.78\.127\.8SCAN60\.633\.915\.06\.829\.1OmegaPRM44\.837\.122\.918\.430\.8PQM55\.743\.928\.524\.538\.1\\rowcolorblue\!10LCA\(Ours,α=1\\alpha=1\)65\.748\.726\.219\.940\.2
Table 1:F1 scores of LLM\-as\-Critic models and PRMs on ProcessBench\.Reward ModelMistral\-7B\-Inst\-v0\.2Pass@1: 9\.6Llama\-3\.1\-8B\-InstPass@1: 44\.6Qwen\-2\.5\-7B\-InstPass@1: 65\.8Avg\.@8@32@64@8@32@64@8@32@64Majority Vote18\.221\.022\.856\.662\.263\.267\.870\.270\.650\.2ImplicitPRM19\.018\.417\.659\.461\.062\.870\.468\.269\.249\.6MathShepherd24\.630\.033\.259\.065\.665\.467\.269\.669\.653\.8SCAN26\.431\.033\.659\.865\.266\.467\.469\.269\.254\.2OmegaPRM13\.415\.217\.253\.455\.457\.466\.268\.068\.446\.1PQM21\.827\.028\.858\.260\.660\.469\.870\.670\.452\.0\\rowcolorblue\!10LCA\(Ours,α=1\\alpha=1\)25\.031\.235\.459\.466\.067\.869\.071\.471\.455\.2
Table 2:Different PRMs’ best\-of\-N sampling performance on MATH\-500 with three different generation models\.
## 5Experiments
We conduct a series of experiments to answer four research questions:
- •RQ1: Can PRMs trained withLCAreliably locate errors in reasoning trajectories?
- •RQ2: Can PRMs trained withLCAenhance LLM reasoning capability via test\-time scaling?
- •RQ3: How does the temperatureα\\alphaaffect SWS pooling’s performance?
- •RQ4: Does SWS pooling outperform other pooling techniques on process reward modeling?
Our experiments focus on math reasoning, where the correctness of intermediate steps and final answers is relatively well\-defined and easy to verify\. Unless otherwise stated, we use the correctness of the final answer as the trajectory\-level label, without any process\-level annotation\.
Implementation details for all experiments are given in Appendix[C](https://arxiv.org/html/2606.27739#A3)\.
### 5\.1Experiments on Error Identification \(RQ1\)
Datasets and Models\.We use the publicly available Math\-Shepherd datasetWanget al\.\([2024c](https://arxiv.org/html/2606.27739#bib.bib2)\)as training set\. Though this dataset provides step\-level labels generated by Monte Carlo estimation, we discard these process labels and use only the problems and their final answers forLCA\. We use Qwen2\.5\-Math\-7B\-InstructYanget al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib18)\)and Llama3\.2\-3B\-InstructGrattafioriet al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib17)\)as the LLM backbone for training PRMs\.
Evaluation Metrics\.We use ProcessBenchZhenget al\.\([2025](https://arxiv.org/html/2606.27739#bib.bib35)\)to measure the PRM’s ability to identify the first error location in a response\. The benchmark evaluates recognition of fully correct responses and precise error identification in incorrect responses\. We calculate accuracies on correct and erroneous samples and report their harmonic mean as the final F1 score\.
Baselines\.We compare our method with the current state\-of\-the\-art baselines:
- •ImplicitPRMYuanet al\.\([2025](https://arxiv.org/html/2606.27739#bib.bib28)\)parameterizes the reward model as a log\-likelihood ratio of policy and reference models, training with trajectory\-level correctness labels\.
- •MathShepherdWanget al\.\([2024c](https://arxiv.org/html/2606.27739#bib.bib2)\)uses brute\-force Monte Carlo estimation to generate hard pseudo labels for every step\.
- •SCANDinget al\.\([2026](https://arxiv.org/html/2606.27739#bib.bib26)\)denoises Monte Carlo pseudo labels using the confidence of the data generator\.
- •OmegaPRMLuoet al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib3)\)uses Monte Carlo tree search to selectively generate high\-quality reasoning trajectories with soft pseudo labels\.
- •PQMLi and Li \([2025](https://arxiv.org/html/2606.27739#bib.bib27)\)formulates process reward modeling as a Q\-value ranking problem and trains PRMs with a ranking loss, which requires estimation of the first error location\.
We also incorporate several 7B\-scale open\-source PRMs and LLM\-as\-critic models as baselines; their results are taken directly from previous researchZhanget al\.\([2025b](https://arxiv.org/html/2606.27739#bib.bib22)\)\.
#### Main Results\.
Table[1](https://arxiv.org/html/2606.27739#S4.T1)reports evaluation results on ProcessBench\.Our method achieves strong performance, outperforming the second\-best by 21\.7% on the 3B backbone and 5\.5% on the 7B backbone\.On the 7B backbone,LCAsurpasses all baselines and establishes a new state of the art\. On the 3B backbone,LCAoutperforms all competing 3B methods by a substantial margin\. Notably, our method avoids the heavy computational cost of Monte Carlo sampling required by MathShepherd, SCAN, OmegaPRM and PQM, yet consistently surpasses them in performance\.
More detailed results can be found in Table[4](https://arxiv.org/html/2606.27739#A3.T4)in Appendix[D\.2](https://arxiv.org/html/2606.27739#A4.SS2)\.
### 5\.2Experiments on Test\-Time Scaling \(RQ2\)
Datasets and Models\.We train PRMs on the Math\-Shepherd dataset, using Qwen2\.5\-Math\-7B\-Instruct as the backbone\. We evaluate test\-time scaling on MATH\-500Lightmanet al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib1)\), a subset of MATH dataset\. To test generalizability, we select three generator models of different sizes, families, and reasoning capabilities: Mistral\-7B\-Instruct\-v0\.2Jianget al\.\([2023](https://arxiv.org/html/2606.27739#bib.bib36)\), Meta\-Llama\-3\-8B\-InstructGrattafioriet al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib17)\), and Qwen2\.5\-7B\-InstructQwenet al\.\([2025](https://arxiv.org/html/2606.27739#bib.bib38)\)\.
Evaluation Metrics\.Following prior workWanget al\.\([2024c](https://arxiv.org/html/2606.27739#bib.bib2)\); Lightmanet al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib1)\); Luoet al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib3)\), we evaluate PRMs by their verification ability using best\-of\-N sampling\. The metric BON@n measures the correctness of the most preferred trajectory selected by the PRM fromnncandidates per question\.
Baselines\.We compare our method with the same baselines as in Section[5\.1](https://arxiv.org/html/2606.27739#S5.SS1): ImplicitPRM, MathShepherd, SCAN, OmegaPRM, and PQM\.
Main Results\.Table[2](https://arxiv.org/html/2606.27739#S4.T2)reports the best\-of\-N performance forN=8N=8,3232, and6464\.Our method achieves the highest average score across generators\.Moreover,LCAattains the best results on all three generation models atN=32N=32andN=64N=64, demonstrating strong scalability and generalization\. Although our method is primarily designed for intra\-trajectory credit assignment rather than for validating the final answer, it still delivers consistent gains on best\-of\-N accuracy\. We attribute this to the Weakest Link Assignment, which anchors credit assignment directly in step correctness\. Therefore, the PRM learns to focus on per\-step correctness instead of relying on spurious correlations or shortcuts, which mitigates overfitting and improves generalization\.
### 5\.3Ablation Study \(RQ3&4\)
We conduct ablation studies to examine the effect of temperatureα\\alphaand the advantage of SWS pooling over other common MIL pooling techniques\.
#### Effect of Temperature\(RQ3\)\.
Following the settings of our main experiments, we train PRMs with differentα\\alphausing Qwen2\.5\-Math\-7B\-Instruct as backbone and Math\-Shepherd as training set, and report their F1 scores on ProcessBench splits\.
Figure 3:Performance on ProcessBench splits with varying temperatureα\\alpha\.Figure[3](https://arxiv.org/html/2606.27739#S5.F3)shows the results\. Asα\\alphaincreases, performance on GSM8K decreases, while performance on OlympiadBench and OmniMath increases\. This finding aligns with our theoretical insights\. A largerα\\alphamakes credit assignment more uniform across instances, which encourages the model to identify as many positive instances as possible within a positive bag, and thus benefits process error detection\. However, this also risks misclassifying negative instances that appear in positive bags, e\.g\., correct prefixes of eventually wrong trajectories\. Consequently, asα\\alphaincreases, performance drops on simple datasets with fewer process errors \(e\.g\., GSM8K\) but improves on complex datasets with substantial process errors \(e\.g\., OlympiadBench and OmniMath\)Zhanget al\.\([2025b](https://arxiv.org/html/2606.27739#bib.bib22)\)\. This again highlights that choosingα\\alpharequires a trade\-off between theoretical consistency and practical effectiveness\.
Figure 4:Average F1 scores of different pooling techniques on ProcessBench\.
#### Comparing SWS Pooling with Other Pooling Techniques \(RQ4\)\.
We compare SWS pooling against commonly used MIL pooling functions: instance poolingWanget al\.\([2018](https://arxiv.org/html/2606.27739#bib.bib10)\), embedding poolingTiboet al\.\([2018](https://arxiv.org/html/2606.27739#bib.bib11)\), attention poolingIlseet al\.\([2018](https://arxiv.org/html/2606.27739#bib.bib12)\), and conjunctive poolingJang and Kwon \([2025](https://arxiv.org/html/2606.27739#bib.bib15)\)\. Appendix[B](https://arxiv.org/html/2606.27739#A2)describes these techniques in detail\. We also include a fully supervised PRM that uses all step\-level ground\-truth labels as an upper bound baseline\. To avoid potential label noise and ensure consistency with standard MIL settings, we curate a training set from the human\-annotated PRM800K datasetLightmanet al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib1)\)and aggregate step labels to obtain trajectory labels \(see Appendix[C\.5](https://arxiv.org/html/2606.27739#A3.SS5)for details\)\. We use Qwen3\-4BYanget al\.\([2025](https://arxiv.org/html/2606.27739#bib.bib39)\)as the backbone to train PRMs\.
Figure[4](https://arxiv.org/html/2606.27739#S5.F4)reports the average F1 scores on ProcessBench\. SWS pooling with temperatureα=0\.2\\alpha=0\.2achieves the best score, surpassing other pooling techniques and closely matching the stepwise\-supervised baseline\. Temperatures higher or lower than 0\.2 yield suboptimal performance, while the optimal temperature depends on data and model\.
## 6Conclusion
In this paper, we present a systematic analysis of outcome\-supervised process reward modeling from the perspective of credit assignment, identifying the fundamental limitation of existing uniform and causal assignment strategies\. To address this, we propose*Weakest Link Assignment*as a principled alternative and introduceLCA, an outcome\-supervised PRM framework that jointly learns credit assignment and reward modeling\. By formulating outcome\-supervised PRM as a Multiple Instance Learning \(MIL\) problem and introducing Softmax\-Weighted\-Sum pooling, we resolve the chicken\-and\-egg dilemma between credit assignment and reward modeling under strong inter\-instance dependence and redundancy\. Both theoretical analysis and extensive experiments demonstrate the superiority ofLCA, offering a principled alternative for fine\-grained process supervision in complex reasoning tasks\.
## Limitations
Despite achieving state\-of\-the\-art performance among outcome\-supervised process reward modeling methods, our approach still lags behind PRMs trained on human\-annotated step\-level labels\. This gap highlights room for further improvement in weakly supervised reward modeling\.
Furthermore, using final answer correctness as a proxy for trajectory\-level labels may introduce inherent noise\. LLMs sometimes arrive at the correct answer via flawed reasoning, in which case the Standard Multi\-Instance Assumption does not strictly hold\. While our method proves robust in practice, integrating techniques for learning with noisy labels into the MIL framework is a promising direction for future work\.
Beyond the above, integrating PRMs into reinforcement learning \(RL\) remains another important open problem\. PRM\-guided RL is susceptible to reward hacking, making this a non\-trivial challenge\. Existing explorations include reward design, optimization algorithms, and co\-evolution of the reward model with the policy model\. Notably, our method requires only outcome supervision to train PRMs and incurs no additional Monte Carlo sampling overhead, rendering it particularly suitable for reward\-policy co\-evolution\. We see this as a valuable avenue for future investigation\.
## Ethical Considerations
All experiments in this work were carried out on publicly available datasets, models, and benchmarks, following their respective licenses and usage terms\. The study does not involve human or animal subjects, and we did not collect, use, or disclose any personally identifiable information\.
In this work, Large Language Models \(LLMs\) were used for language polishing and coding assistance\. Specifically, LLMs supported refining the clarity and grammar of the manuscript, improving stylistic quality, and suggesting code snippets or troubleshooting strategies\. All content generated by LLMs was carefully reviewed and verified by the authors before inclusion\. The research design, critical analyses, and all final decisions were independently conducted by the authors\. LLMs were not involved in generating new research ideas or conclusions\.
## References
- Multiple instance learning networks for fine\-grained sentiment analysis\.Transactions of the Association for Computational Linguistics6,pp\. 17–31\.External Links:[Link](https://aclanthology.org/Q18-1002/),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00002)Cited by:[§A\.1](https://arxiv.org/html/2606.27739#A1.SS1.p1.1),[§A\.1](https://arxiv.org/html/2606.27739#A1.SS1.p2.1)\.
- H\. Chen, T\. Yang, S\. Gao, R\. Chen, X\. Quan, H\. Tian, and T\. Yao \(2025\)Discriminative policy optimization for token\-level reward models\.External Links:[Link](https://openreview.net/forum?id=aq3YxKPZBk)Cited by:[1st item](https://arxiv.org/html/2606.27739#S1.I1.i1.p1.1)\.
- J\. Cheng, G\. Xiong, R\. Qiao, L\. Li, C\. Guo, J\. Wang, Y\. Lv, and F\. Wang \(2026\)Stop summation: min\-form credit assignment is all process reward model needs for reasoning\.External Links:[Link](https://openreview.net/forum?id=3Sxby0hH1q)Cited by:[§A\.2](https://arxiv.org/html/2606.27739#A1.SS2.p1.1),[§A\.2](https://arxiv.org/html/2606.27739#A1.SS2.p3.1),[§1](https://arxiv.org/html/2606.27739#S1.p1.1),[§1](https://arxiv.org/html/2606.27739#S1.p4.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.External Links:2110\.14168,[Link](https://arxiv.org/abs/2110.14168)Cited by:[§1](https://arxiv.org/html/2606.27739#S1.p1.1)\.
- G\. Cui, L\. Yuan, Z\. Wang, H\. Wang, Y\. Zhang, J\. Chen, W\. Li, B\. He, Y\. Fan, T\. Yu, Q\. Xu, W\. Chen, J\. Yuan, H\. Chen, K\. Zhang, X\. Lv, S\. Wang, Y\. Yao, X\. Han, H\. Peng, Y\. Cheng, Z\. Liu, M\. Sun, B\. Zhou, and N\. Ding \(2025\)Process reinforcement through implicit rewards\.External Links:2502\.01456,[Link](https://arxiv.org/abs/2502.01456)Cited by:[§A\.2](https://arxiv.org/html/2606.27739#A1.SS2.p2.1)\.
- Y\. Ding, X\. Shi, J\. Li, xiaobo liang, Z\. Tu, and M\. Zhang \(2026\)SCAN: self\-denoising monte carlo annotation for robust process reward learning\.External Links:[Link](https://openreview.net/forum?id=ifsyZYYDNs)Cited by:[§A\.2](https://arxiv.org/html/2606.27739#A1.SS2.p2.1),[§C\.2](https://arxiv.org/html/2606.27739#A3.SS2.SSS0.Px1.p1.1),[§C\.3](https://arxiv.org/html/2606.27739#A3.SS3.p4.4),[3rd item](https://arxiv.org/html/2606.27739#S5.I2.i3.p1.1)\.
- J\. Early, G\. Cheung, K\. Cutajar, H\. Xie, J\. Kandola, and N\. Twomey \(2024\)Inherently interpretable time series classification via multiple instance learning\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=xriGRsoAza)Cited by:[§A\.1](https://arxiv.org/html/2606.27739#A1.SS1.p1.1),[§A\.1](https://arxiv.org/html/2606.27739#A1.SS1.p2.1),[§2\.1](https://arxiv.org/html/2606.27739#S2.SS1.p4.1)\.
- J\. Foulds and E\. Frank \(2010\)A review of multi\-instance learning assumptions\.The Knowledge Engineering Review25\(1\),pp\. 1–25\.External Links:[Document](https://dx.doi.org/10.1017/S026988890999035X)Cited by:[§2\.1](https://arxiv.org/html/2606.27739#S2.SS1.p3.1)\.
- S\. Gilda and S\. Gilda \(2026\)Structured abductive\-deductive\-inductive reasoning for LLMs via algebraic invariants\.External Links:[Link](https://openreview.net/forum?id=zy6HdcsJ9V)Cited by:[§1](https://arxiv.org/html/2606.27739#S1.p4.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan, A\. Yang, A\. Fan, A\. Goyal, A\. Hartshorn, A\. Yang, A\. Mitra, A\. Sravankumar, A\. Korenev, A\. Hinsvark, A\. Rao, A\. Zhang, A\. Rodriguez, A\. Gregerson, A\. Spataru, B\. Roziere, B\. Biron, B\. Tang, B\. Chern, C\. Caucheteux, C\. Nayak, C\. Bi, C\. Marra, C\. McConnell, C\. Keller, C\. Touret, C\. Wu, C\. Wong, C\. C\. Ferrer, C\. Nikolaidis, D\. Allonsius, D\. Song, D\. Pintz, D\. Livshits, D\. Wyatt, D\. Esiobu, D\. Choudhary, D\. Mahajan, D\. Garcia\-Olano, D\. Perino, D\. Hupkes, E\. Lakomkin, E\. AlBadawy, E\. Lobanova, E\. Dinan, E\. M\. Smith, F\. Radenovic, F\. Guzmán, F\. Zhang, G\. Synnaeve, G\. Lee, G\. L\. Anderson, G\. Thattai, G\. Nail, G\. Mialon, G\. Pang, G\. Cucurell, H\. Nguyen, H\. Korevaar, H\. Xu, H\. Touvron, I\. Zarov, I\. A\. Ibarra, I\. Kloumann, I\. Misra, I\. Evtimov, J\. Zhang, J\. Copet, J\. Lee, J\. Geffert, J\. Vranes, J\. Park, J\. Mahadeokar, J\. Shah, J\. van der Linde, J\. Billock, J\. Hong, J\. Lee, J\. Fu, J\. Chi, J\. Huang, J\. Liu, J\. Wang, J\. Yu, J\. Bitton, J\. Spisak, J\. Park, J\. Rocca, J\. Johnstun, J\. Saxe, J\. Jia, K\. V\. Alwala, K\. Prasad, K\. Upasani, K\. Plawiak, K\. Li, K\. Heafield, K\. Stone, K\. El\-Arini, K\. Iyer, K\. Malik, K\. Chiu, K\. Bhalla, K\. Lakhotia, L\. Rantala\-Yeary, L\. van der Maaten, L\. Chen, L\. Tan, L\. Jenkins, L\. Martin, L\. Madaan, L\. Malo, L\. Blecher, L\. Landzaat, L\. de Oliveira, M\. Muzzi, M\. Pasupuleti, M\. Singh, M\. Paluri, M\. Kardas, M\. Tsimpoukelli, M\. Oldham, M\. Rita, M\. Pavlova, M\. Kambadur, M\. Lewis, M\. Si, M\. K\. Singh, M\. Hassan, N\. Goyal, N\. Torabi, N\. Bashlykov, N\. Bogoychev, N\. Chatterji, N\. Zhang, O\. Duchenne, O\. Çelebi, P\. Alrassy, P\. Zhang, P\. Li, P\. Vasic, P\. Weng, P\. Bhargava, P\. Dubal, P\. Krishnan, P\. S\. Koura, P\. Xu, Q\. He, Q\. Dong, R\. Srinivasan, R\. Ganapathy, R\. Calderer, R\. S\. Cabral, R\. Stojnic, R\. Raileanu, R\. Maheswari, R\. Girdhar, R\. Patel, R\. Sauvestre, R\. Polidoro, R\. Sumbaly, R\. Taylor, R\. Silva, R\. Hou, R\. Wang, S\. Hosseini, S\. Chennabasappa, S\. Singh, S\. Bell, S\. S\. Kim, S\. Edunov, S\. Nie, S\. Narang, S\. Raparthy, S\. Shen, S\. Wan, S\. Bhosale, S\. Zhang, S\. Vandenhende, S\. Batra, S\. Whitman, S\. Sootla, S\. Collot, S\. Gururangan, S\. Borodinsky, T\. Herman, T\. Fowler, T\. Sheasha, T\. Georgiou, T\. Scialom, T\. Speckbacher, T\. Mihaylov, T\. Xiao, U\. Karn, V\. Goswami, V\. Gupta, V\. Ramanathan, V\. Kerkez, V\. Gonguet, V\. Do, V\. Vogeti, V\. Albiero, V\. Petrovic, W\. Chu, W\. Xiong, W\. Fu, W\. Meers, X\. Martinet, X\. Wang, X\. Wang, X\. E\. Tan, X\. Xia, X\. Xie, X\. Jia, X\. Wang, Y\. Goldschlag, Y\. Gaur, Y\. Babaei, Y\. Wen, Y\. Song, Y\. Zhang, Y\. Li, Y\. Mao, Z\. D\. Coudert, Z\. Yan, Z\. Chen, Z\. Papakipos, A\. Singh, A\. Srivastava, A\. Jain, A\. Kelsey, A\. Shajnfeld, A\. Gangidi, A\. Victoria, A\. Goldstand, A\. Menon, A\. Sharma, A\. Boesenberg, A\. Baevski, A\. Feinstein, A\. Kallet, A\. Sangani, A\. Teo, A\. Yunus, A\. Lupu, A\. Alvarado, A\. Caples, A\. Gu, A\. Ho, A\. Poulton, A\. Ryan, A\. Ramchandani, A\. Dong, A\. Franco, A\. Goyal, A\. Saraf, A\. Chowdhury, A\. Gabriel, A\. Bharambe, A\. Eisenman, A\. Yazdan, B\. James, B\. Maurer, B\. Leonhardi, B\. Huang, B\. Loyd, B\. D\. Paola, B\. Paranjape, B\. Liu, B\. Wu, B\. Ni, B\. Hancock, B\. Wasti, B\. Spence, B\. Stojkovic, B\. Gamido, B\. Montalvo, C\. Parker, C\. Burton, C\. Mejia, C\. Liu, C\. Wang, C\. Kim, C\. Zhou, C\. Hu, C\. Chu, C\. Cai, C\. Tindal, C\. Feichtenhofer, C\. Gao, D\. Civin, D\. Beaty, D\. Kreymer, D\. Li, D\. Adkins, D\. Xu, D\. Testuggine, D\. David, D\. Parikh, D\. Liskovich, D\. Foss, D\. Wang, D\. Le, D\. Holland, E\. Dowling, E\. Jamil, E\. Montgomery, E\. Presani, E\. Hahn, E\. Wood, E\. Le, E\. Brinkman, E\. Arcaute, E\. Dunbar, E\. Smothers, F\. Sun, F\. Kreuk, F\. Tian, F\. Kokkinos, F\. Ozgenel, F\. Caggioni, F\. Kanayet, F\. Seide, G\. M\. Florez, G\. Schwarz, G\. Badeer, G\. Swee, G\. Halpern, G\. Herman, G\. Sizov, Guangyi, Zhang, G\. Lakshminarayanan, H\. Inan, H\. Shojanazeri, H\. Zou, H\. Wang, H\. Zha, H\. Habeeb, H\. Rudolph, H\. Suk, H\. Aspegren, H\. Goldman, H\. Zhan, I\. Damlaj, I\. Molybog, I\. Tufanov, I\. Leontiadis, I\. Veliche, I\. Gat, J\. Weissman, J\. Geboski, J\. Kohli, J\. Lam, J\. Asher, J\. Gaya, J\. Marcus, J\. Tang, J\. Chan, J\. Zhen, J\. Reizenstein, J\. Teboul, J\. Zhong, J\. Jin, J\. Yang, J\. Cummings, J\. Carvill, J\. Shepard, J\. McPhie, J\. Torres, J\. Ginsburg, J\. Wang, K\. Wu, K\. H\. U, K\. Saxena, K\. Khandelwal, K\. Zand, K\. Matosich, K\. Veeraraghavan, K\. Michelena, K\. Li, K\. Jagadeesh, K\. Huang, K\. Chawla, K\. Huang, L\. Chen, L\. Garg, L\. A, L\. Silva, L\. Bell, L\. Zhang, L\. Guo, L\. Yu, L\. Moshkovich, L\. Wehrstedt, M\. Khabsa, M\. Avalani, M\. Bhatt, M\. Mankus, M\. Hasson, M\. Lennie, M\. Reso, M\. Groshev, M\. Naumov, M\. Lathi, M\. Keneally, M\. Liu, M\. L\. Seltzer, M\. Valko, M\. Restrepo, M\. Patel, M\. Vyatskov, M\. Samvelyan, M\. Clark, M\. Macey, M\. Wang, M\. J\. Hermoso, M\. Metanat, M\. Rastegari, M\. Bansal, N\. Santhanam, N\. Parks, N\. White, N\. Bawa, N\. Singhal, N\. Egebo, N\. Usunier, N\. Mehta, N\. P\. Laptev, N\. Dong, N\. Cheng, O\. Chernoguz, O\. Hart, O\. Salpekar, O\. Kalinli, P\. Kent, P\. Parekh, P\. Saab, P\. Balaji, P\. Rittner, P\. Bontrager, P\. Roux, P\. Dollar, P\. Zvyagina, P\. Ratanchandani, P\. Yuvraj, Q\. Liang, R\. Alao, R\. Rodriguez, R\. Ayub, R\. Murthy, R\. Nayani, R\. Mitra, R\. Parthasarathy, R\. Li, R\. Hogan, R\. Battey, R\. Wang, R\. Howes, R\. Rinott, S\. Mehta, S\. Siby, S\. J\. Bondu, S\. Datta, S\. Chugh, S\. Hunt, S\. Dhillon, S\. Sidorov, S\. Pan, S\. Mahajan, S\. Verma, S\. Yamamoto, S\. Ramaswamy, S\. Lindsay, S\. Lindsay, S\. Feng, S\. Lin, S\. C\. Zha, S\. Patil, S\. Shankar, S\. Zhang, S\. Zhang, S\. Wang, S\. Agarwal, S\. Sajuyigbe, S\. Chintala, S\. Max, S\. Chen, S\. Kehoe, S\. Satterfield, S\. Govindaprasad, S\. Gupta, S\. Deng, S\. Cho, S\. Virk, S\. Subramanian, S\. Choudhury, S\. Goldman, T\. Remez, T\. Glaser, T\. Best, T\. Koehler, T\. Robinson, T\. Li, T\. Zhang, T\. Matthews, T\. Chou, T\. Shaked, V\. Vontimitta, V\. Ajayi, V\. Montanez, V\. Mohan, V\. S\. Kumar, V\. Mangla, V\. Ionescu, V\. Poenaru, V\. T\. Mihailescu, V\. Ivanov, W\. Li, W\. Wang, W\. Jiang, W\. Bouaziz, W\. Constable, X\. Tang, X\. Wu, X\. Wang, X\. Wu, X\. Gao, Y\. Kleinman, Y\. Chen, Y\. Hu, Y\. Jia, Y\. Qi, Y\. Li, Y\. Zhang, Y\. Zhang, Y\. Adi, Y\. Nam, Yu, Wang, Y\. Zhao, Y\. Hao, Y\. Qian, Y\. Li, Y\. He, Z\. Rait, Z\. DeVito, Z\. Rosnbrick, Z\. Wen, Z\. Yang, Z\. Zhao, and Z\. Ma \(2024\)The llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[§A\.2](https://arxiv.org/html/2606.27739#A1.SS2.p1.1),[§5\.1](https://arxiv.org/html/2606.27739#S5.SS1.p1.1),[§5\.2](https://arxiv.org/html/2606.27739#S5.SS2.p1.1)\.
- X\. Guan, L\. L\. Zhang, Y\. Liu, N\. Shang, Y\. Sun, Y\. Zhu, F\. Yang, and M\. Yang \(2025\)RStar\-math: small LLMs can master math reasoning with self\-evolved deep thinking\.InProceedings of the 42nd International Conference on Machine LearningThe Thirty\-ninth Annual Conference on Neural Information Processing SystemsThe Thirteenth International Conference on Learning RepresentationsThe Thirty\-ninth Annual Conference on Neural Information Processing SystemsThe Thirteenth International Conference on Learning RepresentationsProceedings of the 42nd International Conference on Machine LearningThirty\-seventh Conference on Neural Information Processing SystemsThe Fourteenth International Conference on Learning RepresentationsThirty\-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track \(Round 2\)Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)The Thirteenth International Conference on Learning RepresentationsForty\-second International Conference on Machine LearningThe Twelfth International Conference on Learning RepresentationsICLR 2024 Workshop on Large Language Model \(LLM\) AgentsICLR 2026 Workshop on Logical Reasoning of Large Language ModelsThe 2023 Conference on Empirical Methods in Natural Language ProcessingProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)The Thirty\-eighth Annual Conference on Neural Information Processing SystemsProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningFindings of the Association for Computational Linguistics: ACL 2025,A\. Singh, M\. Fazel, D\. Hsu, S\. Lacoste\-Julien, F\. Berkenkamp, T\. Maharaj, K\. Wagstaff, J\. Zhu, A\. Singh, M\. Fazel, D\. Hsu, S\. Lacoste\-Julien, F\. Berkenkamp, T\. Maharaj, K\. Wagstaff, J\. Zhu, W\. Che, J\. Nabende, E\. Shutova, M\. T\. Pilehvar, A\. Rogers, J\. Boyd\-Graber, N\. Okazaki, L\. Ku, A\. Martins, V\. Srikumar, W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Proceedings of Machine Learning ResearchProceedings of Machine Learning ResearchKDD ’20, Vol\.267267,pp\. 20640–20661\.External Links:[Link](https://proceedings.mlr.press/v267/guan25f.html)Cited by:[§A\.2](https://arxiv.org/html/2606.27739#A1.SS2.p1.1),[§A\.2](https://arxiv.org/html/2606.27739#A1.SS2.p2.1),[2nd item](https://arxiv.org/html/2606.27739#S1.I1.i2.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the MATH dataset\.External Links:[Link](https://openreview.net/forum?id=7Bywt2mQsCe)Cited by:[3rd item](https://arxiv.org/html/2606.27739#A3.I1.i3.p1.1)\.
- B\. Hui, J\. Yang, Z\. Cui, J\. Yang, D\. Liu, L\. Zhang, T\. Liu, J\. Zhang, B\. Yu, K\. Lu, K\. Dang, Y\. Fan, Y\. Zhang, A\. Yang, R\. Men, F\. Huang, B\. Zheng, Y\. Miao, S\. Quan, Y\. Feng, X\. Ren, X\. Ren, J\. Zhou, and J\. Lin \(2024\)Qwen2\.5\-coder technical report\.External Links:2409\.12186,[Link](https://arxiv.org/abs/2409.12186)Cited by:[§A\.2](https://arxiv.org/html/2606.27739#A1.SS2.p1.1)\.
- M\. Ilse, J\. Tomczak, and M\. Welling \(2018\)Attention\-based deep multiple instance learning\.InProceedings of the 35th International Conference on Machine Learning,J\. Dy and A\. Krause \(Eds\.\),Proceedings of Machine Learning Research, Vol\.80,pp\. 2127–2136\.External Links:[Link](https://proceedings.mlr.press/v80/ilse18a.html)Cited by:[§5\.3](https://arxiv.org/html/2606.27739#S5.SS3.SSS0.Px2.p1.1)\.
- A\. Jacovi, Y\. Bitton, B\. Bohnet, J\. Herzig, O\. Honovich, M\. Tseng, M\. Collins, R\. Aharoni, and M\. Geva \(2024\)A chain\-of\-thought is as strong as its weakest link: a benchmark for verifiers of reasoning chains\.Bangkok, Thailand,pp\. 4615–4634\.External Links:[Link](https://aclanthology.org/2024.acl-long.254/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.254)Cited by:[§1](https://arxiv.org/html/2606.27739#S1.p4.1)\.
- J\. Jang and H\. KWON \(2024\)Are multiple instance learning algorithms learnable for instances?\.External Links:[Link](https://openreview.net/forum?id=cUcvlgkQxP)Cited by:[§B\.7](https://arxiv.org/html/2606.27739#A2.SS7.p1.1)\.
- J\. Jang and H\. Kwon \(2025\)TAIL\-mil: time\-aware and instance\-learnable multiple instance learning for multivariate time series anomaly detection\.Proceedings of the AAAI Conference on Artificial Intelligence39,pp\. 17582–17589\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v39i17.33933)Cited by:[§A\.1](https://arxiv.org/html/2606.27739#A1.SS1.p1.1),[§A\.1](https://arxiv.org/html/2606.27739#A1.SS1.p2.1),[§5\.3](https://arxiv.org/html/2606.27739#S5.SS3.SSS0.Px2.p1.1)\.
- S\. A\. Javed, D\. Juyal, H\. Padigela, A\. Taylor\-Weiner, L\. Yu, and A\. Prakash \(2022\)Additive mil: intrinsically interpretable multiple instance learning for pathology\.InProceedings of the 36th International Conference on Neural Information Processing Systems,NIPS ’22,Red Hook, NY, USA\.External Links:ISBN 9781713871088Cited by:[§A\.1](https://arxiv.org/html/2606.27739#A1.SS1.p2.1)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot, D\. de las Casas, F\. Bressand, G\. Lengyel, G\. Lample, L\. Saulnier, L\. R\. Lavaud, M\. Lachaux, P\. Stock, T\. L\. Scao, T\. Lavril, T\. Wang, T\. Lacroix, and W\. E\. Sayed \(2023\)Mistral 7b\.External Links:2310\.06825,[Link](https://arxiv.org/abs/2310.06825)Cited by:[§5\.2](https://arxiv.org/html/2606.27739#S5.SS2.p1.1)\.
- Q\. Li, X\. Dai, X\. Li, W\. Zhang, Y\. Wang, R\. Tang, and Y\. Yu \(2025\)CodePRM: execution feedback\-enhanced process reward model for code generation\.Vienna, Austria,pp\. 8169–8182\.External Links:[Link](https://aclanthology.org/2025.findings-acl.428/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.428),ISBN 979\-8\-89176\-256\-5Cited by:[§1](https://arxiv.org/html/2606.27739#S1.p1.1)\.
- W\. Li and Y\. Li \(2025\)Process reward model with q\-value rankings\.External Links:[Link](https://openreview.net/forum?id=wQEdh2cgEk)Cited by:[§C\.2](https://arxiv.org/html/2606.27739#A3.SS2.SSS0.Px1.p1.1),[§C\.2](https://arxiv.org/html/2606.27739#A3.SS2.SSS0.Px2.p1.2),[5th item](https://arxiv.org/html/2606.27739#S5.I2.i5.p1.1)\.
- Y\. Li, Z\. Lin, S\. Zhang, Q\. Fu, B\. Chen, J\. Lou, and W\. Chen \(2023\)Making language models better reasoners with step\-aware verifier\.Toronto, Canada,pp\. 5315–5333\.External Links:[Link](https://aclanthology.org/2023.acl-long.291/),[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.291)Cited by:[§1](https://arxiv.org/html/2606.27739#S1.p4.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2024\)Let’s verify step by step\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=v8L0pN6EOi)Cited by:[§A\.2](https://arxiv.org/html/2606.27739#A1.SS2.p1.1),[§A\.2](https://arxiv.org/html/2606.27739#A1.SS2.p2.1),[§A\.2](https://arxiv.org/html/2606.27739#A1.SS2.p3.1),[2nd item](https://arxiv.org/html/2606.27739#A3.I1.i2.p1.1),[3rd item](https://arxiv.org/html/2606.27739#A3.I1.i3.p1.1),[§C\.4](https://arxiv.org/html/2606.27739#A3.SS4.SSS0.Px2.p1.4),[§C\.5](https://arxiv.org/html/2606.27739#A3.SS5.p2.1),[§1](https://arxiv.org/html/2606.27739#S1.p1.1),[§1](https://arxiv.org/html/2606.27739#S1.p2.1),[§1](https://arxiv.org/html/2606.27739#S1.p4.1),[§2\.2](https://arxiv.org/html/2606.27739#S2.SS2.SSS0.Px2.p1.1),[§2\.2](https://arxiv.org/html/2606.27739#S2.SS2.SSS0.Px3.p1.4),[§5\.2](https://arxiv.org/html/2606.27739#S5.SS2.p1.1),[§5\.2](https://arxiv.org/html/2606.27739#S5.SS2.p2.1),[§5\.3](https://arxiv.org/html/2606.27739#S5.SS3.SSS0.Px2.p1.1)\.
- X\. Liu, W\. Zhang, W\. Tang, T\. D\. Le, J\. Li, L\. Liu, and M\. Zhang \(2025\)From correlation to causation: max\-pooling\-based multi\-instance learning leads to more robust whole slide image classification\.External Links:2408\.09449,[Link](https://arxiv.org/abs/2408.09449)Cited by:[§A\.1](https://arxiv.org/html/2606.27739#A1.SS1.p2.1)\.
- L\. Luo, Y\. Liu, R\. Liu, S\. Phatale, M\. Guo, H\. Lara, Y\. Li, L\. Shu, Y\. Zhu, L\. Meng, J\. Sun, and A\. Rastogi \(2024\)Improve mathematical reasoning in language models by automated process supervision\.External Links:2406\.06592,[Link](https://arxiv.org/abs/2406.06592)Cited by:[§A\.2](https://arxiv.org/html/2606.27739#A1.SS2.p1.1),[§A\.2](https://arxiv.org/html/2606.27739#A1.SS2.p2.1),[§C\.2](https://arxiv.org/html/2606.27739#A3.SS2.SSS0.Px1.p1.1),[2nd item](https://arxiv.org/html/2606.27739#S1.I1.i2.p1.1),[§1](https://arxiv.org/html/2606.27739#S1.p1.1),[4th item](https://arxiv.org/html/2606.27739#S5.I2.i4.p1.1),[§5\.2](https://arxiv.org/html/2606.27739#S5.SS2.p2.1)\.
- A\. Prasad, S\. Saha, X\. Zhou, and M\. Bansal \(2023\)ReCEval: evaluating reasoning chains via correctness and informativeness\.External Links:[Link](https://openreview.net/forum?id=NtHfJrjkiv)Cited by:[§1](https://arxiv.org/html/2606.27739#S1.p4.1)\.
- Qwen, :, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§5\.2](https://arxiv.org/html/2606.27739#S5.SS2.p1.1)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.External Links:[Link](https://openreview.net/forum?id=HPuSIXJaa9)Cited by:[§A\.2](https://arxiv.org/html/2606.27739#A1.SS2.p2.1)\.
- J\. Rasley, S\. Rajbhandari, O\. Ruwase, and Y\. He \(2020\)DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters\.New York, NY, USA,pp\. 3505–3506\.External Links:ISBN 9781450379984,[Link](https://doi.org/10.1145/3394486.3406703),[Document](https://dx.doi.org/10.1145/3394486.3406703)Cited by:[§C\.2](https://arxiv.org/html/2606.27739#A3.SS2.SSS0.Px3.p1.1)\.
- N\. Razin, Y\. Lin, J\. Yao, and S\. Arora \(2026\)Why is your language model a poor implicit reward model?\.External Links:[Link](https://openreview.net/forum?id=qW05yhm3Zo)Cited by:[§A\.2](https://arxiv.org/html/2606.27739#A1.SS2.p2.1)\.
- J\. Salamon, B\. McFee, and P\. Li \(2017\)DCASE 2017 submission: multiple instance learning for sound event detection\.Technical ReportDCASE2017 Challenge\.Cited by:[§A\.1](https://arxiv.org/html/2606.27739#A1.SS1.p2.1)\.
- A\. Setlur, C\. Nagpal, A\. Fisch, X\. Geng, J\. Eisenstein, R\. Agarwal, A\. Agarwal, J\. Berant, and A\. Kumar \(2025\)Rewarding progress: scaling automated process verifiers for LLM reasoning\.External Links:[Link](https://openreview.net/forum?id=A6Y7AqlzLW)Cited by:[§A\.2](https://arxiv.org/html/2606.27739#A1.SS2.p1.1),[§1](https://arxiv.org/html/2606.27739#S1.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.External Links:2402\.03300,[Link](https://arxiv.org/abs/2402.03300)Cited by:[§1](https://arxiv.org/html/2606.27739#S1.p1.1)\.
- Z\. Shao, H\. Bian, Y\. Chen, Y\. Wang, J\. Zhang, X\. Ji, and Y\. Zhang \(2021\)TransMIL: transformer based correlated multiple instance learning for whole slide image classification\.InProceedings of the 35th International Conference on Neural Information Processing Systems,NIPS ’21,Red Hook, NY, USA\.External Links:ISBN 9781713845393Cited by:[§A\.1](https://arxiv.org/html/2606.27739#A1.SS1.p2.1)\.
- C\. V\. Snell, J\. Lee, K\. Xu, and A\. Kumar \(2025\)Scaling LLM test\-time compute optimally can be more effective than scaling parameters for reasoning\.External Links:[Link](https://openreview.net/forum?id=4FWAwZtd2n)Cited by:[§1](https://arxiv.org/html/2606.27739#S1.p1.1)\.
- Y\. Tian, G\. Pang, Y\. Chen, R\. Singh, J\. Verjans, and G\. Carneiro \(2021\)Weakly\-supervised video anomaly detection with robust temporal feature magnitude learning\.2021 IEEE/CVF International Conference on Computer Vision \(ICCV\),pp\. 4955–4966\.External Links:[Link](https://api.semanticscholar.org/CorpusID:236950792)Cited by:[§A\.1](https://arxiv.org/html/2606.27739#A1.SS1.p1.1)\.
- A\. Tibo, M\. Jaeger, and P\. Frasconi \(2018\)Learning and interpreting multi\-multi\-instance learning networks\.CoRRabs/1810\.11514\.External Links:[Link](http://arxiv.org/abs/1810.11514),1810\.11514Cited by:[§A\.1](https://arxiv.org/html/2606.27739#A1.SS1.p2.1),[§5\.3](https://arxiv.org/html/2606.27739#S5.SS3.SSS0.Px2.p1.1)\.
- J\. Uesato, N\. Kushman, R\. Kumar, F\. Song, N\. Siegel, L\. Wang, A\. Creswell, G\. Irving, and I\. Higgins \(2022\)Solving math word problems with process\- and outcome\-based feedback\.External Links:2211\.14275,[Link](https://arxiv.org/abs/2211.14275)Cited by:[§A\.2](https://arxiv.org/html/2606.27739#A1.SS2.p1.1)\.
- C\. Wang, Y\. Deng, Z\. Lyu, L\. Zeng, J\. He, S\. Yan, and B\. An \(2024a\)Q\*: improving multi\-step reasoning for llms with deliberative planning\.External Links:2406\.14283,[Link](https://arxiv.org/abs/2406.14283)Cited by:[§A\.2](https://arxiv.org/html/2606.27739#A1.SS2.p1.1),[§C\.3](https://arxiv.org/html/2606.27739#A3.SS3.p3.5)\.
- J\. Wang, M\. Fang, Z\. Wan, M\. Wen, J\. Zhu, A\. Liu, Z\. Gong, Y\. Song, L\. Chen, L\. M\. Ni, L\. Yang, Y\. Wen, and W\. Zhang \(2024b\)OpenR: an open source framework for advanced reasoning with large language models\.External Links:2410\.09671,[Link](https://arxiv.org/abs/2410.09671)Cited by:[§C\.3](https://arxiv.org/html/2606.27739#A3.SS3.p3.5)\.
- P\. Wang, L\. Li, Z\. Shao, R\. Xu, D\. Dai, Y\. Li, D\. Chen, Y\. Wu, and Z\. Sui \(2024c\)Math\-shepherd: verify and reinforce LLMs step\-by\-step without human annotations\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 9426–9439\.External Links:[Link](https://aclanthology.org/2024.acl-long.510/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.510)Cited by:[§A\.2](https://arxiv.org/html/2606.27739#A1.SS2.p1.1),[§A\.2](https://arxiv.org/html/2606.27739#A1.SS2.p2.1),[1st item](https://arxiv.org/html/2606.27739#A3.I1.i1.p1.1),[§C\.2](https://arxiv.org/html/2606.27739#A3.SS2.SSS0.Px1.p1.1),[2nd item](https://arxiv.org/html/2606.27739#S1.I1.i2.p1.1),[§1](https://arxiv.org/html/2606.27739#S1.p1.1),[2nd item](https://arxiv.org/html/2606.27739#S5.I2.i2.p1.1),[§5\.1](https://arxiv.org/html/2606.27739#S5.SS1.p1.1),[§5\.2](https://arxiv.org/html/2606.27739#S5.SS2.p2.1)\.
- X\. Wang, Y\. Yan, P\. Tang, X\. Bai, and W\. Liu \(2018\)Revisiting multiple instance neural networks\.Pattern Recogn\.74\(C\),pp\. 15–24\.External Links:ISSN 0031\-3203,[Link](https://doi.org/10.1016/j.patcog.2017.08.026),[Document](https://dx.doi.org/10.1016/j.patcog.2017.08.026)Cited by:[§A\.1](https://arxiv.org/html/2606.27739#A1.SS1.p2.1),[§5\.3](https://arxiv.org/html/2606.27739#S5.SS3.SSS0.Px2.p1.1)\.
- M\. Waqas, S\. U\. Ahmed, M\. A\. Tahir, J\. Wu, and R\. Qureshi \(2024\)Exploring multiple instance learning \(mil\): a brief survey\.Expert Syst\. Appl\.250\(C\)\.External Links:ISSN 0957\-4174,[Link](https://doi.org/10.1016/j.eswa.2024.123893),[Document](https://dx.doi.org/10.1016/j.eswa.2024.123893)Cited by:[§1](https://arxiv.org/html/2606.27739#S1.p6.1),[§3\.1](https://arxiv.org/html/2606.27739#S3.SS1.p2.1)\.
- Y\. Xu, T\. Mo, Q\. Feng, P\. Zhong, M\. Lai, and E\. I\. Chang \(2014\)Deep learning of feature representation with multiple instance learning for medical image analysis\.In2014 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),Vol\.,pp\. 1626–1630\.External Links:[Document](https://dx.doi.org/10.1109/ICASSP.2014.6853873)Cited by:[§A\.1](https://arxiv.org/html/2606.27739#A1.SS1.p1.1),[§2\.1](https://arxiv.org/html/2606.27739#S2.SS1.p4.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§5\.3](https://arxiv.org/html/2606.27739#S5.SS3.SSS0.Px2.p1.1)\.
- A\. Yang, B\. Zhang, B\. Hui, B\. Gao, B\. Yu, C\. Li, D\. Liu, J\. Tu, J\. Zhou, J\. Lin, K\. Lu, M\. Xue, R\. Lin, T\. Liu, X\. Ren, and Z\. Zhang \(2024\)Qwen2\.5\-math technical report: toward mathematical expert model via self\-improvement\.External Links:2409\.12122,[Link](https://arxiv.org/abs/2409.12122)Cited by:[§A\.2](https://arxiv.org/html/2606.27739#A1.SS2.p1.1),[§5\.1](https://arxiv.org/html/2606.27739#S5.SS1.p1.1)\.
- L\. Yu, W\. Jiang, H\. Shi, J\. YU, Z\. Liu, Y\. Zhang, J\. Kwok, Z\. Li, A\. Weller, and W\. Liu \(2024\)MetaMath: bootstrap your own mathematical questions for large language models\.External Links:[Link](https://openreview.net/forum?id=N8N0hgNDRt)Cited by:[§C\.3](https://arxiv.org/html/2606.27739#A3.SS3.p2.1),[§1](https://arxiv.org/html/2606.27739#S1.p1.1)\.
- L\. Yuan, W\. Li, H\. Chen, G\. Cui, N\. Ding, K\. Zhang, B\. Zhou, Z\. Liu, and H\. Peng \(2025\)Free process rewards without process labels\.pp\. 73511–73525\.External Links:[Link](https://proceedings.mlr.press/v267/yuan25c.html)Cited by:[§A\.2](https://arxiv.org/html/2606.27739#A1.SS2.p2.1),[§C\.2](https://arxiv.org/html/2606.27739#A3.SS2.SSS0.Px1.p1.1),[§C\.4](https://arxiv.org/html/2606.27739#A3.SS4.SSS0.Px2.p1.4),[1st item](https://arxiv.org/html/2606.27739#S1.I1.i1.p1.1),[1st item](https://arxiv.org/html/2606.27739#S5.I2.i1.p1.1)\.
- R\. Zhang, P\. Qin, Q\. Cao, and P\. Xie \(2025a\)DreamPRM\-code: function\-as\-step process reward model with label correction for llm coding\.External Links:2512\.15000,[Link](https://arxiv.org/abs/2512.15000)Cited by:[§1](https://arxiv.org/html/2606.27739#S1.p1.1)\.
- Z\. Zhang, C\. Zheng, Y\. Wu, B\. Zhang, R\. Lin, B\. Yu, D\. Liu, J\. Zhou, and J\. Lin \(2025b\)The lessons of developing process reward models in mathematical reasoning\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 10495–10516\.External Links:[Link](https://aclanthology.org/2025.findings-acl.547/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.547),ISBN 979\-8\-89176\-256\-5Cited by:[§A\.2](https://arxiv.org/html/2606.27739#A1.SS2.p1.1),[§C\.4](https://arxiv.org/html/2606.27739#A3.SS4.SSS0.Px2.p1.4),[§5\.1](https://arxiv.org/html/2606.27739#S5.SS1.p3.2),[§5\.3](https://arxiv.org/html/2606.27739#S5.SS3.SSS0.Px1.p2.4)\.
- C\. Zheng, Z\. Zhang, B\. Zhang, R\. Lin, K\. Lu, B\. Yu, D\. Liu, J\. Zhou, and J\. Lin \(2025\)ProcessBench: identifying process errors in mathematical reasoning\.Vienna, Austria,pp\. 1009–1024\.External Links:[Link](https://aclanthology.org/2025.acl-long.50/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.50),ISBN 979\-8\-89176\-251\-0Cited by:[§A\.2](https://arxiv.org/html/2606.27739#A1.SS2.p3.1),[4th item](https://arxiv.org/html/2606.27739#A3.I1.i4.p1.1),[§C\.4](https://arxiv.org/html/2606.27739#A3.SS4.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.27739#S1.p4.1),[§2\.2](https://arxiv.org/html/2606.27739#S2.SS2.SSS0.Px3.p1.4),[§5\.1](https://arxiv.org/html/2606.27739#S5.SS1.p2.1)\.
- C\. Zheng, J\. Zhu, Z\. Ou, Y\. Chen, K\. Zhang, R\. Shan, Z\. Zheng, M\. Yang, J\. Lin, Y\. Yu, and W\. Zhang \(2026\)A survey of process reward models: from outcome signals to process supervisions for large language models\.External Links:2510\.08049,[Link](https://arxiv.org/abs/2510.08049)Cited by:[§4](https://arxiv.org/html/2606.27739#S4.p2.3)\.
- T\. Zheng, K\. Jiang, and H\. Yao \(2024\)Dynamic policy\-driven adaptive multi\-instance learning for whole slide image classification\.External Links:2403\.07939,[Link](https://arxiv.org/abs/2403.07939)Cited by:[§B\.7](https://arxiv.org/html/2606.27739#A2.SS7.p2.1)\.
## Appendix ARelated Work
### A\.1Multiple Instance Learning
Multiple Instance Learning \(MIL\) is a weakly supervised learning paradigm in which labels are assigned to bags of instances rather than to individual instances\. The task is to predict instance labels from bag labels alone\. MIL has been applied to medical image analysis, text processing, time\-series analysis, and video anomaly detectionXuet al\.\([2014](https://arxiv.org/html/2606.27739#bib.bib6)\); Angelidis and Lapata \([2018](https://arxiv.org/html/2606.27739#bib.bib9)\); Earlyet al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib7)\); Tianet al\.\([2021](https://arxiv.org/html/2606.27739#bib.bib8)\); Jang and Kwon \([2025](https://arxiv.org/html/2606.27739#bib.bib15)\)\. To the best of our knowledge, we are the first to apply MIL to process reward modeling\.
In the deep learning era, most MIL methods are pooling\-based\. Proposed pooling techniques include instance poolingWanget al\.\([2018](https://arxiv.org/html/2606.27739#bib.bib10)\), embedding poolingTiboet al\.\([2018](https://arxiv.org/html/2606.27739#bib.bib11)\), attention poolingJavedet al\.\([2022](https://arxiv.org/html/2606.27739#bib.bib14)\); Shaoet al\.\([2021](https://arxiv.org/html/2606.27739#bib.bib13)\), additive poolingJavedet al\.\([2022](https://arxiv.org/html/2606.27739#bib.bib14)\), max poolingLiuet al\.\([2025](https://arxiv.org/html/2606.27739#bib.bib37)\), and conjunctive poolingEarlyet al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib7)\); Angelidis and Lapata \([2018](https://arxiv.org/html/2606.27739#bib.bib9)\); Jang and Kwon \([2025](https://arxiv.org/html/2606.27739#bib.bib15)\)\. The use of softmax operation as pooling functions remains largely unexplored\. The only prior work, an early study on sound event detectionSalamonet al\.\([2017](https://arxiv.org/html/2606.27739#bib.bib16)\), employs a non\-parametric softmax and has seen little subsequent adoption in the MIL literature\. In this work, we propose Softmax\-Weighted\-Sum \(SWS\) pooling, parameterized by a temperatureα\\alphaand tailored for bags with interdependent instances\. Our theoretical analysis and experiments both demonstrate the crucial role of this hyperparameter for instance\-level learnability\.
### A\.2Process Reward Models
Reward models are essential for improving large language models \(LLMs\) in complex reasoning tasks such as mathematical problem\-solvingGrattafioriet al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib17)\); Yanget al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib18)\)and competitive programmingHuiet al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib19)\)\. Unlike outcome reward models \(ORMs\), which evaluate only the final answer, process reward models \(PRMs\)Uesatoet al\.\([2022](https://arxiv.org/html/2606.27739#bib.bib20)\); Lightmanet al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib1)\)assess each reasoning step, providing finer\-grained feedback\. Recent advancementsWanget al\.\([2024c](https://arxiv.org/html/2606.27739#bib.bib2)\); Luoet al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib3)\); Wanget al\.\([2024a](https://arxiv.org/html/2606.27739#bib.bib21)\)have demonstrated the significant potential of PRMs in scaling test\-time computeSetluret al\.\([2025](https://arxiv.org/html/2606.27739#bib.bib25)\); Zhanget al\.\([2025b](https://arxiv.org/html/2606.27739#bib.bib22)\)and post\-trainingGuanet al\.\([2025](https://arxiv.org/html/2606.27739#bib.bib23)\); Chenget al\.\([2026](https://arxiv.org/html/2606.27739#bib.bib24)\)\.
The main difficulty in training a PRM is obtaining reliable step\-level annotations\. Human annotationLightmanet al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib1)\)is the primary source but is expensive and does not scale\. Monte Carlo methodsWanget al\.\([2024c](https://arxiv.org/html/2606.27739#bib.bib2)\); Luoet al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib3)\); Guanet al\.\([2025](https://arxiv.org/html/2606.27739#bib.bib23)\); Dinget al\.\([2026](https://arxiv.org/html/2606.27739#bib.bib26)\)offer a promising alternative, estimating a step’s quality by the empirical probability of reaching a correct final answer from that step\. We refer to this as*causal credit assignment*: a step’s score reflects the expected success of its downstream completions rather than its own logical validity\. This coupling causes errors from later steps to be incorrectly attributed to earlier ones, making it difficult to localize the true error source\. Moreover, the estimates depend crucially on the choice of rollout policy and sampling strategy, which introduces noise and hampers generalization\. Another line of researchYuanet al\.\([2025](https://arxiv.org/html/2606.27739#bib.bib28)\); Cuiet al\.\([2025](https://arxiv.org/html/2606.27739#bib.bib29)\)adopts DPO\-style implicit reward functionsRafailovet al\.\([2023](https://arxiv.org/html/2606.27739#bib.bib30)\), which are trained as ORMs but used as PRMs\. Because the trajectory\-level score decomposes into a sum over steps, this approach implicitly enforces*uniform credit assignment*: every step within the same trajectory receives the same optimization pressure, regardless of its actual contribution to the outcome\. As a consequence, the model cannot differentiate genuinely erroneous steps from correct precursor steps, limiting its capacity for precise error identification\. Moreover, these implicit reward functions are known to suffer from limited generalizationRazinet al\.\([2026](https://arxiv.org/html/2606.27739#bib.bib31)\)\.
In this work, we advocate*Weakest Link Assignment*as a principled alternative, drawing on the long\-standing intuition: a reasoning chain is correct if and only if every step is correct, and incorrect if and only if any step is wrong\. This principle already underlies standard PRM evaluation and downstream usage: ProcessBenchZhenget al\.\([2025](https://arxiv.org/html/2606.27739#bib.bib35)\)evaluates PRMs via first\-error detection; best\-of\-N samplingLightmanet al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib1)\)typically aggregates step scores by taking the minimum \(for scalar rewards\) or the product \(for probabilities\); and recent work has shown that a min\-form reward aggregation in reinforcement learning substantially reduces reward hackingChenget al\.\([2026](https://arxiv.org/html/2606.27739#bib.bib24)\)\. Building on this, we propose an outcome\-supervised process reward modeling framework that jointly learns credit assignment and reward modeling under the Weakest Link principle, thereby closing the training–test gap and producing a PRM that accurately attributes credit to the outcome\-determining steps\.
## Appendix BPooling Techniques in Multiple Instance Learning
This appendix provides a concise overview of the pooling techniques mentioned in this paper\. We assume a standard deep MIL architecture: given a bagB=\{x1,…,xn\}B=\\\{x\_\{1\},\\dots,x\_\{n\}\\\}, a feature extractorϕ\\phi\(e\.g\., a neural network backbone\) maps each instance to add\-dimensional embeddingzi=ϕ\(xi\)∈ℝdz\_\{i\}=\\phi\(x\_\{i\}\)\\in\\mathbb\{R\}^\{d\}\. A classification headψ\\psithen produces instance\-level predictionspi∈\[0,1\]p\_\{i\}\\in\[0,1\]\. The pooling function aggregates the instance\-level information to yield a bag\-level predictiony^\(B\)\\hat\{y\}\(B\)\.
We describe six pooling methods commonly used in the MIL literature\.
### B\.1Embedding Pooling
Embedding pooling aggregates instance embeddings before classification\. The bag embedding is obtained by averaging the instance embeddings:
zbag=1n∑i=1nzi,z\_\{\\text\{bag\}\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}z\_\{i\},and the bag\-level prediction is theny^\(B\)=ψ\(zbag\)\\hat\{y\}\(B\)=\\psi\(z\_\{\\text\{bag\}\}\)\. This approach is simple and widely used for bag\-level classification, but it does not produce instance\-level predictions and can lose discriminative instance information\.
### B\.2Instance Pooling
Instance pooling performs classification on each instance independently, then aggregates the resulting instance predictions\. The bag\-level prediction is the average of the instance probabilities:
pi=ψ\(zi\),y^\(B\)=1n∑i=1npi\.p\_\{i\}=\\psi\(z\_\{i\}\),\\qquad\\hat\{y\}\(B\)=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}p\_\{i\}\.Instance pooling naturally yields instance\-level predictions, but the uniform averaging does not discriminate between more and less relevant instances\.
### B\.3Max Pooling
Max pooling selects the maximum instance prediction as the bag\-level prediction:
y^\(B\)=max\{p1,…,pn\}\.\\hat\{y\}\(B\)=\\max\\\{p\_\{1\},\\dots,p\_\{n\}\\\}\.This directly implements the standard multi\-instance assumption: a bag is positive if at least one instance is positive\. However, it assigns all credit to a single instance, which can lead to unstable training and poor generalization\.
### B\.4Attention Pooling
Attention pooling learns instance\-specific weights via an attention mechanism, then aggregates the instance embeddings accordingly\. An attention headω\\omegacomputes normalized attention weights:
ai=exp\(ω\(zi\)\)∑j=1nexp\(ω\(zj\)\),a\_\{i\}=\\frac\{\\exp\(\\omega\(z\_\{i\}\)\)\}\{\\sum\_\{j=1\}^\{n\}\\exp\(\\omega\(z\_\{j\}\)\)\},and the bag\-level prediction isy^\(B\)=ψ\(∑i=1naizi\)\\hat\{y\}\(B\)=\\psi\\left\(\\sum\_\{i=1\}^\{n\}a\_\{i\}z\_\{i\}\\right\)\. This method allows learnable, instance\-specific weighting, but the attention weights are not directly constrained by the standard multi\-instance assumption, which can lead to degenerate solutions\.
### B\.5Additive Pooling
Additive pooling combines attention pooling and instance pooling\. Instance embeddings are first scaled by their attention weights before classification:
ai=exp\(ω\(zi\)\)∑j=1nexp\(ω\(zj\)\),pi=ψ\(aizi\)\.a\_\{i\}=\\frac\{\\exp\(\\omega\(z\_\{i\}\)\)\}\{\\sum\_\{j=1\}^\{n\}\\exp\(\\omega\(z\_\{j\}\)\)\},\\qquad p\_\{i\}=\\psi\(a\_\{i\}z\_\{i\}\)\.The bag\-level prediction is then:
y^\(B\)=1n∑i=1npi\.\\hat\{y\}\(B\)=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}p\_\{i\}\.This hybrid approach encourages the model to attend to relevant instances while still performing instance\-level classification\.
### B\.6Conjunctive Pooling
Conjunctive pooling computes attention weights and instance predictions independently, then combines them multiplicatively:
ai=exp\(ω\(zi\)\)∑j=1nexp\(ω\(zj\)\),pi=ψ\(zi\)\.a\_\{i\}=\\frac\{\\exp\(\\omega\(z\_\{i\}\)\)\}\{\\sum\_\{j=1\}^\{n\}\\exp\(\\omega\(z\_\{j\}\)\)\},\\qquad p\_\{i\}=\\psi\(z\_\{i\}\)\.The bag\-level prediction is:
y^\(B\)=1n∑i=1naipi\.\\hat\{y\}\(B\)=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}a\_\{i\}p\_\{i\}\.By training the attention and classification heads in parallel, this method avoids the potential degeneracy where the classifier relies on the attention head to pre\-process the input, making the model more robust\.
### B\.7Discussion
Despite the empirical success of MIL across diverse applications, most common pooling techniques lack theoretical guarantees for instance\-level classification accuracy\. Attention\-based pooling, currently the most widely used approach, has been shown not to be PAC\-learnableJang and KWON \([2024](https://arxiv.org/html/2606.27739#bib.bib50)\)\. Average\-pooling methods do not align with the multi\-instance assumption\. Max\-pooling, while most consistent with the assumption, suffers from low sample efficiency and training instability, which limits its practical adoption\.
In contrast, our proposed temperature\-parameterized SWS pooling exhibits a favorable property—instance\-level Bayes consistency of the learning objective—under mild assumptions, and performs well in outcome\-supervised PRM tasks\. However, we note that the bag sizes in our setting are modest \(typically under a few dozen\), whereas some MIL applications involve bags with hundreds to thousands of instancesZhenget al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib56)\)\. Whether SWS pooling remains effective in such large\-bag regimes is uncertain and lies outside the scope of this work\.
## Appendix CDataset and Implementation Details
### C\.1Dataset Details
We provide details of the datasets and benchmarks used in this work\.
- •Math\-ShepherdWanget al\.\([2024c](https://arxiv.org/html/2606.27739#bib.bib2)\)222[https://huggingface\.co/datasets/peiyi9979/Math\-Shepherd](https://huggingface.co/datasets/peiyi9979/Math-Shepherd)serves as the training set in our main experiments \(Sections[5\.1](https://arxiv.org/html/2606.27739#S5.SS1)and[5\.2](https://arxiv.org/html/2606.27739#S5.SS2)\)\. It contains 445k English math reasoning trajectories with their associated problems\. Each reasoning step is labeled via Monte Carlo \(MC\) estimation without human annotation; the MC rollouts are not included\. We use the entire dataset for training and do not perform evaluation on it\.
- •PRM800KLightmanet al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib1)\)333[https://github\.com/openai/prm800k](https://github.com/openai/prm800k)is used as the training set for the ablation study on pooling techniques \(Section[5\.3](https://arxiv.org/html/2606.27739#S5.SS3)\)\. The dataset comprises roughly 40k English math reasoning trajectories and their problems\. Human annotators label the location of the first error in each trajectory\. We adopt the original train/test split, yielding 37\.5k training and 3\.7k test trajectories\. The training split is further processed as described in Appendix[C\.5](https://arxiv.org/html/2606.27739#A3.SS5); the test split is not used for evaluation\.
- •MATH\-500Lightmanet al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib1)\)444[https://huggingface\.co/datasets/HuggingFaceH4/MATH\-500](https://huggingface.co/datasets/HuggingFaceH4/MATH-500)provides the problem set for best\-of\-N evaluation \(Section[5\.2](https://arxiv.org/html/2606.27739#S5.SS2)\)\. It is a subset of the MATH datasetHendryckset al\.\([2021](https://arxiv.org/html/2606.27739#bib.bib32)\)comprising 500 English math problems\. We use all 500 problems as the test set\.
- •ProcessBenchZhenget al\.\([2025](https://arxiv.org/html/2606.27739#bib.bib35)\)555[https://huggingface\.co/datasets/Qwen/ProcessBench](https://huggingface.co/datasets/Qwen/ProcessBench)is the evaluation benchmark for our main experiments \(Section[5\.1](https://arxiv.org/html/2606.27739#S5.SS1)\)\. It consists of four splits—GSM8K \(0\.4k\), MATH \(1k\), OlympiadBench \(1k\), and Omni\-MATH \(1k\)—which differ in difficulty\. Each problem comes with a reasoning chain, and all text is in English\. Every trajectory is manually annotated with the first error location\. We evaluate on all four splits\. Evaluation details are given in Appendix[C\.4](https://arxiv.org/html/2606.27739#A3.SS4)\.
### C\.2PRM Training Details
#### Model architecture\.
For our method\(LCA\), MathShepherdWanget al\.\([2024c](https://arxiv.org/html/2606.27739#bib.bib2)\), OmegaPRMLuoet al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib3)\), and SCANDinget al\.\([2026](https://arxiv.org/html/2606.27739#bib.bib26)\), we remove the language modeling head from the LLM backbone and initialize a randomly initialized classification head, implemented as a two\-layer MLP with the same hidden dimension as the LLM backbone\. ImplicitPRMYuanet al\.\([2025](https://arxiv.org/html/2606.27739#bib.bib28)\)uses the log\-probability ratio between the policy model and the reference model \(the same LLM backbone without fine\-tuning\) as an implicit reward model\. For PQMLi and Li \([2025](https://arxiv.org/html/2606.27739#bib.bib27)\), we replace the language modeling head with a linear value head following the original paper\.
#### Hyperparameters\.
For Qwen2\.5\-Math\-7B\-Instruct and Llama3\.2\-3B\-Instruct backbones, we set the learning rate to1×10−61\\times 10^\{\-6\}; for the Qwen3\-4B backbone, we use1×10−51\\times 10^\{\-5\}\. All experiments employ a linear learning rate scheduler\. For all methods except PQM, we use a global batch size \(GPU count × gradient accumulation steps × per\-GPU batch size\) of 384 on Qwen2\.5\-Math\-7B\-Instruct and Llama3\.2\-3B\-Instruct backbones, and 512 on Qwen3\-4B\. For PQM, we adopt the more fine\-grained batching strategy described in its original paperLi and Li \([2025](https://arxiv.org/html/2606.27739#bib.bib27)\)\. All models are trained for a single epoch\.
#### Infrastructure\.
All experiments are conducted on four 80GB A800 GPUs using DeepSpeed ZeRO\-3 parallel trainingRasleyet al\.\([2020](https://arxiv.org/html/2606.27739#bib.bib51)\)\.
### C\.3Baseline Implementation Details
MathShepherd, PQM, and ImplicitPRM are trained directly on the Math\-Shepherd dataset\. For ImplicitPRM, we use cross\-entropy loss withβ=0\.05\\beta=0\.05\. For PQM, we setζ=2\\zeta=2\. Other implement details follow the original papers\.
OmegaPRM and SCAN improve the data collection and labeling strategy on top of MathShepherd and thus cannot be trained directly on the Math\-Shepherd dataset\. We therefore generate training data following their respective proposals\. To ensure a fair comparison, we use the same set of problems as in Math\-Shepherd dataset and adopt MetaMath\-Llemma\-7BYuet al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib44)\)as the generator model, which is used for Monte Carlo rollouts in Math\-Shepherd dataset\. For both methods, the amount of generated data is comparable to that of Math\-Shepherd\.
For OmegaPRM, as no official code is available, we use the open\-source implementation openr666[https://github\.com/openreasoner/openr](https://github.com/openreasoner/openr)Wanget al\.\([2024b](https://arxiv.org/html/2606.27739#bib.bib52)\)for MCTS\-based data generation\. We set the search limit to 200 per question,α=0\.5\\alpha=0\.5,β=0\.9\\beta=0\.9,L=500L=500, andcpuct=0\.125c\_\{\\text\{puct\}\}=0\.125\. For each Monte Carlo estimation, we samplek=16k=16rollouts\. See the original paper for detailsWanget al\.\([2024a](https://arxiv.org/html/2606.27739#bib.bib21)\)\.
For SCAN, we follow the proposed denoising procedure on top of the Math\-Shepherd dataset, with a tolerant distanced=2d=2\. For each problem, we first generatek=16k=16rollouts to estimate the confidence of the completor model\. For each trajectory in the dataset, we locate the first error \(with label=0=0\), then consider the first and second steps before that error, generatek=8k=8completions for each, compute the Monte Carlo value, and combine it with the problem\-level confidence to produce denoised labels\. See the original paper for detailsDinget al\.\([2026](https://arxiv.org/html/2606.27739#bib.bib26)\)\.
### C\.4Evaluation Details
#### ProcessBench evaluation\.
We follow the evaluation protocol of ProcessBenchZhenget al\.\([2025](https://arxiv.org/html/2606.27739#bib.bib35)\)\. For our method\(LCA\), MathShepherd, and SCAN, PRMs are trained as standard binary classifiers; we therefore extract the earliest erroneous step directly from its correctness predictions for reasoning steps\. OmegaPRM, ImplicitPRM, and PQM produce scalar scores for each step; we first transform these scores into binary correctness predictions using a threshold\. The threshold is chosen as the one that yields the highest F1 score on the MATH subset \(see Table[3](https://arxiv.org/html/2606.27739#A3.T3)\)\.
MethodQwen2\.5\-Math7B\-InstructLlama3\.23B\-InstructImplicitPRM0\.50\.4OmegePRM0\.40\.2PQM0\.20\.2Table 3:Thresholds for evaluating real\-valued PRMs on ProcessBench across different backbones\.
#### Best\-of\-N evaluation\.
We use Mistral\-7B\-Instruct\-v0\.2, Meta\-Llama\-3\-8B\-Instruct, and Qwen2\.5\-7B\-Instruct as the generator models to produce evaluation data for best\-of\-N sampling on the MATH500 problem set\. The data for Mistral\-7B\-Instruct\-v0\.2 and Meta\-Llama\-3\-8B\-Instruct are taken directly fromYuanet al\.\([2025](https://arxiv.org/html/2606.27739#bib.bib28)\)\. For Qwen2\.5\-7B\-Instruct, we collect 64 rollouts per problem in MATH500 using temperature0\.70\.7, top\-pp0\.950\.95, and a maximum generation length of10241024\. For binary classification models \(LCA, MathShepherd, SCAN\), we take the product of the step\-wise scores as the trajectory score, consistent with earlier studiesLightmanet al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib1)\); Zhanget al\.\([2025b](https://arxiv.org/html/2606.27739#bib.bib22)\)\. For models that output real\-valued scores \(OmegaPRM, ImplicitPRM, PQM\), we take the minimum of the step\-wise scores, following their original papers\.
### C\.5Detailed Settings of Ablation on Pooling Techniques
We implement embedding pooling, instance pooling, attention pooling, and conjunctive pooling following the descriptions in Appendix[B](https://arxiv.org/html/2606.27739#A2)\. For attention and conjunctive pooling, the attention head is realized as a simple linear value head, as the attention‑based LLM backbone already provides sufficient nonlinearity\.
As noted in[Limitations](https://arxiv.org/html/2606.27739#Sx1), using final answer correctness as bag‑level labels may introduce label noise when correct answers are reached through flawed reasoning, violating the Standard Multi‑Instance Assumption\. To eliminate this confounding factor and adhere strictly to the standard MIL setting, we construct a training set from the human‑annotated PRM800K datasetLightmanet al\.\([2024](https://arxiv.org/html/2606.27739#bib.bib1)\)\. This dataset marks the first error location for each trajectory \(−1\-1for error‑free trajectories\)\. We derive trajectory‑level labels by checking whether the trajectory contains any error, and then discard the original step‑level annotations\. The resulting dataset is further balanced to ensure equal proportions of erroneous and correct trajectories, yielding a final set of 30k trajectories \(50% containing errors\), which we refer to asPRM800K\-balanced\. We then train PRMs on this dataset with different pooling techniques, using Qwen3‑4B as the backbone\. Training details are provided in Appendix[C\.2](https://arxiv.org/html/2606.27739#A3.SS2)\.
For the stepwise\-supervised baseline, we retain the step\-level labels fromPRM800K\-balanced\. Steps before the annotated first error are labeled correct; all others are labeled incorrect\. The PRM is trained with step\-level cross\-entropy loss, with all other settings unchanged\.
ModelGSM8KMATHOlympiadBenchOmni\-MATHAvg\.F1Err\.Corr\.F1Err\.Corr\.F1Err\.Corr\.F1Err\.Corr\.F1Large Language Models as CriticLlama\-3\-8B\-Instruct42\.57\.813\.128\.69\.113\.827\.12\.74\.826\.18\.312\.611\.1Llama\-3\.1\-8B\-Instruct44\.46\.210\.941\.92\.75\.132\.41\.52\.832\.00\.81\.65\.1Qwen2\.5\-7B\-Instruct40\.633\.236\.530\.845\.136\.626\.533\.929\.726\.228\.627\.432\.6Qwen2\.5\-Math\-7B\-Instruct15\.5100\.026\.814\.896\.825\.77\.791\.714\.26\.988\.012\.719\.9Qwen2\.5\-Coder\-7B\-Instruct7\.7100\.014\.33\.498\.36\.52\.199\.14\.10\.998\.31\.86\.7Open Source Process Reward ModelsMath\-Shepherd\-PRM\-7B32\.491\.747\.918\.082\.029\.515\.071\.124\.814\.273\.023\.831\.5RLHFlow\-PRM\-Mistral\-8B33\.899\.050\.421\.772\.233\.48\.243\.113\.89\.645\.215\.828\.4RLHFlow\-PRM\-Deepseek\-8B24\.298\.438\.821\.480\.033\.810\.151\.016\.910\.951\.916\.926\.6EurusPRM\-Stage146\.942\.044\.333\.338\.235\.623\.919\.821\.721\.924\.523\.131\.2EurusPRM\-Stage251\.244\.047\.336\.435\.035\.725\.718\.021\.223\.119\.120\.931\.3Qwen2\.5\-Math\-7B\-Math\-Shepherd\-PRM46\.495\.962\.518\.996\.631\.67\.493\.813\.74\.095\.07\.728\.9Process Reward Models Trained onLlama3\.2\-3B\-InstructImplicitPRM41\.182\.454\.822\.966\.534\.113\.955\.222\.27\.057\.712\.430\.9MathShepherd24\.298\.438\.813\.090\.922\.77\.990\.014\.56\.190\.011\.421\.8SCAN31\.497\.947\.616\.388\.927\.68\.987\.016\.26\.686\.312\.225\.9OmegaPRM37\.240\.438\.721\.948\.030\.110\.954\.918\.27\.963\.514\.125\.3PQM21\.390\.734\.426\.370\.738\.314\.234\.820\.220\.046\.127\.930\.2\\rowcolorblue\!10LCA\(Ours,α=1\\alpha=1\)35\.775\.148\.430\.667\.542\.123\.442\.830\.320\.750\.629\.437\.6Process Reward Models Trained onQwen2\.5\-Math\-7B\-InstructImplicitPRM51\.240\.445\.241\.132\.536\.326\.820\.923\.524\.126\.125\.132\.5MathShepherd42\.597\.459\.218\.596\.331\.16\.893\.812\.74\.291\.38\.127\.8SCAN95\.344\.460\.620\.596\.633\.98\.292\.615\.03\.690\.06\.829\.1OmegaPRM41\.149\.244\.829\.151\.037\.115\.049\.022\.911\.546\.118\.430\.8PQM39\.196\.455\.729\.387\.943\.917\.773\.228\.514\.871\.424\.538\.1\\rowcolorblue\!10LCA\(Ours,α=1\\alpha=1\)51\.291\.765\.733\.490\.648\.715\.486\.126\.211\.384\.619\.940\.2
Table 4:Full results of ProcessBench evaluation\.Table 5:Case study \(1\)\. The first error occurs in step 3, which uses repeated numbers \(2, 2\) despite the requirement for distinct integers\. Blue values indicate the probability of*incorrect*predicted byLCA\.Question
The product of a set of distinct positive integers is 84\. What is the least possible sum of these integers? \(first wrong step: 3\)
Step 1
To find the least possible sum of distinct positive integers whose product is 84, let’s start by prime factorizing 84:84=22×3×784=2^\{2\}\\times 3\\times 7Our goal is to minimize the sum of these factors while ensuring they multiply to 84\. To do this, we want to use as few numbers as possible because larger numbers generally lead to larger sums\. \(0\.18\)
Step 2
Start with the largest prime factor: The largest prime factor here is 7\. If we include 7 in our set, we have84/7=1284/7=12\. Now, we need to factorize 12:12=22×312=2^\{2\}\\times 3\. \(0\.03\)
Step 3
Include the necessary factors: We already have a222^\{2\}from our prime factorization\. Since we’re trying to minimize the sum, we should include the smallest possible numbers that multiply to the remaining factors\. First, for222^\{2\}, we already have one 2 \(from222^\{2\}\) included\. Second, for the remaining212^\{1\}and33, adding these directly gives us the necessary components\. So, the set of integers that multiply to 84 and have the least possible sum would be7,2,2,37,2,2,3\. \(0\.84\)
Step 4
Calculating the sum:7\+2\+2\+3=147\+2\+2\+3=14Therefore, the least possible sum of these integers is \*\*14\*\*\. \(0\.94\)
Table 6:Case study \(2\)\. The first error occurs in step 3 when multiplying the numbers: 6×518400=3110400, not 3104400\. Blue values indicate the probability of*incorrect*predicted byLCA\.Question
The Coventry School’s European debate club has 6 German delegates, 5 French delegates, and 3 Italian delegates\. In how many ways can these 14 delegates sit in a row of 14 chairs, if each country’s delegates insist on all sitting next to each other? \(first wrong step: 3\)
Step 1
To find the total number of ways the delegates can sit in a row of 14 chairs with each country’s delegates sitting together, we can break down the problem into smaller steps\. First, consider the three groups of delegates as single units\. Since there are three countries, we have 3 units \(groups\) that need to be arranged in a row\. The number of ways to arrange these 3 units is 3\!\. This is because for the first position, we have 3 choices, for the second position, we have 2 choices left, and for the last position, we have only 1 choice left\. 3\! = 3 \* 2 \* 1 = 6\. \(0\.14\)
Step 2
Next, let’s consider the internal arrangement of each group\. For the 6 German delegates, they can be arranged in 6\! ways\. For the 5 French delegates, they can be arranged in 5\! ways\. And for the 3 Italian delegates, they can be arranged in 3\! ways\. 6\! = 6 \* 5 \* 4 \* 3 \* 2 \* 1 = 720, 5\! = 5 \* 4 \* 3 \* 2 \* 1 = 120, and 3\! = 3 \* 2 \* 1 = 6\. \(0\.07\)
Step 3
To find the total number of arrangements, we multiply the number of ways to arrange the units \(Step 2\) by the number of ways to arrange each group internally \(Step 3\)\. Total number of arrangements = \(number of ways to arrange units\) \* \(number of ways to arrange Germans\) \* \(number of ways to arrange French\) \* \(number of ways to arrange Italians\) = 6 \* 720 \* 120 \* 6 = 6 \* 518400 = 3104400\. \(0\.99\)
Step 4
Therefore, the final answer is:3104400\. \(0\.99\)
## Appendix DAdditional Results
Figure 5:Average predicted error probability and ground‑truth error rate across relative step positions, computed over 1024 incorrect trajectories sampled from MATH split of Math\-Shepherd\. The PRM is trained on Math‑Shepherd with max pooling for 1 epoch\.Figure 6:F1 score on MATH split of ProcessBench during training the PRM on Math‑Shepherd with max pooling\.### D\.1Preliminary Experiments
In early experiments, we applied max pooling as the MIL aggregation mechanism, since it directly instantiates the Standard Multi\-Instance Assumption\. Using Qwen2\.5\-Math\-7B\-Instruct as the backbone, we trained a PRM on the Math\-Shepherd dataset\. However, the PRM rapidly degenerated and performed poorly on downstream evaluations\. Figure[5](https://arxiv.org/html/2606.27739#A4.F5)shows the average prediction values of the resulting PRM across different relative positions in erroneous reasoning trajectories \(0 denotes the first step, 1 the last step\)\. The predictions remain near zero for all early positions but rise sharply at the final step\. Figure[6](https://arxiv.org/html/2606.27739#A4.F6)shows that the performance first improves and then collapses during training\. This confirms our hypothesis that the strong dependence and redundancy among prefixes cause the classifier to quickly settle into degenerate solutions, failing to learn robust instance\-level classification\. In contrast, our proposed SWS pooling is provably Bayes consistent under mild assumptions and thus avoids such degenerate optima\.
### D\.2Full Results of ProcessBench Evaluation
We provide full results of ProcessBench evaluation in Table[4](https://arxiv.org/html/2606.27739#A3.T4)\.
### D\.3Case Studies
We show some qualatative examples taken from ProcessBench in Table 6\. The blue values represent the predicted probability of*incorrect*provided by the PRM trained withLCA\. The probability is high once encountering the first wrong step\.
## Appendix EExtension to Reasoning with Reflection and Backtracking
In the main text, we restrict our discussion to sequential reasoning without reflection or backtracking\. Under this setting, any logical error, once introduced, is inherited by all subsequent steps and ultimately leads to an incorrect final answer\. The Weakest Link principle therefore holds: a chain is correct iff every step is correct, and incorrect iff any step contains an error\.
For reasoning that allows reflection and backtracking, however, a later step may correct an earlier mistake\. Consequently, a trajectory that arrives at a correct final answer may temporarily contain intermediate errors, violating the Weakest Link principle in its strict sequential form\. Below, we show that by lifting the formulation from linear*prefixes*to graph\-structured*reasoning states*, sequential reasoning and reasoning with reflection can be treated within a unified framework, and our method applies seamlessly to both\.
#### Reasoning state\.
A*reasoning state*is a directed acyclic graph \(DAG\)
G=\(𝒱,ℰ\),G=\(\\mathcal\{V\},\\mathcal\{E\}\),where each nodev∈𝒱v\\in\\mathcal\{V\}represents a premise, an intermediate conclusion, or the final answer, and each directed edge\(u→v\)∈ℰ\(u\\to v\)\\in\\mathcal\{E\}indicates thatvvis logically derived fromuu\.
#### Correctness of reasoning states\.
A reasoning stateGGis said to contain an error if there exists a node whose stated result is factually false or cannot be validly inferred from its parent nodes\. We define the correctness label ofGGas
y\(G\)∈\{0,1\},y\(G\)\\in\\\{0,1\\\},where11denotes the presence of at least one logical error and0denotes an error‑free state\. Under this definition, correctness is monotone with respect to graph extension:
G1⊆G2⟹y\(G2\)≥y\(G1\),G\_\{1\}\\subseteq G\_\{2\}\\;\\Longrightarrow\\;y\(G\_\{2\}\)\\geq y\(G\_\{1\}\),\(7\)since any error present in a subgraphG1G\_\{1\}is inherited by every supergraphG2G\_\{2\}that extends it\.
#### Reasoning as graph construction\.
A reasoning process can be viewed as the step\-by\-step construction of a reasoning state, starting from the initial premises and adding, modifying, or deleting nodes and edges until a node representing the final answer is produced\.
For sequential reasoning without reflection, each step adds new nodes and edges to the current graph without removing any existing ones\. Hence, every prefix of the trajectory corresponds to a subgraph of the subsequent prefix, and the monotonicity of prefix correctness \(Eq\.[1](https://arxiv.org/html/2606.27739#S2.E1)\) follows immediately as a special case of Eq\.[7](https://arxiv.org/html/2606.27739#A5.E7)\.
For reasoning with reflection or backtracking, a step may revise or retract earlier conclusions, i\.e\., it may delete nodes or edges\. In this case, earlier reasoning states are not necessarily subgraphs of later ones\. Nevertheless, the monotonicity relationG1⊆G2⇒y\(G2\)≥y\(G1\)G\_\{1\}\\subseteq G\_\{2\}\\Rightarrow y\(G\_\{2\}\)\\geq y\(G\_\{1\}\)still holds for any pair whereG1G\_\{1\}is a subgraph ofG2G\_\{2\}\.
#### Applying our PRM to reasoning with reflection\.
Our method,LCA, trains the PRM on sequential reasoning trajectories and thus learns a classifierffthat maps a reasoning state to its error probability:
f\(G\)∈\[0,1\],f\(G\)\\in\[0,1\],with the objectivef\(G\)≈y\(G\)f\(G\)\\approx y\(G\)\. Given a trajectory with reflectionτ=⟨q,s1,…,sT⟩\\tau=\\langle q,s\_\{1\},\\dots,s\_\{T\}\\rangle, we first convert each prefixτ:t\\tau\_\{:t\}into its corresponding reasoning stateGtG\_\{t\}\(the DAG constructed up to steptt\) and linearizeGtG\_\{t\}into a sequential form, i\.e\., a prefix without backtracking\. We then evaluatef\(Gt\)f\(G\_\{t\}\)for each step\.
The step‑wise quality of reasoning can be assessed by the change in the corretness of reasoning states:
- •Iff\(Gt\)<f\(Gt−1\)f\(G\_\{t\}\)<f\(G\_\{t\-1\}\), stepttimproves the reasoning state \(e\.g\., by correcting an earlier mistake\)\.
- •Iff\(Gt\)\>f\(Gt−1\)f\(G\_\{t\}\)\>f\(G\_\{t\-1\}\), stepttintroduces or amplifies an error\.
- •Iff\(Gt\)≈f\(Gt−1\)f\(G\_\{t\}\)\\approx f\(G\_\{t\-1\}\), stepttis neutral\.
Thus, although trained solely on sequential trajectories without reflection, our PRM provides fine‑grained, per‑step feedback for reasoning with reflection and backtracking without any architectural modification\.
## Appendix FProof of Theorems
### F\.1Formulations and Assumptions
###### Definition F\.1\(Instance space\)\.
Let𝒳\\mathcal\{X\}denote the*instance space*\. Each instancex∈𝒳x\\in\\mathcal\{X\}is associated with a label
y\(x\)∈\{0,1\},y\(x\)\\in\\\{0,1\\\},which induces a partition of𝒳\\mathcal\{X\}into the*positive instance set*and the*negative instance set*:
𝒳\+:=\{x∈𝒳∣y\(x\)=1\},\\displaystyle\\mathcal\{X\}^\{\+\}:=\\\{x\\in\\mathcal\{X\}\\mid y\(x\)=1\\\},𝒳−:=\{x∈𝒳∣y\(x\)=0\}\.\\displaystyle\\mathcal\{X\}^\{\-\}:=\\\{x\\in\\mathcal\{X\}\\mid y\(x\)=0\\\}\.
###### Definition F\.2\(Bag space\)\.
A*bag*\(or multi\-instance\)BBis defined as a finite set of instances from𝒳\\mathcal\{X\}:
B=\(x1,x2,…,xn\),xi∈𝒳,1≤n≤N,B=\(x\_\{1\},x\_\{2\},\\dots,x\_\{n\}\),\\quad x\_\{i\}\\in\\mathcal\{X\},\\quad 1\\leq n\\leq N,whereNNis the maximum allowed bag size\. The*bag space*ℬ\\mathcal\{B\}is the set of all possible bags:
ℬ=⋃n=1N𝒳n\.\\mathcal\{B\}=\\bigcup\_\{n=1\}^\{N\}\\mathcal\{X\}^\{n\}\.Each bagBBis associated with a labely\(B\)∈\{0,1\}y\(B\)\\in\\\{0,1\\\}, defined via the instance labels:
y\(B\)=𝟏\(∃i∈\{1,…,n\}s\.t\.y\(xi\)=1\),y\(B\)=\\mathbf\{1\}\\Big\(\\exists i\\in\\\{1,\\dots,n\\\}\\text\{ s\.t\. \}y\(x\_\{i\}\)=1\\Big\),i\.e\., a bag is labeled positive if it contains at least one positive instance, and negative otherwise\.
###### Definition F\.3\(Bag distribution\)\.
Letℬ\\mathcal\{B\}be the bag space as defined above\. A*bag distribution*is a probability measureℙ\\mathbb\{P\}onℬ\\mathcal\{B\}that describes how bags are generated\. In general, the generation process can be described in two steps:
1. 1\.The bag sizennis drawn from a discrete distribution over\{1,2,…,N\}\\\{1,2,\\dots,N\\\}: whereNNis the maximum allowed bag size\.
2. 2\.Conditional on the bag sizenn, the instances\(x1,…,xn\)\(x\_\{1\},\\dots,x\_\{n\}\)are jointly drawn according to some distributionℙn\\mathbb\{P\}\_\{n\}on𝒳n\\mathcal\{X\}^\{n\}: \(x1,…,xn\)∼ℙn\.\(x\_\{1\},\\dots,x\_\{n\}\)\\sim\\mathbb\{P\}\_\{n\}\.
Therefore, the overall distribution of a bagB=\(x1,…,xn\)B=\(x\_\{1\},\\dots,x\_\{n\}\)is
ℙ\(B\)=PN\(n\)⋅ℙn\(x1,…,xn\),\\mathbb\{P\}\(B\)=P\_\{N\}\(n\)\\cdot\\mathbb\{P\}\_\{n\}\(x\_\{1\},\\dots,x\_\{n\}\),where no independence assumption on the instances is made; they may follow an arbitrary joint distribution\.
###### Definition F\.4\(Bag cover set \(Restated\)\)\.
LetS⊆𝒳S\\subseteq\\mathcal\{X\}be a set of instances\. The*bag cover set*ofSSis the set of all bags inℬ\\mathcal\{B\}that contain at least one instance fromSS:
𝒞\(S\):=\{B∈ℬ∣B∩S≠∅\}\.\\mathcal\{C\}\(S\):=\\\{B\\in\\mathcal\{B\}\\mid B\\cap S\\neq\\emptyset\\\}\.That is, a bagBBbelongs to𝒞\(S\)\\mathcal\{C\}\(S\)if and only if it contains at least one instance fromSS\.
Similarly, we define the*positive bag cover set*and*negative bag cover set*ofSSas follows:
𝒞\+\(S\):=\{B∈ℬ∣B∩S≠∅∧y\(B\)=1\};\\mathcal\{C\}^\{\+\}\(S\):=\\\{B\\in\\mathcal\{B\}\\mid B\\cap S\\neq\\emptyset\\wedge y\(B\)=1\\\};𝒞−\(S\):=\{B∈ℬ∣B∩S≠∅∧y\(B\)=0\}\.\\mathcal\{C\}^\{\-\}\(S\):=\\\{B\\in\\mathcal\{B\}\\mid B\\cap S\\neq\\emptyset\\wedge y\(B\)=0\\\}\.
A bag cover set𝒞\(S\)\\mathcal\{C\}\(S\)\(or𝒞\+\(S\),𝒞−\(S\)\\mathcal\{C\}^\{\+\}\(S\),\\mathcal\{C\}^\{\-\}\(S\)\) is called*non\-trivial*if its probability under the bag distributionℙ\\mathbb\{P\}is strictly positive:
ℙ\(𝒞\(S\)\)\>0\(orℙ\(𝒞\+\(S\)\)\>0,ℙ\(𝒞−\(S\)\)\>0\)\.\\mathbb\{P\}\(\\mathcal\{C\}\(S\)\)\>0\\quad\(\\text\{or \}\\mathbb\{P\}\(\\mathcal\{C\}^\{\+\}\(S\)\)\>0,\\,\\mathbb\{P\}\(\\mathcal\{C\}^\{\-\}\(S\)\)\>0\)\.
###### Assumption F\.1\(MM\-Assumption \(Restated\)\)\.
There exists a constantM\>0M\>0such that for any set of negative instancesS⊆𝒳−S\\subseteq\\mathcal\{X\}^\{\-\}, if
ℙ\(𝒞\(S\)\)=ℙ\(𝒞\+\(S\)\)\+ℙ\(𝒞−\(S\)\)\>0,\\mathbb\{P\}\(\\mathcal\{C\}\(S\)\)=\\mathbb\{P\}\(\\mathcal\{C\}^\{\+\}\(S\)\)\+\\mathbb\{P\}\(\\mathcal\{C\}^\{\-\}\(S\)\)\>0,then
ℙ\(𝒞\+\(S\)\)ℙ\(𝒞−\(S\)\)≤M,\\frac\{\\mathbb\{P\}\(\\mathcal\{C\}^\{\+\}\(S\)\)\}\{\\mathbb\{P\}\(\\mathcal\{C\}^\{\-\}\(S\)\)\}\\leq M,whereℙ\(𝒞\+\(S\)\)\\mathbb\{P\}\(\\mathcal\{C\}^\{\+\}\(S\)\)andℙ\(𝒞−\(S\)\)\\mathbb\{P\}\(\\mathcal\{C\}^\{\-\}\(S\)\)denote the probabilities of the positive and negative bag cover sets under the bag distribution, respectively\.
### F\.2Mathematical Properties of Softmax Weight Sum
We first give a few lemmas about the mathematical properties of the Softmax Weight Sum \(SWS\) function\.
###### Lemma F\.1\(Partial derivative of SWS\)\.
Let the temperature parameter beα\>0\\alpha\>0and let𝐩=\(p1,…,pn\)∈\[0,1\]n\\bm\{p\}=\(p\_\{1\},\\dots,p\_\{n\}\)\\in\[0,1\]^\{n\}\. Define the softmax weightswiw\_\{i\}and the softmax weighted sump^\\hat\{p\}as
p^\(𝒑\)=∑i=1nwipi,wi=epi/α∑j=1nepj/α\.\\hat\{p\}\(\\bm\{p\}\)=\\sum\_\{i=1\}^\{n\}w\_\{i\}p\_\{i\},\\qquad w\_\{i\}=\\frac\{e^\{p\_\{i\}/\\alpha\}\}\{\\sum\_\{j=1\}^\{n\}e^\{p\_\{j\}/\\alpha\}\}\.Then for anyi∈\{1,…,n\}i\\in\\\{1,\\dots,n\\\}the partial derivative admits the representation
∂p^∂pi=wiα\(α\+pi−p^\)\.\\frac\{\\partial\\hat\{p\}\}\{\\partial p\_\{i\}\}=\\frac\{w\_\{i\}\}\{\\alpha\}\\bigl\(\\alpha\+p\_\{i\}\-\\hat\{p\}\\bigr\)\.In particular, the sign of the derivative is completely determined by the sign of\(α\+pi−p^\)\(\\alpha\+p\_\{i\}\-\\hat\{p\}\):
sgn\(∂p^∂pi\)=sgn\(α\+pi−p^\)\.\\operatorname\{sgn\}\\left\(\\frac\{\\partial\\hat\{p\}\}\{\\partial p\_\{i\}\}\\right\)=\\operatorname\{sgn\}\\bigl\(\\alpha\+p\_\{i\}\-\\hat\{p\}\\bigr\)\.
###### Lemma F\.2\(Minimum of SWS underp1=εp\_\{1\}=\\varepsilon\)\.
Letα\>0\\alpha\>0,ε∈\(0,1\]\\varepsilon\\in\(0,1\], and letn∈ℕn\\in\\mathbb\{N\},n≥1n\\geq 1\. For any𝐩=\(p1,…,pn\)∈\[0,1\]n\\bm\{p\}=\(p\_\{1\},\\dots,p\_\{n\}\)\\in\[0,1\]^\{n\}withp1=εp\_\{1\}=\\varepsilon, define the softmax weighted sump^\(𝐩\)\\hat\{p\}\(\\bm\{p\}\)as in Lemma[F\.1](https://arxiv.org/html/2606.27739#A6.Thmtheorem1)\. Then the minimum ofp^\\hat\{p\}over all such𝐩\\bm\{p\}is
p^min=\{εeε/αeε/α\+n−1,ifε≤α\(u\+1\),ε−αu,ifε\>α\(u\+1\),\\hat\{p\}\_\{\\min\}=\\begin\{cases\}\\dfrac\{\\varepsilon\\,e^\{\\varepsilon/\\alpha\}\}\{e^\{\\varepsilon/\\alpha\}\+n\-1\},&\\text\{if \}\\varepsilon\\leq\\alpha\(u\+1\),\\\\\[16\.00008pt\] \\varepsilon\-\\alpha\\,u,&\\text\{if \}\\varepsilon\>\\alpha\(u\+1\),\\end\{cases\}whereu=W0\(n−1e\)u=W\_\{0\}\\\!\\big\(\\frac\{n\-1\}\{e\}\\big\)andW0W\_\{0\}is the principal branch of the LambertWWfunction\. Forn=1n=1the expression reduces top^min=ε\\hat\{p\}\_\{\\min\}=\\varepsilon\.
###### Proof\.
Fixp1=εp\_\{1\}=\\varepsilonand write
p^\(𝒑\)=εeε/α\+∑i=2npiepi/αeε/α\+∑i=2nepi/α\.\\hat\{p\}\(\\bm\{p\}\)=\\frac\{\\varepsilon e^\{\\varepsilon/\\alpha\}\+\\sum\_\{i=2\}^\{n\}p\_\{i\}e^\{p\_\{i\}/\\alpha\}\}\{e^\{\\varepsilon/\\alpha\}\+\\sum\_\{i=2\}^\{n\}e^\{p\_\{i\}/\\alpha\}\}\.By Lemma[F\.1](https://arxiv.org/html/2606.27739#A6.Thmtheorem1), for everyi≥2i\\geq 2,
∂p^∂pi=wiα\(α\+pi−p^\)\.\\frac\{\\partial\\hat\{p\}\}\{\\partial p\_\{i\}\}=\\frac\{w\_\{i\}\}\{\\alpha\}\(\\alpha\+p\_\{i\}\-\\hat\{p\}\)\.Sincewi\>0w\_\{i\}\>0, a minimizer has eachpip\_\{i\}either satisfyingpi=p^−αp\_\{i\}=\\hat\{p\}\-\\alphaor lying on the boundary\{0,1\}\\\{0,1\\\}\. If somepi=1p\_\{i\}=1, thenα\+1−p^\>0\\alpha\+1\-\\hat\{p\}\>0, so the derivative is positive; decreasing that coordinate would lowerp^\\hat\{p\}, contradicting minimality\. Hence everypi\(i≥2\)p\_\{i\}\\;\(i\\geq 2\)is either0or satisfiespi=p^−αp\_\{i\}=\\hat\{p\}\-\\alpha\.
Suppose a minimizer contains both0and an interior point\. Forpi=0p\_\{i\}=0the inward derivative must be non‑negative, givingp^≤α\\hat\{p\}\\leq\\alpha\. For an interior pointpjp\_\{j\}we havepj=p^−α\>0p\_\{j\}=\\hat\{p\}\-\\alpha\>0, yieldingp^\>α\\hat\{p\}\>\\alpha, a contradiction\. Therefore a minimizer consists either entirely of zeros or entirely of a common valuec∈\(0,1\)c\\in\(0,1\)obeyingc=p^−αc=\\hat\{p\}\-\\alpha\.
Case 1:p2=⋯=pn=0p\_\{2\}=\\dots=p\_\{n\}=0\.
p^I=εeε/αeε/α\+n−1\.\\hat\{p\}\_\{\\text\{I\}\}=\\frac\{\\varepsilon e^\{\\varepsilon/\\alpha\}\}\{e^\{\\varepsilon/\\alpha\}\+n\-1\}\.
Case 2:p2=⋯=pn=cp\_\{2\}=\\dots=p\_\{n\}=cwithc=p^−αc=\\hat\{p\}\-\\alpha\. Substituting into the definition ofp^\\hat\{p\}and rearranging gives
ε−c−ααe\(ε−c\)/α=n−1\.\\frac\{\\varepsilon\-c\-\\alpha\}\{\\alpha\}\\,e^\{\(\\varepsilon\-c\)/\\alpha\}=n\-1\.Letu=W0\(n−1e\)u=W\_\{0\}\\\!\\bigl\(\\frac\{n\-1\}\{e\}\\bigr\), i\.e\.ueu=n−1eue^\{u\}=\\frac\{n\-1\}\{e\}\. Then
c=ε−α\(u\+1\),p^II=ε−αu\.c=\\varepsilon\-\\alpha\(u\+1\),\\qquad\\hat\{p\}\_\{\\text\{II\}\}=\\varepsilon\-\\alpha u\.Feasibility of the interior solution requiresc\>0c\>0, which is equivalent toε\>α\(u\+1\)\\varepsilon\>\\alpha\(u\+1\)\.
Now define the function
φ\(ε\)=\{p^I,ε≤α\(u\+1\),p^II,ε\>α\(u\+1\)\.\\varphi\(\\varepsilon\)=\\begin\{cases\}\\hat\{p\}\_\{\\text\{I\}\},&\\varepsilon\\leq\\alpha\(u\+1\),\\\\\[2\.0pt\] \\hat\{p\}\_\{\\text\{II\}\},&\\varepsilon\>\\alpha\(u\+1\)\.\\end\{cases\}Both expressions equalα\\alphaatε=α\(u\+1\)\\varepsilon=\\alpha\(u\+1\), andφ\\varphiis strictly increasing on each interval \(the derivative ofp^I\\hat\{p\}\_\{\\text\{I\}\}is positive, whilep^II\\hat\{p\}\_\{\\text\{II\}\}has derivative11\)\. Thusφ\\varphiis continuous and increasing on\(0,1\]\(0,1\]\. Forε≤α\(u\+1\)\\varepsilon\\leq\\alpha\(u\+1\)the interior candidate is infeasible, while forε\>α\(u\+1\)\\varepsilon\>\\alpha\(u\+1\)the all‑zero candidate violates the necessary boundary conditionp^≤α\\hat\{p\}\\leq\\alpha, hence cannot be a local minimum\. Consequently the global minimum is preciselyφ\(ε\)\\varphi\(\\varepsilon\), which yields the claimed formula\. The casen=1n=1givesu=0u=0and both branches reduce top^min=ε\\hat\{p\}\_\{\\min\}=\\varepsilon\. ∎
###### Lemma F\.3\(Minimum of SWS undermaxipi≥ε\\max\_\{i\}p\_\{i\}\\geq\\varepsilon\)\.
Letα\>0\\alpha\>0,ε∈\(0,1\]\\varepsilon\\in\(0,1\], andn∈ℕn\\in\\mathbb\{N\},n≥1n\\geq 1\. For any𝐩=\(p1,…,pn\)∈\[0,1\]n\\bm\{p\}=\(p\_\{1\},\\dots,p\_\{n\}\)\\in\[0,1\]^\{n\}withmaxipi≥ε\\max\_\{i\}p\_\{i\}\\geq\\varepsilon, define the softmax weighted sump^\(𝐩\)\\hat\{p\}\(\\bm\{p\}\)as in Lemma[F\.1](https://arxiv.org/html/2606.27739#A6.Thmtheorem1)\. Then the minimum ofp^\\hat\{p\}over all such𝐩\\bm\{p\}is
p^min=\{εeε/αeε/α\+n−1,ifε≤α\(u\+1\),ε−αu,ifε\>α\(u\+1\),\\hat\{p\}\_\{\\min\}=\\begin\{cases\}\\dfrac\{\\varepsilon\\,e^\{\\varepsilon/\\alpha\}\}\{e^\{\\varepsilon/\\alpha\}\+n\-1\},&\\text\{if \}\\varepsilon\\leq\\alpha\(u\+1\),\\\\\[16\.00008pt\] \\varepsilon\-\\alpha\\,u,&\\text\{if \}\\varepsilon\>\\alpha\(u\+1\),\\end\{cases\}whereu=W0\(n−1e\)u=W\_\{0\}\\\!\\bigl\(\\frac\{n\-1\}\{e\}\\bigr\)andW0W\_\{0\}is the principal branch of the LambertWWfunction\. Forn=1n=1this reduces top^min=ε\\hat\{p\}\_\{\\min\}=\\varepsilon\.
###### Proof\.
Let’s first consider the minimum ofp^\\hat\{p\}under the constraintp1≥εp\_\{1\}\\geq\\varepsilon, i\.e\.,
min𝒑:p1≥εp^\(𝒑\)\.\\min\_\{\\bm\{p\}\\,:\\,p\_\{1\}\\geq\\varepsilon\}\\hat\{p\}\(\\bm\{p\}\)\.For any admissible𝒑\\bm\{p\}, setm=p1≥εm=p\_\{1\}\\geq\\varepsilon\. Consider the family of vectors with a fixed valuemmin the first coordinate and arbitrary remaining coordinates in\[0,1\]\[0,1\]\. The minimum ofp^\\hat\{p\}over this family is a functionφ\(m\)\\varphi\(m\), which by Lemma[F\.2](https://arxiv.org/html/2606.27739#A6.Thmtheorem2)equals
φ\(m\)\\displaystyle\\varphi\(m\)=min𝒑:p1=mp^\(𝒑\)\\displaystyle=\\min\_\{\\bm\{p\}\\,:\\,p\_\{1\}=m\}\\hat\{p\}\(\\bm\{p\}\)=\{mem/αem/α\+n−1,m≤α\(u\+1\),m−αu,m\>α\(u\+1\),\\displaystyle=\\begin\{cases\}\\dfrac\{m\\,e^\{m/\\alpha\}\}\{e^\{m/\\alpha\}\+n\-1\},&m\\leq\\alpha\(u\+1\),\\\\\[11\.99998pt\] m\-\\alpha\\,u,&m\>\\alpha\(u\+1\),\\end\{cases\}whereu=W0\(n−1e\)u=W\_\{0\}\\bigl\(\\frac\{n\-1\}\{e\}\\bigr\)\.
Nowφ\(m\)\\varphi\(m\)is strictly increasing form∈\(0,1\]m\\in\(0,1\]\. Indeed, form<α\(u\+1\)m<\\alpha\(u\+1\),
φ′\(m\)=w\[1\+mα\(1−w\)\]\>0,\\displaystyle\\varphi^\{\\prime\}\(m\)=w\\bigl\[1\+\\tfrac\{m\}\{\\alpha\}\(1\-w\)\\bigr\]\>0,w=em/αem/α\+n−1,\\displaystyle w=\\frac\{e^\{m/\\alpha\}\}\{e^\{m/\\alpha\}\+n\-1\},
while form\>α\(u\+1\)m\>\\alpha\(u\+1\)we haveφ′\(m\)=1\>0\\varphi^\{\\prime\}\(m\)=1\>0; the two branches meet continuously atm=α\(u\+1\)m=\\alpha\(u\+1\)where both giveφ=α\\varphi=\\alpha\. Henceφ\\varphiis non‑decreasing, and its minimum on the interval\[ε,1\]\[\\varepsilon,1\]is attained at the left endpointm=εm=\\varepsilon\. Consequently,
min𝒑:p1≥εp^\(𝒑\)=minm≥εφ\(m\)=φ\(ε\)\.\\min\_\{\\bm\{p\}\\,:\\,p\_\{1\}\\geq\\varepsilon\}\\hat\{p\}\(\\bm\{p\}\)=\\min\_\{m\\geq\\varepsilon\}\\varphi\(m\)=\\varphi\(\\varepsilon\)\.
By symmetry we note that the constraintp1≥εp\_\{1\}\\geq\\varepsilonandmaxpi≥ε\\max p\_\{i\}\\geq\\varepsilonare equivalent\. Therefore,
min𝒑:maxpi≥εp^\(𝒑\)=min𝒑:p1≥εp^\(𝒑\)=φ\(ε\)\.\\min\_\{\\bm\{p\}\\,:\\,\\max p\_\{i\}\\geq\\varepsilon\}\\hat\{p\}\(\\bm\{p\}\)=\\min\_\{\\bm\{p\}\\,:\\,p\_\{1\}\\geq\\varepsilon\}\\hat\{p\}\(\\bm\{p\}\)=\\varphi\(\\varepsilon\)\.This completes the proof\. ∎
###### Lemma F\.4\(Increment lower bound for SWS\)\.
Letα\>0\\alpha\>0,ε∈\(0,1\]\\varepsilon\\in\(0,1\], andN∈ℕN\\in\\mathbb\{N\}\(N≥1N\\geq 1\)\. For any𝐩=\(p1,…,pn\)∈\[0,1\]n\\bm\{p\}=\(p\_\{1\},\\dots,p\_\{n\}\)\\in\[0,1\]^\{n\}withn≤Nn\\leq Nandp1≤1−εp\_\{1\}\\leq 1\-\\varepsilon, fixp2,…,pnp\_\{2\},\\dots,p\_\{n\}and define𝐩′=\(1,p2,…,pn\)\\bm\{p^\{\\prime\}\}=\(1,p\_\{2\},\\dots,p\_\{n\}\)\. Define the softmax weighted sump^\(⋅\)\\hat\{p\}\(\\cdot\)as in Lemma[F\.1](https://arxiv.org/html/2606.27739#A6.Thmtheorem1)\. Then the incrementΔ=p^\(𝐩′\)−p^\(𝐩\)\\Delta=\\hat\{p\}\(\\bm\{p^\{\\prime\}\}\)\-\\hat\{p\}\(\\bm\{p\}\)admits the uniform positive lower bound
Δ≥δ\(ε,α,N\)\>0,\\Delta\\;\\geq\\;\\delta\(\\varepsilon,\\alpha,N\)\\;\>\\;0,where
δ\(ε,α,N\)\\displaystyle\\delta\(\\varepsilon,\\alpha,N\)=\\displaystyle=min\{11\+\(N−1\)e1/α,ε1\+\(N−1\)eε/α\}\.\\displaystyle\\min\\\!\\left\\\{\\frac\{1\}\{1\+\(N\-1\)e^\{1/\\alpha\}\},\\;\\frac\{\\varepsilon\}\{1\+\(N\-1\)e^\{\\varepsilon/\\alpha\}\}\\right\\\}\.
###### Proof\.
Writea=p1∈\[0,1−ε\]a=p\_\{1\}\\in\[0,1\-\\varepsilon\]and set
A=∑i=2npiepi/α,B=∑i=2nepi/αA=\\sum\_\{i=2\}^\{n\}p\_\{i\}e^\{p\_\{i\}/\\alpha\},\\qquad B=\\sum\_\{i=2\}^\{n\}e^\{p\_\{i\}/\\alpha\}\(withA=B=0A=B=0whenn=1n=1\)\. For anyx∈\[0,1\]x\\in\[0,1\]define the auxiliary function
h\(x\)=xex/α\+Aex/α\+B,h\(x\)=\\frac\{xe^\{x/\\alpha\}\+A\}\{e^\{x/\\alpha\}\+B\},so thatp^\(𝒑\)=h\(a\)\\hat\{p\}\(\\bm\{p\}\)=h\(a\)andp^\(𝒑′\)=h\(1\)\\hat\{p\}\(\\bm\{p\}^\{\\prime\}\)=h\(1\)\.
From Lemma[F\.1](https://arxiv.org/html/2606.27739#A6.Thmtheorem1),h′\(x\)=w1\(x\)α\(α\+x−h\(x\)\)h^\{\\prime\}\(x\)=\\frac\{w\_\{1\}\(x\)\}\{\\alpha\}\(\\alpha\+x\-h\(x\)\)wherew1\(x\)=ex/α/\(ex/α\+B\)\>0w\_\{1\}\(x\)=e^\{x/\\alpha\}/\(e^\{x/\\alpha\}\+B\)\>0\. Every interior critical point satisfiesα\+x=h\(x\)\\alpha\+x=h\(x\), and evaluating the second derivative there givesh′′=w1/α\>0h^\{\\prime\\prime\}=w\_\{1\}/\\alpha\>0; hence all interior critical points are strict local minima\. Hence the maximum ofhhon any subinterval must occur at an endpoint\.
Consequently, fora∈\[0,1−ε\]a\\in\[0,1\-\\varepsilon\]we haveh\(a\)≤max\{h\(0\),h\(1−ε\)\}h\(a\)\\leq\\max\\\{h\(0\),h\(1\-\\varepsilon\)\\\}, and therefore
Δ\\displaystyle\\Delta=h\(1\)−h\(a\)\\displaystyle=h\(1\)\-h\(a\)≥min\{h\(1\)−h\(0\),h\(1\)−h\(1−ε\)\}\.\\displaystyle\\geq\\min\\\!\\bigl\\\{h\(1\)\-h\(0\),\\;h\(1\)\-h\(1\-\\varepsilon\)\\bigr\\\}\.
We now bound the two differences from below\.
Boundingh\(1\)−h\(0\)h\(1\)\-h\(0\)\.A direct computation gives
h\(1\)−h\(0\)=e1/α\(1\+B−A\)\+A\(e1/α\+B\)\(1\+B\)\.h\(1\)\-h\(0\)=\\frac\{e^\{1/\\alpha\}\(1\+B\-A\)\+A\}\{\(e^\{1/\\alpha\}\+B\)\(1\+B\)\}\.Sincepi∈\[0,1\]p\_\{i\}\\in\[0,1\]impliesA≤BA\\leq Bande1/α\>1e^\{1/\\alpha\}\>1, the numerator decreases withAA\. Its minimum on the admissible region is attained atA=BA=B, yielding
h\(1\)−h\(0\)≥11\+B\.h\(1\)\-h\(0\)\\geq\\frac\{1\}\{1\+B\}\.Frompi≤1p\_\{i\}\\leq 1we haveB≤\(n−1\)e1/α≤\(N−1\)e1/αB\\leq\(n\-1\)e^\{1/\\alpha\}\\leq\(N\-1\)e^\{1/\\alpha\}, hence
h\(1\)−h\(0\)≥11\+\(N−1\)e1/α\.h\(1\)\-h\(0\)\\geq\\frac\{1\}\{1\+\(N\-1\)e^\{1/\\alpha\}\}\.
Boundingh\(1\)−h\(1−ε\)h\(1\)\-h\(1\-\\varepsilon\)\.Setx=1−εx=1\-\\varepsilon\. Algebraic simplification yieldsh\(1\)−h\(x\)=h\(1\)\-h\(x\)=
εex/α\(e1/α\+B\)\+\(B−A\)\(e1/α−ex/α\)\(e1/α\+B\)\(ex/α\+B\)\.\\frac\{\\varepsilon e^\{x/\\alpha\}\(e^\{1/\\alpha\}\+B\)\+\(B\-A\)\(e^\{1/\\alpha\}\-e^\{x/\\alpha\}\)\}\{\(e^\{1/\\alpha\}\+B\)\(e^\{x/\\alpha\}\+B\)\}\.SinceA≤BA\\leq Bande1/α\>ex/αe^\{1/\\alpha\}\>e^\{x/\\alpha\}, the second term in the numerator is non‑negative\. Hence the numerator is at leastεex/α\(e1/α\+B\)\\varepsilon e^\{x/\\alpha\}\(e^\{1/\\alpha\}\+B\), and we obtain the lower bound
h\(1\)−h\(x\)≥εex/αex/α\+B\.h\(1\)\-h\(x\)\\geq\\frac\{\\varepsilon e^\{x/\\alpha\}\}\{e^\{x/\\alpha\}\+B\}\.UsingB≤\(N−1\)e1/αB\\leq\(N\-1\)e^\{1/\\alpha\}andex/α=e\(1−ε\)/αe^\{x/\\alpha\}=e^\{\(1\-\\varepsilon\)/\\alpha\},
h\(1\)−h\(x\)≥ε1\+\(N−1\)eε/α\.h\(1\)\-h\(x\)\\geq\\frac\{\\varepsilon\}\{1\+\(N\-1\)e^\{\\varepsilon/\\alpha\}\}\.
Taking the minimum of these two bounds establishes the claimed inequality\. The strict positivity follows fromε\>0\\varepsilon\>0and the boundedness of all terms\. ∎
###### Corollary F\.5\(Maximum of SWS\)\.
Letα\>0\\alpha\>0,n≥1n\\geq 1, and fix arbitraryp2,…,pn∈\[0,1\]p\_\{2\},\\dots,p\_\{n\}\\in\[0,1\]\. Consider the function
φ\(x\)=p^\(x,p2,…,pn\),x∈\[0,1\],\\varphi\(x\)=\\hat\{p\}\(x,p\_\{2\},\\dots,p\_\{n\}\),\\qquad x\\in\[0,1\],wherep^\\hat\{p\}is the softmax weighted sum defined in Lemma[F\.1](https://arxiv.org/html/2606.27739#A6.Thmtheorem1)\. Thenφ\\varphiattains its maximum on\[0,1\]\[0,1\]uniquely atx=1x=1\.
###### Proof\.
For anyx∈\[0,1\)x\\in\[0,1\), setε=1−x\>0\\varepsilon=1\-x\>0\. Applying Lemma[F\.4](https://arxiv.org/html/2606.27739#A6.Thmtheorem4)withN=nN=n\(the actual size of the bag\) to the vector𝒑=\(x,p2,…,pn\)\\bm\{p\}=\(x,p\_\{2\},\\dots,p\_\{n\}\)gives
φ\(1\)−φ\(x\)\\displaystyle\\varphi\(1\)\-\\varphi\(x\)=p^\(1,p2,…,pn\)−p^\(x,p2,…,pn\)\\displaystyle=\\hat\{p\}\(1,p\_\{2\},\\dots,p\_\{n\}\)\-\\hat\{p\}\(x,p\_\{2\},\\dots,p\_\{n\}\)≥δ\(ε,α,n\)\>0\.\\displaystyle\\geq\\delta\(\\varepsilon,\\alpha,n\)\\;\>\\;0\.
Henceφ\(1\)\>φ\(x\)\\varphi\(1\)\>\\varphi\(x\)for everyx<1x<1, which meansx=1x=1is the strict global maximum\. ∎
### F\.3Proof of Theorem[3\.1](https://arxiv.org/html/2606.27739#S3.Thmtheorem1)
###### Lemma F\.6\(Loss decrease for correcting false negatives\)\.
Let𝒳\\mathcal\{X\}be the instance space,ℬ\\mathcal\{B\}the bag space, andℙ\\mathbb\{P\}a distribution overℬ\\mathcal\{B\}\. Letf:𝒳→\[0,1\]f:\\mathcal\{X\}\\to\[0,1\]be an instance‑level classifier\. For a bagB=\(x1,…,xn\)B=\(x\_\{1\},\\dots,x\_\{n\}\)with labely\(B\)∈\{0,1\}y\(B\)\\in\\\{0,1\\\}, define the softmax weighted sum and the expected loss as
p^\(B\)\\displaystyle\\hat\{p\}\(B\)=SWSα\(f\(x1\),…,f\(xn\)\),\\displaystyle=\\mathrm\{SWS\}\_\{\\alpha\}\\bigl\(f\(x\_\{1\}\),\\dots,f\(x\_\{n\}\)\\bigr\),ℒ\(f\)\\displaystyle\\mathcal\{L\}\(f\)=𝔼B∼ℙ\[LCE\(p^\(B\),y\(B\)\)\]\.\\displaystyle=\\mathbb\{E\}\_\{B\\sim\\mathbb\{P\}\}\\bigl\[L\_\{\\mathrm\{CE\}\}\\bigl\(\\hat\{p\}\(B\),y\(B\)\\bigr\)\\bigr\]\.LetS\+⊆𝒳\+S^\{\+\}\\subseteq\\mathcal\{X\}^\{\+\}be a set of positive instances satisfyingℙ\(𝒞\+\(S\+\)\)\>0\\mathbb\{P\}\(\\mathcal\{C\}^\{\+\}\(S^\{\+\}\)\)\>0, and assumef\(x\)<1f\(x\)<1for allx∈S\+x\\in S^\{\+\}\. Define a new classifierf′f^\{\\prime\}by
f′\(x\)=\{1,x∈S\+,f\(x\),otherwise\.f^\{\\prime\}\(x\)=\\begin\{cases\}1,&x\\in S^\{\+\},\\\\ f\(x\),&\\text\{otherwise\}\.\\end\{cases\}Then the expected loss strictly decreases, i\.e\.ℒ\(f′\)<ℒ\(f\)\\mathcal\{L\}\(f^\{\\prime\}\)<\\mathcal\{L\}\(f\)\.
###### Proof\.
By Corollary[F\.5](https://arxiv.org/html/2606.27739#A6.Thmtheorem5), replacing anypi<1p\_\{i\}<1by11strictly increasesp^\\hat\{p\}\. Applying this repeatedly to allx∈S\+x\\in S^\{\+\}contained in a bagB∈𝒞\+\(S\+\)B\\in\\mathcal\{C\}^\{\+\}\(S^\{\+\}\)givesp^′\(B\)\>p^\(B\)\\hat\{p\}^\{\\prime\}\(B\)\>\\hat\{p\}\(B\)\. Hence−logp^′\(B\)<−logp^\(B\)\-\\log\\hat\{p\}^\{\\prime\}\(B\)<\-\\log\\hat\{p\}\(B\)on𝒞\+\(S\+\)\\mathcal\{C\}^\{\+\}\(S^\{\+\}\)\. All other bags are unaffected becausef′f^\{\\prime\}differs fromffonly onS\+S^\{\+\}, which appears only in bags of𝒞\+\(S\+\)\\mathcal\{C\}^\{\+\}\(S^\{\+\}\)\. Therefore
ℒ\(f\)−ℒ\(f′\)=𝔼\[𝟏𝒞\+\(S\+\)\(logp^′−logp^\)\]\>0,\\mathcal\{L\}\(f\)\-\\mathcal\{L\}\(f^\{\\prime\}\)=\\mathbb\{E\}\\bigl\[\\mathbf\{1\}\_\{\\mathcal\{C\}^\{\+\}\(S^\{\+\}\)\}\\,\(\\log\\hat\{p\}^\{\\prime\}\-\\log\\hat\{p\}\)\\bigr\]\>0,since the integrand is positive on a set of positive probability\. Henceℒ\(f′\)<ℒ\(f\)\\mathcal\{L\}\(f^\{\\prime\}\)<\\mathcal\{L\}\(f\)\. ∎
###### Lemma F\.7\(Loss decrease for correcting false positives\)\.
Let𝒳,ℬ,ℙ,p^,ℒ\\mathcal\{X\},\\mathcal\{B\},\\mathbb\{P\},\\hat\{p\},\\mathcal\{L\}be as defined in Lemma[F\.6](https://arxiv.org/html/2606.27739#A6.Thmtheorem6)\. Suppose that Assumption[F\.1](https://arxiv.org/html/2606.27739#A6.Thmassumption1)\(theMM\-assumption\) holds\. LetNNbe the maximum bag size,u=W0\(N−1e\)u=W\_\{0\}\\\!\\bigl\(\\frac\{N\-1\}\{e\}\\bigr\), andMMthe constant from that assumption\. LetS−⊆𝒳−S^\{\-\}\\subseteq\\mathcal\{X\}^\{\-\}be a set of negative instances withℙ\(𝒞\(S−\)\)\>0\\mathbb\{P\}\(\\mathcal\{C\}\(S^\{\-\}\)\)\>0, and assume that classifierffsatisfies
f\(x\)\\displaystyle f\(x\)=1\\displaystyle=1\\;\\;∀x∈𝒳\+,\\displaystyle\\forall x\\in\\mathcal\{X\}^\{\+\},f\(x\)\\displaystyle f\(x\)∈\(0,1\]\\displaystyle\\in\(0,1\]\\;\\;∀x∈S−,\\displaystyle\\forall x\\in S^\{\-\},f\(x\)\\displaystyle f\(x\)=0\\displaystyle=0\\;\\;for all otherx∈𝒳−\.\\displaystyle\\text\{for all other \}x\\in\\mathcal\{X\}^\{\-\}\.
Definef′f^\{\\prime\}by
f′\(x\)=\{0,x∈S−,f\(x\),otherwise\.f^\{\\prime\}\(x\)=\\begin\{cases\}0,&x\\in S^\{\-\},\\\\\[2\.0pt\] f\(x\),&\\text\{otherwise\}\.\\end\{cases\}Then there exists a temperatureα\>0\\alpha\>0such thatℒ\(f′\)<ℒ\(f\)\\mathcal\{L\}\(f^\{\\prime\}\)<\\mathcal\{L\}\(f\)\.
###### Proof\.
Chooseα\>0\\alpha\>0small enough so that
12\(u\+1\)\\displaystyle\\frac\{1\}\{2\(u\+1\)\}\>α,\\displaystyle\>\\alpha,\(8\)−log\(α\(2u\+1\)\)\+Mlog\(1−αu\)\\displaystyle\-\\log\\\!\\bigl\(\\alpha\(2u\+1\)\\bigr\)\+M\\log\(1\-\\alpha u\)\>0\.\\displaystyle\>0\.Suchα\\alphaexists because asα→0\+\\alpha\\to 0^\{\+\}the second term tends to0while the first diverges to\+∞\+\\infty\. Setε1=1−αu−α\\varepsilon\_\{1\}=1\-\\alpha u\-\\alphaand partitionS−S^\{\-\}according to the current predictions:
S1=\{x∈S−:f\(x\)\>ε1\},\\displaystyle S\_\{1\}=\\\{x\\in S^\{\-\}:f\(x\)\>\\varepsilon\_\{1\}\\\},S2=S−∖S1\(sof\(x\)≤ε1onS2\)\.\\displaystyle S\_\{2\}=S^\{\-\}\\setminus S\_\{1\}\\;\\;\(\\text\{so \}f\(x\)\\leq\\varepsilon\_\{1\}\\text\{ on \}S\_\{2\}\)\.
Only bags containing instances fromS−S^\{\-\}change whenffis replaced byf′f^\{\\prime\}\. We split those bags into four families:
A1\\displaystyle A\_\{1\}=𝒞\+\(S1\),\\displaystyle=\\mathcal\{C\}^\{\+\}\(S\_\{1\}\),A2\\displaystyle A\_\{2\}=𝒞−\(S1\),\\displaystyle=\\mathcal\{C\}^\{\-\}\(S\_\{1\}\),A3\\displaystyle A\_\{3\}=𝒞\+\(S2\)∖𝒞\+\(S1\),\\displaystyle=\\mathcal\{C\}^\{\+\}\(S\_\{2\}\)\\setminus\\mathcal\{C\}^\{\+\}\(S\_\{1\}\),A4\\displaystyle A\_\{4\}=𝒞−\(S2\)∖𝒞−\(S1\)\.\\displaystyle=\\mathcal\{C\}^\{\-\}\(S\_\{2\}\)\\setminus\\mathcal\{C\}^\{\-\}\(S\_\{1\}\)\.Denote byp^\(B\),p^′\(B\)\\hat\{p\}\(B\),\\hat\{p\}^\{\\prime\}\(B\)the softmax weighted sum underf,f′f,f^\{\\prime\}\. The difference of expected losses decomposes as
ℒ\(f\)−ℒ\(f′\)=Δ1\+Δ2\+Δ3\+Δ4,\\mathcal\{L\}\(f\)\-\\mathcal\{L\}\(f^\{\\prime\}\)=\\Delta\_\{1\}\+\\Delta\_\{2\}\+\\Delta\_\{3\}\+\\Delta\_\{4\},where
Δ1\\displaystyle\\Delta\_\{1\}=𝔼\[𝟏A1\(logp^′−logp^\)\],\\displaystyle=\\mathbb\{E\}\\bigl\[\\mathbf\{1\}\_\{A\_\{1\}\}\(\\log\\hat\{p\}^\{\\prime\}\-\\log\\hat\{p\}\)\\bigr\],Δ2\\displaystyle\\Delta\_\{2\}=𝔼\[𝟏A2\(log\(1−p^′\)−log\(1−p^\)\)\],\\displaystyle=\\mathbb\{E\}\\bigl\[\\mathbf\{1\}\_\{A\_\{2\}\}\\bigl\(\\log\(1\-\\hat\{p\}^\{\\prime\}\)\-\\log\(1\-\\hat\{p\}\)\\bigr\)\\bigr\],Δ3\\displaystyle\\Delta\_\{3\}=𝔼\[𝟏A3\(logp^′−logp^\)\],\\displaystyle=\\mathbb\{E\}\\bigl\[\\mathbf\{1\}\_\{A\_\{3\}\}\(\\log\\hat\{p\}^\{\\prime\}\-\\log\\hat\{p\}\)\\bigr\],Δ4\\displaystyle\\Delta\_\{4\}=𝔼\[𝟏A4\(log\(1−p^′\)−log\(1−p^\)\)\]\.\\displaystyle=\\mathbb\{E\}\\bigl\[\\mathbf\{1\}\_\{A\_\{4\}\}\\bigl\(\\log\(1\-\\hat\{p\}^\{\\prime\}\)\-\\log\(1\-\\hat\{p\}\)\\bigr\)\\bigr\]\.
BoundingΔ1\+Δ2\\Delta\_\{1\}\+\\Delta\_\{2\}\.Ifℙ\(A1∪A2\)=0\\mathbb\{P\}\(A\_\{1\}\\cup A\_\{2\}\)=0, thenΔ1=Δ2=0\\Delta\_\{1\}=\\Delta\_\{2\}=0\. Otherwise theMM\-assumption givesℙ\(A2\)\>0\\mathbb\{P\}\(A\_\{2\}\)\>0\. ForB∈A2B\\in A\_\{2\}, by Lemma[F\.3](https://arxiv.org/html/2606.27739#A6.Thmtheorem3)andn≤Nn\\leq N,
p^\(B\)≥ε1−αu≥1−2αu−α,\\hat\{p\}\(B\)\\geq\\varepsilon\_\{1\}\-\\alpha u\\geq 1\-2\\alpha u\-\\alpha,so1−p^\(B\)≤α\(2u\+1\)1\-\\hat\{p\}\(B\)\\leq\\alpha\(2u\+1\)\. Sincef′f^\{\\prime\}assigns0to all instances ofBB,p^′\(B\)=0\\hat\{p\}^\{\\prime\}\(B\)=0, yielding
log\(1−p^′\(B\)\)−log\(1−p^\(B\)\)\\displaystyle\\log\(1\-\\hat\{p\}^\{\\prime\}\(B\)\)\-\\log\(1\-\\hat\{p\}\(B\)\)=−log\(1−p^\(B\)\)\\displaystyle=\-\\log\(1\-\\hat\{p\}\(B\)\)≥−log\(α\(2u\+1\)\)\.\\displaystyle\\geq\-\\log\\\!\\bigl\(\\alpha\(2u\+1\)\\bigr\)\.
Taking expectations gives
Δ2≥ℙ\(A2\)\(−log\(α\(2u\+1\)\)\)\.\\Delta\_\{2\}\\geq\\mathbb\{P\}\(A\_\{2\}\)\\bigl\(\-\\log\(\\alpha\(2u\+1\)\)\\bigr\)\.
ForB∈A1B\\in A\_\{1\}, Lemma[F\.3](https://arxiv.org/html/2606.27739#A6.Thmtheorem3)withε=1\\varepsilon=1impliesp^\(B\),p^′\(B\)≥1−αu\\hat\{p\}\(B\),\\hat\{p\}^\{\\prime\}\(B\)\\geq 1\-\\alpha u, hence
logp^′\(B\)−logp^\(B\)≥log\(1−αu\)\\log\\hat\{p\}^\{\\prime\}\(B\)\-\\log\\hat\{p\}\(B\)\\geq\\log\(1\-\\alpha u\)\(note thatlog\(1−αu\)<0\\log\(1\-\\alpha u\)<0\)\. Thus
Δ1≥ℙ\(A1\)log\(1−αu\)\.\\Delta\_\{1\}\\geq\\mathbb\{P\}\(A\_\{1\}\)\\log\(1\-\\alpha u\)\.
Combining the two bounds and using theMM\-assumptionℙ\(A1\)/ℙ\(A2\)≤M\\mathbb\{P\}\(A\_\{1\}\)/\\mathbb\{P\}\(A\_\{2\}\)\\leq Mtogether withlog\(1−αu\)<0\\log\(1\-\\alpha u\)<0, we obtain
Δ1\+Δ2\\displaystyle\\Delta\_\{1\}\+\\Delta\_\{2\}≥\\displaystyle\\geqℙ\(A2\)\[−log\(α\(2u\+1\)\)\+ℙ\(A1\)ℙ\(A2\)log\(1−αu\)\]\\displaystyle\\mathbb\{P\}\(A\_\{2\}\)\\Bigl\[\-\\log\\\!\\bigl\(\\alpha\(2u\+1\)\\bigr\)\+\\tfrac\{\\mathbb\{P\}\(A\_\{1\}\)\}\{\\mathbb\{P\}\(A\_\{2\}\)\}\\log\(1\-\\alpha u\)\\Bigr\]≥\\displaystyle\\geqℙ\(A2\)\[−log\(α\(2u\+1\)\)\+Mlog\(1−αu\)\]\\displaystyle\\mathbb\{P\}\(A\_\{2\}\)\\Bigl\[\-\\log\\\!\\bigl\(\\alpha\(2u\+1\)\\bigr\)\+M\\log\(1\-\\alpha u\)\\Bigr\]\>\\displaystyle\>0,\\displaystyle 0,where the final positivity is exactly the second condition in \([9](https://arxiv.org/html/2606.27739#A6.E9)\)\.
Positivity ofΔ3\\Delta\_\{3\}\.ForB∈A3B\\in A\_\{3\}, the bag contains a positive instance \(predicted as11\) and at least one instance fromS2S\_\{2\}withpx≤ε1p\_\{x\}\\leq\\varepsilon\_\{1\}\. Lemma[F\.3](https://arxiv.org/html/2606.27739#A6.Thmtheorem3)again givesp^\(B\)≥1−αu\\hat\{p\}\(B\)\\geq 1\-\\alpha u\. Using Lemma[F\.1](https://arxiv.org/html/2606.27739#A6.Thmtheorem1), for any suchpxp\_\{x\},
∂p^∂px=wα\(α\+px−p^\)≤wα\(α\+ε1−\(1−αu\)\)=0\.\\frac\{\\partial\\hat\{p\}\}\{\\partial p\_\{x\}\}=\\frac\{w\}\{\\alpha\}\\bigl\(\\alpha\+p\_\{x\}\-\\hat\{p\}\\bigr\)\\leq\\frac\{w\}\{\\alpha\}\\bigl\(\\alpha\+\\varepsilon\_\{1\}\-\(1\-\\alpha u\)\\bigr\)=0\.Hence loweringpxp\_\{x\}to0does not decreasep^\\hat\{p\}\. Repeating for allS2S\_\{2\}instances inBByieldsp^′\(B\)≥p^\(B\)\\hat\{p\}^\{\\prime\}\(B\)\\geq\\hat\{p\}\(B\), soΔ3≥0\\Delta\_\{3\}\\geq 0\.
Positivity ofΔ4\\Delta\_\{4\}\.ForB∈A4B\\in A\_\{4\},f′f^\{\\prime\}sets all predictions to0, sop^′\(B\)=0\\hat\{p\}^\{\\prime\}\(B\)=0\. Underff, at least oneS2S\_\{2\}instance has a positive prediction, thereforep^\(B\)\>0\\hat\{p\}\(B\)\>0\. Consequently,
log\(1−p^′\(B\)\)−log\(1−p^\(B\)\)\\displaystyle\\log\(1\-\\hat\{p\}^\{\\prime\}\(B\)\)\-\\log\(1\-\\hat\{p\}\(B\)\)=−log\(1−p^\(B\)\)\\displaystyle=\-\\log\(1\-\\hat\{p\}\(B\)\)\>0,\\displaystyle\>0,
which givesΔ4≥0\\Delta\_\{4\}\\geq 0, andΔ4\>0\\Delta\_\{4\}\>0wheneverℙ\(A4\)\>0\\mathbb\{P\}\(A\_\{4\}\)\>0\.
Combining the four terms\.Ifℙ\(A1∪A2\)\>0\\mathbb\{P\}\(A\_\{1\}\\cup A\_\{2\}\)\>0, we already haveΔ1\+Δ2\>0\\Delta\_\{1\}\+\\Delta\_\{2\}\>0whileΔ3,Δ4≥0\\Delta\_\{3\},\\Delta\_\{4\}\\geq 0, so the sum is strictly positive\. Ifℙ\(A1∪A2\)=0\\mathbb\{P\}\(A\_\{1\}\\cup A\_\{2\}\)=0, then all bags affected by the change belong toA3∪A4A\_\{3\}\\cup A\_\{4\}\. The non‑triviality ofS−S^\{\-\}impliesℙ\(A3∪A4\)=ℙ\(𝒞\(S−\)\)\>0\\mathbb\{P\}\(A\_\{3\}\\cup A\_\{4\}\)=\\mathbb\{P\}\(\\mathcal\{C\}\(S^\{\-\}\)\)\>0\. Applying theMM\-assumption toS2⊆𝒳−S\_\{2\}\\subseteq\\mathcal\{X\}^\{\-\}yieldsℙ\(A3\)/ℙ\(A4\)≤M\\mathbb\{P\}\(A\_\{3\}\)/\\mathbb\{P\}\(A\_\{4\}\)\\leq M\. Ifℙ\(A4\)=0\\mathbb\{P\}\(A\_\{4\}\)=0, this ratio would be infinite, a contradiction\. Henceℙ\(A4\)\>0\\mathbb\{P\}\(A\_\{4\}\)\>0and consequentlyΔ4\>0\\Delta\_\{4\}\>0, making the total difference positive\.
In either caseℒ\(f\)−ℒ\(f′\)\>0\\mathcal\{L\}\(f\)\-\\mathcal\{L\}\(f^\{\\prime\}\)\>0, i\.e\.ℒ\(f′\)<ℒ\(f\)\\mathcal\{L\}\(f^\{\\prime\}\)<\\mathcal\{L\}\(f\)\. ∎
###### Theorem F\.8\(Unique global minimizer \(Restated\)\)\.
Let𝒳,ℬ,ℙ,p^,ℒ\\mathcal\{X\},\\mathcal\{B\},\\mathbb\{P\},\\hat\{p\},\\mathcal\{L\}be as defined in Lemma[F\.6](https://arxiv.org/html/2606.27739#A6.Thmtheorem6)\. Suppose that Assumption[F\.1](https://arxiv.org/html/2606.27739#A6.Thmassumption1)\(theMM\-assumption\) holds\. LetNNbe the maximum bag size,u=W0\(N−1e\)u=W\_\{0\}\\\!\\bigl\(\\frac\{N\-1\}\{e\}\\bigr\), andMMthe constant from that assumption\. Then there exists a sufficiently small temperatureα\>0\\alpha\>0such that, up to a set of instancesSSwhose bag cover set has probability zero,ℒ\\mathcal\{L\}is uniquely minimized by the true label function:
f∗\(x\)=y\(x\)∈\{0,1\},∀x∉S\.f^\{\*\}\(x\)=y\(x\)\\in\\\{0,1\\\},\\qquad\\forall x\\notin S\.Equivalently, any global minimiserf∗f^\{\*\}must satisfyf∗\(x\)=y\(x\)f^\{\*\}\(x\)=y\(x\)for allx∈𝒳x\\in\\mathcal\{X\}, except possibly on a setS⊆𝒳S\\subseteq\\mathcal\{X\}withℙ\(𝒞\(S\)\)=0\\mathbb\{P\}\(\\mathcal\{C\}\(S\)\)=0\.
###### Proof\.
Choose the temperatureα\>0\\alpha\>0such that
α<12\(u\+1\),−log\(α\(2u\+1\)\)\+Mlog\(1−αu\)\>0\.\\begin\{split\}&\\alpha<\\frac\{1\}\{2\(u\+1\)\},\\\\ &\-\\log\\\!\\bigl\(\\alpha\(2u\+1\)\\bigr\)\+M\\log\(1\-\\alpha u\)\>0\.\\end\{split\}\(9\)Suchα\\alphaexists because the left‑hand side of the second inequality tends to\+∞\+\\inftyasα→0\+\\alpha\\to 0^\{\+\}, while the first is an explicit positive upper bound\. We fix thisα\\alphafor the remainder of the proof; note that it depends only onNNandMM\.
Letffbe any global minimiser ofℒ\\mathcal\{L\}\. We show thatffmust coincide with the true labels everywhere except on a set with zero bag‑cover probability\.
Step 1\. Correction of false negatives\.SetS\+:=\{x∈𝒳\+∣f\(x\)<1\}S^\{\+\}:=\\\{x\\in\\mathcal\{X\}^\{\+\}\\mid f\(x\)<1\\\}\. Ifℙ\(𝒞\(S\+\)\)\>0\\mathbb\{P\}\(\\mathcal\{C\}\(S^\{\+\}\)\)\>0, then Lemma[F\.6](https://arxiv.org/html/2606.27739#A6.Thmtheorem6)\(which holds for anyα\>0\\alpha\>0, in particular for our chosenα\\alpha\) applied toS\+S^\{\+\}producesf1\(x\)=1f\_\{1\}\(x\)=1forx∈S\+x\\in S^\{\+\},f1\(x\)=f\(x\)f\_\{1\}\(x\)=f\(x\)otherwise, withℒ\(f1\)<ℒ\(f\)\\mathcal\{L\}\(f\_\{1\}\)<\\mathcal\{L\}\(f\), contradicting optimality\. Henceℙ\(𝒞\(S\+\)\)=0\\mathbb\{P\}\(\\mathcal\{C\}\(S^\{\+\}\)\)=0\.
Because the bag cover ofS\+S^\{\+\}has probability zero, modifyingffonS\+S^\{\+\}does not change the expected loss\. Define
f~\(x\):=\{1,x∈𝒳\+,f\(x\),x∈𝒳−\.\\tilde\{f\}\(x\):=\\begin\{cases\}1,&x\\in\\mathcal\{X\}^\{\+\},\\\\ f\(x\),&x\\in\\mathcal\{X\}^\{\-\}\.\\end\{cases\}Thenℒ\(f~\)=ℒ\(f\)\\mathcal\{L\}\(\\tilde\{f\}\)=\\mathcal\{L\}\(f\), sof~\\tilde\{f\}is also a global minimiser\.
Step 2\. Correction of false positives\.Consider the negative instances wheref~\\tilde\{f\}is strictly positive:S−:=\{x∈𝒳−∣f~\(x\)\>0\}S^\{\-\}:=\\\{x\\in\\mathcal\{X\}^\{\-\}\\mid\\tilde\{f\}\(x\)\>0\\\}\. Ifℙ\(𝒞\(S−\)\)=0\\mathbb\{P\}\(\\mathcal\{C\}\(S^\{\-\}\)\)=0, the claim already holds withS=S\+∪S−S=S^\{\+\}\\cup S^\{\-\}\. Assume thereforeℙ\(𝒞\(S−\)\)\>0\\mathbb\{P\}\(\\mathcal\{C\}\(S^\{\-\}\)\)\>0\. By construction,f~\\tilde\{f\}satisfies exactly the hypotheses of Lemma[F\.7](https://arxiv.org/html/2606.27739#A6.Thmtheorem7):f~\(x\)=1\\tilde\{f\}\(x\)=1on𝒳\+\\mathcal\{X\}^\{\+\},f~\(x\)∈\(0,1\]\\tilde\{f\}\(x\)\\in\(0,1\]onS−S^\{\-\}, andf~\(x\)=0\\tilde\{f\}\(x\)=0on𝒳−∖S−\\mathcal\{X\}^\{\-\}\\setminus S^\{\-\}\. Moreover, our chosenα\\alphafulfills the two inequalities that are shown \(in the proof of Lemma[F\.7](https://arxiv.org/html/2606.27739#A6.Thmtheorem7)\) to be sufficient for the strict decrease of the loss when the false positives inS−S^\{\-\}are set to0\. Consequently, the same construction applied tof~\\tilde\{f\}andS−S^\{\-\}with temperatureα\\alphayields a classifier
f~′\(x\)=\{0,x∈S−,f~\(x\),otherwise,\\tilde\{f\}^\{\\prime\}\(x\)=\\begin\{cases\}0,&x\\in S^\{\-\},\\\\ \\tilde\{f\}\(x\),&\\text\{otherwise\},\\end\{cases\}satisfyingℒ\(f~′\)<ℒ\(f~\)\\mathcal\{L\}\(\\tilde\{f\}^\{\\prime\}\)<\\mathcal\{L\}\(\\tilde\{f\}\)\. This contradicts the global optimality off~\\tilde\{f\}\(and hence offf\)\. Thusℙ\(𝒞\(S−\)\)=0\\mathbb\{P\}\(\\mathcal\{C\}\(S^\{\-\}\)\)=0as well\.
Conclusion\.LetS=S\+∪S−S=S^\{\+\}\\cup S^\{\-\}\. We haveℙ\(𝒞\(S\)\)=0\\mathbb\{P\}\(\\mathcal\{C\}\(S\)\)=0\. Forx∉Sx\\notin S, ifx∈𝒳\+x\\in\\mathcal\{X\}^\{\+\}thenf\(x\)=1f\(x\)=1\(otherwisexxwould belong toS\+S^\{\+\}\), and ifx∈𝒳−x\\in\\mathcal\{X\}^\{\-\}thenf\(x\)=0f\(x\)=0\(otherwisex∈S−x\\in S^\{\-\}\)\. Thusf\(x\)=y\(x\)f\(x\)=y\(x\)for allx∉Sx\\notin S, which is exactly the statement of the theorem\. ∎
### F\.4Proof of Theorem[3\.2](https://arxiv.org/html/2606.27739#S3.Thmtheorem2)
###### Lemma F\.9\(Quantitative loss decrease for correcting false negatives\)\.
Letα\>0\\alpha\>0andN∈ℕN\\in\\mathbb\{N\}\(N≥1N\\geq 1\) be the temperature and maximum bag size\. Let𝒳\\mathcal\{X\},ℬ\\mathcal\{B\},ℙ\\mathbb\{P\},p^\(⋅\)\\hat\{p\}\(\\cdot\)andℒ\\mathcal\{L\}be as in Lemma[F\.6](https://arxiv.org/html/2606.27739#A6.Thmtheorem6)\. Letffbe a classifier\. For a givenε∈\(0,1\]\\varepsilon\\in\(0,1\], letS\+⊆𝒳\+S^\{\+\}\\subseteq\\mathcal\{X\}^\{\+\}be a set of positive instances such thatf\(x\)≤1−εf\(x\)\\leq 1\-\\varepsilonfor allx∈S\+x\\in S^\{\+\}, andℙ\(𝒞\+\(S\+\)\)≥δ\>0\\mathbb\{P\}\\bigl\(\\mathcal\{C\}^\{\+\}\(S^\{\+\}\)\\bigr\)\\geq\\delta\>0\. Define a new classifierf′f^\{\\prime\}by
f′\(x\)=\{1,x∈S\+,f\(x\),otherwise\.f^\{\\prime\}\(x\)=\\begin\{cases\}1,&x\\in S^\{\+\},\\\\ f\(x\),&\\text\{otherwise\}\.\\end\{cases\}Then the expected loss decreases by at least a constant timesδ\\delta:
ℒ\(f\)−ℒ\(f′\)≥δ⋅ΛFN\(ε,α,N\)\>0,\\mathcal\{L\}\(f\)\-\\mathcal\{L\}\(f^\{\\prime\}\)\\;\\geq\\;\\delta\\cdot\\Lambda\_\{\\mathrm\{FN\}\}\(\\varepsilon,\\alpha,N\)\\;\>\\;0,where
ΛFN\(ε,α,N\)=log\\displaystyle\\Lambda\_\{\\mathrm\{FN\}\}\(\\varepsilon,\\alpha,N\)=\\log\(1\+min\{11\+\(N−1\)e1/α,ε1\+\(N−1\)eε/α\}\)\.\\displaystyle\\left\(1\+\\min\\\!\\left\\\{\\frac\{1\}\{1\+\(N\-1\)e^\{1/\\alpha\}\},\\;\\frac\{\\varepsilon\}\{1\+\(N\-1\)e^\{\\varepsilon/\\alpha\}\}\\right\\\}\\right\)\.
###### Proof\.
Denote byδ0=δ0\(ε,α,N\)\\delta\_\{0\}=\\delta\_\{0\}\(\\varepsilon,\\alpha,N\)the increment lower bound from Lemma[F\.4](https://arxiv.org/html/2606.27739#A6.Thmtheorem4), i\.e\.
δ0\\displaystyle\\delta\_\{0\}=min\{11\+\(N−1\)e1/α,ε1\+\(N−1\)eε/α\}\\displaystyle=\\min\\\!\\left\\\{\\frac\{1\}\{1\+\(N\-1\)e^\{1/\\alpha\}\},\\;\\frac\{\\varepsilon\}\{1\+\(N\-1\)e^\{\\varepsilon/\\alpha\}\}\\right\\\}\>0\.\\displaystyle\>0\.LetB=\(x1,…,xn\)∈𝒞\+\(S\+\)B=\(x\_\{1\},\\dots,x\_\{n\}\)\\in\\mathcal\{C\}^\{\+\}\(S^\{\+\}\)\. By definitionBBis a positive bag \(y\(B\)=1y\(B\)=1\) and contains at least one instancexi∈S\+x\_\{i\}\\in S^\{\+\}withf\(xi\)≤1−εf\(x\_\{i\}\)\\leq 1\-\\varepsilon\. Write𝒑=\(f\(x1\),…,f\(xn\)\)\\bm\{p\}=\(f\(x\_\{1\}\),\\dots,f\(x\_\{n\}\)\)and let𝒑′\\bm\{p\}^\{\\prime\}be the vector obtained after replacing every suchf\(xi\)f\(x\_\{i\}\)by11\. Repeated application of Lemma[F\.4](https://arxiv.org/html/2606.27739#A6.Thmtheorem4)shows that
p^′\(B\)−p^\(B\)≥δ0,\\hat\{p\}^\{\\prime\}\(B\)\-\\hat\{p\}\(B\)\\;\\geq\\;\\delta\_\{0\},because each single replacement increases the softmax weighted sum by at leastδ0\\delta\_\{0\}, and the sum is non‑decreasing under further replacements\.
Since the bag is positive, its contribution to the loss is−logp^\(B\)\-\\log\\hat\{p\}\(B\)underffand−logp^′\(B\)\-\\log\\hat\{p\}^\{\\prime\}\(B\)underf′f^\{\\prime\}\. The decrease in loss for this bag is
logp^′\(B\)p^\(B\)\\displaystyle\\log\\frac\{\\hat\{p\}^\{\\prime\}\(B\)\}\{\\hat\{p\}\(B\)\}=log\(1\+p^′\(B\)−p^\(B\)p^\(B\)\)\\displaystyle=\\log\\\!\\left\(1\+\\frac\{\\hat\{p\}^\{\\prime\}\(B\)\-\\hat\{p\}\(B\)\}\{\\hat\{p\}\(B\)\}\\right\)≥log\(1\+δ0\),\\displaystyle\\geq\\log\\\!\\left\(1\+\\delta\_\{0\}\\right\),where we usedp^\(B\)≤1\\hat\{p\}\(B\)\\leq 1and the monotonicity of the logarithm\.
Bags outside𝒞\+\(S\+\)\\mathcal\{C\}^\{\+\}\(S^\{\+\}\)are left unchanged byf′f^\{\\prime\}, hence their loss contribution is identical\. Taking expectations,
ℒ\(f\)−ℒ\(f′\)\\displaystyle\\mathcal\{L\}\(f\)\-\\mathcal\{L\}\(f^\{\\prime\}\)=𝔼\[𝟏𝒞\+\(S\+\)\(B\)\(logp^′\(B\)p^\(B\)\)\]\\displaystyle=\\mathbb\{E\}\\Bigl\[\\mathbf\{1\}\_\{\\mathcal\{C\}^\{\+\}\(S^\{\+\}\)\}\(B\)\\,\\bigl\(\\log\\frac\{\\hat\{p\}^\{\\prime\}\(B\)\}\{\\hat\{p\}\(B\)\}\\bigr\)\\Bigr\]≥ℙ\(𝒞\+\(S\+\)\)⋅log\(1\+δ0\)\.\\displaystyle\\geq\\mathbb\{P\}\\bigl\(\\mathcal\{C\}^\{\+\}\(S^\{\+\}\)\\bigr\)\\cdot\\log\(1\+\\delta\_\{0\}\)\.
By hypothesisℙ\(𝒞\+\(S\+\)\)≥δ\\mathbb\{P\}\(\\mathcal\{C\}^\{\+\}\(S^\{\+\}\)\)\\geq\\delta, so
ℒ\(f\)−ℒ\(f′\)≥δ⋅log\(1\+δ0\)=δ⋅ΛFN\(ε,α,N\)\.\\mathcal\{L\}\(f\)\-\\mathcal\{L\}\(f^\{\\prime\}\)\\;\\geq\\;\\delta\\cdot\\log\(1\+\\delta\_\{0\}\)=\\delta\\cdot\\Lambda\_\{\\mathrm\{FN\}\}\(\\varepsilon,\\alpha,N\)\.The strict positivity follows fromδ0\>0\\delta\_\{0\}\>0andδ\>0\\delta\>0\. ∎
###### Lemma F\.10\(Quantitative loss decrease for correcting false positives\)\.
Let𝒳,ℬ,ℙ,p^,ℒ\\mathcal\{X\},\\mathcal\{B\},\\mathbb\{P\},\\hat\{p\},\\mathcal\{L\}be as in Lemma[F\.6](https://arxiv.org/html/2606.27739#A6.Thmtheorem6)\. Suppose Assumption[F\.1](https://arxiv.org/html/2606.27739#A6.Thmassumption1)\(theMM\-assumption\) holds\. LetNNbe the maximum bag size,u=W0\(N−1e\)u=W\_\{0\}\\bigl\(\\frac\{N\-1\}\{e\}\\bigr\), andMMthe constant from that assumption\. LetS−⊆𝒳−S^\{\-\}\\subseteq\\mathcal\{X\}^\{\-\}be a set of negative instances withℙ\(𝒞\(S−\)\)\>0\\mathbb\{P\}\(\\mathcal\{C\}\(S^\{\-\}\)\)\>0, and assume that the classifierffsatisfies
f\(x\)\\displaystyle f\(x\)=1\\displaystyle=1∀x∈𝒳\+,\\displaystyle\\forall x\\in\\mathcal\{X\}^\{\+\},f\(x\)\\displaystyle f\(x\)∈\(0,1\]\\displaystyle\\in\(0,1\]∀x∈S−,\\displaystyle\\forall x\\in S^\{\-\},f\(x\)\\displaystyle f\(x\)=0\\displaystyle=0for all otherx∈𝒳−\.\\displaystyle\\text\{for all other \}x\\in\\mathcal\{X\}^\{\-\}\.For a givenε∈\(0,1\]\\varepsilon\\in\(0,1\], define the subsetS~:=\{x∈S−∣f\(x\)≥ε\}\\tilde\{S\}:=\\\{x\\in S^\{\-\}\\mid f\(x\)\\geq\\varepsilon\\\}, and suppose that its bag cover set has probabilityδ:=ℙ\(𝒞\(S~\)\)\>0\\delta:=\\mathbb\{P\}\\bigl\(\\mathcal\{C\}\(\\tilde\{S\}\)\\bigr\)\>0\. Define the corrected classifier
f′\(x\)=\{0,x∈S−,f\(x\),otherwise\.f^\{\\prime\}\(x\)=\\begin\{cases\}0,&x\\in S^\{\-\},\\\\ f\(x\),&\\text\{otherwise\}\.\\end\{cases\}Then there exists a temperatureα\>0\\alpha\>0depending only onNNandMMsuch that
ℒ\(f\)−ℒ\(f′\)≥δ⋅ΛFP\(ε,α,N,M\)\>0,\\mathcal\{L\}\(f\)\-\\mathcal\{L\}\(f^\{\\prime\}\)\\;\\geq\\;\\delta\\cdot\\Lambda\_\{\\mathrm\{FP\}\}\(\\varepsilon,\\alpha,N,M\)\\;\>\\;0,whereΛFP\(ε,α,N,M\)=\\Lambda\_\{\\mathrm\{FP\}\}\(\\varepsilon,\\alpha,N,M\)=
11\+M⋅min\{\\displaystyle\\frac\{1\}\{1\+M\}\\cdot\\min\\Bigl\\\{−log\(α\(2u\+1\)\)\+Mlog\(1−αu\),\\displaystyle\-\\log\\\!\\bigl\(\\alpha\(2u\+1\)\\bigr\)\+M\\log\(1\-\\alpha u\),−log\(1−p^min\(ε,α,N\)\)\},\\displaystyle\-\\log\\\!\\bigl\(1\-\\hat\{p\}\_\{\\min\}\(\\varepsilon,\\alpha,N\)\\bigr\)\\Bigr\\\},andp^min\(ε,α,N\)\\hat\{p\}\_\{\\min\}\(\\varepsilon,\\alpha,N\)is the minimum softmax weighted sum under the constraintmaxipi≥ε\\max\_\{i\}p\_\{i\}\\geq\\varepsilon, given by Lemma[F\.3](https://arxiv.org/html/2606.27739#A6.Thmtheorem3)\.
###### Proof\.
Chooseα\>0\\alpha\>0small enough so that
α<12\(u\+1\),\\displaystyle\\alpha<\\frac\{1\}\{2\(u\+1\)\},Λ0:=−log\(α\(2u\+1\)\)\+Mlog\(1−αu\)\>0\.\\displaystyle\\Lambda\_\{0\}:=\-\\log\\\!\\bigl\(\\alpha\(2u\+1\)\\bigr\)\+M\\log\(1\-\\alpha u\)\>0\.Such anα\\alphaexists because asα→0\+\\alpha\\to 0^\{\+\}the second term tends to0while the first diverges to\+∞\+\\infty\. Setε1=1−αu−α\\varepsilon\_\{1\}=1\-\\alpha u\-\\alphaand partitionS−S^\{\-\}intoS1=\{x∈S−:f\(x\)\>ε1\}S\_\{1\}=\\\{x\\in S^\{\-\}:f\(x\)\>\\varepsilon\_\{1\}\\\}andS2=S−∖S1S\_\{2\}=S^\{\-\}\\setminus S\_\{1\}\.
Following the decomposition in Lemma[F\.7](https://arxiv.org/html/2606.27739#A6.Thmtheorem7), the change in expected risk is written as
ℒ\(f\)−ℒ\(f′\)=Δ1\+Δ2\+Δ3\+Δ4,\\mathcal\{L\}\(f\)\-\\mathcal\{L\}\(f^\{\\prime\}\)=\\Delta\_\{1\}\+\\Delta\_\{2\}\+\\Delta\_\{3\}\+\\Delta\_\{4\},where the four terms correspond to the bag families
A1\\displaystyle A\_\{1\}=𝒞\+\(S1\),\\displaystyle=\\mathcal\{C\}^\{\+\}\(S\_\{1\}\),A2\\displaystyle A\_\{2\}=𝒞−\(S1\),\\displaystyle=\\mathcal\{C\}^\{\-\}\(S\_\{1\}\),A3\\displaystyle A\_\{3\}=𝒞\+\(S2\)∖𝒞\+\(S1\),\\displaystyle=\\mathcal\{C\}^\{\+\}\(S\_\{2\}\)\\setminus\\mathcal\{C\}^\{\+\}\(S\_\{1\}\),A4\\displaystyle A\_\{4\}=𝒞−\(S2\)∖𝒞−\(S1\),\\displaystyle=\\mathcal\{C\}^\{\-\}\(S\_\{2\}\)\\setminus\\mathcal\{C\}^\{\-\}\(S\_\{1\}\),defined with respect toS1,S2S\_\{1\},S\_\{2\}\.
Bounds forΔ1,Δ2,Δ3\\Delta\_\{1\},\\Delta\_\{2\},\\Delta\_\{3\}\.Exactly as in the proof of Lemma[F\.7](https://arxiv.org/html/2606.27739#A6.Thmtheorem7), the conditions onffguarantee
Δ1\+Δ2\\displaystyle\\Delta\_\{1\}\+\\Delta\_\{2\}≥ℙ\(A2\)Λ0,\\displaystyle\\geq\\mathbb\{P\}\(A\_\{2\}\)\\,\\Lambda\_\{0\},\(10\)Δ3\\displaystyle\\Delta\_\{3\}≥0\.\\displaystyle\\geq 0\.\(11\)Sinceℙ\(A2\)≥ℙ\(A2∩𝒞−\(S~\)\)\\mathbb\{P\}\(A\_\{2\}\)\\geq\\mathbb\{P\}\(A\_\{2\}\\cap\\mathcal\{C\}^\{\-\}\(\\tilde\{S\}\)\)andΛ0\>0\\Lambda\_\{0\}\>0, we further have
Δ1\+Δ2≥ℙ\(A2∩𝒞−\(S~\)\)Λ0\.\\Delta\_\{1\}\+\\Delta\_\{2\}\\;\\geq\\;\\mathbb\{P\}\\bigl\(A\_\{2\}\\cap\\mathcal\{C\}^\{\-\}\(\\tilde\{S\}\)\\bigr\)\\,\\Lambda\_\{0\}\.\(12\)
Bound forΔ4\\Delta\_\{4\}\.As analysed in Lemma[F\.7](https://arxiv.org/html/2606.27739#A6.Thmtheorem7), for anyB∈A4B\\in A\_\{4\},
log\(1−p^′\(B\)\)−log\(1−p^\(B\)\)\\displaystyle\\log\(1\-\\hat\{p\}^\{\\prime\}\(B\)\)\-\\log\(1\-\\hat\{p\}\(B\)\)=−log\(1−p^\(B\)\)\\displaystyle=\-\\log\(1\-\\hat\{p\}\(B\)\)\>0\.\\displaystyle\>0\.HenceΔ4=𝔼\[𝟏A4\(B\)\(−log\(1−p^\(B\)\)\)\]≥0\\Delta\_\{4\}=\\mathbb\{E\}\[\\mathbf\{1\}\_\{A\_\{4\}\}\(B\)\\,\(\-\\log\(1\-\\hat\{p\}\(B\)\)\)\]\\geq 0\.
Now consider anyB∈A4∩𝒞−\(S~\)B\\in A\_\{4\}\\cap\\mathcal\{C\}^\{\-\}\(\\tilde\{S\}\)\. Such a bag additionally contains at least one instance fromS~\\tilde\{S\}, somaxx∈Bf\(x\)≥ε\\max\_\{x\\in B\}f\(x\)\\geq\\varepsilon\. Applying Lemma[F\.3](https://arxiv.org/html/2606.27739#A6.Thmtheorem3)to the predictions onBBgives
p^\(B\)≥p^min\(ε,α,n\)≥p^min\(ε,α,N\),\\hat\{p\}\(B\)\\;\\geq\\;\\hat\{p\}\_\{\\min\}\(\\varepsilon,\\alpha,n\)\\;\\geq\\;\\hat\{p\}\_\{\\min\}\(\\varepsilon,\\alpha,N\),wheren≤Nn\\leq Nis the size ofBBandp^min\\hat\{p\}\_\{\\min\}is non‑increasing innn\. Consequently,
−log\(1−p^\(B\)\)≥−log\(1−p^min\(ε,α,N\)\)\>0\.\-\\log\(1\-\\hat\{p\}\(B\)\)\\;\\geq\\;\-\\log\\\!\\bigl\(1\-\\hat\{p\}\_\{\\min\}\(\\varepsilon,\\alpha,N\)\\bigr\)\>0\.Restricting the expectation definingΔ4\\Delta\_\{4\}to the subsetA4∩𝒞−\(S~\)A\_\{4\}\\cap\\mathcal\{C\}^\{\-\}\(\\tilde\{S\}\)yields
Δ4≥ℙ\(A4∩𝒞−\(S~\)\)Λ1,\\displaystyle\\Delta\_\{4\}\\;\\geq\\;\\mathbb\{P\}\(A\_\{4\}\\cap\\mathcal\{C\}^\{\-\}\(\\tilde\{S\}\)\)\\,\\Lambda\_\{1\},\(13\)Λ1:=−log\(1−p^min\(ε,α,N\)\)\.\\displaystyle\\Lambda\_\{1\}=\-\\log\\\!\\bigl\(1\-\\hat\{p\}\_\{\\min\}\(\\varepsilon,\\alpha,N\)\\bigr\)\.
Combining the estimates\.From \([12](https://arxiv.org/html/2606.27739#A6.E12)\), \([11](https://arxiv.org/html/2606.27739#A6.E11)\) and \([13](https://arxiv.org/html/2606.27739#A6.E13)\) we obtain
ℒ\(f\)−ℒ\(f′\)\\displaystyle\\mathcal\{L\}\(f\)\-\\mathcal\{L\}\(f^\{\\prime\}\)≥\\displaystyle\\geqmin\{Λ0,Λ1\}\(ℙ\(A2∩𝒞−\(S~\)\)\+ℙ\(A4∩𝒞−\(S~\)\)\)\\displaystyle\\min\\\{\\Lambda\_\{0\},\\Lambda\_\{1\}\\\}\\,\\bigl\(\\mathbb\{P\}\(A\_\{2\}\\cap\\mathcal\{C\}^\{\-\}\(\\tilde\{S\}\)\)\+\\mathbb\{P\}\(A\_\{4\}\\cap\\mathcal\{C\}^\{\-\}\(\\tilde\{S\}\)\)\\bigr\)=\\displaystyle=min\{Λ0,Λ1\}ℙ\(𝒞−\(S~\)\)\.\\displaystyle\\min\\\{\\Lambda\_\{0\},\\Lambda\_\{1\}\\\}\\,\\mathbb\{P\}\\bigl\(\\mathcal\{C\}^\{\-\}\(\\tilde\{S\}\)\\bigr\)\.
Finally, theMM\-assumption applied toS~⊆𝒳−\\tilde\{S\}\\subseteq\\mathcal\{X\}^\{\-\}gives
ℙ\(𝒞\(S~\)\)≤\(1\+M\)ℙ\(𝒞−\(S~\)\),\\mathbb\{P\}\\bigl\(\\mathcal\{C\}\(\\tilde\{S\}\)\\bigr\)\\;\\leq\\;\(1\+M\)\\,\\mathbb\{P\}\\bigl\(\\mathcal\{C\}^\{\-\}\(\\tilde\{S\}\)\\bigr\),i\.e\.ℙ\(𝒞−\(S~\)\)≥δ/\(1\+M\)\\mathbb\{P\}\(\\mathcal\{C\}^\{\-\}\(\\tilde\{S\}\)\)\\geq\\delta/\(1\+M\)\. Substituting back,
ℒ\(f\)−ℒ\(f∗\)\\displaystyle\\mathcal\{L\}\(f\)\-\\mathcal\{L\}\(f^\{\*\}\)≥δ1\+Mmin\{Λ0,Λ1\}\\displaystyle\\geq\\frac\{\\delta\}\{1\+M\}\\,\\min\\\{\\Lambda\_\{0\},\\Lambda\_\{1\}\\\}=δ⋅ΛFP\(ε,α,N,M\)\>0,\\displaystyle=\\delta\\cdot\\Lambda\_\{\\mathrm\{FP\}\}\(\\varepsilon,\\alpha,N,M\)\\;\>\\;0,which completes the proof\. ∎
###### Theorem F\.11\(Convergence in probability \(Restated\)\)\.
Let Assumption[3\.1](https://arxiv.org/html/2606.27739#S3.Thmassumption1)hold\. LetNNbe the maximum allowed bag size,u=W0\(N−1e\)u=W\_\{0\}\\bigl\(\\frac\{N\-1\}\{e\}\\bigr\), andMMthe constant from theMM\-assumption\. Letf∗\(x\)=y\(x\)f^\{\*\}\(x\)=y\(x\)be the Bayes optimal classifier, and let\{fl\}\\\{f\_\{l\}\\\}be a sequence of instance classifiers\. Define the expected lossℒ\(f\)\\mathcal\{L\}\(f\)as in Lemma[F\.6](https://arxiv.org/html/2606.27739#A6.Thmtheorem6)\. Then there exists a sufficiently small temperatureα\>0\\alpha\>0such that, ifliml→∞ℒ\(fl\)=ℒ\(f∗\)\\lim\_\{l\\to\\infty\}\\mathcal\{L\}\(f\_\{l\}\)=\\mathcal\{L\}\(f^\{\*\}\), then\{fl\}\\\{f\_\{l\}\\\}converges tof∗f^\{\*\}in probability\. Equivalently, for anyε,δ\>0\\varepsilon,\\delta\>0,
liml→∞ℙ\(𝒞\(\{x:\|fl\(x\)−f∗\(x\)\|≥ε\}\)\)=0\.\\lim\_\{l\\to\\infty\}\\mathbb\{P\}\\Bigl\(\\mathcal\{C\}\\bigl\(\\\{x:\|f\_\{l\}\(x\)\-f^\{\*\}\(x\)\|\\geq\\varepsilon\\\}\\bigr\)\\Bigr\)=0\.
###### Proof\.
Choose a temperatureα\>0\\alpha\>0small enough to satisfy both
α\\displaystyle\\alpha<12\(u\+1\),\\displaystyle<\\frac\{1\}\{2\(u\+1\)\},\(14\)Λ0\\displaystyle\\Lambda\_\{0\}:=−log\(α\(2u\+1\)\)\+Mlog\(1−αu\)\>0\.\\displaystyle:=\-\\log\\\!\\bigl\(\\alpha\(2u\+1\)\\bigr\)\+M\\log\(1\-\\alpha u\)\>0\.Suchα\\alphaexists becauseΛ0→\+∞\\Lambda\_\{0\}\\to\+\\inftyasα→0\+\\alpha\\to 0^\{\+\}while the first bound is an explicit positive constant\. In the sequel we work with this fixedα\\alpha\. Define the two positive constants
Λ1\(ε\)\\displaystyle\\Lambda\_\{1\}\(\\varepsilon\):=ΛFN\(ε,α,N\),\\displaystyle:=\\Lambda\_\{\\mathrm\{FN\}\}\(\\varepsilon,\\alpha,N\),Λ2\(ε\)\\displaystyle\\Lambda\_\{2\}\(\\varepsilon\):=ΛFP\(ε,α,N,M\),\\displaystyle:=\\Lambda\_\{\\mathrm\{FP\}\}\(\\varepsilon,\\alpha,N,M\),whereΛFN\\Lambda\_\{\\mathrm\{FN\}\}andΛFP\\Lambda\_\{\\mathrm\{FP\}\}are given by Lemma[F\.9](https://arxiv.org/html/2606.27739#A6.Thmtheorem9)and Lemma[F\.10](https://arxiv.org/html/2606.27739#A6.Thmtheorem10)respectively\. BothΛ1\(ε\)\\Lambda\_\{1\}\(\\varepsilon\)andΛ2\(ε\)\\Lambda\_\{2\}\(\\varepsilon\)are strictly positive for everyε∈\(0,1\]\\varepsilon\\in\(0,1\]\.
Fix arbitraryε,δ\>0\\varepsilon,\\delta\>0and set
ξ:=δ⋅min\{Λ1\(ε\),Λ2\(ε\)\}\>0\.\\xi:=\\delta\\cdot\\min\\\{\\Lambda\_\{1\}\(\\varepsilon\),\\,\\Lambda\_\{2\}\(\\varepsilon\)\\\}\>0\.We show that for any classifierff,
ℙ\(𝒞\(\{x:\|f\(x\)−f∗\(x\)\|≥ε\}\)\)≥δ\\displaystyle\\mathbb\{P\}\\Bigl\(\\mathcal\{C\}\\bigl\(\\\{x:\|f\(x\)\-f^\{\*\}\(x\)\|\\geq\\varepsilon\\\}\\bigr\)\\Bigr\)\\geq\\delta\(15\)⟹ℒ\(f\)−ℒ\(f∗\)≥ξ\.\\displaystyle\\Longrightarrow\\mathcal\{L\}\(f\)\-\\mathcal\{L\}\(f^\{\*\}\)\\geq\\xi\.Once this implication is proved, the convergence follows immediately: ifℒ\(fl\)→ℒ\(f∗\)\\mathcal\{L\}\(f\_\{l\}\)\\to\\mathcal\{L\}\(f^\{\*\}\), then for sufficiently largellwe haveℒ\(fl\)−ℒ\(f∗\)<ξ\\mathcal\{L\}\(f\_\{l\}\)\-\\mathcal\{L\}\(f^\{\*\}\)<\\xi, which forces the left‑hand side of \([15](https://arxiv.org/html/2606.27739#A6.E15)\) to be smaller thanδ\\delta\. Hence the bag‑cover probability of theε\\varepsilon\-deviating instances tends to0\.
Proof of \([15](https://arxiv.org/html/2606.27739#A6.E15)\)\.Letffbe a classifier withℙ\(𝒞\(S\)\)≥δ\\mathbb\{P\}\(\\mathcal\{C\}\(S\)\)\\geq\\delta, whereS:=\{x:\|f\(x\)−f∗\(x\)\|≥ε\}S:=\\\{x:\|f\(x\)\-f^\{\*\}\(x\)\|\\geq\\varepsilon\\\}\. DecomposeSSinto false negatives and false positives:
SFN\\displaystyle S\_\{\\mathrm\{FN\}\}:=\{x∈𝒳\+:f\(x\)≤1−ε\},\\displaystyle:=\\\{x\\in\\mathcal\{X\}^\{\+\}:f\(x\)\\leq 1\-\\varepsilon\\\},SFP\\displaystyle S\_\{\\mathrm\{FP\}\}:=\{x∈𝒳−:f\(x\)≥ε\}\.\\displaystyle:=\\\{x\\in\\mathcal\{X\}^\{\-\}:f\(x\)\\geq\\varepsilon\\\}\.ClearlyS=SFN∪SFPS=S\_\{\\mathrm\{FN\}\}\\cup S\_\{\\mathrm\{FP\}\}and the union is disjoint\. Denoteδ1=ℙ\(𝒞\(SFN\)\)\\delta\_\{1\}=\\mathbb\{P\}\(\\mathcal\{C\}\(S\_\{\\mathrm\{FN\}\}\)\)andδ2=ℙ\(𝒞\(SFP\)\)\\delta\_\{2\}=\\mathbb\{P\}\(\\mathcal\{C\}\(S\_\{\\mathrm\{FP\}\}\)\)\. By subadditivity ofℙ\(𝒞\(⋅\)\)\\mathbb\{P\}\(\\mathcal\{C\}\(\\cdot\)\),
δ≤ℙ\(𝒞\(S\)\)≤δ1\+δ2\.\\delta\\leq\\mathbb\{P\}\(\\mathcal\{C\}\(S\)\)\\leq\\delta\_\{1\}\+\\delta\_\{2\}\.
We now modifyffto the Bayes classifierf∗f^\{\*\}in three steps and track the loss decrease\.
Step 1: Correct significant false negatives\.Definef1f\_\{1\}by settingf1\(x\)=1f\_\{1\}\(x\)=1forx∈SFNx\\in S\_\{\\mathrm\{FN\}\}andf1\(x\)=f\(x\)f\_\{1\}\(x\)=f\(x\)otherwise\. Ifδ1\>0\\delta\_\{1\}\>0, Lemma[F\.9](https://arxiv.org/html/2606.27739#A6.Thmtheorem9)\(applicable becausef\(x\)≤1−εf\(x\)\\leq 1\-\\varepsilononSFNS\_\{\\mathrm\{FN\}\}\) yields
ℒ\(f\)−ℒ\(f1\)≥δ1Λ1\(ε\)\.\\mathcal\{L\}\(f\)\-\\mathcal\{L\}\(f\_\{1\}\)\\;\\geq\\;\\delta\_\{1\}\\,\\Lambda\_\{1\}\(\\varepsilon\)\.Ifδ1=0\\delta\_\{1\}=0the inequality trivially holds as0≥00\\geq 0\.
Step 2: Correct remaining false negatives\.Definef2f\_\{2\}byf2\(x\)=1f\_\{2\}\(x\)=1for allx∈𝒳\+x\\in\\mathcal\{X\}^\{\+\}andf2\(x\)=f1\(x\)f\_\{2\}\(x\)=f\_\{1\}\(x\)forx∈𝒳−x\\in\\mathcal\{X\}^\{\-\}\. Sincef2f\_\{2\}only raises predictions of positive instances from possibly values\>1−ε\>1\-\\varepsilonto11, Lemma[F\.6](https://arxiv.org/html/2606.27739#A6.Thmtheorem6)impliesℒ\(f2\)≤ℒ\(f1\)\\mathcal\{L\}\(f\_\{2\}\)\\leq\\mathcal\{L\}\(f\_\{1\}\)\. Hence
ℒ\(f\)−ℒ\(f2\)≥δ1Λ1\(ε\)\.\\mathcal\{L\}\(f\)\-\\mathcal\{L\}\(f\_\{2\}\)\\;\\geq\\;\\delta\_\{1\}\\,\\Lambda\_\{1\}\(\\varepsilon\)\.
Step 3: Correct false positives\.Nowf2f\_\{2\}satisfies precisely the prerequisites of Lemma[F\.10](https://arxiv.org/html/2606.27739#A6.Thmtheorem10):f2\(x\)=1f\_\{2\}\(x\)=1on𝒳\+\\mathcal\{X\}^\{\+\},f2\(x\)∈\(0,1\]f\_\{2\}\(x\)\\in\(0,1\]onS−:=\{x∈𝒳−:f2\(x\)\>0\}S^\{\-\}:=\\\{x\\in\\mathcal\{X\}^\{\-\}:f\_\{2\}\(x\)\>0\\\}, andf2\(x\)=0f\_\{2\}\(x\)=0on the rest of𝒳−\\mathcal\{X\}^\{\-\}\. The setSFPS\_\{\\mathrm\{FP\}\}is exactlyS~\\tilde\{S\}in that lemma, and its bag‑cover probability isδ2\\delta\_\{2\}\. Applying Lemma[F\.10](https://arxiv.org/html/2606.27739#A6.Thmtheorem10)with the already chosenα\\alpha\(which satisfies \([14](https://arxiv.org/html/2606.27739#A6.E14)\)\) gives
ℒ\(f2\)−ℒ\(f′\)≥δ2Λ2\(ε\),\\mathcal\{L\}\(f\_\{2\}\)\-\\mathcal\{L\}\(f^\{\\prime\}\)\\;\\geq\\;\\delta\_\{2\}\\,\\Lambda\_\{2\}\(\\varepsilon\),wheref′f^\{\\prime\}is the classifier obtained by setting all instances inS−S^\{\-\}to0\. Note thatf′f^\{\\prime\}assigns11to all positive instances and0to all negative ones, i\.e\.f′=f∗f^\{\\prime\}=f^\{\*\}\. Henceℒ\(f′\)=ℒ\(f∗\)\\mathcal\{L\}\(f^\{\\prime\}\)=\\mathcal\{L\}\(f^\{\*\}\)\.
Combining the three steps\.Summing the inequalities,
ℒ\(f\)−ℒ\(f∗\)\\displaystyle\\mathcal\{L\}\(f\)\-\\mathcal\{L\}\(f^\{\*\}\)≥δ1Λ1\(ε\)\+δ2Λ2\(ε\)\\displaystyle\\;\\geq\\;\\delta\_\{1\}\\,\\Lambda\_\{1\}\(\\varepsilon\)\+\\delta\_\{2\}\\,\\Lambda\_\{2\}\(\\varepsilon\)≥\(δ1\+δ2\)min\{Λ1\(ε\),Λ2\(ε\)\}\\displaystyle\\;\\geq\\;\(\\delta\_\{1\}\+\\delta\_\{2\}\)\\,\\min\\\{\\Lambda\_\{1\}\(\\varepsilon\),\\Lambda\_\{2\}\(\\varepsilon\)\\\}≥δmin\{Λ1\(ε\),Λ2\(ε\)\}\\displaystyle\\;\\geq\\;\\delta\\,\\min\\\{\\Lambda\_\{1\}\(\\varepsilon\),\\Lambda\_\{2\}\(\\varepsilon\)\\\}=ξ\.\\displaystyle=\\xi\.This establishes \([15](https://arxiv.org/html/2606.27739#A6.E15)\) and completes the proof\. ∎Similar Articles
Unsupervised Process Reward Models
This paper proposes unsupervised Process Reward Models (uPRM) that eliminate the need for human annotations by using LLM next-token probabilities to identify erroneous reasoning steps, achieving up to 15% accuracy improvements over LLM-as-a-Judge and performing comparably to supervised PRMs as verifiers and reward signals.
ProcessThinker: Enhancing Multi-modal Large Language Models Reasoning via Rollout-based Process Reward
ProcessThinker introduces a practical post-training pipeline that provides step-level process rewards without training an explicit process reward model. It uses rollout-based rewards to give dense credit assignment for multi-step reasoning in multimodal LLMs, consistently improving performance on video benchmarks.
Reducing Credit Assignment Variance via Counterfactual Reasoning Paths
Introduces Implicit Behavior Policy Optimization (IBPO), a counterfactual comparison-based credit assignment framework that improves training stability and performance in multi-step reasoning tasks for large language models by converting sparse terminal rewards into step-sensitive learning signals.
Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning
Introduces IOP, a framework that internalizes outcome supervision into process supervision for reasoning reinforcement learning, enabling fine-grained credit assignment without external annotations.
Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport
This paper introduces Distributional Process Reward Models, using conditional optimal transport to calibrate PRMs for more accurate success probability estimates in inference-time scaling. It demonstrates improved calibration and downstream performance on mathematical reasoning benchmarks like MATH-500 and AIME.