Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

arXiv cs.AI 05/11/26, 04:00 AM Papers
Summary
This paper introduces CASPO, a framework for aligning token-level confidence with step-wise logical correctness in large reasoning models using iterative Direct Preference Optimization. It also proposes Confidence-aware Thought (CaT) for dynamically pruning uncertain reasoning branches during inference to improve reliability and efficiency.
arXiv:2605.07353v1 Announce Type: new Abstract: Large reasoning models often reach correct answers through flawed intermediate steps, creating a gap between final accuracy and reasoning reliability. Existing alignment strategies address this with external verifiers or massive sampling, limiting scalability. In this work, we introduce CASPO (Confidence-Aware Step-wise Preference Optimization), a framework that aligns token-level confidence with step-wise logical correctness through iterative Direct Preference Optimization, without training a separate reward model. During inference, we propose Confidence-aware Thought (CaT), which leverages this calibrated confidence to dynamically prune uncertain reasoning branches with negligible O(V) latency. Experiments across ten benchmarks and multiple model families show that CASPO consistently improves reasoning reliability and inference efficiency. CASPO scales to Qwen3-8B-Base and surpasses tree-search baselines on AIME'24 and AIME'25 without using reward-model data. We also release a step-wise dataset with confidence annotations to support fine-grained analysis of reasoning reliability. Code is available at https://github.com/Thecommonirin/CASPO.
Original Article Export to Word Export to PDF
View Cached Full Text
Cached at: 05/11/26, 07:17 AM
# Confidence-Aware Alignment Makes Reasoning LLMs More Reliable
Source: [https://arxiv.org/html/2605.07353](https://arxiv.org/html/2605.07353)
Kejia Chen1, Jiawen Zhang1, Yihong Wu2, Kewei Gao1, Jian Lou3, Zunlei Feng1, Mingli Song1, Ruoxi Jia4 1Zhejiang University2Université de Montréal3Sun Yat\-sen University4Virginia Tech

###### Abstract

Large reasoning models often reach correct answers through flawed intermediate steps, creating a gap between final accuracy and reasoning reliability\. Existing alignment strategies address this with external verifiers or massive sampling, limiting scalability\. In this work, we introduceCASPO\(Confidence\-AwareStep\-wisePreferenceOptimization\), a framework that aligns token\-level confidence with step\-wise logical correctness through iterative Direct Preference Optimization, without training a separate reward model\. During inference, we propose Confidence\-aware Thought \(CaT\), which leverages this calibrated confidence to dynamically prune uncertain reasoning branches with negligibleO\(V\)O\(V\)latency\. Experiments across ten benchmarks and multiple model families show thatCASPOconsistently improves reasoning reliability and inference efficiency\. Notably,CASPOscales to Qwen3\-8B\-Base and surpasses tree\-search baselines on AIME’24 and AIME’25 without using reward\-model data\. We also release a step\-wise dataset with confidence annotations to support fine\-grained analysis of reasoning reliability\. Code is available at[https://github\.com/Thecommonirin/CASPO](https://github.com/Thecommonirin/CASPO)\.

## 1Introduction

Large reasoning models \(LRMs\) such as OpenAI\-o1\([jaech2024openai,](https://arxiv.org/html/2605.07353#bib.bib20)\)and Qwen\-3\([yang2025qwen3,](https://arxiv.org/html/2605.07353#bib.bib49)\)have substantially advanced mathematical and scientific problem\-solving through detailed step\-by\-step generation\. However, optimizing these models purely for final\-answer correctness masks a critical vulnerability: they frequently arrive at correct conclusions via logically flawed intermediate steps\([arcuschin2503chain,](https://arxiv.org/html/2605.07353#bib.bib1)\)\. In high\-stakes domains such as medicine and finance\([fadeeva2024fact,](https://arxiv.org/html/2605.07353#bib.bib8);[zhang2025towards,](https://arxiv.org/html/2605.07353#bib.bib57)\), relying on invalid reasoning traces poses significant risks\. Therefore, reliable LRM deployment demands not only accurate final outputs but verifiably sound reasoning trajectories\.

The root cause of this vulnerability lies in a fundamental misalignment between a model’s internal confidence and logical correctness\. In current LRMs, token\-level probabilities reflect superficial string fluency and pattern frequency rather than true deductive validity\([arcuschin2503chain,](https://arxiv.org/html/2605.07353#bib.bib1);[yang2025probability,](https://arxiv.org/html/2605.07353#bib.bib51)\)\. Consequently, a model might confidently hallucinate a syntactically valid but logically incorrect step, while exhibiting low confidence when executing a rigorous but unfamiliar derivation\. This pervasive miscalibration prevents internal confidence from serving as a reliable metric for self\-verification\.

Current efforts to improve reliability mainly operate at the trajectory level\. Chain\-of\-Thought \(CoT\)\([wei2022chain,](https://arxiv.org/html/2605.07353#bib.bib46)\)elicits intermediate steps through prompting, Self\-Consistency\([wang2022self,](https://arxiv.org/html/2605.07353#bib.bib44)\)aggregates multiple paths via majority voting, and reinforcement learning frameworks such as Group Relative Policy Optimization \(GRPO\) align models with preferred trajectories using verifiable rewards\([guo2025deepseek,](https://arxiv.org/html/2605.07353#bib.bib13)\)\. Even scaling methods such as rStar\-Math\([guan2025rstar,](https://arxiv.org/html/2605.07353#bib.bib12)\)largely treat the reasoning process as a monolithic output\. This trajectory\-centric paradigm presents a dilemma: trajectory\-level methods overlook the reliability of individual steps, while search\-intensive approaches incur computational costs that limit scalability\.

To address this granularity gap, recent work introduces step\-wise supervision to improve intermediate reasoning quality\. Step\-wise preference optimization\([razghandi2025cer,](https://arxiv.org/html/2605.07353#bib.bib38)\)and process\-based self\-rewarding frameworks\([tu2025enhancing,](https://arxiv.org/html/2605.07353#bib.bib42)\)integrate intermediate feedback into training, and weakness\-driven augmentation strategies such as SwS\([liang2025sws,](https://arxiv.org/html/2605.07353#bib.bib29)\)diagnose systematic failures\. However, these methods typically rely on heuristic feedback or external verifiers and do not explicitly model the model’s own uncertainty\. Parallel efforts on confidence estimation via token probabilities\([xu2024genarm,](https://arxiv.org/html/2605.07353#bib.bib48)\)face a further obstacle: empirical evidence\([arcuschin2503chain,](https://arxiv.org/html/2605.07353#bib.bib1);[yang2025probability,](https://arxiv.org/html/2605.07353#bib.bib51);[hu2025open,](https://arxiv.org/html/2605.07353#bib.bib19)\)indicates that token\-level confidence reflects surface fluency or frequent patterns rather than reasoning reliability\. Models often assign high probability to syntactically correct but logically irrelevant steps, and underestimate uncertainty in complex derivations\. Closing this gap requires a principled way to synchronize internal confidence with reasoning correctness\.

Our core insight is that reliable reasoning requires calibration, where high predictive confidence is reserved for valid logical steps\. Aligning internal probability with external correctness allows the model’s own entropy to serve as a high\-fidelity, zero\-cost signal for guiding generation, removing the dependency on external evaluators during inference\. Building on this principle, we proposeCASPO\(Confidence\-AwareStep\-wisePreferenceOptimization\), a unified framework that operationalizes step\-wise confidence across both training and inference\.

During training,CASPOcalibrates the model by constructing preference pairs that contrast correct but uncertain steps with confidently wrong predictions\. These pairs are optimized via iterative DPO, aligning the model’s probability distribution with logical validity\. During inference, we introduce the Confidence\-aware Thought \(CaT\) strategy, which uses cumulative step\-wise confidence to dynamically expand promising paths and prune uncertain trajectories\. This two\-stage design propagates step\-wise improvements into faithful final answers with negligible computational overhead\.

In summary, our contributions are as follows: We proposeCASPO, a unified framework that uses intrinsic model confidence to achieve reliable reasoning without external verifiers\. By aligning token\-level entropy with logical correctness during training, the method enables self\-calibration and addresses the tension between exploration and reliability\. This calibration supports our CaT strategy, which prunes uncertain reasoning branches at inference withO\(V\)O\(V\)latency overhead\. Extensive experiments across ten benchmarks show consistent improvements with strong data and compute efficiency:CASPOraises the average accuracy of Qwen2\.5\-7B\-Instruct from 44\.4% to 50\.6% and reaches 56\.1% with CaT at inference\. On Qwen3\-8B\-Base, it surpasses tree\-search baselines such as rStar\-Math\([guan2025rstar,](https://arxiv.org/html/2605.07353#bib.bib12)\)and Satori\([shen2025satori,](https://arxiv.org/html/2605.07353#bib.bib41)\)on AIME2024 and AIME2025 without using any reward model data\.

## 2Related Work

Large Reasoning Models\.The evolution of LRMs has progressed from simple prompting to more sophisticated strategies\. CoT showed that explicit step\-by\-step reasoning improves performance on complex tasks, while Self\-Consistency\([wang2022self,](https://arxiv.org/html/2605.07353#bib.bib44)\)enhanced robustness by aggregating multiple reasoning paths\. Recent systems such as OpenAI’s o1\([jaech2024openai,](https://arxiv.org/html/2605.07353#bib.bib20)\)and DeepSeek\-R1\([guo2025deepseek,](https://arxiv.org/html/2605.07353#bib.bib13)\)now leverage post\-training to elicit extended reasoning traces for superior transparency and accuracy\. In parallel, distillation techniques\([hsieh2023distilling,](https://arxiv.org/html/2605.07353#bib.bib17)\)transfer high\-quality reasoning trajectories to smaller models for efficiency\. For instance,\([guan2025rstar,](https://arxiv.org/html/2605.07353#bib.bib12)\)explicitly utilizes rationales from large teacher models to supervise smaller students, reducing data requirements while maintaining performance\. Structured approaches such as Tree\-of\-Thoughts\([yao2023tree,](https://arxiv.org/html/2605.07353#bib.bib52)\), Graph\-of\-Thoughts\([besta2024graph,](https://arxiv.org/html/2605.07353#bib.bib2)\), and reinforcement learning\([zhang2024rest,](https://arxiv.org/html/2605.07353#bib.bib54);[zhang2025process,](https://arxiv.org/html/2605.07353#bib.bib56);[li2025treepo,](https://arxiv.org/html/2605.07353#bib.bib27)\)further expand the reasoning space, albeit often at the expense of considerable computational efficiency\.

Reasoning Process Verification\.As reasoning traces lengthen, ensuring their faithfulness becomes paramount\. One prominent direction involves Process Reward Models \(PRMs\)\([lightman2023let,](https://arxiv.org/html/2605.07353#bib.bib30);[wang2023math,](https://arxiv.org/html/2605.07353#bib.bib43)\), trained on datasets such as PRM800K\([lightman2023let,](https://arxiv.org/html/2605.07353#bib.bib30)\), to score intermediate reasoning steps\. Subsequent works such as PURE\([cheng2025stop,](https://arxiv.org/html/2605.07353#bib.bib5)\)refine step\-wise credit assignment in reinforcement learning\. Beyond direct scoring, collaborative deliberation\([patnaik2025helps,](https://arxiv.org/html/2605.07353#bib.bib35);[patnaik2025learning,](https://arxiv.org/html/2605.07353#bib.bib36)\)and selective rationale optimization\([lightman2023let,](https://arxiv.org/html/2605.07353#bib.bib30);[du2023improving,](https://arxiv.org/html/2605.07353#bib.bib7);[qu2024recursive,](https://arxiv.org/html/2605.07353#bib.bib37)\)demonstrate that models can enhance reliability through mutual verification and preference ranking\. While autonomous self\-correction remains difficult, combining self\-verification with lightweight external supervision offers a promising path toward reliability without the prohibitive cost of massive reward models\.

Verification\-Enhanced Reasoning\.Beyond evaluation, recent work integrates verification directly into reasoning\. Test\-time scaling generates multiple candidate solutions and selects the most reliable one, improving accuracy at high computational cost\. At training time, reinforcement learning with verifiable rewards \(e\.g\., SimpleRL\([zeng2025simplerl,](https://arxiv.org/html/2605.07353#bib.bib53)\), PURE\([cheng2025stop,](https://arxiv.org/html/2605.07353#bib.bib5)\)\) iteratively refine reasoning by rewarding faithful traces\. To reduce reliance on explicit reward models, DPO\-based methods approximate reward signals via likelihood estimation\. While co\-training generators and verifiers\([ouyang2022training,](https://arxiv.org/html/2605.07353#bib.bib34)\)has also been explored, scalability and stability issues persist\. Grounded in these directions,CASPOdiffers from these collaborative or external\-distillation approaches: rather than relying on multi\-model collaboration or mimicking teacher preferences, it unifies training and inference throughintrinsic step\-wise confidence calibration, utilizing the student’s own token entropy to guide reliable reasoning paths\.

## 3Method

![Refer to caption](https://arxiv.org/html/2605.07353v1/x1.png)Figure 1:Overview ofCASPO: A Unified Framework for Calibrated Reasoning\.CASPOfirst aligns intrinsic uncertainty with step\-wise correctness through iterative preference optimization, then utilizes this calibrated confidence to dynamically prune reasoning trees during inference\.CASPOintegrates intrinsic confidence estimation into a unified pipeline for both training and inference\. As illustrated in Figure[1](https://arxiv.org/html/2605.07353#S3.F1), our framework operates in two interconnected phases: \(i\) Confidence\-Aware Preference Optimization, which aligns model uncertainty with step\-wise correctness through iterative DPO, and \(ii\) Confidence\-aware Thought \(CaT\) Inference, which leverages this calibrated uncertainty to dynamically navigate and prune the reasoning tree\.

### 3\.1Motivation and Problem Formulation

Recent progress\([li2025treepo,](https://arxiv.org/html/2605.07353#bib.bib27);[wang2022self,](https://arxiv.org/html/2605.07353#bib.bib44);[zuo2025ttrl,](https://arxiv.org/html/2605.07353#bib.bib59)\)in LRMs have highlighted a critical tension: sampling multiple reasoning paths boosts performance via diversity, but often introduces plausible yet hallucinated steps\. Existing paradigms primarily rely on compute\-intensive external verifiers or large\-scale sampling, which introduce substantial inference overhead and provide limited insight into the model’s intrinsic assessment of its own reasoning process\.

Our goal is to equip the model with the ability toself\-evaluatethe quality of each reasoning stepsts\_\{t\}conditioned on the current contextqtq\_\{t\}\. We posit that genuine reasoning competence requires more than eventually arriving at the correct answer; it should also be reflected in the model’s confidence when taking valid reasoning steps\. In other words, correct reasoning should correspond to concentrated probability mass, or equivalently, low predictive entropy\.CASPOtherefore explicitly aligns the model’s predicted probability distribution with the correctness of its reasoning steps, encouraging valid steps to be generated with high confidence while suppressing invalid or unreliable ones\.

### 3\.2CASPO:Confidence\-AwareStep\-wisePreferenceOptimization

Notations\.We consider an auto\-regressive language modelπθ\\pi\_\{\\theta\}, which defines a next\-token distributionπθ\(⋅\|x\)\\pi\_\{\\theta\}\(\\cdot\|x\)given an input promptxx\. For each queryxxin the dataset𝒟math\\mathcal\{D\}\_\{\\text\{math\}\}, we view the reasoning process as a sequence ofmmstepss1:m=\(s1,s2,…,sm\)s\_\{1:m\}=\(s\_\{1\},s\_\{2\},\\ldots,s\_\{m\}\), leading to a final answeraa\. Each stepsjs\_\{j\}is generated conditioned on a specific context, which we define as thesub\-questionqjq\_\{j\}\. This context concatenates the original query and the preceding reasoning history:

qj=\[x,s1,s2,…,sj−1\],q\_\{j\}=\[x,s\_\{1\},s\_\{2\},\\ldots,s\_\{j\-1\}\],\(1\)The model then generates the current stepsj∼πθ\(⋅\|qj\)s\_\{j\}\\sim\\pi\_\{\\theta\}\(\\cdot\|q\_\{j\}\)\. This formulation allows us to evaluate the quality of intermediate reasoning in a fine\-grained manner\.

Step\-wise Confidence Estimation\.To quantify the model’s intrinsic uncertainty without external supervision, we utilize token\-level entropy\. Let the step answersjs\_\{j\}generated by the model consist of a sequence of tokens\{tl\}l=1L\\\{t\_\{l\}\\\}\_\{l=1\}^\{L\}\. The confidence of this specific stepsjs\_\{j\}given contextqjq\_\{j\}is computed as the negative average entropy:

confidence\(sj\|qj\)=−1L∑l=1L∑v∈𝒱πθ\(v\|qj,t<l\)log⁡πθ\(v\|qj,t<l\),\\small\\text\{confidence\}\(s\_\{j\}\|q\_\{j\}\)=\-\\frac\{1\}\{L\}\\sum\_\{l=1\}^\{L\}\\sum\_\{v\\in\\mathcal\{V\}\}\\pi\_\{\\theta\}\(v\|q\_\{j\},t\_\{<l\}\)\\log\\pi\_\{\\theta\}\(v\|q\_\{j\},t\_\{<l\}\),\(2\)WhereLLis the length of the step answer,𝒱\\mathcal\{V\}is the vocabulary, andπθ\(v\|qj\)\\pi\_\{\\theta\}\(v\|q\_\{j\}\)denotes the predictive distribution over tokensvv\. Higher cumulative entropy indicates greater uncertainty and, hence, lower confidence in the generation\. We adopt token\-level entropy as our uncertainty metric because it captures the model’s intrinsic uncertainty during generation, avoiding the overconfidence bias and hallucination sensitivity inherent in frequency\-based diversity measures\. This reference\-free criterion evaluates each candidate’s confidence independently of the ground truth\.

Confidence\-Aware Step\-wise Data Collection\.To obtain reliable supervision, we employ a strong model \(e\.g\., Qwen2\.5\-Math\-7B\-Instruct\) as an offline external evaluator\. The evaluator verifies whether the step\-wise answersjs\_\{j\}is correct, and the modelθ\\thetagives the confidence to the corresponding questionqjq\_\{j\}:

- •Ifsjs\_\{j\}is correct and has high confidence, omit it\.
- •Ifsjs\_\{j\}is correct but has low confidence, setyw=sjy\_\{w\}=s\_\{j\}and chooseyly\_\{l\}as a high\-probability competing candidate step fromπθ\(⋅\|qj\)\\pi\_\{\\theta\}\(\\cdot\|q\_\{j\}\)\.
- •Ifsjs\_\{j\}is incorrect, setywy\_\{w\}to the correct answer andyly\_\{l\}tosjs\_\{j\}\.

This selection strategy ensures that the preference dataset𝒟\\mathcal\{D\}consists exclusively of signals that drive the model towards calibrated correctness\.

Training: Confidence\-Aware Preference Optimization\.Based on the step\-centric dataset𝒟\\mathcal\{D\}constructed in Algorithm[1](https://arxiv.org/html/2605.07353#alg1), we form preference pairs\(qj,yjw,yjl\)\(q\_\{j\},y\_\{j\}^\{w\},y\_\{j\}^\{l\}\)\. This design ensures that both reliable but uncertain predictions and confidently wrong predictions contribute to preference learning\.

The training objective follows the DPO formulation, which encourages the target modelπθ\\pi\_\{\\theta\}to increase the relative likelihood of the preferred answer compared to the dispreferred one:

ℒDPO=−log⁡σ\(β\[log⁡πθ\(yjw\|qj\)πref\(yjw\|qj\)−log⁡πθ\(yjl\|qj\)πref\(yjl\|qj\)\]\),\\displaystyle\\mathcal\{L\}\_\{\\text\{DPO\}\}=\-\\log\\sigma\\\!\\Bigg\(\\beta\\left\[\\log\\frac\{\\pi\_\{\\theta\}\(y\_\{j\}^\{w\}\|q\_\{j\}\)\}\{\\pi\_\{\\text\{ref\}\}\(y\_\{j\}^\{w\}\|q\_\{j\}\)\}\-\\log\\frac\{\\pi\_\{\\theta\}\(y\_\{j\}^\{l\}\|q\_\{j\}\)\}\{\\pi\_\{\\text\{ref\}\}\(y\_\{j\}^\{l\}\|q\_\{j\}\)\}\\right\]\\Bigg\),\(3\)whereβ\\betacontrols the strength of preference alignment\. To achieve continuous improvement, we adopt an Iterative DPO scheme: at each iterationkk, the target modelπθk\\pi\_\{\\theta\_\{k\}\}is optimized using the above loss with respect to the previous modelπref=πθk−1\\pi\_\{\\text\{ref\}\}=\\pi\_\{\\theta\_\{k\-1\}\}as the reference\. After optimization, we setπref←πθk\\pi\_\{\\text\{ref\}\}\\leftarrow\\pi\_\{\\theta\_\{k\}\}for the next step\. This allows the model to bootstrap its own reasoning capabilities, progressively refining both its accuracy and its confidence calibration\.

Inference: Confidence\-aware Thought \(CaT\)\.After iterative preference optimization, the model not only learns to prefer correct reasoning steps but also calibrates its confidence estimation at each step\. This enables a CaT inference strategy: instead of committing to a single linear chain, the model explores a reasoning tree where each node corresponds to a partial reasoning trajectoryz1:t=\(z1,…,zt\)z\_\{1:t\}=\(z\_\{1\},\\dots,z\_\{t\}\)with an associated confidence score

C\(z1:t\)=∏i=1tconfidence\(zi\|z1:i−1\),\\displaystyle C\(z\_\{1:t\}\)=\\prod\_\{i=1\}^\{t\}\\text\{confidence\}\(z\_\{i\}\|z\_\{1:i\-1\}\),\(4\)whereconf\(zi\|z1:i−1\)\\text\{conf\}\(z\_\{i\}\|z\_\{1:i\-1\}\)denotes the normalized confidence of reasoning stepziz\_\{i\}given the previous context\. During inference, a path is expanded only if its cumulative confidenceC\(z1:t\)C\(z\_\{1:t\}\)exceeds a thresholdτ\\tau\. Low\-confidence branches are pruned early, reallocating computational budget to more promising reasoning paths\. This mechanism acts as an intrinsic*self\-correction*filter, ensuring that the final output is the result of a chain of high\-confidence, valid reasoning steps\.

## 4Experiments

We evaluateCASPOacross multiple dimensions to verify its effectiveness in aligning reasoning confidence with correctness\. Our analysis encompasses training comparisons, inference strategy scaling, out\-of\-domain generalization, and calibration quality\. Comprehensive details are in Appx\.[A](https://arxiv.org/html/2605.07353#A1)\.

### 4\.1Settings

Models\.We employ Llama\-3\.1\-8B\-Instruct\([grattafiori2024llama,](https://arxiv.org/html/2605.07353#bib.bib10)\), Qwen2\.5\-Math\-7B, and Qwen2\.5\-7B\-Instruct\([yang2024qwen25,](https://arxiv.org/html/2605.07353#bib.bib50)\)as our primary base models\. To verify scalability to stronger base models, we additionally conduct experiments on Qwen3\-8B\-Base\([yang2025qwen3,](https://arxiv.org/html/2605.07353#bib.bib49)\)\. For answer calibration during training data construction, Qwen2\.5\-Math\-7B\-Instruct serves as the evaluator\.

Baselines\.We compareCASPOagainst two categories of methods:\(i\) Training\-based methods \(Table[1](https://arxiv.org/html/2605.07353#S4.T1)\), which update model parameters using verifiable self\-improvement signals\. We select six representative methods: GRPO\([shao2024deepseekmath,](https://arxiv.org/html/2605.07353#bib.bib40)\), Simple\-RL\-Zero\([zeng2025simplerl,](https://arxiv.org/html/2605.07353#bib.bib53)\), PURE\-VR\([cheng2025stop,](https://arxiv.org/html/2605.07353#bib.bib5)\), rStar\-Math\([guan2025rstar,](https://arxiv.org/html/2605.07353#bib.bib12)\), PCPO\([yang2025probability,](https://arxiv.org/html/2605.07353#bib.bib51)\), and DPO\-VP\([tu2025enhancing,](https://arxiv.org/html/2605.07353#bib.bib42)\)\. For scalability comparisons, we also include tree\-search\-based methods rStar\-Math\([guan2025rstar,](https://arxiv.org/html/2605.07353#bib.bib12)\)and Satori\([shen2025satori,](https://arxiv.org/html/2605.07353#bib.bib41)\)\.\(ii\) Inference\-time methods \(Table[2](https://arxiv.org/html/2605.07353#S4.T2)\), which modify the decoding process without parameter updates\. We compare against CoT\([kojima2022large,](https://arxiv.org/html/2605.07353#bib.bib24)\), Self\-Consistency\([wang2022self,](https://arxiv.org/html/2605.07353#bib.bib44)\), and DiPT\([just2024dipt,](https://arxiv.org/html/2605.07353#bib.bib22)\)\. Detailed descriptions of these baselines are deferred to Appendix[A\.2](https://arxiv.org/html/2605.07353#A1.SS2)\.

Evaluation Benchmarks\.Our main evaluation focuses on mathematical reasoning benchmarks widely used in prior research\([bi2024forest,](https://arxiv.org/html/2605.07353#bib.bib3);[li2025system,](https://arxiv.org/html/2605.07353#bib.bib28);[lin2025cppo,](https://arxiv.org/html/2605.07353#bib.bib31)\): MATH500\([lightman2023let,](https://arxiv.org/html/2605.07353#bib.bib30)\), Minerva\-Math\([lewkowycz2022solving,](https://arxiv.org/html/2605.07353#bib.bib26)\), OlympiadBench\([he2024olympiadbench,](https://arxiv.org/html/2605.07353#bib.bib14)\), AMC2023\([amc,](https://arxiv.org/html/2605.07353#bib.bib32)\), and AIME2024\([aime,](https://arxiv.org/html/2605.07353#bib.bib33)\)\. To assess generalizability, we extend our evaluation to BoardgameQA \(BGQA\)\([kazemi2023boardgameqa,](https://arxiv.org/html/2605.07353#bib.bib23)\), CRUXEval \(CRUX\)\([gu2024cruxeval,](https://arxiv.org/html/2605.07353#bib.bib11)\), StrategyQA \(STGQA\)\([geva2021did,](https://arxiv.org/html/2605.07353#bib.bib9)\), TableBench\([wu2025tablebench,](https://arxiv.org/html/2605.07353#bib.bib47)\), and STEM subsets of MMLU\-Pro\([wang2024mmlu,](https://arxiv.org/html/2605.07353#bib.bib45)\)\. Furthermore, we test code generation and language understanding capabilities using HumanEval\([chen2021evaluating,](https://arxiv.org/html/2605.07353#bib.bib4)\), LiveCodeBench\([jain2024livecodebench,](https://arxiv.org/html/2605.07353#bib.bib21)\), and RACE\([lai2017race,](https://arxiv.org/html/2605.07353#bib.bib25)\)\.

### 4\.2Main Results

Table 1:Comprehensive performance comparison\.CASPOconsistently outperforms trajectory\-level optimization baselines across both in\-domain mathematical reasoning and out\-of\-domain generalization tasks\.ModelsIn\-Domain Math ReasoningOut\-of\-Domain ReasoningMath500MinervaMathOlympiadBenchAIME24\(Avg@1/32\)AMC23AvgBGQACRUXSTGQATableBenchMMLUSTEMAvgQwen2\.5\-Math\-7B64\.815\.425\.616\.737\.532\.048\.050\.088\.038\.040\.052\.8\+ GRPO76\.232\.738\.116\.755\.043\.750\.553\.089\.539\.042\.054\.8\+ Simple\-RL\-Zero78\.033\.136\.620\.057\.545\.051\.553\.590\.040\.042\.555\.5\+ PURE\-VR79\.836\.841\.920\.057\.547\.552\.054\.090\.540\.543\.056\.0\+ DPO\-VP74\.835\.336\.923\.360\.046\.152\.554\.591\.041\.043\.556\.5\+CASPO76\.637\.843\.823\.362\.548\.853\.555\.591\.541\.544\.057\.2\+CASPO\+ CaT81\.940\.546\.926\.767\.552\.756\.258\.396\.143\.646\.260\.1Qwen2\.5\-7B\-Instruct76\.237\.643\.013\.352\.544\.453\.058\.191\.343\.245\.258\.2\+ GRPO79\.041\.046\.513\.355\.046\.654\.559\.992\.144\.046\.259\.3\+ Simple\-RL\-Zero80\.241\.545\.816\.757\.547\.855\.560\.992\.544\.446\.760\.0\+ PURE\-VR81\.543\.047\.516\.757\.548\.956\.061\.492\.844\.747\.060\.4\+ DPO\-VP79\.842\.546\.220\.060\.049\.156\.862\.193\.345\.247\.461\.0\+CASPO82\.044\.048\.320\.062\.550\.657\.562\.993\.845\.748\.061\.6\+CASPO\+ CaT87\.747\.151\.726\.767\.556\.160\.466\.098\.548\.050\.464\.7Llama\-3\.1\-8B\-Instruct49\.613\.223\.56\.727\.524\.140\.045\.082\.035\.036\.047\.6\+ GRPO52\.015\.025\.06\.730\.025\.541\.546\.583\.035\.837\.048\.8\+ Simple\-RL\-Zero53\.215\.525\.610\.030\.026\.942\.047\.083\.536\.237\.249\.2\+ PURE\-VR54\.016\.026\.810\.032\.527\.642\.547\.583\.836\.537\.549\.6\+ DPO\-VP54\.816\.527\.213\.332\.528\.843\.048\.084\.237\.038\.050\.0\+CASPO55\.215\.627\.613\.335\.029\.143\.548\.584\.537\.538\.550\.5\+CASPO\+ CaT59\.116\.729\.520\.040\.033\.145\.751\.088\.839\.440\.453\.1

Training\-Based Comparison\.Table[1](https://arxiv.org/html/2605.07353#S4.T1)presents the comparison betweenCASPOand baseline methods under matched training and inference budgets\.CASPOdelivers consistent gains across all three base models\. On Qwen2\.5\-7B\-Instruct, it achieves an average score of 50\.6, surpassing GRPO, Simple\-RL\-Zero, PURE\-VR, and DPO\-VP\. These improvements stem from our*step\-wise confidence\-aware preference learning*, which aligns token probabilities with intermediate\-step correctness more effectively than trajectory\-level rewards\. The monotonic accuracy growth in Appendix Figure[6a](https://arxiv.org/html/2605.07353#A2.F6.sf1)further corroborates this, signifying stable self improvement as calibration accumulates\.

Table 2:Comparison of inference strategies\.M500 denotes MATH 500\. MM denotes Minerva\-Math\. OB denotes OlympiadBench\. A23 denotes AMC2023\. A24 denotes AIME2024\.ModelsM500MMOBA24A23AvgQwen\-Math\-CASPO76\.637\.843\.823\.362\.548\.8\+ CoT78\.238\.644\.723\.363\.849\.7\+ Self\-Consistency79\.639\.345\.626\.765\.051\.2\+ DiPT80\.039\.545\.823\.365\.050\.7\+ CaT \(Ours\)81\.940\.546\.926\.767\.552\.7Qwen\-Ins\-CASPO82\.044\.048\.320\.062\.550\.6\+ CoT83\.644\.949\.320\.063\.852\.3\+ Self\-Consistency85\.345\.850\.223\.365\.053\.9\+ DiPT85\.746\.050\.520\.065\.053\.4\+ CaT \(Ours\)87\.747\.151\.726\.767\.556\.1Llama\-Ins\-CASPO55\.215\.627\.613\.335\.029\.1\+ CoT56\.315\.928\.113\.336\.330\.0\+ Self\-Consistency57\.416\.228\.716\.737\.531\.3\+ DiPT57\.716\.328\.813\.337\.530\.7\+ CaT \(Ours\)59\.116\.729\.520\.040\.033\.1Inference\-Time Comparison\.Table[2](https://arxiv.org/html/2605.07353#S4.T2)evaluates various inference strategies applied toCASPO\-trained models\. All methods utilize an identical sampling budget \(K=10K\{=\}10\) to ensure fair comparison\([zhang2023sac3,](https://arxiv.org/html/2605.07353#bib.bib55)\)\. We observe that both Self\-Consistency and CaT yield larger performance deltas onCASPOmodels compared to the original instruct\-tuned counterparts\. This indicates that the calibration learned during training effectively transfers to inference\-time search\. Specifically, our CaT strategy achieves the highest average performance across all base models while maintaining the fixed sampling budget, validating the efficacy of pruning low\-confidence paths\.

Table 3:Scalability on strong Instructed models\.MethodDataBudgetMath500\(Pass@1\)Math500\(Maj@8\)OlympiadBenchBase2\.5M \(SFT\)83\.687\.141\.6\+ DPO\-VP\+8K80\.982\.144\.0\+ PCPO\+30K81\.483\.844\.3\+CASPO\(Ours\)\+8K85\.190\.449\.0

Scalability to Strong Instruction\-Tuned Models\.We investigate whetherCASPOprovides marginal gains for models already optimized through extensive SFT and alignment\. Using Qwen2\.5\-Math\-7B\-Instruct \(trained on 2\.5M samples\) as a baseline, Table[3](https://arxiv.org/html/2605.07353#S4.T3)shows thatCASPOyields substantial improvements with only 8K seed samples, elevating MATH500 Pass@1 from 83\.6% to 85\.1% and Maj@8 to 90\.4%, surpassing strong baselines such as DPO\-VP and PCPO\. These results positionCASPOas a complementary stage that corrects confidence miscalibrations after large\-scale SFT\.

To verify scalability, we evaluateCASPOon Qwen3\-8B\-Base against tree\-search baselines\. As shown in Table[9](https://arxiv.org/html/2605.07353#A2.T9),CASPOoutperforms rStar\-Math and Satori on all benchmarks while using zero reward\-model data, compared with 3\.64M and 240K samples required by the two baselines\. The gains on AIME’24 and AIME’25 further show that calibrated intrinsic uncertainty scales effectively to stronger models without external supervision\.

### 4\.3Generalization and Transferability

Out\-of\-Domain Transferability\.Although trained exclusively on mathematical data,CASPOdemonstrates robust transfer capabilities to non\-mathematical reasoning tasks \(Table[1](https://arxiv.org/html/2605.07353#S4.T1)\)\. It consistently improves performance across diverse benchmarks, including commonsense \(STGQA\), code \(CRUX\), tabular reasoning \(TableBench\), and STEM knowledge \(MMLU\-Pro STEM\)\. Specifically, on the aggregated MMLU\-Pro subsets \(spanning physics, chemistry, CS, engineering, biology, and economics; 5,371 problems\),CASPOimproves Qwen2\.5\-Math\-7B from 52\.8% to 57\.2% and Qwen2\.5\-7B\-Instruct from 58\.2% to 61\.6%\. These gains indicate that our stepwise aggregation strategy generalizes beyond the math domain, offering a lightweight yet robust mechanism for diverse reasoning\.

Generalization to Code and Language Tasks\.To verify thatCASPOcaptures general reasoning consistency rather than overfitting to mathematical patterns, we extended our evaluation to strictly non\-mathematical domains: code generation \(HumanEval\([chen2021evaluating,](https://arxiv.org/html/2605.07353#bib.bib4)\), LiveCodeBench\([jain2024livecodebench,](https://arxiv.org/html/2605.07353#bib.bib21)\)\) and reading comprehension \(RACE\([lai2017race,](https://arxiv.org/html/2605.07353#bib.bib25)\)\)\. As detailed in Figure[4](https://arxiv.org/html/2605.07353#S4.F4),CASPOconsistently outperforms baselines across these diverse tasks\. On HumanEval, it improves the base model’s Pass@1 from 40\.9% to 51\.9%\. These results confirm that identifying and pruning low\-confidence steps is a fundamental reasoning capability that transfers effectively across modalities\.

Computational Efficiency and Signal Complexity\.A critical advantage ofCASPOis its computational frugality compared to methods relying on external verifiers or extensive sampling\. Prior approaches such as Process Reward Models \(PRM\) or Process Preference Models \(PPM\)\([guan2025rstar,](https://arxiv.org/html/2605.07353#bib.bib12)\)typically require a full model forward pass to evaluate each intermediate step, resulting in a computational complexity ofO\(L2d\)O\(L^\{2\}d\)whereLLis sequence length\. In contrast,CASPOcomputes the verification signal directly from the output logits already generated by the policy model\. As shown in Table[4](https://arxiv.org/html/2605.07353#S4.T4), the complexity of our entropy calculation isO\(V\)O\(V\)\(vocabulary size\) and is independent of the sequence length\. This reduces the verifier latency by two orders of magnitude \(from 2\.9s to 0\.03s per step\), making intrinsic entropy a negligible cost for scalable oversight\.

Table 4:Complexity and latency comparison on Qwen2\.5\-Math\-7B\.ModelMath500\(Acc\)Latency\(s/step\)ComplexityBase64\.8––\+ PRM76\.32\.9O\(L2d\)O\(L^\{2\}d\)\+ PPM\([guan2025rstar,](https://arxiv.org/html/2605.07353#bib.bib12)\)78\.41\.4O\(L2d\)O\(L^\{2\}d\)\+ Ours81\.90\.03𝐎\(𝐕\)\\mathbf\{O\(V\)\}Latency Overhead of CaT Inference\.We further evaluate the inference overhead introduced by CaT\. Unlike Self\-Consistency, which relies on repeated sampling and multiple full forward passes, CaT uses the token\-level entropy already available during generation to guide a small number of candidate continuations\. This design focuses computation on uncertain reasoning regions while pruning low\-confidence paths early\. As shown in Table[5](https://arxiv.org/html/2605.07353#S4.T5), CaT achieves higher accuracy with modest additional latency: on Qwen2\.5\-7B\-Instruct, its end\-to\-end latency is 2\.8 s/query, close to greedy decoding at 1\.2 s/query and much lower than Self\-Consistency at 12\.5 s/query\.

Table 5:Inference latency analysis\.On Qwen2\.5\-7B\-Instruct and Llama\-3\.1\-8B\-Instruct, CaT achieves gains with marginal latency overhead over greedy decoding, whereas Self\-Consistency is computationally costly\.Qwen2\.5Llama\-3\.1MethodMathLatency \(s/query\)MathLatency \(s/query\)Greedy Decoding82\.01\.255\.21\.5Chain\-of\-Thought83\.64\.656\.35\.9Self\-Consistency85\.312\.557\.718\.0CaT \(Ours\)87\.72\.859\.13\.1Table 6:Calibration quality on MATH\-500\.Base denotes Qwen2\.5\-Math\-7B\.ModelAcc\. \(%\)ECE↓\\downarrowBS↓\\downarrowBase64\.80\.1840\.215\+ DPO71\.20\.1420\.188\+CASPO76\.60\.0810\.142Step\-wise correctness rate before/after CASPO\.To verify thatCASPOachieves genuine calibration rather than merely improving accuracy, we compute the Expected Calibration Error \(ECE\) and Brier Score \(BS\) on MATH\-500 \(Table[6](https://arxiv.org/html/2605.07353#S4.T6)\)\. While standard DPO improves accuracy, it remains poorly calibrated \(ECE = 0\.142\)\.CASPOsubstantially reduces ECE to 0\.081 and Brier Score to 0\.142, no external that the model’s confidence becomes a more reliable indicator of correctness after alignment\. This result directly supports the core motivation of our framework: high predictive confidence should be strictly reserved for valid logical steps\.

Table 7:Step Correctness AUC\-ROC\.Uncertainty SignalAUC\-ROCContinuation Length0\.54Max Token Probability0\.68Perplexity \(PPL\)0\.72Shannon Entropy \(Ours\)0\.86Entropy as a Step\-Correctness Signal\.To justify the use of Shannon entropy over alternative uncertainty signals, we compare their predictive power for step\-wise correctness via AUC\-ROC \(Table[7](https://arxiv.org/html/2605.07353#S4.T7)\)\. Shannon entropy achieves an AUC\-ROC of 0\.86, outperforming continuation length \(0\.54\), max token probability \(0\.68\), and perplexity \(0\.72\)\. The key distinction is that perplexity reflects the probability of the chosen token sequence, whereas entropy measures the competitiveness of the entire vocabulary distribution, capturing the model’s confusion between diverging logical paths even when the top\-1 token has high probability\. Continuation length is a noisier post\-hoc signal that conflates rigorous derivations with hallucination loops\.

Table 8:Token\-level entropy gap\. C\.S\. denotes Correct Step\. I\.S\. denotes Incorrect Step\. E\.G\. denotes Entropy Gap\.StageC\.S\.I\.S\.E\.G\.Qwen2\.5\-7B0\.380\.420\.04\+CASPO0\.220\.880\.66Table[8](https://arxiv.org/html/2605.07353#S4.T8)further illustrates why entropy is an effective pruning signal\. Before training, the model is often confidently wrong \(entropy gap between correct and incorrect steps = 0\.04\)\. AfterCASPO, incorrect steps exhibit a sharp entropy increase \(gap = 0\.66\), providing a clear and reliable signal for CaT to prune logically flawed branches\.Training Dynamics\.Following prior work on DPO training dynamics\([ren2024learning,](https://arxiv.org/html/2605.07353#bib.bib39)\), we examine reward evolution during optimization\. As shown in Figure[2](https://arxiv.org/html/2605.07353#S4.F2), all models exhibit expected dual\-pressure mechanism: chosen rewards initially drop before recovering near zero, while rejected rewards decrease monotonically, confirming theoretical framework of simultaneous upward and downward pressures\. For completeness, we examine the evolution of training accuracy and loss, with corresponding curves provided in Appendix[B\.3](https://arxiv.org/html/2605.07353#A2.SS3)\.

![Refer to caption](https://arxiv.org/html/2605.07353v1/x2.png)Figure 2:Training dynamics\.Reward evolution during DPO training across Qwen2\.5\-Math\-7B, Qwen2\.5\-7B\-Instruct, and Llama\-3\.1\-8B\-Instruct models\.Our results reveal clear model\-specific patterns: Qwen2\.5\-Math\-7B converges the fastest and with the greatest stability, achieving the largest reward separation of about 6\.0\. This large reward separation reflects its strong alignment with mathematical reasoning preferences, strengthened by domain\-specific pre\-training\. Qwen2\.5\-7B\-Instruct converges efficiently within 200 steps, reaching a moderate separation of about 1\.5, which indicates a balance between training efficiency and preference learning\. In contrast, Llama\-3\.1\-8B\-Instruct shows higher volatility during optimization but achieves a separation comparable to the Math model, although this requires more careful tuning of hyperparameters\.

![Refer to caption](https://arxiv.org/html/2605.07353v1/x3.png)Figure 3:Evolution of token length and self\-correction\.Pass@1 accuracy improves consistently across DPO rounds without substantial increase in token length\. Meanwhile, the use of self\-talk triggers declines or stabilizes, suggesting that DPO guides models toward more concise reasoning\.#### Token Length and Reasoning Pattern Evolution\.

To examine whether the observed performance improvements stem merely from generating longer reasoning chains, we analyze both the token length and reasoning patterns of Qwen2\.5\-7B\-Instruct\. As shown in Figure[3](https://arxiv.org/html/2605.07353#S4.F3), Pass@1 accuracy improves consistently across DPO rounds while the average token length remains stable\. Furthermore, we use the frequency of the self\-correction trigger “Wait” or “Let’s” as a proxy for explicit self\-checking\([tu2025enhancing,](https://arxiv.org/html/2605.07353#bib.bib42);[zhou2025r1,](https://arxiv.org/html/2605.07353#bib.bib58)\)\. The observed decline in these triggers suggests thatCASPOdoes not teach the model to mimic reflective phrasing\. Instead, it internalizes the verification process\. The model learns to rely on the optimized preference signals to produce correct answers directly\.

### 4\.4Discussion

#### Decoupling Confidence from External Supervision\.

We decouple the model’s intrinsic logical confidence from the role of external supervision\. InCASPO, the external evaluator serves only to verify final answer correctness during the offline data collection phase\. This procedure mirrors the established paradigm in mathematical reasoning research, where datasets such as GSM8K\([cobbe2021training,](https://arxiv.org/html/2605.07353#bib.bib6)\)or MATH500\([lightman2023let,](https://arxiv.org/html/2605.07353#bib.bib30)\)utilize automated ground truth verification to filter training trajectories\. This one\-time investment during dataset construction ensures that the model requires no external guidance during deployment\. More importantly, the core learning signal in our framework originates from the model’s own token\-level entropy rather than the evaluator’s feedback\. By extracting correctness and confidence from these two independent channels, we ensure that the supervision remains stable even if the evaluator occasionally mislabels a reasoning path\. The confidence\-aware signal acts as an internal anchor that prioritizes stable, mastered reasoning over accidental success\. Ultimately, this separation allows the model to internalize the verification process, enabling efficient and autonomous inference without the computational burden of an external judge\.

### 4\.5Ablations

Iterative Training\.We evaluate iterative training by applyingCASPOover three epochs, where training data is regenerated by the current policy at each stage\. Unlike standard fine\-tuning on static datasets, this allows supervision to track the model’s evolving reasoning distribution\. Figure[5](https://arxiv.org/html/2605.07353#A2.F5)shows monotonic improvements: the first epoch yields the largest gains, with Math500 improving from 64\.8% to 76\.6% and AMC23 increasing from 37\.5% to 60\.0%\. This surge indicates the rapid rectification of primary calibration errors\. Subsequent iterations induce more granular refinements, pushing AMC23 further to 62\.5% and OlympiadBench from 37\.8% to 38\.7%\. This reflects progressive optimization where early stages establish a confidence baseline while later ones refine boundary handling, validating a positive feedback loop where superior policies generate higher\-fidelity supervision\.

![Refer to caption](https://arxiv.org/html/2605.07353v1/fig/caspocode.png)Figure 4:Generalization Performance ofCASPOon Qwen2\.5\-Math\-7B across HumanEval, LiveCodeBench, and RACE Benchmarks\.Balance between Diversity and Reliability\.Results in Table[2](https://arxiv.org/html/2605.07353#S4.T2)show thatCASPOimproves the balance between diversity and reliability by providing more accurate confidence signals\. This makes pass@kksampling less noisy and better aligned with the model’s calibrated preferences\. Aggregation methods such as majority voting or Self\-Consistency further benefit from these higher\-quality candidates, which in turn explains why CaT achieves stronger and more stable gains\.

## 5Conclusion

This work addresses the critical discrepancy between final answer accuracy and the logical integrity of intermediate reasoning steps\. We demonstrate that reliance on external verifiers or exhaustive sampling is not the only path to reliable reasoning\. By introducingCASPO, we show that a model’s intrinsic token level uncertainty provides a powerful and efficient signal for alignment\. Building on this insight, our framework bridges the gap between training and inference, using confidence aware preference optimization to calibrate the model and the CaT strategy to dynamically refine reasoning trajectories with minimal latency\. Experimental results confirm thatCASPOfundamentally enhances the faithfulness of the reasoning process rather than merely inflating benchmark scores\. By leveraging intrinsic uncertainty, the model learns to identify and correct logical inconsistencies without heavy external supervision\. As a result,CASPOenables a scalable and transparent framework for improving reasoning reliability\. Our released dataset and analysis support future work on fine\-grained, step\-wise alignment and the diagnosis of hidden reasoning failures\.

## References

- \[1\]Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy\.Chain\-of\-thought reasoning in the wild is not always faithful, 2025\.URL https://arxiv\. org/abs/2503\.08679\.
- \[2\]Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al\.Graph of thoughts: Solving elaborate problems with large language models\.InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 17682–17690, 2024\.
- \[3\]Zhenni Bi, Kai Han, Chuanjian Liu, Yehui Tang, and Yunhe Wang\.Forest\-of\-thought: Scaling test\-time compute for enhancing llm reasoning\.arXiv preprint arXiv:2412\.09078, 2024\.
- \[4\]Mark Chen\.Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374, 2021\.
- \[5\]Jie Cheng, Ruixi Qiao, Lijun Li, Chao Guo, Junle Wang, Gang Xiong, Yisheng Lv, and Fei\-Yue Wang\.Stop summation: Min\-form credit assignment is all process reward model needs for reasoning\.arXiv preprint arXiv:2504\.15275, 2025\.
- \[6\]Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al\.Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168, 2021\.
- \[7\]Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch\.Improving factuality and reasoning in language models through multiagent debate\.InForty\-first International Conference on Machine Learning, 2023\.
- \[8\]Ekaterina Fadeeva, Aleksandr Rubashevskii, Artem Shelmanov, Sergey Petrakov, Haonan Li, Hamdy Mubarak, Evgenii Tsymbalov, Gleb Kuzmin, Alexander Panchenko, Timothy Baldwin, et al\.Fact\-checking the output of large language models via token\-level uncertainty quantification\.arXiv preprint arXiv:2403\.04696, 2024\.
- \[9\]Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant\.Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies\.Transactions of the Association for Computational Linguistics, 9:346–361, 2021\.
- \[10\]Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al\-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al\.The llama 3 herd of models\.arXiv:2407\.21783, 2024\.
- \[11\]Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar\-Lezama, Gabriel Synnaeve, and Sida I Wang\.Cruxeval: A benchmark for code reasoning, understanding and execution\.arXiv preprint arXiv:2401\.03065, 2024\.
- \[12\]Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang\.Rstar\-math: Small llms can master math reasoning with self\-evolved deep thinking\.arXiv preprint arXiv:2501\.04519, 2025\.
- \[13\]Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al\.Deepseek\-r1: Incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948, 2025\.
- \[14\]Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al\.Olympiadbench: A challenging benchmark for promoting agi with olympiad\-level bilingual multimodal scientific problems\.arXiv preprint arXiv:2402\.14008, 2024\.
- \[15\]Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt\.Measuring massive multitask language understanding\.arXiv preprint arXiv:2009\.03300, 2020\.
- \[16\]Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt\.Measuring mathematical problem solving with the math dataset\.arXiv preprint arXiv:2103\.03874, 2021\.
- \[17\]Cheng\-Yu Hsieh, Chun\-Liang Li, Chih\-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen\-Yu Lee, and Tomas Pfister\.Distilling step\-by\-step\! outperforming larger language models with less training data and smaller model sizes\.InFindings of the Association for Computational Linguistics: ACL 2023, pages 8003–8017, 2023\.
- \[18\]Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, et al\.Openrlhf: An easy\-to\-use, scalable and high\-performance rlhf framework\.arXiv preprint arXiv:2405\.11143, 2024\.
- \[19\]Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung\-Yeung Shum\.Open\-reasoner\-zero: An open source approach to scaling up reinforcement learning on the base model\.arXiv preprint arXiv:2503\.24290, 2025\.
- \[20\]Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El\-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al\.Openai o1 system card\.arXiv preprint arXiv:2412\.16720, 2024\.
- \[21\]Naman Jain, King Han, Alex Gu, Wen\-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar\-Lezama, Koushik Sen, and Ion Stoica\.Livecodebench: Holistic and contamination free evaluation of large language models for code\.arXiv preprint arXiv:2403\.07974, 2024\.
- \[22\]Hoang Anh Just, Mahavir Dabas, Lifu Huang, Ming Jin, and Ruoxi Jia\.Dipt: Enhancing llm reasoning through diversified perspective\-taking\.arXiv preprint arXiv:2409\.06241, 2024\.
- \[23\]Mehran Kazemi, Quan Yuan, Deepti Bhatia, Najoung Kim, Xin Xu, Vaiva Imbrasaite, and Deepak Ramachandran\.Boardgameqa: A dataset for natural language reasoning with contradictory information\.Advances in Neural Information Processing Systems, 36:39052–39074, 2023\.
- \[24\]Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa\.Large language models are zero\-shot reasoners\.Advances in neural information processing systems, 35:22199–22213, 2022\.
- \[25\]Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy\.Race: Large\-scale reading comprehension dataset from examinations\.arXiv preprint arXiv:1704\.04683, 2017\.
- \[26\]Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman\-Solo, et al\.Solving quantitative reasoning problems with language models\.Advances in neural information processing systems, 35:3843–3857, 2022\.
- \[27\]Yizhi Li, Qingshui Gu, Zhoufutu Wen, Ziniu Li, Tianshun Xing, Shuyue Guo, Tianyu Zheng, Xin Zhou, Xingwei Qu, Wangchunshu Zhou, et al\.Treepo: Bridging the gap of policy optimization and efficacy and inference efficiency with heuristic tree\-based modeling\.arXiv preprint arXiv:2508\.17445, 2025\.
- \[28\]Zhong\-Zhi Li, Duzhen Zhang, Ming\-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei\-Jie Wang, Xiuyi Chen, et al\.From system 1 to system 2: A survey of reasoning large language models\.arXiv preprint arXiv:2502\.17419, 2025\.
- \[29\]Xiao Liang, Zhong\-Zhi Li, Yeyun Gong, Yang Wang, Hengyuan Zhang, Yelong Shen, Ying Nian Wu, and Weizhu Chen\.Sws: Self\-aware weakness\-driven problem synthesis in reinforcement learning for llm reasoning\.arXiv preprint arXiv:2506\.08989, 2025\.
- \[30\]Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe\.Let’s verify step by step\.InThe Twelfth International Conference on Learning Representations, 2023\.
- \[31\]Zhihang Lin, Mingbao Lin, Yuan Xie, and Rongrong Ji\.Cppo: Accelerating the training of group relative policy optimization\-based reasoning models\.arXiv preprint arXiv:2503\.22342, 2025\.
- \[32\]American mathematics competitions \(AMC 10/12\)\.Mathematics Competition Series, 2023\.
- \[33\]American invitational mathematics examination \(AIME\)\.Mathematics Competition Series, 2024\.
- \[34\]Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al\.Training language models to follow instructions with human feedback\.Advances in neural information processing systems, 35:27730–27744, 2022\.
- \[35\]Sohan Patnaik, Milan Aggarwal, Sumit Bhatia, and Balaji Krishnamurthy\.It helps to take a second opinion: Teaching smaller llms to deliberate mutually via selective rationale optimisation\.arXiv preprint arXiv:2503\.02463, 2025\.
- \[36\]Sohan Patnaik, Milan Aggarwal, Sumit Bhatia, and Balaji Krishnamurthy\.Learning together to perform better: Teaching small\-scale llms to collaborate via preferential rationale tuning\.arXiv preprint arXiv:2506\.02519, 2025\.
- \[37\]Yuxiao Qu, Tianjun Zhang, Naman Garg, and Aviral Kumar\.Recursive introspection: Teaching language model agents how to self\-improve\.Advances in Neural Information Processing Systems, 37:55249–55285, 2024\.
- \[38\]Ali Razghandi, Seyed Mohammad Hadi Hosseini, and Mahdieh Soleymani Baghshah\.Cer: Confidence enhanced reasoning in llms\.arXiv preprint arXiv:2502\.14634, 2025\.
- \[39\]Yi Ren and Danica J Sutherland\.Learning dynamics of llm finetuning\.arXiv preprint arXiv:2407\.10490, 2024\.
- \[40\]Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al\.Deepseekmath: Pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300, 2024\.
- \[41\]Maohao Shen, Guangtao Zeng, Zhenting Qi, Zhang\-Wei Hong, Zhenfang Chen, Wei Lu, Gregory Wornell, Subhro Das, David Cox, and Chuang Gan\.Satori: Reinforcement learning with chain\-of\-action\-thought enhances llm reasoning via autoregressive search\.arXiv preprint arXiv:2502\.02508, 2025\.
- \[42\]Songjun Tu, Jiahao Lin, Xiangyu Tian, Qichao Zhang, Linjing Li, Yuqian Fu, Nan Xu, Wei He, Xiangyuan Lan, Dongmei Jiang, et al\.Enhancing llm reasoning with iterative dpo: A comprehensive empirical investigation\.arXiv preprint arXiv:2503\.12854, 2025\.
- \[43\]Peiyi Wang, Lei Li, Zhihong Shao, RX Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui\.Math\-shepherd: Verify and reinforce llms step\-by\-step without human annotations\.arXiv preprint arXiv:2312\.08935, 2023\.
- \[44\]Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou\.Self\-consistency improves chain of thought reasoning in language models\.arXiv preprint arXiv:2203\.11171, 2022\.
- \[45\]Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al\.Mmlu\-pro: A more robust and challenging multi\-task language understanding benchmark\.Advances in Neural Information Processing Systems, 37:95266–95290, 2024\.
- \[46\]Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al\.Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems, 35:24824–24837, 2022\.
- \[47\]Xianjie Wu, Jian Yang, Linzheng Chai, Ge Zhang, Jiaheng Liu, Xeron Du, Di Liang, Daixin Shu, Xianfu Cheng, Tianzhen Sun, et al\.Tablebench: A comprehensive and complex benchmark for table question answering\.InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 25497–25506, 2025\.
- \[48\]Yuancheng Xu, Udari Madhushani Sehwag, Alec Koppel, Sicheng Zhu, Bang An, Furong Huang, and Sumitra Ganesh\.Genarm: Reward guided generation with autoregressive reward model for test\-time alignment\.arXiv preprint arXiv:2410\.08193, 2024\.
- \[49\]An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al\.Qwen3 technical report\.arXiv preprint arXiv:2505\.09388, 2025\.
- \[50\]An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al\.Qwen2\.5 technical report\.arXiv:2412\.15115, 2024\.
- \[51\]Yunqiao Yang, Houxing Ren, Zimu Lu, Ke Wang, Weikang Shi, Aojun Zhou, Junting Pan, Mingjie Zhan, and Hongsheng Li\.Probability\-consistent preference optimization for enhanced llm reasoning\.arXiv preprint arXiv:2505\.23540, 2025\.
- \[52\]Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan\.Tree of thoughts: Deliberate problem solving with large language models\.Advances in neural information processing systems, 36:11809–11822, 2023\.
- \[53\]Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He\.Simplerl\-zoo: Investigating and taming zero reinforcement learning for open base models in the wild\.arXiv preprint arXiv:2503\.18892, 2025\.
- \[54\]Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, and Jie Tang\.Rest\-mcts\*: Llm self\-training via process reward guided tree search\.Advances in Neural Information Processing Systems, 37:64735–64772, 2024\.
- \[55\]Jiaxin Zhang, Zhuohang Li, Kamalika Das, Bradley A Malin, and Sricharan Kumar\.Sac3: reliable hallucination detection in black\-box language models via semantic\-aware cross\-check consistency\.arXiv preprint arXiv:2311\.01740, 2023\.
- \[56\]Shimao Zhang, Xiao Liu, Xin Zhang, Junxiao Liu, Zheheng Luo, Shujian Huang, and Yeyun Gong\.Process\-based self\-rewarding language models\.arXiv preprint arXiv:2503\.03746, 2025\.
- \[57\]Yichi Zhang, Yue Ding, Jingwen Yang, Tianwei Luo, Dongbai Li, Ranjie Duan, Qiang Liu, Hang Su, Yinpeng Dong, and Jun Zhu\.Towards safe reasoning in large reasoning models via corrective intervention\.arXiv preprint arXiv:2509\.24393, 2025\.
- \[58\]Hengguang Zhou, Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho\-Jui Hsieh\.R1\-zero’s" aha moment" in visual reasoning on a 2b non\-sft model\.arXiv preprint arXiv:2503\.05132, 2025\.
- \[59\]Yuxin Zuo, Kaiyan Zhang, Shang Qu, Li Sheng, Xuekai Zhu, Biqing Qi, Youbang Sun, Ganqu Cui, Ning Ding, and Bowen Zhou\.Ttrl: Test\-time reinforcement learning, 2025\.

## Limitations

AlthoughCASPOachieves consistent gains across different benchmarks and model families, there are several limitations worth noting\. First, our definition of confidence is based on Shannon entropy\. This provides a simple and effective way to reduce logical hallucinations and improve calibration, but it is only one possible choice among many uncertainty measures\. Other signals, such as model self\-reflection or internal representations, may capture different aspects of uncertainty\. A more systematic comparison of these alternatives would be a useful direction for future work\. Second, our data construction pipeline relies on an offline step\-wise evaluator, which may introduce evaluator\-specific biases when the evaluator shares similar reasoning patterns with the target model\. Although the evaluator is used only during offline data construction, future work could explore self\-contained or jointly trained verification mechanisms to make the framework more robust and scalable\.

## Broader Impact

CASPOencourages models to optimize reasoning step by step and to better align confidence with correctness\. In doing so, it can improve the reliability and transparency of language\-model reasoning, especially in tasks where intermediate steps matter\. However, stronger reasoning ability can also increase risks in high\-stakes or dual\-use scenarios\. For example, models may produce more convincing outputs even when they are wrong or are used for harmful purposes\. We therefore emphasize the need for careful evaluation before deployment, particularly in downstream applications where errors may have serious consequences, as well as appropriate safeguards against misuse\.

We provide additional experimental details, supplementary results, and implementation analysis in the appendix\. Specifically, Appendix[A](https://arxiv.org/html/2605.07353#A1)details the training setup, evaluation protocol, baselines, and benchmarks, while Appendix[B](https://arxiv.org/html/2605.07353#A2)reports scalability results, training dynamics, aggregation\-function ablations, and the fullCASPOalgorithm\.

## Appendix AExperimental Setup

### A\.1Details of Training and Evaluation

The base models include Qwen2\.5\-Math\-7B, Qwen2\.5\-7B\-Instruct, and Llama\-3\.1\-8B\-Instruct\. All model\-centric training was conducted with full\-parameter fine\-tuning using the Open\-RLHF framework\[[18](https://arxiv.org/html/2605.07353#bib.bib18)\]\. Random seeds are fixed at 42 for reproducibility\. All experiments are trained on 4 NVIDIA A800 GPUs \(80GB\) with mixed\-precision \(FP16\) enabled\.

#### Optimization hyperparameters\.

The SFT stage uses a learning rate of5×10−65\\times 10^\{\-6\}, while the Direct Preference Optimization \(DPO\) stage adopts5×10−75\\times 10^\{\-7\}to stabilize preference\-based updates\. Both stages share a maximum sequence length of 2048 tokens and a batch size of 64\. The DPO loss coefficientβ\\betais fixed at 0\.1\. For each DPO round, candidate responses were sampled with temperaturet=0\.7t=0\.7, and preference pairs were filtered according to verifiable\-pair criterion in Section[3\.2](https://arxiv.org/html/2605.07353#S3.SS2)\.

#### Training schedule\.

Each training run lasts six epochs\. For the first three epochs, the sampling temperature is fixed att=0\.7t=0\.7to keep the data distribution close to the initial policy\. For epochs four and five, it is increased tot=1\.0t=1\.0, and further raised tot=1\.2t=1\.2in the final epoch\. This annealed schedule reflects the observation that performance plateaus after three epochs, while higher temperatures promote exploration of novel reasoning paths without destabilizing optimization\.

#### Evaluation protocol\.

We follow the Qwen\-Math evaluation suite\. For every benchmark and model, generations are produced with greedy decoding \(t=0\.0t\{=\}0\.0\), one output per input \(no sampling, no self\-consistency\), and a 2048\-token generation limit\. All models use the same zero\-shot CoT prompt template \(shown below\) to avoid prompt\-engineering confounds\. We report pass@1\. For datasets that provide official scoring scripts, we use those scripts; otherwise, answers are extracted from the boxed span \(see below\) and matched after standard normalization\.

### A\.2Details of Baselines

#### Training\-based methods\.

We compareCASPOwith representative training\-based self\-improvement methods that update model parameters using verifiable feedback\. GRPO\[[40](https://arxiv.org/html/2605.07353#bib.bib40)\]and Simple\-RL\-Zero\[[53](https://arxiv.org/html/2605.07353#bib.bib53)\]perform on\-policy reinforcement learning with verifiable rewards\. PURE\-VR\[[5](https://arxiv.org/html/2605.07353#bib.bib5)\]propagates verifiable rewards across reasoning steps, while DPO\-VP\[[42](https://arxiv.org/html/2605.07353#bib.bib42)\]applies iterative DPO to verifiable correct–incorrect output pairs\. These methods improve reasoning performance, but mainly optimize complete trajectories or final\-answer preferences rather than explicitly calibrating step\-wise confidence\.

#### Inference\-time methods\.

We also compare with inference\-time methods that modify decoding without updating model parameters\. Chain\-of\-Thought prompting\[[24](https://arxiv.org/html/2605.07353#bib.bib24)\]elicits intermediate reasoning steps, Self\-Consistency\[[44](https://arxiv.org/html/2605.07353#bib.bib44)\]aggregates multiple sampled chains by majority voting, and DiPT\[[22](https://arxiv.org/html/2605.07353#bib.bib22)\]uses diverse prompts to improve reasoning coverage\. These methods enhance robustness but do not explicitly calibrate confidence or prune unreliable reasoning paths\.

#### Distinction from prior methods\.

Unlike the above methods,CASPOaligns confidence with correctness at the reasoning\-step level\. During training, it uses correct\-but\-uncertain and confidently incorrect steps to construct preference pairs; during inference, CaT expands or prunes trajectories according to cumulative step\-wise confidence\. This unified design enables more reliable supervision and more efficient search\.

### A\.3Details of Benchmarks

MATH500\[[30](https://arxiv.org/html/2605.07353#bib.bib30)\]is a 500\-problem subset of the MATH benchmark\[[16](https://arxiv.org/html/2605.07353#bib.bib16)\]\. It is uniformly sampled across subjects and difficulty levels, making it used for evaluating mathematical reasoning\.

Minerva\-Math\[[26](https://arxiv.org/html/2605.07353#bib.bib26)\]consists of 272 challenging mathematical problems\. Some questions also involve scientific reasoning in related domains such as physics\.

OlympiadBench\[[14](https://arxiv.org/html/2605.07353#bib.bib14)\]is a bilingual benchmark containing 8,476 Olympiad\-level mathematics and physics problems, including problems adapted from the Chinese college entrance examination\. We use its text\-only, open\-ended mathematics competition subset, which contains 674 problems\.

AMC2023andAIME2024are competition\-style mathematical reasoning benchmarks\. AMC2023 contains 40 text\-only problems from the 2023 American Mathematics Competition, while AIME2024 contains 30 text\-only problems from the 2024 American Invitational Mathematics Examination\.

BoardgameQA \(BGQA\)\[[23](https://arxiv.org/html/2605.07353#bib.bib23)\]is a logical reasoning dataset with 15K unique problems designed to evaluate LLMs’ ability to perform defeasible reasoning, where contradictions must be resolved using credibility or recency cues\.

CRUXEval\[[11](https://arxiv.org/html/2605.07353#bib.bib11)\]evaluates code reasoning and execution\. It contains 800 short Python functions, each paired with input\-output examples, where models are required to predict the correct output given a function snippet and input\.

StrategyQA\[[9](https://arxiv.org/html/2605.07353#bib.bib9)\]contains 2,780 multi\-hop reasoning questions whose reasoning steps are implicit and must be inferred\. Each example is paired with a decomposition into sub\-steps and supporting evidence from Wikipedia\.

TableBench\[[47](https://arxiv.org/html/2605.07353#bib.bib47)\]evaluates tabular reasoning in real\-world data analysis tasks across 18 domains\. We use the fact\-checking and numerical reasoning subsets, resulting in 491 unique problems that cover fact verification, numerical calculation, and reasoning over structured tables\.

MMLUPro\-STEM\[[45](https://arxiv.org/html/2605.07353#bib.bib45)\]is a STEM\-focused subset of MMLU\-Pro, an enhanced version of MMLU\[[15](https://arxiv.org/html/2605.07353#bib.bib15)\]with more reasoning\-intensive questions and expanded answer choices\. We select six STEM domains, physics, chemistry, computer science, engineering, biology, and economics, and exclude mathematics to avoid overlap with in\-domain mathematical reasoning benchmarks\.

## Appendix BAdditional Results

### B\.1Scalability to Stronger Base Models

As shown in Table[9](https://arxiv.org/html/2605.07353#A2.T9),CASPOachieves the best results across MATH500, AIME’24, and AIME’25 while using zero reward\-model data\. In comparison, rStar\-Math and Satori rely on 3\.64M and 240K reward\-model samples, respectively\. The gains are especially clear on the more challenging AIME benchmarks, whereCASPOimproves AIME’24 to 36\.7 and AIME’25 to 33\.3\. These results indicate that calibrated intrinsic confidence can serve as an efficient alternative to external reward\-model supervision and remains effective on stronger base models\.

Table 9:Scalability to Qwen3 and comparison with tree\-search baselines\.CASPOachieves superior performance on Qwen3\-8B\-Base using zero reward model data, outperforming rStar\-Math\[[12](https://arxiv.org/html/2605.07353#bib.bib12)\]\(3\.64M RM samples\) and Satori\[[41](https://arxiv.org/html/2605.07353#bib.bib41)\]\(240K RM samples\)\.MethodRM DataMATH500\(Pass@1\)AIME’24AIME’25Qwen3\-8B\-Base–87\.423\.320\.0\+ rStar\-Math3\.64M88\.230\.023\.3\+ Satori240K88\.630\.026\.7\+CASPO\(Ours\)089\.036\.733\.3
### B\.2Effect of Iterative Training

Figure[5](https://arxiv.org/html/2605.07353#A2.F5)shows greedy evaluation scores across training epochs under iterativeCASPOtraining\. Both Qwen2\.5\-7B\-Math and LLaMA3\.1\-8B\-Instruct achieve the largest gains in the first epoch, indicating that the initial round corrects major confidence miscalibrations\. Subsequent epochs bring smaller but consistent improvements on most benchmarks, suggesting that iterative data regeneration helps refine harder reasoning cases and progressively improves confidence calibration\.

![Refer to caption](https://arxiv.org/html/2605.07353v1/fig/iteration.png)Figure 5:Greedy evaluation scores across iterativeCASPOtraining epochs on Qwen2\.5\-7B\-Math \(left\) and LLaMA3\.1\-8B\-Instruct \(right\)\. Both models achieve the largest gains in the first epoch and continue to improve gradually in later epochs\.
### B\.3Accuracy and Loss Dynamics\.

Figures[6a](https://arxiv.org/html/2605.07353#A2.F6.sf1)and[6b](https://arxiv.org/html/2605.07353#A2.F6.sf2)show the accuracy and loss dynamics during training\. Across all models, accuracy increases as reward separation emerges, while loss decreases steadily, indicating that preference learning improves both reward discrimination and prediction reliability\. Qwen2\.5\-Math\-7B shows the smoothest convergence, with accuracy quickly approaching high levels and loss declining consistently\. Qwen2\.5\-7B\-Instruct stabilizes within about 200 steps, while Llama\-3\.1\-8B\-Instruct converges more slowly with larger loss fluctuations but still reaches strong final accuracy\.

![Refer to caption](https://arxiv.org/html/2605.07353v1/x4.png)\(a\)Training accuracy trajectories\.
![Refer to caption](https://arxiv.org/html/2605.07353v1/x5.png)\(b\)\(Loss reduction patterns\.

Figure 6:Training dynamics during DPO optimization across Qwen2\.5\-Math\-7B, Qwen2\.5\-7B\-Instruct, and Llama\-3\.1\-8B\-Instruct\.
### B\.4Step\-wise Aggregation Function

We study two choices for aggregating token\-level confidence into a step\-wise score\. Mean entropy measures the model’s average uncertainty over the generated tokens:

fentropy\(s\)=−1n∑i=1n∑v∈𝒱p\(ti=v\)log⁡p\(ti=v\)\.f\_\{\\text\{entropy\}\}\(s\)=\-\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\sum\_\{v\\in\\mathcal\{V\}\}p\(t\_\{i\}=v\)\\log p\(t\_\{i\}=v\)\.\(5\)Multiplicative probability estimates the likelihood of the whole step:

fmult\(s\)=∏i=1np\(ti\)\.f\_\{\\text\{mult\}\}\(s\)=\\prod\_\{i=1\}^\{n\}p\(t\_\{i\}\)\.\(6\)
Table[10](https://arxiv.org/html/2605.07353#A2.T10)compares two choices: Mean entropy better captures uncertainty calibration, while multiplicative probability favors high\-likelihood steps and penalizes any low\-confidence token\. Results show that both signals are useful, but entropy provides stronger overall performance\.

ModelMathOpen\-domainMath500Minerva MathOlympiadBenchMMLU\-STEMMultiplicationQwen2\.5\-Math\-7B63\.214\.724\.941\.9Qwen2\.5\-7B\-Instruct80\.532\.738\.145\.2Llama3\.1\-8B\-Instruct48\.712\.822\.636\.1EntropyQwen2\.5\-Math\-7B64\.815\.425\.642\.5Qwen2\.5\-7B\-Instruct83\.233\.538\.445\.6Llama3\.1\-8B\-Instruct49\.613\.623\.536\.0Table 10:Accuracy of LRMs using multiplicative and entropy aggregation\.
### B\.5CASPO Training Procedure

Algorithm[1](https://arxiv.org/html/2605.07353#alg1)summarizes the training procedure of CASPO\. It first constructs step\-wise preference pairs by comparing the model’s confidence with the correctness signal from an offline critic\. Correct but low\-confidence steps are treated as preferred over competing alternatives, while incorrect steps are paired against the critic\-provided correct step\.

Algorithm 1CASPO Training1:Input:Math dataset

𝒟math\\mathcal\{D\}\_\{\\text\{math\}\}, target model

πθ\\pi\_\{\\theta\}, critic

πcritic\\pi\_\{\\text\{critic\}\}, confidence threshold

τ\\tau, iterations

KK\.

2:Initialize:Preference dataset

𝒟←\{\}\\mathcal\{D\}\\leftarrow\\\{\\\}, reference model

πref←πθ\\pi\_\{\\text\{ref\}\}\\leftarrow\\pi\_\{\\theta\}\.

3:foreach question

x∈𝒟mathx\\in\\mathcal\{D\}\_\{\\text\{math\}\}do

4:foreach sub\-question

qjq\_\{j\}of

xxdo

5:Generate answer

sj∼πθ\(⋅\|qj\)s\_\{j\}\\sim\\pi\_\{\\theta\}\(\\cdot\|q\_\{j\}\)and confidence

cj←confidence\(sj\|qj\)c\_\{j\}\\leftarrow\\mathrm\{confidence\}\(s\_\{j\}\|q\_\{j\}\)\.

6:Obtain reference step

gj←πcritic\(⋅\|qj\)g\_\{j\}\\leftarrow\\pi\_\{\\text\{critic\}\}\(\\cdot\|q\_\{j\}\)\.

7:Set

\(yw,yl\)←\{\(sj,competing candidate\)ifsj=gjandcj≤τ,\(gj,sj\)ifsj≠gj,skipotherwise\.\(y\_\{w\},y\_\{l\}\)\\leftarrow\\begin\{cases\}\(s\_\{j\},\\text\{competing candidate\}\)&\\text\{if \}s\_\{j\}=g\_\{j\}\\text\{ and \}c\_\{j\}\\leq\\tau,\\\\ \(g\_\{j\},s\_\{j\}\)&\\text\{if \}s\_\{j\}\\neq g\_\{j\},\\\\ \\text\{skip\}&\\text\{otherwise\.\}\\end\{cases\}
8:Add

\(qj,yw,yl\)\(q\_\{j\},y\_\{w\},y\_\{l\}\)to

𝒟\\mathcal\{D\}if not skipped\.

9:endfor

10:endfor

11:for

k=1k=1to

KKdo

12:

πθk←arg⁡minθ⁡ℒDPO\(πθ,πref;𝒟\)\\pi\_\{\\theta\_\{k\}\}\\leftarrow\\arg\\min\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{DPO\}\}\(\\pi\_\{\\theta\},\\pi\_\{\\text\{ref\}\};\\mathcal\{D\}\)\.

13:

πref←πθk\\pi\_\{\\text\{ref\}\}\\leftarrow\\pi\_\{\\theta\_\{k\}\}\.

14:endfor

15:Return:Optimized model

πθK\\pi\_\{\\theta\_\{K\}\}\.

### B\.6CaT Inference Procedure

Algorithm[2](https://arxiv.org/html/2605.07353#alg2)summarizes the inference\-time procedure of CaT\. CaT does not impose a predefined reasoning format; steps are segmented by natural delimiters such as line breaks or final\-answer markers\. At each step, CaT evaluates the calibrated entropy\-based confidence of each active branch, prunes branches whose cumulative confidence falls belowτ\\tau, and reallocates the remaining budget to more promising branches\.

Algorithm 2CaT Inference1:Input:Query

xx, calibrated model

πθ\\pi\_\{\\theta\}, threshold

τ\\tau, branch budget

KK, maximum steps

TT\.

2:Initialize:Active branches

ℬ←\{\(∅,1\.0\)\}\\mathcal\{B\}\\leftarrow\\\{\(\\emptyset,1\.0\)\\\}, completed answers

𝒜←∅\\mathcal\{A\}\\leftarrow\\emptyset\.

3:for

t=1t=1to

TTdo

4:

ℬnew←∅\\mathcal\{B\}\_\{\\mathrm\{new\}\}\\leftarrow\\emptyset\.

5:foreach branch

\(z1:t−1,C1:t−1\)∈ℬ\(z\_\{1:t\-1\},C\_\{1:t\-1\}\)\\in\\mathcal\{B\}do

6:Generate candidate next steps

\{zt\(k\)\}k=1K\\\{z\_\{t\}^\{\(k\)\}\\\}\_\{k=1\}^\{K\}from

πθ\(⋅∣x,z1:t−1\)\\pi\_\{\\theta\}\(\\cdot\\mid x,z\_\{1:t\-1\}\)\.

7:foreach candidate step

zt\(k\)z\_\{t\}^\{\(k\)\}do

8:Compute step confidence

ct\(k\)c\_\{t\}^\{\(k\)\}from calibrated token\-level entropy\.

9:

C1:t\(k\)←C1:t−1⋅ct\(k\)C\_\{1:t\}^\{\(k\)\}\\leftarrow C\_\{1:t\-1\}\\cdot c\_\{t\}^\{\(k\)\}\.

10:if

zt\(k\)z\_\{t\}^\{\(k\)\}contains a final answerthen

11:Add

\(z1:t−1,zt\(k\),C1:t\(k\)\)\(z\_\{1:t\-1\},z\_\{t\}^\{\(k\)\},C\_\{1:t\}^\{\(k\)\}\)to

𝒜\\mathcal\{A\}\.

12:elseif

C1:t\(k\)≥τC\_\{1:t\}^\{\(k\)\}\\geq\\tauthen

13:Add

\(z1:t−1,zt\(k\),C1:t\(k\)\)\(z\_\{1:t\-1\},z\_\{t\}^\{\(k\)\},C\_\{1:t\}^\{\(k\)\}\)to

ℬnew\\mathcal\{B\}\_\{\\mathrm\{new\}\}\.

14:endif

15:endfor

16:endfor

17:Keep the top\-

KKbranches in

ℬnew\\mathcal\{B\}\_\{\\mathrm\{new\}\}by cumulative confidence\.

18:

ℬ←ℬnew\\mathcal\{B\}\\leftarrow\\mathcal\{B\}\_\{\\mathrm\{new\}\}\.

19:if

ℬ=∅\\mathcal\{B\}=\\emptysetthen

20:break

21:endif

22:endfor

23:if

𝒜≠∅\\mathcal\{A\}\\neq\\emptysetthen

24:Returnthe completed answer with the highest cumulative confidence\.

25:else

26:Returnfailure and mark as incorrect under Pass@1\.

27:endif
Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

Similar Articles

ATTNPO: Attention-Guided Process Supervision for Efficient Reasoning

Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight

Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Environments

Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering

LACE: Lattice Attention for Cross-thread Exploration

Submit Feedback

Similar Articles

ATTNPO: Attention-Guided Process Supervision for Efficient Reasoning
Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight
Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Environments
Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering
LACE: Lattice Attention for Cross-thread Exploration