CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning
Summary
This paper introduces CoRA, a GRPO-based reinforcement learning framework that aligns LLM confidence with generated rationales to improve the reliability of chain-of-thought reasoning, achieving up to 26.51% reduction in misalignment error across multiple benchmarks.
View Cached Full Text
Cached at: 06/16/26, 11:44 AM
# CoRA: Confidence–Rationale Alignment for Reliable Chain-of-Thought Reasoning
Source: [https://arxiv.org/html/2606.14961](https://arxiv.org/html/2606.14961)
Juming Xiong1,Weixin Liu1,Kevin Guo1,Congning Ni2,Junchao Zhu1,Chongyu Qu1, Chao Yan2,Katherine Brown2,Avinash Baidya3,Xiang Gao3,Bradley Malin1,2,Zhijun Yin1,2,
1Vanderbilt University,2Vanderbilt University Medical Center,3Intuit AI Research
###### Abstract
Chain\-of\-thought \(CoT\) reasoning can improve LLM performance, but high answer confidence may be misleading when the accompanying CoT rationale is plausible yet incomplete or poorly supported\. We study confidence–rationale alignment: whether a model’s confidence in its committed answer is justified by its generated rationale\. We introduce a GRPO\-based reinforcement learning framework that jointly rewards answer correctness, committed\-answer probability, and rubric\-based rationale support, where the rubric assesses grounding, coherence, task match, and connection to the selected answer without revealing the gold answer to the judge\. Across MedQA, MathQA, and OpenBookQA using three open\-weight LLMs, our method reduces the confidence–rationale alignment error by up to 26\.51% compared with untuned checkpoints, SFT, and correctness\-only GRPO, while maintaining competitive accuracy and often improving calibration\. These results show that reliable CoT reasoning requires not only confident answers, but rationales that substantively support them\.
CoRA: Confidence–Rationale Alignment for Reliable Chain\-of\-Thought Reasoning
Juming Xiong1, Weixin Liu1, Kevin Guo1, Congning Ni2, Junchao Zhu1, Chongyu Qu1,Chao Yan2,Katherine Brown2,Avinash Baidya3,Xiang Gao3,Bradley Malin1,2,Zhijun Yin1,2,1Vanderbilt University,2Vanderbilt University Medical Center,3Intuit AI Research
## 1Introduction
Figure 1:An example of confidence–rationale misalignment\. Both the base model and a correctness\-oriented method choose the correct answer with near\-perfect confidence but include an unsupported claim that bats lay eggs\. Our method produces a rationale that better supports the committed answer by connecting the selected option, swallow, to hatched offspring\.Chain\-of\-thought \(CoT\) reasoning has been shown to improve large language model \(LLM\) performance on arithmetic, commonsense, symbolic, and other reasoning tasks\(Weiet al\.,[2022](https://arxiv.org/html/2606.14961#bib.bib1); Kojimaet al\.,[2022](https://arxiv.org/html/2606.14961#bib.bib2)\)\. In this paper, we use*rationale*to refer to the generated CoT explanation that accompanies the final answer\. In many high\-stakes user\-facing settings, the reliability of a response is judged not only by whether the final answer is correct, but also by whether the model appears confident and whether its rationale justifies the answer\. This motivates*confidence–rationale alignment*: the extent to which a model’s confidence in its committed answer is justified by the rationale it generates\.
This problem is important because a model may produce a fluent rationale, choose an incorrect answer, and still assign high confidence to that answer\. Such failures can be subtle: when an answer is accompanied by both a persuasive rationale and a strong confidence signal, users may find the error difficult to recognize or correct\. Prior work in human\-AI interaction shows that rationales and explanations can strongly shape user reliance on model outputs, sometimes increasing reliance even when the response is incorrect\(Vasconceloset al\.,[2023](https://arxiv.org/html/2606.14961#bib.bib30); Steyverset al\.,[2025](https://arxiv.org/html/2606.14961#bib.bib28); Kimet al\.,[2025](https://arxiv.org/html/2606.14961#bib.bib29)\)\. A trustworthy reasoning model should therefore be confident only when its rationale adequately supports the answer it commits to\.
Existing work addresses related parts of this problem, but not their interaction\. CoT prompting, self\-consistency, and tree\-structured search improve reasoning accuracy by eliciting or searching over intermediate reasoning traces\(Weiet al\.,[2022](https://arxiv.org/html/2606.14961#bib.bib1); Wanget al\.,[2023](https://arxiv.org/html/2606.14961#bib.bib3); Yaoet al\.,[2023](https://arxiv.org/html/2606.14961#bib.bib4)\)\. However, generated rationales may not faithfully reflect the factors that determine the model’s answer\(Turpinet al\.,[2023](https://arxiv.org/html/2606.14961#bib.bib5); Lanhamet al\.,[2023](https://arxiv.org/html/2606.14961#bib.bib6); Paulet al\.,[2024](https://arxiv.org/html/2606.14961#bib.bib7); Tuteket al\.,[2025](https://arxiv.org/html/2606.14961#bib.bib8)\)\. Separately, calibration methods aim to make answer confidence reflect empirical correctness\(Guoet al\.,[2017](https://arxiv.org/html/2606.14961#bib.bib10); Zhaoet al\.,[2021](https://arxiv.org/html/2606.14961#bib.bib12); Kadavathet al\.,[2022](https://arxiv.org/html/2606.14961#bib.bib13); Tianet al\.,[2023](https://arxiv.org/html/2606.14961#bib.bib15); Xieet al\.,[2024](https://arxiv.org/html/2606.14961#bib.bib16); Shenet al\.,[2024](https://arxiv.org/html/2606.14961#bib.bib17)\)\. Yet they typically evaluate output probabilities in aggregate and do not assess whether an individual rationale supports the answer at the matched confidence level\. This distinction matters in human\-facing systems, where confidence and explanation signals can affect user reliance and may not be interpreted uniformly across users\(Linet al\.,[2022](https://arxiv.org/html/2606.14961#bib.bib41); Tianet al\.,[2023](https://arxiv.org/html/2606.14961#bib.bib15); Steyverset al\.,[2025](https://arxiv.org/html/2606.14961#bib.bib28); Kimet al\.,[2025](https://arxiv.org/html/2606.14961#bib.bib29)\)\.
To address this gap, we introduce a confidence–rationale alignment framework, CoRA, for multiple\-choice reasoning\. CoRA has two components\. First, we use a structured LLM\-as\-judge rubric to assess whether a rationale is grounded, coherent, task\-matched, and properly connected to the model’s committed answer without exposing the gold answer\(Zhenget al\.,[2023](https://arxiv.org/html/2606.14961#bib.bib21); Liuet al\.,[2023](https://arxiv.org/html/2606.14961#bib.bib22); Kimet al\.,[2024](https://arxiv.org/html/2606.14961#bib.bib23)\)\. Second, we optimize the LLM with a Group Relative Policy Optimization \(GRPO\)\-based reward\(Shaoet al\.,[2024](https://arxiv.org/html/2606.14961#bib.bib31); DeepSeek\-AI,[2025](https://arxiv.org/html/2606.14961#bib.bib44)\)that combines answer correctness, rationale\-support quality, and committed\-answer confidence\. Unlike correctness\-only GRPO, this objective encourages the model to ground its confidence in the generated rationale rather than treating confidence as an isolated scalar\.
We evaluate CoRA on three benchmark datasets, MedQA, MathQA, and OpenBookQA\(Jinet al\.,[2021](https://arxiv.org/html/2606.14961#bib.bib32); Aminiet al\.,[2019](https://arxiv.org/html/2606.14961#bib.bib33); Mihaylovet al\.,[2018](https://arxiv.org/html/2606.14961#bib.bib34)\), across three open\-weight models\. We measure answer accuracy, committed\-answer probability calibration using expected calibration error \(ECE\) and Brier score\(Brier,[1950](https://arxiv.org/html/2606.14961#bib.bib11)\), and confidence–rationale mismatch, which captures cases where confidence exceeds rationale\-support quality\. Empirically, CoRA most consistently reduces unsupported overconfidence, with the clearest gains on MathQA, while maintaining competitive accuracy\.
Our contributions are fourfold\. \(1\) We formulate*confidence–rationale alignment*as a reliability problem for reasoning LLMs, requiring committed\-answer confidence to be justified by the generated rationale\. \(2\) We design a structured LLM\-as\-judge rubric that evaluates rationale support for the model’s selected answer without exposing the gold answer\. \(3\) We propose a GRPO\-based training framework that combines answer correctness, rationale\-support quality, and committed\-answer confidence to reduce unsupported overconfidence\. \(4\) We evaluate CoRA on three benchmark datasets and three open\-weight models, showing reduced confidence–rationale error in most settings while maintaining competitive accuracy; we further introduce a downstream correctness\-prediction task showing that CoRA can make generated reasoning traces more diagnostically informative\.
## 2Related Work
### 2\.1CoT Reasoning and Rationale Faithfulness
CoT prompting improves LLM reasoning by eliciting intermediate reasoning steps before a final answer\(Weiet al\.,[2022](https://arxiv.org/html/2606.14961#bib.bib1)\)\. Follow\-up methods such as zero\-shot CoT, self\-consistency, Tree of Thoughts, and STaR further show that generated reasoning traces can improve task accuracy through prompting, sampling, search, or bootstrapping from model\-generated rationales\(Kojimaet al\.,[2022](https://arxiv.org/html/2606.14961#bib.bib2); Wanget al\.,[2023](https://arxiv.org/html/2606.14961#bib.bib3); Yaoet al\.,[2023](https://arxiv.org/html/2606.14961#bib.bib4); Zelikmanet al\.,[2022](https://arxiv.org/html/2606.14961#bib.bib42)\)\. These methods demonstrate the practical value of intermediate reasoning traces, but they primarily optimize or evaluate final\-answer performance\.
Improved reasoning performance, however, does not guarantee that a generated rationale faithfully explains the model’s prediction\. Prior work shows that CoT rationales can omit factors that influence model outputs, rationalize biased or incorrect answers, or fail to causally determine the final answer\(Turpinet al\.,[2023](https://arxiv.org/html/2606.14961#bib.bib5); Lanhamet al\.,[2023](https://arxiv.org/html/2606.14961#bib.bib6)\)\. More recent work uses causal mediation, process verification, or unlearning\-based interventions to test whether reasoning steps influence or justify final predictions\(Lightmanet al\.,[2023](https://arxiv.org/html/2606.14961#bib.bib43); Paulet al\.,[2024](https://arxiv.org/html/2606.14961#bib.bib7); Tuteket al\.,[2025](https://arxiv.org/html/2606.14961#bib.bib8)\)\. Our work builds on this reliability concern, but focuses on a different question: whether the model’s confidence in its committed answer is justified by the rationale it presents\.
### 2\.2Confidence Calibration and Confidence–Quality Alignment
Confidence calibration aims to ensure that predictive confidence reflects empirical correctness\. Classical calibration work shows that neural networks can be poorly calibrated and that post\-hoc methods such as temperature scaling can improve probability estimates\(Guoet al\.,[2017](https://arxiv.org/html/2606.14961#bib.bib10)\)\. In LLMs, prior work studies prompt\-induced calibration errors, whether models can assess the correctness of their own answers, and how models express uncertainty in probabilities or words\(Zhaoet al\.,[2021](https://arxiv.org/html/2606.14961#bib.bib12); Kadavathet al\.,[2022](https://arxiv.org/html/2606.14961#bib.bib13); Linet al\.,[2022](https://arxiv.org/html/2606.14961#bib.bib41); Tianet al\.,[2023](https://arxiv.org/html/2606.14961#bib.bib15)\)\. Other work proposes post\-hoc or auxiliary calibration methods and surveys broader confidence estimation techniques for LLMs\(Xieet al\.,[2024](https://arxiv.org/html/2606.14961#bib.bib16); Shenet al\.,[2024](https://arxiv.org/html/2606.14961#bib.bib17); Genget al\.,[2024](https://arxiv.org/html/2606.14961#bib.bib18)\)\.
Recent work has also begun to connect confidence with response quality\. CONQORD aligns verbalized confidence with response quality using reinforcement learning\(Taoet al\.,[2024](https://arxiv.org/html/2606.14961#bib.bib25)\), while CoT\-UQ and CER study whether CoT or confidence signals can improve uncertainty quantification and reasoning behavior\(Zhang and Zhang,[2025](https://arxiv.org/html/2606.14961#bib.bib26); Razghandiet al\.,[2025](https://arxiv.org/html/2606.14961#bib.bib27)\)\. These studies are closely related to our motivation, but our setting differs in two ways: 1\) we use committed\-answer probability rather than verbalized confidence; 2\) we explicitly evaluate whether confidence is supported by the rationale, yielding an instance\-level alignment signal beyond aggregate calibration\.
### 2\.3Rationale Evaluation, LLM\-as\-Judge, and RL Training
Our rationale\-support rubric is grounded in prior work on rationales, faithfulness, and rubric\-based evaluation\. These studies investigate how models can provide textual evidence for their decisions\(Leiet al\.,[2016](https://arxiv.org/html/2606.14961#bib.bib38)\)and provide a benchmark for evaluating rationalized NLP models\(DeYounget al\.,[2020](https://arxiv.org/html/2606.14961#bib.bib40)\)\.Jacovi and Goldberg \([2020](https://arxiv.org/html/2606.14961#bib.bib39)\)further emphasizes that explanations should be evaluated by their relationship to model predictions, not only by their surface plausibility\. These ideas motivate our focus on answer\-support quality: whether a rationale uses relevant evidence, follows coherent inference steps, and bridges to the model’s selected answer\.
LLM\-as\-judge evaluation provides a scalable way to assess open\-ended outputs, but prior work also shows that judge behavior can be sensitive to rubric design and may exhibit biases\(Zhenget al\.,[2023](https://arxiv.org/html/2606.14961#bib.bib21)\)\. G\-Eval and Prometheus demonstrate that structured evaluation prompts and fine\-grained rubrics can improve the consistency and usefulness of model\-based evaluation\(Liuet al\.,[2023](https://arxiv.org/html/2606.14961#bib.bib22); Kimet al\.,[2024](https://arxiv.org/html/2606.14961#bib.bib23)\)\. Following this line of work, our judge is constrained by a structured rubric and is not given the gold answer, so that it evaluates whether the rationale supports the model’s own committed answer rather than whether the answer is correct\.
Our optimization method is related to reinforcement learning for LLMs\. PPO\-style optimization has been widely used in LLM alignment\(Schulmanet al\.,[2017](https://arxiv.org/html/2606.14961#bib.bib47); Ouyanget al\.,[2022](https://arxiv.org/html/2606.14961#bib.bib48)\), and recent reasoning\-oriented RL methods such as DeepSeekMath and DeepSeek\-R1 show that reinforcement learning can improve mathematical and general reasoning behavior\(Shaoet al\.,[2024](https://arxiv.org/html/2606.14961#bib.bib31); DeepSeek\-AI,[2025](https://arxiv.org/html/2606.14961#bib.bib44)\)\. In contrast to correctness\-only reinforcement learning, our reward incorporates answer correctness, rationale\-support quality, and committed\-answer confidence\. This connects reasoning supervision with calibration by training models to reduce unsupported overconfidence rather than only maximizing final\-answer correctness\.
## 3Method
We develop a reinforcement learning framework for confidence–rationale alignment in multiple\-choice reasoning, as shown in Figure[2](https://arxiv.org/html/2606.14961#S3.F2)\.
Figure 2:An overview of the CoRA framework\.Given multiple\-choice questions, the policy model samples a group of responses, each consisting of a rationale and a committed answer\. For each response, an LLM judge estimates rationale\-support qualityQQ, while the policy model provides committed\-answer confidenceCC\. We first combineQQandCCinto an alignment scoreTT, and then useTT, correctnessAA, and confidenceCCto compute the final rewardRRfor GRPO optimization\. The reward favors well\-supported confident answers and penalizes unsupported overconfidence\.### 3\.1Rubric for Assessing Rationale Quality
We define rationale quality as the extent to which a generated rationale provides coherent, grounded, and task\-appropriate support for the model’s committed answer\. We estimate this quality using an LLM\-as\-judge that takes as input the dataset name, question, answer options, optional auxiliary background knowledge, generated rationale, and committed answer\. The auxiliary background knowledge does not include the gold option label, official solution, official explanation, or gold rationale\. The judge evaluates whether the rationale supports the committed answer, not whether the committed answer is correct\. We treat this judge score as a scalable proxy for rationale\-support quality rather than as a human gold standard\. To reduce label leakage, the judge is constrained by a structured rubric and is not given the gold answer\.
The rubric is designed based on criteria that recur in prior work, which emphasizes 1\) whether explanations identify relevant evidence and support model predictions\(Leiet al\.,[2016](https://arxiv.org/html/2606.14961#bib.bib38); DeYounget al\.,[2020](https://arxiv.org/html/2606.14961#bib.bib40); Jacovi and Goldberg,[2020](https://arxiv.org/html/2606.14961#bib.bib39)\); 2\) CoT faithfulness work highlights the need to assess whether reasoning steps coherently justify final predictions\(Turpinet al\.,[2023](https://arxiv.org/html/2606.14961#bib.bib5); Lanhamet al\.,[2023](https://arxiv.org/html/2606.14961#bib.bib6); Paulet al\.,[2024](https://arxiv.org/html/2606.14961#bib.bib7)\); and 3\) LLM\-as\-judge work motivates explicit, fine\-grained rubrics to reduce ambiguity in open\-ended evaluation\(Zhenget al\.,[2023](https://arxiv.org/html/2606.14961#bib.bib21); Liuet al\.,[2023](https://arxiv.org/html/2606.14961#bib.bib22); Kimet al\.,[2024](https://arxiv.org/html/2606.14961#bib.bib23)\)\. Based on these principles, our general axes cover format validity, task understanding, evidence grounding, inference coherence, answer bridging, and structure\. The task\-specific axes add dataset\-relevant requirements, such as computation and quantity setup for MathQA, clinical application for MedQA, and science\-fact application for OpenBookQA\.
The rubric provides an operational measure of answer\-support quality\. It evaluates whether a generated reasoning trace supports the model’s committed answer through grounded, coherent, and task\-matched reasoning\. To reduce judge\-specific bias, we evaluate alignment with three independent judge models and report whether the observed trends remain consistent across judges\.
Table 1:A modular rationale\-support rubric\. General axes are shared across datasets; each dataset adds two task\-specific axes\.Each axis is assigned a categorical label and mapped to a scalar value in\[0,1\]\[0,1\]\. LetQgenQ\_\{\\mathrm\{gen\}\}denote the aggregate score over the six general axes, andQtaskQ\_\{\\mathrm\{task\}\}denote the aggregate score over the two task\-specific axes for the corresponding dataset\. The final rationale\-support score is:
Q=wgenQgen\+wtaskQtask−P,Q=w\_\{\\mathrm\{gen\}\}Q\_\{\\mathrm\{gen\}\}\+w\_\{\\mathrm\{task\}\}Q\_\{\\mathrm\{task\}\}\-P,\(1\)wherePPdenotes a penalty for severe structural failure or unsupported reasoning\. The resulting score is clipped to\[0,1\]\[0,1\]\. The format axisG1G\_\{1\}acts as a gate: if the response lacks a valid parseable answer,QQis set to zero\.
### 3\.2Confidence–Rationale Alignment Reward
We design a reward function that jointly considers answer correctness, rationale\-support quality, and committed\-answer confidence\.
#### Committed\-answer confidence\.
Let𝒴\\mathcal\{Y\}denote the set of answer options and lety∈𝒴y\\in\\mathcal\{Y\}be the parsed answer generated by the model\. We compute the committed\-answer confidenceCCby normalizing model scores over candidate answer labels:
C=exp\(sy\)∑ℓ∈𝒴exp\(sℓ\),C=\\frac\{\\exp\(s\_\{y\}\)\}\{\\sum\_\{\\ell\\in\\mathcal\{Y\}\}\\exp\(s\_\{\\ell\}\)\},\(2\)wheresℓs\_\{\\ell\}is the conditional log\-probability score of candidate labelℓ\\ellunder the exact generated prefix up to, but excluding, the final answer label\. Thus,C∈\[0,1\]C\\in\[0,1\]measures the normalized probability assigned to the model’s committed answer\. We use option\-label probabilities rather than verbalized confidence\. Details are described in Appendix[A\.1](https://arxiv.org/html/2606.14961#A1.SS1)\.
#### Alignment score\.
A desirable response should maintain consistency between confidence and rationale support: higher confidence should be accompanied by stronger rationale support, while weakly supported rationales should receive lower confidence\. We operationalize this criterion with a confidence–rationale alignment scoreTTas follows:
T=clip\(\\displaystyle T=\\operatorname\{clip\}\\big\(αQ\+\(1−α\)C\\displaystyle\\alpha Q\+\(1\-\\alpha\)C\(3\)−βovermax\(C−Q,0\)\\displaystyle\-\\beta\_\{\\mathrm\{over\}\}\\max\(C\-Q,0\)−βundermax\(Q−C,0\),0,1\)\.\\displaystyle\-\\beta\_\{\\mathrm\{under\}\}\\max\(Q\-C,0\),0,1\\big\)\.The first two terms combine rationale support and committed\-answer confidence\. The penalty terms capture the extent of the confidence–rationale mismatch:max\(C−Q,0\)\\max\(C\-Q,0\)penalizes confidence that exceeds rationale support, whilemax\(Q−C,0\)\\max\(Q\-C,0\)penalizes confidence that falls below rationale support\. We use asymmetric penalties to prioritize reducing unsupported high\-confidence answers\.
#### Overall reward\.
The final reward combines correctness and confidence–rationale alignment:
R=\\displaystyle R=A\(1\+λT\)−γ\(1−A\)C\\displaystyle A\(1\+\\lambda T\)\-\\gamma\(1\-A\)C\(4\)−βmax\(C−Q,0\)\.\\displaystyle\-\\beta\\max\(C\-Q,0\)\.Here,A∈\{0,1\}A\\in\\\{0,1\\\}is a binary answer\-correctness signal, whileQQ,CC, andTTare continuous scores in\[0,1\]\[0,1\]\. The first term rewards correct answers and those with stronger confidence–rationale alignment\. The second term penalizes confident wrong answers, while the third term penalizes confidence that exceeds rationale support, regardless of answer correctness\. This reward encourages the model to produce accurate answers, generate rationales that support those answers, and calibrate its confidence according to the quality of its reasoning\.
### 3\.3Optimization
We optimize the policy model using GRPO\. For each input question, the policy samples a group of candidate completions\. Each completion is parsed into a rationale and a committed answer, scored for answer correctness, evaluated by the LLM judge for rationale\-support quality, and assigned a committed\-answer confidence score\. The resulting reward in Equation[4](https://arxiv.org/html/2606.14961#S3.E4)is applied to compute group\-relative advantages within the sampled response group\. The policy is then updated to increase the likelihood of responses with higher relative rewards while regularizing against large deviations from the reference policy\. This allows the model to learn from multiple signals simultaneously\.
### 3\.4Correctness Prediction from CoT Reasoning Linguistic Features
As an auxiliary diagnostic of user\-facing reasoning transparency, we test whether surface linguistic features of generated reasoning traces contain information about answer correctness\. Given a generated reasoning tracerir\_\{i\}, we extract a feature vectorϕ\(ri\)\\phi\(r\_\{i\}\)and train a logistic regression classifier to predict the binary correctness label\. We use stratified cross\-validation and report out\-of\-fold predicted probabilities\. Feature standardization is performed within each training fold to avoid information leakage\. The full feature set is listed in Appendix[B](https://arxiv.org/html/2606.14961#A2)Table[5](https://arxiv.org/html/2606.14961#A2.T5)\.
## 4Experiments
### 4\.1Experimental Settings
#### Datasets\.
We evaluate CoRA on three multiple\-choice reasoning benchmarks spanning mathematical, medical, and scientific reasoning\.
- •MathQAAminiet al\.\([2019](https://arxiv.org/html/2606.14961#bib.bib33)\): A mathematical reasoning benchmark consisting of multiple\-choice word problems that require quantity identification, symbolic reasoning, and arithmetic computation\.
- •MedQAJinet al\.\([2021](https://arxiv.org/html/2606.14961#bib.bib32)\): A medical question\-answering benchmark derived from medical licensing exams\. It evaluates clinical and biomedical reasoning over patient\-centered and factual medical scenarios\.
- •OpenBookQAMihaylovet al\.\([2018](https://arxiv.org/html/2606.14961#bib.bib34)\): An elementary science question\-answering benchmark that requires using scientific facts and commonsense knowledge for selecting the correct answer\.
#### Models\.
We use three open\-weight models as policy models: Ministral 3 8B ReasoningMistral AI \([2025](https://arxiv.org/html/2606.14961#bib.bib92)\), Qwen2\.5 7BQwen Team \([2024](https://arxiv.org/html/2606.14961#bib.bib35)\), and Gemma 2 9B InstructGemma Team \([2024](https://arxiv.org/html/2606.14961#bib.bib36)\)\. These models cover different model families and instruction\-tuning settings\. We use the same prompting format and evaluation protocol across all policy models and datasets\.
For rubric\-based training, we use GPT\-OSS 20B as the judge model\. The judge is provided with task\-relevant auxiliary background knowledge, but not the gold option label, official solution, official explanation, or gold rationale\. This setup approximates a stronger evaluator with access to external evidence, reducing the chance that the judge rewards or penalizes a reasoning trace due to missing background knowledge\.
For evaluation, we use three judge models, GPT\-OSS 20B\(OpenAI,[2025](https://arxiv.org/html/2606.14961#bib.bib93)\), Llama 3\.1 70B Instruct\(Llama Team, AI @ Meta,[2024](https://arxiv.org/html/2606.14961#bib.bib94)\), and Qwen 3 32B\(Yanget al\.,[2025](https://arxiv.org/html/2606.14961#bib.bib67)\)to measure rationale\-support quality and confidence–rationale alignment\. The evaluation judges are also provided with the same auxiliary background knowledge used by the training judge, under the same no\-gold\-information constraint\. This design allows us to assess whether the observed alignment trends remain stable across different judge models rather than depending on a single evaluator\.
#### Baselines\.
We compare CoRA with three training baselines: 1\)Base: The original model without task\-specific fine\-tuning; 2\)SFT: Supervised fine\-tuning on task examples containing rationales and final answers; 3\)GRPO: GRPO training with a rule\-based correctness reward, where the reward is determined by whether the parsed answer matches the gold label\. It uses the same optimization setup as CoRA, but replaces the confidence–rationale reward with a correctness\-only reward\.
### 4\.2Evaluation Metrics
To evaluate whether CoRA improves both task performance and confidence–rationale alignment, we employ accuracy and calibration metrics, together with rationale\-centered alignment diagnostics\.
- •Accuracy: It measures whether the parsed answer matches the gold label\. We use the final parsed option as the model prediction\.
- •Expected Calibration Error \(ECE\)Guoet al\.\([2017](https://arxiv.org/html/2606.14961#bib.bib10)\): ECE measures the discrepancy between committed\-answer confidence and empirical correctness\. We compute ECE using 10 equal\-width confidence bins\.
- •Brier ScoreBrier \([1950](https://arxiv.org/html/2606.14961#bib.bib11)\): It is a proper scoring rule for probabilistic calibration\. For each example, it computes the squared difference between committed\-answer confidence and binary correctness\.
- •Rationale\-Support Quality: We evaluate it using a rubric\-based judge, which assigns each generated reasoning trace a scoreQ∈\[0,1\]Q\\in\[0,1\]\. It measures whether the reasoning trace is grounded, coherent, task\-matched, and supportive of the committed answer\.
- •Confidence–Rationale Alignment Error: We evaluate confidence–rationale alignment using the average absolute gap between committed\-answer confidence and rationale\-support quality: Ealign=1N∑i=1N\|Ci−Qi\|\.E\_\{\\mathrm\{align\}\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\|C\_\{i\}\-Q\_\{i\}\|\.\(5\)LowerEalignE\_\{\\mathrm\{align\}\}indicates stronger alignment between confidence and rationale support\.
### 4\.3Implementation Details
For each dataset, we use 200 samples for training and 500 held\-out samples for evaluation\. For SFT, the 200 training examples are selected from instances that the corresponding base model answers correctly\. For GRPO and CoRA, we randomly sample 200 examples from the training set\. The same evaluation subset is used across all methods for a controlled comparison\. During GRPO\-based training, we sample 8 completions per question with a sampling temperature of 0\.8\.
## 5Results
Table 2:Results on accuracy and calibration scores on datasets and models\.Accuracy, ECE, and Brier score are reported as percentages\. Higher accuracy is better, while lower ECE and Brier score indicate better calibration\. Best values are bolded and second\-best values are underlined within each model–dataset block\.### 5\.1Accuracy and Calibration Scores\.
We conduct experiments to evaluate the effectiveness of CoRA on MathQA, MedQA, and OpenBookQA across three foundation models\. Table[2](https://arxiv.org/html/2606.14961#S5.T2)reports answer accuracy, Expected Calibration Error \(ECE\), and Brier score\.
We observe that CoRA generally improves calibration while preserving competitive task performance\. In particular, CoRA achieves the best Brier score in most model–dataset settings, including all three datasets for Gemma 2 9B Instruct, MathQA and OpenBookQA for Ministral 3 8B, and MathQA and OpenBookQA for Qwen2\.5 7B\. This suggests that incorporating confidence–rationale alignment into reinforcement learning improves the probabilistic reliability of the committed answers\.
CoRA also maintains relatively strong answer accuracy\. On MathQA, CoRA obtains the best or tied\-best accuracy for Ministral 3 8B and Gemma 2 9B Instruct, while remaining competitive with GRPO on Qwen2\.5 7B\. On OpenBookQA, CoRA achieves the best accuracy for Ministral 3 8B and Qwen2\.5 7B, and the second\-best accuracy for Gemma 2 9B Instruct\. These results indicate that optimizing for confidence–rationale alignment does not require sacrificing task accuracy\.
However, CoRA is not uniformly best in every setting\. On MedQA, GRPO achieves the strongest calibration for Ministral 3 8B and Qwen2\.5 7B, while CoRA performs best for Gemma 2 9B Instruct\. This suggests that the benefit of confidence–rationale alignment depends on the interaction between model family and domain\. Nevertheless, CoRA remains consistently competitive and often improves calibration\-sensitive metrics\.
Table 3:Rationale\-support quality and confidence–rationale alignment results evaluated by Llama 3\.1 70B Instruct\.QQmeasures the quality of the generated reasoning trace, whileEalignE\_\{\\mathrm\{align\}\}denotes the average confidence–rationale alignment error\.↑\\uparrowindicates higher is better, and↓\\downarrowindicates lower is better\. All values are reported as percentages\.
### 5\.2Rationale\-Support Quality and Alignment
Table[3](https://arxiv.org/html/2606.14961#S5.T3)reports reasoning qualityQQand alignment errorEalignE\_\{\\mathrm\{align\}\}evaluated by Llama 3\.1 70B Instruct\. CoRA shows the clearest gains on MathQA, achieving the highestQQand lowestEalignE\_\{\\mathrm\{align\}\}across all three policy models\. This indicates that CoRA improves both the quality of mathematical reasoning traces and the consistency between rationale support and committed\-answer confidence\.
On MedQA, CoRA also performs strongly\. It obtains the lowestEalignE\_\{\\mathrm\{align\}\}across all three models and the highestQQfor Qwen2\.5 7B and Gemma 2 9B, while GRPO achieves the highestQQfor Ministral 3 8B\. On OpenBookQA, CoRA achieves the best results for Qwen2\.5 7B and Gemma 2 9B, while SFT performs best for Ministral 3 8B\.
Overall, the results show that CoRA generally improves confidence–rationale alignment and reasoning\-support quality, with especially consistent gains on MathQA\.
To reduce the impact of judge\-specific bias in LLM\-based evaluation, we further evaluateQQandEalignE\_\{\\mathrm\{align\}\}using two additional judge models: GPT\-OSS 20B and Qwen 3 32B\. These judges follow the same rubric and receive the same auxiliary background knowledge as the main evaluator, under the same no\-gold\-information constraint\. The additional results are reported in Appendix[C](https://arxiv.org/html/2606.14961#A3), from Table[6](https://arxiv.org/html/2606.14961#A3.T6)to Table[7](https://arxiv.org/html/2606.14961#A3.T7)\. Across judges, the overall trends remain broadly consistent, suggesting that the observed improvements are not an artifact of a single evaluator\.
Table 4:Prediction Results from CoT Reasoning Linguistic Features\.↑\\uparrowindicates higher is better\. Best values are bolded and second\-best values are underlined within each model–dataset block\.
### 5\.3Correctness Prediction from CoT Reasoning Linguistic Features
Table[4](https://arxiv.org/html/2606.14961#S5.T4)evaluates whether surface linguistic features in generated reasoning traces can predict answer correctness\. Overall, the results show that linguistic cues provide a useful but limited signal\. Many AUROC values on MedQA and OpenBookQA remain close to the weak\-discrimination range of 0\.50–0\.60, indicating that surface linguistic features alone are insufficient for reliable correctness prediction in these settings\.
The clearest signal appears on MathQA\. Across all models, MathQA obtains substantially higher AUROC than other datasets, and CoRA achieves the best AUROC for every model\. For example, on Qwen2\.5, CoRA improves linguistic\-only AUROC from 0\.592 under GRPO to 0\.741\. This suggests that mathematical reasoning traces contain stronger observable correctness signals, which CoRA further enhances\.
The results on MedQA and OpenBookQA are more mixed\. CoRA achieves the best AUROC for Qwen2\.5 and Gemma 2 on MedQA, and for Ministral 3 and Qwen2\.5 on OpenBookQA, but several scores remain near 0\.55\. This suggests that correctness in medical and science QA may depend more on latent domain knowledge than on shallow linguistic patterns alone\. The main takeaway is that CoRA can make generated reasoning traces more diagnostically informative, especially on MathQA\.
## 6Discussion and Conclusion
The results suggest that confidence–rationale alignment is most useful as a reliability\-oriented objective rather than as a direct substitute for correctness\-oriented training\. CoRA most consistently reduces unsupported overconfidence, as measured by the alignment errorEalignE\_\{\\mathrm\{align\}\}, while maintaining competitive accuracy\. This distinction is important because a model can be accurate or calibrated on average and still produce individual responses whose confidence is not supported by the rationale it presents\. Confidence–rationale alignment therefore provides an instance\-level reliability signal that complements conventional accuracy and calibration metrics\.
The main trade\-off is that CoRA improves rationale\-centered alignment more consistently than it improves all conventional metrics\. Correctness\-only GRPO rewards the final answer, whereas CoRA also penalizes cases in which confidence exceeds rationale support\. As a result, CoRA may prefer responses that are better supported or less overconfident, even when a correctness\-only objective would favor more aggressive predictions\. This behavior is desirable when the goal is to reduce unsupported high\-confidence reasoning, but it also means that confidence–rationale alignment should be viewed as complementary to answer accuracy, not as a guaranteed way to maximize it\.
The domain differences suggest that the value of rationale\-support supervision depends on how reliably rationale quality can be assessed\. MathQA provides relatively explicit computational and logical structure, which may make rationale\-support judgments more stable\. MedQA requires specialized domain knowledge and may introduce greater ambiguity for both the policy model and the judge\. This helps explain why the gains are broadest on MathQA and more variable on MedQA\.
Evaluations under Llama 3\.1 70B Instruct, GPT\-OSS 20B, and Qwen 3 32B further suggest that the observed improvements are not solely an artifact of the training judge\. The fact that CoRA often reduces the alignment error under all judges provides evidence that the method improves a broader rationale\-support signal rather than only matching the preferences of a single judge\. Meanwhile, disagreements between judges indicate that rationale\-support evaluation remains imperfect and should not be treated as a substitute for human assessment\.
Overall, our findings support treating answer correctness, confidence, and rationale support as coupled reliability signals\. A reasoning model should not merely be accurate or calibrated on average; it should assign high confidence only when its rationale adequately supports its answer\.
## Limitations
There are several limitations to this work that we believe serve as opportunities for further investigation and methodological advancement\. First, our rationale\-support score depends on LLM judges, whose judgments may contain biases or inconsistencies despite the use of a structured rubric and evaluation across multiple judge models\. Future work should compare judge scores with expert and non\-expert human assessments of rationale support\. Second, our current experiments use relatively small training subsets, so stronger trends may require larger\-scale training and multiple random seeds\. Third, although the alignment errorEalignE\_\{\\mathrm\{align\}\}captures unsupported overconfidence, it does not fully measure all forms of rationale faithfulness\. A rationale may appear supportive according to the rubric while still not reflecting the model’s internal decision process\. Finally, our current evaluation focuses on multiple\-choice reasoning, and future work should extend confidence–rationale alignment to open\-ended and retrieval\-intensive reasoning tasks\.
## References
- MathQA: towards interpretable math word problem solving with operation\-based formalisms\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,Minneapolis, Minnesota,pp\. 2357–2367\.External Links:[Document](https://dx.doi.org/10.18653/v1/N19-1245),[Link](https://aclanthology.org/N19-1245/)Cited by:[§1](https://arxiv.org/html/2606.14961#S1.p5.1),[1st item](https://arxiv.org/html/2606.14961#S4.I1.i1.p1.1)\.
- G\. W\. Brier \(1950\)Verification of forecasts expressed in terms of probability\.Monthly Weather Review78\(1\),pp\. 1–3\.External Links:[Document](https://dx.doi.org/10.1175/1520-0493%281950%29078%3C0001%3AVOFEIT%3E2.0.CO%3B2),[Link](https://journals.ametsoc.org/view/journals/mwre/78/1/1520-0493_1950_078_0001_vofeit_2_0_co_2.xml)Cited by:[§1](https://arxiv.org/html/2606.14961#S1.p5.1),[3rd item](https://arxiv.org/html/2606.14961#S4.I2.i3.p1.1)\.
- DeepSeek\-AI \(2025\)DeepSeek\-r1: incentivizing reasoning capability in LLMs via reinforcement learning\.External Links:2501\.12948,[Link](https://arxiv.org/abs/2501.12948)Cited by:[§1](https://arxiv.org/html/2606.14961#S1.p4.1),[§2\.3](https://arxiv.org/html/2606.14961#S2.SS3.p3.1)\.
- J\. DeYoung, S\. Jain, N\. F\. Rajani, E\. Lehman, C\. Xiong, R\. Socher, and B\. C\. Wallace \(2020\)ERASER: a benchmark to evaluate rationalized NLP models\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,Online,pp\. 4443–4458\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.408),[Link](https://aclanthology.org/2020.acl-main.408/)Cited by:[§2\.3](https://arxiv.org/html/2606.14961#S2.SS3.p1.1),[§3\.1](https://arxiv.org/html/2606.14961#S3.SS1.p2.1)\.
- Gemma Team \(2024\)Gemma 2: improving open language models at a practical size\.External Links:2408\.00118,[Link](https://arxiv.org/abs/2408.00118)Cited by:[§4\.1](https://arxiv.org/html/2606.14961#S4.SS1.SSS0.Px2.p1.1)\.
- J\. Geng, F\. Cai, Y\. Wang, H\. Koeppl, P\. Nakov, and I\. Gurevych \(2024\)A survey of confidence estimation and calibration in large language models\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),Mexico City, Mexico,pp\. 6577–6595\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.naacl-long.366),[Link](https://aclanthology.org/2024.naacl-long.366/)Cited by:[§2\.2](https://arxiv.org/html/2606.14961#S2.SS2.p1.1)\.
- C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger \(2017\)On calibration of modern neural networks\.InProceedings of the 34th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.70,pp\. 1321–1330\.External Links:[Link](https://proceedings.mlr.press/v70/guo17a.html)Cited by:[§1](https://arxiv.org/html/2606.14961#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.14961#S2.SS2.p1.1),[2nd item](https://arxiv.org/html/2606.14961#S4.I2.i2.p1.1)\.
- A\. Jacovi and Y\. Goldberg \(2020\)Towards faithfully interpretable NLP systems: how should we define and evaluate faithfulness?\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,Online,pp\. 4198–4205\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.386),[Link](https://aclanthology.org/2020.acl-main.386/)Cited by:[§2\.3](https://arxiv.org/html/2606.14961#S2.SS3.p1.1),[§3\.1](https://arxiv.org/html/2606.14961#S3.SS1.p2.1)\.
- D\. Jin, E\. Pan, N\. Oufattole, W\. Weng, H\. Fang, and P\. Szolovits \(2021\)What disease does this patient have? a large\-scale open domain question answering dataset from medical exams\.Applied Sciences11\(14\),pp\. 6421\.External Links:[Document](https://dx.doi.org/10.3390/app11146421),[Link](https://www.mdpi.com/2076-3417/11/14/6421)Cited by:[§1](https://arxiv.org/html/2606.14961#S1.p5.1),[2nd item](https://arxiv.org/html/2606.14961#S4.I1.i2.p1.1)\.
- S\. Kadavath, T\. Conerly, A\. Askell, T\. Henighan, D\. Drain, E\. Perez, N\. Schiefer, Z\. Hatfield\-Dodds, N\. DasSarma, E\. Tran\-Johnson, S\. Johnston, S\. El\-Showk, A\. Jones, N\. Elhage, T\. Hume, A\. Chen, Y\. Bai, S\. Bowman, S\. Fort, D\. Ganguli, D\. Hernandez, J\. Jacobson, J\. Kernion, S\. Kravec, L\. Lovitt, K\. Ndousse, C\. Olsson, S\. Ringer, D\. Amodei, T\. Brown, J\. Clark, N\. Joseph, B\. Mann, S\. McCandlish, C\. Olah, and J\. Kaplan \(2022\)Language models \(mostly\) know what they know\.arXiv preprint arXiv:2207\.05221\.External Links:[Link](https://arxiv.org/abs/2207.05221)Cited by:[§1](https://arxiv.org/html/2606.14961#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.14961#S2.SS2.p1.1)\.
- S\. Kim, J\. Shin, Y\. Cho, J\. Jang, S\. Longpre, H\. Lee, S\. Yun, S\. Shin, S\. Kim, J\. Thorne, and M\. Seo \(2024\)Prometheus: inducing fine\-grained evaluation capability in language models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=8euJaTveKw)Cited by:[§1](https://arxiv.org/html/2606.14961#S1.p4.1),[§2\.3](https://arxiv.org/html/2606.14961#S2.SS3.p2.1),[§3\.1](https://arxiv.org/html/2606.14961#S3.SS1.p2.1)\.
- S\. S\. Y\. Kim, J\. W\. Vaughan, Q\. V\. Liao, T\. Lombrozo, and O\. Russakovsky \(2025\)Fostering appropriate reliance on large language models: the role of explanations, sources, and inconsistencies\.InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems,Yokohama, Japan\.External Links:[Document](https://dx.doi.org/10.1145/3706598.3714020),[Link](https://doi.org/10.1145/3706598.3714020)Cited by:[§1](https://arxiv.org/html/2606.14961#S1.p2.1),[§1](https://arxiv.org/html/2606.14961#S1.p3.1)\.
- T\. Kojima, S\. S\. Gu, M\. Reid, Y\. Matsuo, and Y\. Iwasawa \(2022\)Large language models are zero\-shot reasoners\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 22199–22213\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/hash/8bb0d291acd4acf06ef112099c16f326-Abstract-Conference.html)Cited by:[§1](https://arxiv.org/html/2606.14961#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.14961#S2.SS1.p1.1)\.
- T\. Lanham, A\. Chen, A\. Radhakrishnan, B\. Steiner, C\. Denison, D\. Hernandez, D\. Li, E\. Durmus, E\. Hubinger, J\. Kernion, K\. Lukošiūtė, K\. Nguyen, N\. Cheng, N\. Joseph, N\. Schiefer, O\. Rausch, R\. Larson, S\. McCandlish, S\. Kundu, S\. Kadavath, S\. Yang, T\. Henighan, T\. Maxwell, T\. Telleen\-Lawton, T\. Hume, Z\. Hatfield\-Dodds, J\. Kaplan, J\. Brauner, S\. R\. Bowman, and E\. Perez \(2023\)Measuring faithfulness in chain\-of\-thought reasoning\.arXiv preprint arXiv:2307\.13702\.External Links:[Link](https://arxiv.org/abs/2307.13702)Cited by:[§1](https://arxiv.org/html/2606.14961#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.14961#S2.SS1.p2.1),[§3\.1](https://arxiv.org/html/2606.14961#S3.SS1.p2.1)\.
- T\. Lei, R\. Barzilay, and T\. Jaakkola \(2016\)Rationalizing neural predictions\.InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing,Austin, Texas,pp\. 107–117\.External Links:[Document](https://dx.doi.org/10.18653/v1/D16-1011),[Link](https://aclanthology.org/D16-1011/)Cited by:[§2\.3](https://arxiv.org/html/2606.14961#S2.SS3.p1.1),[§3\.1](https://arxiv.org/html/2606.14961#S3.SS1.p2.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2023\)Let’s verify step by step\.arXiv preprint arXiv:2305\.20050\.External Links:[Link](https://arxiv.org/abs/2305.20050)Cited by:[§2\.1](https://arxiv.org/html/2606.14961#S2.SS1.p2.1)\.
- S\. Lin, J\. Hilton, and O\. Evans \(2022\)Teaching models to express their uncertainty in words\.arXiv preprint arXiv:2205\.14334\.External Links:[Link](https://arxiv.org/abs/2205.14334)Cited by:[§1](https://arxiv.org/html/2606.14961#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.14961#S2.SS2.p1.1)\.
- Y\. Liu, D\. Iter, Y\. Xu, S\. Wang, R\. Xu, and C\. Zhu \(2023\)G\-eval: NLG evaluation using GPT\-4 with better human alignment\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Singapore,pp\. 2511–2522\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.153),[Link](https://aclanthology.org/2023.emnlp-main.153/)Cited by:[§1](https://arxiv.org/html/2606.14961#S1.p4.1),[§2\.3](https://arxiv.org/html/2606.14961#S2.SS3.p2.1),[§3\.1](https://arxiv.org/html/2606.14961#S3.SS1.p2.1)\.
- Llama Team, AI @ Meta \(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.External Links:[Link](https://arxiv.org/abs/2407.21783)Cited by:[§4\.1](https://arxiv.org/html/2606.14961#S4.SS1.SSS0.Px2.p3.1)\.
- T\. Mihaylov, P\. Clark, T\. Khot, and A\. Sabharwal \(2018\)Can a suit of armor conduct electricity? a new dataset for open book question answering\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,Brussels, Belgium,pp\. 2381–2391\.External Links:[Document](https://dx.doi.org/10.18653/v1/D18-1260),[Link](https://aclanthology.org/D18-1260/)Cited by:[§1](https://arxiv.org/html/2606.14961#S1.p5.1),[3rd item](https://arxiv.org/html/2606.14961#S4.I1.i3.p1.1)\.
- Mistral AI \(2025\)Ministral 3 8b reasoning 2512\.Note:Hugging Face Model CardExternal Links:[Link](https://huggingface.co/mistralai/Ministral-3-8B-Reasoning-2512)Cited by:[§4\.1](https://arxiv.org/html/2606.14961#S4.SS1.SSS0.Px2.p1.1)\.
- OpenAI \(2025\)Gpt\-oss\-120b & gpt\-oss\-20b model card\.arXiv preprint arXiv:2508\.10925\.External Links:[Link](https://arxiv.org/abs/2508.10925)Cited by:[§4\.1](https://arxiv.org/html/2606.14961#S4.SS1.SSS0.Px2.p3.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. L\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray, J\. Schulman, J\. Hilton, F\. Kelton, L\. Miller, M\. Simens, A\. Askell, P\. Welinder, P\. Christiano, J\. Leike, and R\. Lowe \(2022\)Training language models to follow instructions with human feedback\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 27730–27744\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html)Cited by:[§2\.3](https://arxiv.org/html/2606.14961#S2.SS3.p3.1)\.
- D\. Paul, R\. West, A\. Bosselut, and B\. Faltings \(2024\)Making reasoning matter: measuring and improving faithfulness of chain\-of\-thought reasoning\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Miami, Florida, USA,pp\. 15012–15032\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.882),[Link](https://aclanthology.org/2024.findings-emnlp.882/)Cited by:[§1](https://arxiv.org/html/2606.14961#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.14961#S2.SS1.p2.1),[§3\.1](https://arxiv.org/html/2606.14961#S3.SS1.p2.1)\.
- Qwen Team \(2024\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§4\.1](https://arxiv.org/html/2606.14961#S4.SS1.SSS0.Px2.p1.1)\.
- A\. Razghandi, S\. M\. H\. Hosseini, and M\. S\. Baghshah \(2025\)CER: confidence enhanced reasoning in LLMs\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Vienna, Austria,pp\. 7918–7938\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.390),[Link](https://aclanthology.org/2025.acl-long.390/)Cited by:[§2\.2](https://arxiv.org/html/2606.14961#S2.SS2.p2.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.External Links:[Link](https://arxiv.org/abs/1707.06347)Cited by:[§2\.3](https://arxiv.org/html/2606.14961#S2.SS3.p3.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.External Links:[Link](https://arxiv.org/abs/2402.03300)Cited by:[§1](https://arxiv.org/html/2606.14961#S1.p4.1),[§2\.3](https://arxiv.org/html/2606.14961#S2.SS3.p3.1)\.
- M\. Shen, S\. Das, K\. Greenewald, P\. Sattigeri, G\. Wornell, and S\. Ghosh \(2024\)Thermometer: towards universal calibration for large language models\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235,pp\. 44687–44711\.External Links:[Link](https://proceedings.mlr.press/v235/shen24c.html)Cited by:[§1](https://arxiv.org/html/2606.14961#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.14961#S2.SS2.p1.1)\.
- M\. Steyvers, H\. Tejeda, A\. Kumar, C\. Belem, S\. Karny, X\. Hu, L\. Mayer, and P\. Smyth \(2025\)What large language models know and what people think they know\.Nature Machine Intelligence7,pp\. 221–231\.External Links:[Document](https://dx.doi.org/10.1038/s42256-024-00976-7),[Link](https://www.nature.com/articles/s42256-024-00976-7)Cited by:[§1](https://arxiv.org/html/2606.14961#S1.p2.1),[§1](https://arxiv.org/html/2606.14961#S1.p3.1)\.
- S\. Tao, L\. Yao, H\. Ding, Y\. Xie, Q\. Cao, F\. Sun, J\. Gao, H\. Shen, and B\. Ding \(2024\)When to trust LLMs: aligning confidence with response quality\.InFindings of the Association for Computational Linguistics: ACL 2024,Bangkok, Thailand,pp\. 5984–5996\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.357),[Link](https://aclanthology.org/2024.findings-acl.357/)Cited by:[§2\.2](https://arxiv.org/html/2606.14961#S2.SS2.p2.1)\.
- K\. Tian, E\. Mitchell, A\. Zhou, A\. Sharma, R\. Rafailov, H\. Yao, C\. Finn, and C\. Manning \(2023\)Just ask for calibration: strategies for eliciting calibrated confidence scores from language models fine\-tuned with human feedback\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Singapore,pp\. 5433–5442\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.330),[Link](https://aclanthology.org/2023.emnlp-main.330/)Cited by:[§1](https://arxiv.org/html/2606.14961#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.14961#S2.SS2.p1.1)\.
- M\. Turpin, J\. Michael, E\. Perez, and S\. R\. Bowman \(2023\)Language models don’t always say what they think: unfaithful explanations in chain\-of\-thought prompting\.InAdvances in Neural Information Processing Systems,Vol\.36\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/hash/ed3fea9033a80fea1376299fa7863f4a-Abstract-Conference.html)Cited by:[§1](https://arxiv.org/html/2606.14961#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.14961#S2.SS1.p2.1),[§3\.1](https://arxiv.org/html/2606.14961#S3.SS1.p2.1)\.
- M\. Tutek, F\. Hashemi Chaleshtori, A\. Marasović, and Y\. Belinkov \(2025\)Measuring chain of thought faithfulness by unlearning reasoning steps\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,Suzhou, China,pp\. 9935–9960\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.504),[Link](https://aclanthology.org/2025.emnlp-main.504/)Cited by:[§1](https://arxiv.org/html/2606.14961#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.14961#S2.SS1.p2.1)\.
- H\. Vasconcelos, M\. Jörke, M\. Grunde\-McLaughlin, T\. Gerstenberg, M\. S\. Bernstein, and R\. Krishna \(2023\)Explanations can reduce overreliance on AI systems during decision\-making\.Proceedings of the ACM on Human\-Computer Interaction7\(CSCW1\),pp\. 1–38\.External Links:[Document](https://dx.doi.org/10.1145/3579605),[Link](https://doi.org/10.1145/3579605)Cited by:[§1](https://arxiv.org/html/2606.14961#S1.p2.1)\.
- X\. Wang, J\. Wei, D\. Schuurmans, Q\. Le, E\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou \(2023\)Self\-consistency improves chain of thought reasoning in language models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=1PL1NIMMrw)Cited by:[§1](https://arxiv.org/html/2606.14961#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.14961#S2.SS1.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. Chi, Q\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 24824–24837\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html)Cited by:[§1](https://arxiv.org/html/2606.14961#S1.p1.1),[§1](https://arxiv.org/html/2606.14961#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.14961#S2.SS1.p1.1)\.
- J\. Xie, A\. S\. Chen, Y\. Lee, E\. Mitchell, and C\. Finn \(2024\)Calibrating language models with adaptive temperature scaling\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Miami, Florida, USA,pp\. 18128–18138\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.1007),[Link](https://aclanthology.org/2024.emnlp-main.1007/)Cited by:[§1](https://arxiv.org/html/2606.14961#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.14961#S2.SS2.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§4\.1](https://arxiv.org/html/2606.14961#S4.SS1.SSS0.Px2.p3.1)\.
- S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. L\. Griffiths, Y\. Cao, and K\. Narasimhan \(2023\)Tree of thoughts: deliberate problem solving with large language models\.InAdvances in Neural Information Processing Systems,Vol\.36\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/hash/271db9922b8d1f4dd7aaef84ed5ac703-Abstract-Conference.html)Cited by:[§1](https://arxiv.org/html/2606.14961#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.14961#S2.SS1.p1.1)\.
- E\. Zelikman, Y\. Wu, J\. Mu, and N\. D\. Goodman \(2022\)STaR: bootstrapping reasoning with reasoning\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 15476–15488\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/hash/639a9a172c044fbb64175b5fad42e9a5-Abstract-Conference.html)Cited by:[§2\.1](https://arxiv.org/html/2606.14961#S2.SS1.p1.1)\.
- B\. Zhang and R\. Zhang \(2025\)CoT\-UQ: improving response\-wise uncertainty quantification in LLMs with chain\-of\-thought\.InFindings of the Association for Computational Linguistics: ACL 2025,Vienna, Austria,pp\. 26114–26133\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.1339),[Link](https://aclanthology.org/2025.findings-acl.1339/)Cited by:[§2\.2](https://arxiv.org/html/2606.14961#S2.SS2.p2.1)\.
- T\. Z\. Zhao, E\. Wallace, S\. Feng, D\. Klein, and S\. Singh \(2021\)Calibrate before use: improving few\-shot performance of language models\.InProceedings of the 38th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.139,pp\. 12697–12706\.External Links:[Link](https://proceedings.mlr.press/v139/zhao21c.html)Cited by:[§1](https://arxiv.org/html/2606.14961#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.14961#S2.SS2.p1.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica \(2023\)Judging LLM\-as\-a\-judge with MT\-bench and chatbot arena\.InAdvances in Neural Information Processing Systems,Vol\.36\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/hash/91f18a1287b398d378ef22505bf41832-Abstract-Datasets_and_Benchmarks.html)Cited by:[§1](https://arxiv.org/html/2606.14961#S1.p4.1),[§2\.3](https://arxiv.org/html/2606.14961#S2.SS3.p2.1),[§3\.1](https://arxiv.org/html/2606.14961#S3.SS1.p2.1)\.
## Appendix AAdditional Implementation Details
### A\.1Committed\-Answer Confidence Extraction
For each generated completion, we first parse the committed answeryyfrom the final answer lineFinal Answer: X\. We then compute confidence over the discrete answer\-option labels rather than using the model’s verbalized confidence\. Specifically, after the model generates a rationale and a final\-answer line, we locate the generated answer label in the token sequence and take the exact generated prefix before that label as the scoring context\. Under this prefix, we score each candidate answer labelℓ∈𝒴\\ell\\in\\mathcal\{Y\}as a continuation\.
In our implementation, candidate labels are scored with a leading\-space templateX, whereXXis one of the valid option labels\. Thus, for a four\-choice dataset, the candidate continuations areA,B,C, andD\. For MedQA and MathQA, we additionally includeE\. Letsℓs\_\{\\ell\}denote the conditional log\-probability of candidate continuationℓ\\ellunder the exact generated prefix\. We convert these scores into an option\-level probability distribution:
p\(ℓ∣x,r\)=exp\(sℓ\)∑ℓ′∈𝒴exp\(sℓ′\)\.p\(\\ell\\mid x,r\)=\\frac\{\\exp\(s\_\{\\ell\}\)\}\{\\sum\_\{\\ell^\{\\prime\}\\in\\mathcal\{Y\}\}\\exp\(s\_\{\\ell^\{\\prime\}\}\)\}\.The committed\-answer confidence is then defined as
C=p\(y∣x,r\),C=p\(y\\mid x,r\),whereyyis the parsed answer selected by the model\. Since the candidate continuations are fixed answer labels rather than full option texts, they are short and comparable across options; we therefore use the summed continuation log\-probability without additional length normalization\.
No temperature scaling is applied when computing these option\-level probabilities\. Although generation may use sampling parameters during response generation, the confidence score is computed post hoc from the model’s raw conditional likelihoods under the generated prefix\. The resulting distribution is stored aschoice\_probs, and the probability assigned to the parsed answer is stored asp\_parsed\. Thisp\_parsedvalue is used asC∈\[0,1\]C\\in\[0,1\]for ECE, Brier score, the confidence–rationale alignment reward, and the alignment error metric\. If a completion does not contain a parseable final answer, it is counted as incorrect and assigned zero rationale\-support score\. This procedure measures the model’s probability for the committed answer label under the generated reasoning context, rather than the probability of the full option text or any self\-reported confidence statement\.
### A\.2Reproducibility Details
For reward computation during training, we use GPT\-OSS 20B as the sole rationale\-support judge\. The judge input excludes gold option labels, official solutions, official explanations, and gold rationales; the auxiliary background knowledge is used only to assess factual grounding\. The general support scoreQgenQ\_\{\\mathrm\{gen\}\}is computed from the rubric axes in Table[1](https://arxiv.org/html/2606.14961#S3.T1)\. The format\-validity axisG1G\_\{1\}is used as a hard gate\. For the remaining general axes, we use weights of 0\.20, 0\.30, 0\.20, 0\.25, and 0\.05 for task understanding, evidence grounding, inference coherence, answer bridge, and structure, respectively\. The task\-specific scoreQtaskQ\_\{\\mathrm\{task\}\}is computed by averaging the two task\-specific axes\. We setwgen=0\.75w\_\{\\mathrm\{gen\}\}=0\.75andwtask=0\.25w\_\{\\mathrm\{task\}\}=0\.25, apply the penaltyPPfor severe structural failure or unsupported reasoning, and clip the final score to\[0,1\]\[0,1\]\. The training process uses 3 NVIDIA H100 GPUs\.
For the confidence–rationale alignment score in Equation[3](https://arxiv.org/html/2606.14961#S3.E3), we setα=0\.6\\alpha=0\.6,βover=0\.6\\beta\_\{\\mathrm\{over\}\}=0\.6, andβunder=0\.4\\beta\_\{\\mathrm\{under\}\}=0\.4, assigning a larger penalty to confidence that exceeds rationale support\. For the final reward in Equation[4](https://arxiv.org/html/2606.14961#S3.E4), we useλ=0\.2\\lambda=0\.2,γ=0\.2\\gamma=0\.2, andβ=0\.1\\beta=0\.1\. We train with LoRA using rank 16, LoRA alpha 32, and dropout 0\.05\. Each model is trained for one epoch with learning rate10−510^\{\-5\}, batch size 8, group size 8 for GRPO sampling, and KL coefficient 0\.04\.
### A\.3Prompt Templates
#### Generation Prompt\.
The policy model is prompted with the following template:
> ``` You are a careful reasoner. Answer the following {dataset_display} multiple-choice question. Give a short chain-of-thought in several steps. Then output Final Answer as a single letter from {choice_list}. Stop immediately after the Final Answer line. {fact_block}Question: {question} Options: {options_text} Use exactly this format: Chain-of-Thought: 1) ... 2) ... Final Answer: {choice_template} ```
#### Judge System Prompt\.
The rubric judge is prompted with the following system instruction:
> ``` Reasoning: low You are an evaluator for multiple-choice question reasoning. You are given: (a) a dataset name (b) a question and answer options (c) optional reference/evidence (d) a CHAIN-OF-THOUGHT answer written by another model Score the CHAIN along the 8 rubric axes. Use reliable domain knowledge when needed: - OpenBookQA: science/common-sense knowledge - MedQA: medical knowledge - MathQA: mathematical reasoning/computation You are NOT given the correct answer. Do not use a gold label. Judge only from the question, options, optional evidence, and chain. For each axis, output the value only. No evidence quotes. Output STRICT JSON only. ```
#### Judge Input Prompt\.
For each generated reasoning trace, the judge receives:
> ``` DATASET: {dataset_name} QUESTION: {question} OPTIONS: A: {option_A} B: {option_B} ... PROVIDED REFERENCE/EVIDENCE: {fact1} CHAIN-OF-THOUGHT: {chain} CHOSEN ANSWER: {parsed_answer} Evaluate this chain along the 8 rubric axes. Output STRICT JSON only. ```
The reference\-evidence block is omitted when no reference evidence is used\.
## Appendix BLinguistic Features for Correctness Prediction
Table 5:Linguistic features used for correctness prediction\. Features are extracted from the generated CoT reasoning trace\.
## Appendix CAdditional Judge Evaluation
Table 6:Rationale\-support quality and confidence–rationale alignment results evaluated by GPT\-OSS 20B\.QQmeasures the quality of the generated reasoning trace\.EalignE\_\{\\mathrm\{align\}\}denotes the average confidence–rationale alignment error\.↑\\uparrowindicates higher is better, and↓\\downarrowindicates lower is better\. All values are reported as percentages\.Table 7:Rationale\-support quality and confidence–rationale alignment results evaluated by Qwen 3 32B\.QQmeasures the quality of the generated reasoning trace, whileEalignE\_\{\\mathrm\{align\}\}denotes the average confidence–rationale alignment error\.↑\\uparrowindicates higher is better, and↓\\downarrowindicates lower is better\. All values are reported as percentages\.Similar Articles
Confidence-Aware Alignment Makes Reasoning LLMs More Reliable
This paper introduces CASPO, a framework for aligning token-level confidence with step-wise logical correctness in large reasoning models using iterative Direct Preference Optimization. It also proposes Confidence-aware Thought (CaT) for dynamically pruning uncertain reasoning branches during inference to improve reliability and efficiency.
CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment
This paper analyzes the thinking-answer inconsistency in multimodal reinforcement learning with verifiable rewards (RLVR) for large vision-language models and proposes CORA, a method that introduces a consistency reward model and hybrid reward advantage splitting to improve faithfulness and task performance.
Robust and Efficient Guardrails with Latent Reasoning
CoLaGuard is a new guardrail model that transfers multi-step safety reasoning into a continuous latent space, achieving 12.9x speedup and 22.4x token reduction compared to explicit reasoning baselines while matching macro-F1 performance on ten safety benchmarks.
Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning
Proposes ProxyCoT, a training framework that improves long-context reasoning in large language models by first obtaining chain-of-thought reasoning traces on short proxy contexts (via reinforcement learning or distillation) and then grounding them in full long contexts through supervised fine-tuning. Experiments show consistent improvements over baselines with reduced computational cost.
Mental-R1: Aligning LLM Reasoning for Mental Health Assessment
Proposes Cognitive Relative Policy Optimization (CRPO), a reinforcement learning framework for aligning LLM reasoning in mental health assessment, achieving an average improvement of 10.4 percentage points in weighted F1-score over existing baselines.