Confidence Calibration in Large Language Models

arXiv cs.AI 05/26/26, 04:00 AM Papers
Summary
This paper analyzes the confidence calibration of 11 popular LLMs, finding that they are generally overconfident, especially on hard tasks, and underconfident on easy tasks. It introduces LifeEval, a test for evaluating calibration across difficulty levels.
arXiv:2605.23909v1 Announce Type: new Abstract: We investigate the calibration of large language models' (LLMs') confidence across diverse tasks. The results of our preregistered study show that the current crop of LLMs are, like people, too sure they are right: confidence exceeds accuracy, on average. Importantly, however, this tendency is moderated by a powerful hard-easy effect, wherein overconfidence is greatest on difficult tests; by contrast, easy tests actually show substantial underconfidence. We develop LifeEval, a test for evaluating model calibration across levels of difficulty.
Original Article
View Cached Full Text
Cached at: 05/26/26, 08:58 AM
# Confidence Calibration in Large Language Models
Source: [https://arxiv.org/html/2605.23909](https://arxiv.org/html/2605.23909)
###### Abstract

We investigate the calibration of large language models’ \(LLMs’\) confidence across diverse tasks\. The results of our preregistered study show that the current crop of LLMs are, like people, too sure they are right: confidence exceeds accuracy, on average\. Importantly, however, this tendency is moderated by a powerful hard\-easy effect, wherein overconfidence is greatest on difficult tests; by contrast, easy tests actually show substantial underconfidence\. We developLifeEval, a test for evaluating model calibration across levels of difficulty\.

## 1Introduction

Large Language Models have seen widespread adoption due to their ability to provide useful information through natural language\(Bicket al\.,[2024](https://arxiv.org/html/2605.23909#bib.bib59)\)\. However, LLMs’ usefulness as guides, teachers, and advisers depends on their provision of truthful and accurate information\(Afrooghet al\.,[2024](https://arxiv.org/html/2605.23909#bib.bib58)\)\. Hallucination, in which an LLM confidently reports falsehoods, fundamentally undermines their value\(Kalaiet al\.,[2025](https://arxiv.org/html/2605.23909#bib.bib46)\)\. That is why a proviso warns ChatGPT users, “ChatGPT can make mistakes\. Check important info”\(OpenAI,[2025a](https://arxiv.org/html/2605.23909#bib.bib57)\)\. Other LLMs come with similar warnings\.

Ideally, an LLM ought to provide only truthful information\. This is, of course, unrealistic for at least two reasons\. First, it ignores the complexity of irreducible uncertainties\. Few things can be known with certainty and perfect Bayesian rationality only provides probabilistic credences\. Second, it neglects the limits in the LLM’s information\(Tripathiet al\.,[2025](https://arxiv.org/html/2605.23909#bib.bib7)\)\. LLMs generally lack access to verifiable ground truth, and must rely on the imperfect information available to them\.

Accepting these constraints, a more realistic possibility is well\-calibrated confidence\. That is, the LLM should be able to faithfully report the probability that it is correct, conditional on its own limitations and vulnerability to error\. This would allow users to rely on an LLM’s stated confidence\. To do so, users must trust that confidence indicates accuracy\. This trust is essential to enable autonomous systems to know when they are too uncertain to take action, among many other uses\. Without well\-calibrated confidence, users might trust faulty outputs or doubt accurate outputs\. Hallucination and miscalibration are therefore epistemic risks that cut to the very heart of the usefulness of AI\.

This motivates our tests of the confidence calibration of commercially available LLMs in a variety of contexts\. This work presents an analysis of 11 popular open\- and closed\-source LLMs on a variety of reasoning tasks\. We find that:1\)LLMs are, on average, overconfident2\)Models are more overconfident on hard tasks and underconfident on the easiest tasks\.3\)Reasoning models provide more nuanced confidence estimates\. Moreover, we add to the current literature by proposing a new test for measuring model calibration on Bayesian\-inference tasks:LifeEvalas seen in Figure[1](https://arxiv.org/html/2605.23909#S1.F1)\. This framework allows for:

- •A continuous measure of task difficulty grounded in empirical probabilities\.
- •Monotonic scaling of task difficulty\.
- •Evaluation of model performance based on of quantitative elements of the problem at hand rather than qualitative ones\.

![Refer to caption](https://arxiv.org/html/2605.23909v1/Special_Plots/main_fig.png)Figure 1:LifeEval, from left to right: The user provides the LLM with sex, minimum age and radius\. The LLM responds with its best guess and its confidence that the actual age at death falls within that range\. We score the model’s response based on its point estimate and the user’s conditions\. Finally, we compare the true probability of the model’s response and the model’s stated confidence\.
## 2Related Work

Human judgment is vulnerable to many biases, of which overconfidence may be the most consequential\(Kahneman,[2011](https://arxiv.org/html/2605.23909#bib.bib65)\)\. Well\-calibrated confidence is foundational to effective decision making, since committing to a course of action requires sufficient confidence in its consequences\. Yet the calibration of human confidence judgments is notoriously poor\. People are overconfident and confidence judgments exhibit a “hard\-easy” effect: overconfidence increases with difficulty, while underconfidence emerges on easier tasks\(Lichtenstein and Fischhoff,[1977](https://arxiv.org/html/2605.23909#bib.bib13)\)\.

The most parsimonious explanation for the hard\-easy effect is that it is a regression\-to\-the\-mean artifact, a byproduct of the noisy relationship between confidence and accuracy\(Boundy\-Singeret al\.,[2023](https://arxiv.org/html/2605.23909#bib.bib8); Krueger and Mueller,[2002](https://arxiv.org/html/2605.23909#bib.bib9)\)\. Changes in difficulty have a more direct influence on accuracy than on confidence\(Erevet al\.,[1994](https://arxiv.org/html/2605.23909#bib.bib32)\)\. As task difficulty increases, performance drops, but if confidence is imperfectly responsive to this drop in accuracy, overconfidence must grow\. Conversely, as a task becomes easier and performance increases, noisy confidence judgments produce underconfidence\.

Other explanations for confidence biases emphasize motivational factors\(Brown,[2012](https://arxiv.org/html/2605.23909#bib.bib52); Kruger and Dunning,[1999](https://arxiv.org/html/2605.23909#bib.bib12)\)\. We might hope that artificially intelligent agents might be less biased by motivational factors and would therefore exhibit better\-calibrated confidence\. On the other hand, if LLMs’ confidence is, as with people, a noisy signal of accuracy, then we might expect to see similar confidence biases\. Evidence suggests deep neural networks are routinely more certain than they are accurate\(Oelrichet al\.,[2020](https://arxiv.org/html/2605.23909#bib.bib36); Abdaret al\.,[2021](https://arxiv.org/html/2605.23909#bib.bib39)\), and are often poorly calibrated\(Guoet al\.,[2017](https://arxiv.org/html/2605.23909#bib.bib116); Xuet al\.,[2025](https://arxiv.org/html/2605.23909#bib.bib120)\)\. Nevertheless, recent work has suggested that large language models might overcome these weaknesses through their increasing sophistication\(Kadavathet al\.,[2022](https://arxiv.org/html/2605.23909#bib.bib37); Xiaoet al\.,[2025](https://arxiv.org/html/2605.23909#bib.bib35); Lenget al\.,[2025](https://arxiv.org/html/2605.23909#bib.bib34); Chhikara,[2025](https://arxiv.org/html/2605.23909#bib.bib119); Liet al\.,[2025](https://arxiv.org/html/2605.23909#bib.bib115)\)\.

If models’ overconfidence grows with difficulty, investigating model calibration requires variation in task difficulty\. Prior methods have sought to assess difficulty of tasks through one of three approaches: \(1\) intuitive human assessment of that difficulty, \(2\) LLM as a judge\(Hwanget al\.,[2025](https://arxiv.org/html/2605.23909#bib.bib15); Gobaraet al\.,[2024](https://arxiv.org/html/2605.23909#bib.bib30)\), or \(3\) scaling the context providedSunget al\.\([2025](https://arxiv.org/html/2605.23909#bib.bib16)\)\. Unfortunately, these approaches rely on subjectivity of the annotator, model, or question author respectively\. Tasks that are difficult for humans can be quite easy for LLMs\(Luonget al\.,[2025](https://arxiv.org/html/2605.23909#bib.bib29)\)while models may struggle with mundane human tasks\(Philip and Hemang,[2024](https://arxiv.org/html/2605.23909#bib.bib17)\)\. Additionally, like humans, models are subject to their own biases which may influence their rating of task difficulty\(Tabib and Deedar,[2025](https://arxiv.org/html/2605.23909#bib.bib31)\)\. Scaling the amount of context provided can mitigate some of these issues; however, it is not clear how each piece of context may impact overall difficulty\. Because of this, simply adding or removing more context may not reflect the true intellectual difficulty\. Furthermore, in almost all cases, evaluations rely on a coarse measure of difficulty rather than a continuous one\. In contrast, a continuous measure of difficulty allows for a more precise analysis of model calibration and a greater understanding of how difficulty relates to overall calibration\.

We contribute to this literature by first systematically studying confidence calibration in eleven large language models across five different tests\. Some of these tests are more aligned with models’ abilities than others, affording a post\-hoc analysis of the hard\-easy effect\. To isolate the effect of difficulty from other task features, we develop a new task,LifeEval, that affords a bias\-free manipulation of difficulty while holding constant other task characteristics\. LifeEval asks for probabilistic confidence judgments that we then compare to empirical probabilities\. This method incorporates the benefits of moderating difficulty while sidestepping the aforementioned constraints of previous approaches\.

Table 1:The six question sets\.
## 3Method

Our plan used six English\-based question sets \(see Table[1](https://arxiv.org/html/2605.23909#S2.T1)\) testing 11 large language models\. Five of these are marketed as reasoning models: DeepSeek\-R1\(DeepSeek,[2025](https://arxiv.org/html/2605.23909#bib.bib22)\), Gemini 2\.5 Pro\(Google,[2025b](https://arxiv.org/html/2605.23909#bib.bib24)\), GPT\-o3\(OpenAI,[2025b](https://arxiv.org/html/2605.23909#bib.bib28)\), Claude Sonnet 4\(Anthropic,[2025b](https://arxiv.org/html/2605.23909#bib.bib20)\), and Claude Sonnet 3\.7\(Anthropic,[2025a](https://arxiv.org/html/2605.23909#bib.bib19)\)\.111In the interest of increasing the credibility of our results we[preregistered our research plans](https://osf.io/92hjz/overview?view_only=fe33a7ba8c204f09993067123f1736f6)\. This preregistration precommited us to conducting and reporting a set of planned analyses\. Appendix[A](https://arxiv.org/html/2605.23909#A1)explains deviations from our preregistered plans\.We compared these models to six "chat" models: DeepSeek\-V3\(DeepSeek,[2024](https://arxiv.org/html/2605.23909#bib.bib21)\), Gemini 2\.5 Flash\(Google,[2025a](https://arxiv.org/html/2605.23909#bib.bib23)\), GPT\-4o\(OpenAI,[2024](https://arxiv.org/html/2605.23909#bib.bib27)\)and Claude Haiku 3\(Anthropic,[2024](https://arxiv.org/html/2605.23909#bib.bib18)\)as well as two locally\-run, instruction\-tuned, versions of Llama 3\.1 \(8B and 70B\)\(Meta,[2024a](https://arxiv.org/html/2605.23909#bib.bib26),[b](https://arxiv.org/html/2605.23909#bib.bib25)\)\.

Each model/question\-set pairing yields confidence distributions, accuracy averages, and calibration metrics\. We counted a response as correct if the answer option assigned the highest probability matched the ground truth \(with proportional scoring for ties\)\. We compared confidence to observed accuracy to compute calibration statistics, most centrally Expected Calibration Error \(ECE\)\(Naeiniet al\.,[2015](https://arxiv.org/html/2605.23909#bib.bib41)\)and overconfidence\. We evaluated all models under identical conditions so that observed differences in calibration or overconfidence could be attributed to the model\.

For each question set, we employed one\-shot prompting that instructed the model to return its output in JSON format\. Except for HaluEval, we also incorporated a chain\-of\-thought prompting strategy to encourage more faithful, step\-by\-step reasoning\. We repeated the system prompt within the input to reinforce adherence to the formatting rules\. In the case of multiple choice questions \(MCQ\), we prompted models to select an answer and state the likelihood that each option is correct\. This allows us to not only observe the confidence assigned to the response but also the distribution of confidence in other answer options\.

ModelTypeScore \(%\)ECEConf\. \(%\)% RndHard\-EasyNNClaude\-Sonnet\-3\.7Reasoning54\.50\.04053\.190\.10\.180808Claude\-Sonnet\-4Reasoning54\.00\.06349\.898\.80\.327808DeepSeek\-R1Reasoning54\.40\.03157\.229\.00\.053808Gemini\-2\.5\-ProReasoning53\.80\.02553\.418\.00\.092808GPT\-o3Reasoning54\.20\.02954\.169\.80\.189761Reasoning models54\.20\.03753\.561\.10\.168751Claude Haiku 3Chat53\.00\.26779\.81000\.996808DeepSeek\-V3Chat53\.30\.12463\.71000\.782808Gemini\-2\.5\-FlashChat53\.80\.09863\.648\.90\.192808GPT\-4oChat54\.50\.08559\.81000\.604808Llama\-3\.1\-70BChat53\.50\.18572\.099\.50\.874807Llama\-3\.1\-8BChat48\.40\.14259\.91000\.941800Chat models52\.80\.15066\.591\.40\.732751Table 2:Performance metrics across various models on LifeEval split by model type\. We report Mean Score, Expected Calibration Error \(ECE\), Mean Confidence, Percentage of Rounded outputs, Hard\-Easy \(the regression coefficient between difficulty and overconfidence\), and number of completions \(NN\)\. LifeEval has a mean Maximum Achievable Score \(MAS\) of 56\.80%\. We ran a regression comparing Overconfidence and question difficulty\(1−MASquestion\)\(1\-MAS\_\{question\}\)\. A higher regression coefficient implies an increased Hard\-Easy effect\. Aggregate rows average within column, except forNNwhich is the size of the subset of questions answered by all models \(Reasoning & Chat\)\.Score \(%\)is the mean score for each model onLifeEval\. Our formula for question level scoring can be seen in Eq\.[3](https://arxiv.org/html/2605.23909#S4.E3)
## 4Question Sets

We selected a variety of different question types intended to capture a spectrum of conditions under which calibration can succeed or fail\. By examining calibration across these types of questions, we seek a comprehensive understanding of model calibration\.

Some questions, like true/false items, entail a two\-alternative forced choice \(so\-called 2AFC formats\)\. Peak scoring focuses on the favored option and the agent’s confidence that it is correct\. It is standard practice to assign responses to bins that subdivide the range of confidence\(Mooreet al\.,[2015](https://arxiv.org/html/2605.23909#bib.bib43); Keren,[1988](https://arxiv.org/html/2605.23909#bib.bib60)\)\. This affords the calculation of overconfidence and computed ECE over bins\. Table[1](https://arxiv.org/html/2605.23909#S2.T1)contains a brief description of each question set used in our analysis\.

### 4\.1BoolQ and SciQ

To measure model calibration in general knowledge, we used 1000 multiple choice questions \(MCQ\) from the SciQ dataset\(Welblet al\.,[2017](https://arxiv.org/html/2605.23909#bib.bib56)\)as well as 3,270 True/False questions from the BoolQ datasetClarket al\.\([2019](https://arxiv.org/html/2605.23909#bib.bib110)\)\. We scored models against ground truth for each question\.

### 4\.2LSAT\-AR

To evaluate calibration in logical reasoning, we used 230 questions from the LSAT Analytical Reasoning section\(Zhonget al\.,[2021](https://arxiv.org/html/2605.23909#bib.bib71)\)\. Each question contained five multiple choice answer options\. These tasks required multi\-step reasoning, rule application, and inference, making them well\-suited for testing whether models’ confidence appropriately degrades as logical complexity increases\.

### 4\.3SAT\-EN

For contextual understanding, we evaluated models on 1000 passage\-based inference questions drawn from the SAT English section\(Zhonget al\.,[2023](https://arxiv.org/html/2605.23909#bib.bib112)\)\. Each passage was accompanied by multiple choice comprehension questions requiring information extraction, inference, and reasoning about nuanced textual details\. Our measure of calibration compares model confidence with actual accuracy across varying levels of passage complexity\. This allowed us to test whether models maintain appropriate confidence when the answer depends on subtle contextual cues\.

### 4\.4HaluEval

To assess confidence where LLMs are prone to hallucinating, we drew on the HaluEval question set\(Liet al\.,[2023](https://arxiv.org/html/2605.23909#bib.bib113)\): 2000 question\-answer pairs, 1000 truthful answers and 1000 hallucinated answers\. Here, we prompted models not to produce an answer, but instead to state their confidence in the given answer’s correctness\. Because half of these answers were deliberately hallucinated, this setting provided a direct test of whether models could recognize and signal their own fallibility\. Calibration compared stated confidence with correctness\.

### 4\.5LifeEval

We developed a new question set, which we call LifeEval, to manipulate difficulty while holding constant the nature of the question\. LifeEval asks models to predict the lifespan of a person given their age and sex\. Models were then asked to report the probability that their estimate would fall within one of several radii \(1, 5, 10, or 20 years\) of the true lifespan\. See Figure[1](https://arxiv.org/html/2605.23909#S1.F1)\. We assessed actual probability using U\.S\. Social Security Administration Period Life Tables\(Social Security Administration, Office of the Chief Actuary,[2025](https://arxiv.org/html/2605.23909#bib.bib114)\)\. Manipulating radius, age, and sex enabled us to vary task difficulty holding all else constant\. Unlike other question sets currently available, LifeEval distinguishes itself from existing benchmarks by providing a gradient of difficulty that the model can actively detect\. For instance, if the model is told that a male has already lived 80 years, it can be confident in its guess landing within a 20\-year radius of the truth\. However, if it only knows the sex and must get within 1 year, it should be clear to the model that the actual probability is low\.

We use the probability of success given the optimal answer as a measure of task difficulty for our analysis, given that this represents a performance ceiling\. If there exists an answer to a question that can theoretically capture 100% of the mass of the conditional distribution, we can consider that question easier than one where ceiling is only 20%\. We can therefore understand the difficulty of a question as 1 minus its Maximum Achievable Score \(MAS\) as seen in Figure[5](https://arxiv.org/html/2605.23909#A4.F5)\.

This approach affords us another lens through which to understand model calibration: how does a model react to changing task difficulty? While there exist several methods for quantifying task difficulty they all come with their own drawbacks as discussed in Section[2](https://arxiv.org/html/2605.23909#S2)\. For LifeEval we computed accuracy differently from the other question sets, because we knew the true probabilities against which we could compare the LLMs’ responses\. For a given question, letaabe the minimum age \(e\.g\. 25\),ssbe the sex, andrrbe the radius around the model’s guess\. Suppose a model guessedy^\(a,s\)\\hat\{y\}\(a,s\)and has confidencec\(a,s,r\)c\(a,s,r\)thaty^\(a,s\)\\hat\{y\}\(a,s\)is within a radiusrrof the correct outcome\. We justify why our approach fits into the framework of the other question sets as follows: Imagine for a moment we had a large set of peopleQQwhere personi∈Qi\\in Qdied at ageyiy\_\{i\}\. Focusing on the subsetQas=\{i:yi≥a,sex\(i\)=s\}Q\_\{as\}=\\\{i:y\_\{i\}\\geq a,\\text\{sex\}\(i\)=s\\\}of people of sexsswho lived until at least the age ofaa, we can think of this as a binary question of asking about whether the trueyiy\_\{i\}falls in the interval with

accLifeEval\(Qas\)=1\|Qas\|∑i∈Qas𝕀\{yi∈\[r^−,r^\+\]\}\\displaystyle\\text\{acc\}\_\{LifeEval\}\(Q\_\{as\}\)=\\frac\{1\}\{\|Q\_\{as\}\|\}\\sum\_\{i\\in Q\_\{as\}\}\\mathbb\{I\}\\bigl\\\{y\_\{i\}\\in\[\\hat\{r\}^\{\-\},\\hat\{r\}^\{\+\}\]\\bigr\\\}\(1\)where𝕀\\mathbb\{I\}is the indicator function,r^−=y^\(a,s\)−r\\hat\{r\}^\{\-\}=\\hat\{y\}\(a,s\)\-r, andr^\+=y^\(a,s\)\+r\\hat\{r\}^\{\+\}=\\hat\{y\}\(a,s\)\+r\. Imagining\|Qas\|→∞\|Q\_\{as\}\|\\to\\infty, we haveaccLifeEval\(Qas\)→p\(y^\(a,s\),r\|a,s\)\\text\{acc\}\_\{LifeEval\}\(Q\_\{as\}\)\\rightarrow p\(\\hat\{y\}\(a,s\),r\|a,s\), wherep\(k,r\|a,s\)p\(k,r\|a,s\)is defined as

ℙ\(y∈\[k−r,k\+r\]\|y≥a,s\)\\mathbb\{P\}\\bigl\(y\\in\[k\-r,k\+r\]\\bigr\|y\\geq a,s\)\(2\)By taking advantage of the actuarial life tables from the social security administration\(Social Security Administration, Office of the Chief Actuary,[2025](https://arxiv.org/html/2605.23909#bib.bib114)\), we computep\(k,r\|a,s\)p\(k,r\|a,s\)as:

p\(k,r\|a,s\)=∑i=k−rk\+rSi\(a,s\)⋅qi\(s\)p\(k,r\|a,s\)=\\sum\_\{i=k\-r\}^\{k\+r\}S\_\{i\}\(a,s\)\\cdot q\_\{i\}\(s\)\(3\)
Whereqi\(s\)q\_\{i\}\(s\)is the probability of death for a person of a given sex to die at ageiionce they becomeiiyears old andSi\(a,s\)S\_\{i\}\(a,s\)is the conditional probability that someone lives at least to ageiigiven sex and minimum age\. Whileqi\(s\)q\_\{i\}\(s\)is provided by the life tables, we can compute

Si\(a,s\)=ℙ\(live to agei\|a,s\)=∏j=ai−1\(1−qj\(s\)\)\.\\begin\{split\}S\_\{i\}\(a,s\)&=\\mathbb\{P\}\(\\text\{live to age $i$\}\|a,s\)\\\\ &=\\prod\_\{j=a\}^\{i\-1\}\\bigl\(1\-q\_\{j\}\(s\)\\bigr\)\.\\end\{split\}\(4\)

## 5Confidence Scoring

At its core, confidence calibration quantifies the alignment between subjective probability and objective accuracy\. When a model assigns a subjective probability of 80%, good calibration demands it is correct 80% of the time\(Dawid,[1982](https://arxiv.org/html/2605.23909#bib.bib49)\)\.

### 5\.1Stated Confidence

For all question sets, we prompted models to provide a numerical confidence score from 0 to 1\.0 representing the probability they are correct\. In the case of multiple choice questions, models assigned probabilities to each of the answer options\. In the instances where the provided probabilities did not sum to 1 we obtained normalized probabilitiesPiP\_\{i\}as follows:

Pi\\displaystyle P\_\{i\}=si∑j∈Ssj,\\displaystyle=\\frac\{s\_\{i\}\}\{\\sum\_\{j\\in S\}s\_\{j\}\},\(5\)whereSSis the set of options and eachsis\_\{i\}represents a stated confidence for a given option\.

## 6Metrics

For any multiple choice question setQQ, we letyi,y^i∈\{1,…,K\}y\_\{i\},\\hat\{y\}\_\{i\}\\in\\\{1,\\ldots,K\\\}denote the correct option and the model’s chosen option, respectively, for questioni∈Qi\\in Q\(whereKKdenotes the number of choices\)\. We letCi\(k\)∈\[0,1\]C\_\{i\}\(k\)\\in\[0,1\]denote the model’s confidence that optionkkis correct, where∑k=1KCi\(k\)=1\\sum\_\{k=1\}^\{K\}C\_\{i\}\(k\)=1\. WhereCi\(k\)C\_\{i\}\(k\)refers to a model’s stated confidence\. We define accuracy as

acc\(Q\)=1\|Q\|∑i∈Q𝕀\{yi^=yi\},\\displaystyle\\text\{acc\}\(Q\)=\\frac\{1\}\{\|Q\|\}\\sum\_\{i\\in Q\}\\mathbb\{I\}\\\{\\hat\{y\_\{i\}\}=y\_\{i\}\\\},\(6\)where𝕀\\mathbb\{I\}is the indicator function, and confidence as

conf\(Q\)=1\|Q\|∑i∈QCi\(y^i\)\.\\displaystyle\\text\{conf\}\(Q\)=\\frac\{1\}\{\|Q\|\}\\sum\_\{i\\in Q\}C\_\{i\}\(\\hat\{y\}\_\{i\}\)\.\(7\)
### 6\.1Scoring for HaluEval

For HaluEval, we determined whether a provided answer was correct ahead of time\. Therefore, we scored each question based on the provided label such that

accHaluEval\(Q\)=1\|Q\|∑i∈Qyi\.\\displaystyle\\text\{acc\}\_\{HaluEval\}\(\{Q\}\)=\\frac\{1\}\{\|Q\|\}\\sum\_\{i\\in Q\}y\_\{i\}\.\(8\)Whereyi∈\{0,1\}y\_\{i\}\\in\\\{0,1\\\}depending on the response we inject\.

### 6\.2Expected Calibration Error \(ECE\)

Expected Calibration Error \(ECE\) quantifies the misalignment between a model’s predicted confidence and its empirical accuracy\(Pavlovic,[2025](https://arxiv.org/html/2605.23909#bib.bib40); Naeiniet al\.,[2015](https://arxiv.org/html/2605.23909#bib.bib41)\)\. We first partition a question setQQintoMMdisjoint bins by confidence:

Qm=\{i∈Q:m−1M<Ci\(yi^\)≤mM\}\\displaystyle Q\_\{m\}=\\bigl\\\{i\\in Q:\\frac\{m\-1\}\{M\}<C\_\{i\}\(\\hat\{y\_\{i\}\}\)\\leq\\frac\{m\}\{M\}\\bigr\\\}form=1,…,Mm=1,\\ldots,M\. We then compute

ECE\(Q\)=1\|Q\|∑m=1Mnm\|acc\(Qm\)−conf\(Qm\)\|\\displaystyle\\text\{ECE\}\(Q\)=\\frac\{1\}\{\|Q\|\}\\sum\_\{m=1\}^\{M\}n\_\{m\}\\,\\bigl\|\\text\{acc\}\(Q\_\{m\}\)\-\\text\{conf\}\(Q\_\{m\}\)\\bigr\|\(9\)wherenmn\_\{m\}is the number of questions inQmQ\_\{m\}\. Probabilities were grouped into ten equally spaced intervals from \[0, 1\), with an additional bin dedicated to the value 1\.0\. This eleventh bin identifies those distinctive instances in which a model reports absolute certainty by assigning a probability of exactly 1\.

### 6\.3Overconfidence

Since ECE does not reveal whether miscalibration is due to over\- or underconfidence, we needed a separate measure of overconfidence\. We borrowed from previous works in psychology\(Klaymanet al\.,[1999](https://arxiv.org/html/2605.23909#bib.bib87)\)to define overconfidence over an entire question setQQas

overconfidence\(Q\)=conf\(Q\)−acc\(Q\)\.\\text\{overconfidence\}\(Q\)=\\text\{conf\}\(Q\)\-\\text\{acc\}\(Q\)\.\(10\)

## 7Results

![Refer to caption](https://arxiv.org/html/2605.23909v1/Special_Plots/2x3CalPlots.png)Figure 2:Aggregate calibration plots for each question set, showing accuracy conditional on confidence\.Reasoningmodels in red andNon\-Reasoningin black\. Observations are averaged within confidence bins, \[0,0\.1\),\[0\.1,0\.2\)…\[0\.9,1\),\[1\]\.![Refer to caption](https://arxiv.org/html/2605.23909v1/x1.png)Figure 3:Overconfidence by question set and Model\.Across all question sets, models verbally report88%confidence on average in their favored answer option being correct\. They are, in fact, correct for79%of questions\. The calibration plot shown in Figure[2](https://arxiv.org/html/2605.23909#S7.F2)reveals that there is a strong positive relationship between confidence and accuracy\. The diagonal identity line reflects perfect calibration\. Observations to the southeast of the identity line, where confidence exceeds accuracy, indicate overconfidence\.

We find that models’ tendency toward overconfidence varies by question set\. Figure[3](https://arxiv.org/html/2605.23909#S7.F3)shows that models tended to be overconfident on some tasks like logical reasoning and hallucination detection \(LSAT\-ARandHaluEvalrespectively\)\. Models struggle to think through complex tasks and fail to adequately detect when they have gone astray\. While accuracy was fixed for HaluEval at 52\.12% and LifeEval had an upper bound of 56\.8%, we found that LSAT\-AR proved to be the most difficult of all the unbounded question sets with an average accuracy of58\.6%\. By contrast, while models excelled at SciQ and SAT\-EN, they remained consistently underconfident\. The sole outlier was Llama\-3\.1\-8B, likely due to its significantly lower parameter count\.

LifeEval’s four levels of difficulty afforded insight into how task difficulty affected confidence calibration\. For the lowest radius \(most difficult\) tasks, models reported average confidence of34\.2%but actual probability was only9\.6%\. Yet, like humans, models displayed a tendency towards underconfidence when the task got easier \(i\.e\. the radius increased\)\. Models reported80\.5%confidence but actual probability was92\.0%for their 20\-year radius responses\.

Comparing overconfidence by radius in Figure[4](https://arxiv.org/html/2605.23909#S7.F4)reveals that as task difficulty increases \(i\.e\., radius decreases\), overconfidence increases\. This suggests that models’ reported confidence was insufficiently sensitive to variation in task difficulty\.

![Refer to caption](https://arxiv.org/html/2605.23909v1/x2.png)Figure 4:Overconfidence as a function of model and radius; difficulty decreases with larger accuracy radius\.When analyzing the stated confidence values for LifeEval, we found a disparity between larger reasoning models like DeepSeek\-R1, which tended to provide more nuanced estimates while their smaller siblings provided less nuanced reports, as seen in Table[2](https://arxiv.org/html/2605.23909#S3.T2)\. We found that models tended to resemble human confidence reporting by rounding to the nearest 5%\. Stated confidence was a multiple of 5% for91\.4%of reports by non\-reasoning chat models, with some reporting a multiple of 5% for all of their responses\. By contrast, only61\.1%of stated confidence was a multiple of 5% for reasoning models\. This highlights a key flaw in uncertainty estimation via model reporting\. By design, models imitate human behavior; this extends to confidence reporting where, like humans\(Wallstenet al\.,[1993](https://arxiv.org/html/2605.23909#bib.bib6)\), models tend to avoid precision\(Xionget al\.,[2024](https://arxiv.org/html/2605.23909#bib.bib1)\)\.

We observe stronger correlations between stated confidence and actual probability of being correct for reasoning models\. This correlation is, on average, 0\.94 for the five reasoning models \(GPT\-o3, DeepSeek\-R1, Gemini 2\.5\-Pro, Claude Sonnet\-3\.7 and Sonnet\-4\) and only 0\.48 for the other models\.

## 8Discussion

LLMs can report their confidence in the accuracy of their judgments, but that confidence deviates predictably from actual accuracy\. We find 9% overconfidence \(the difference between models’ 88% stated confidence and their 79% accuracy\)\. Nevertheless, confidence reported by the models in our study is less responsive than is accuracy to variations in difficulty\. Consequently, we observe the “hard\-easy” effect documented in human confidence judgments: Overconfidence rises with difficulty\(Erevet al\.,[1994](https://arxiv.org/html/2605.23909#bib.bib32)\)\. This is particularly evident in LifeEval, the test we devised to provide exogenous variation in difficulty holding other task characteristics constant\.

Documenting the presence of hard\-easy effects in AI confidence is novel\. It is important because it highlights a parallel between AI and human judgment that sheds light on its origins, and points the way toward potential debiasing strategies\. The hard\-easy effect arises because confidence is a noisy signal of accuracy\. Confidence is less responsive to difficulty than is accuracy\. So the overall overconfidence we observe is partly a result of test items sufficiently difficult to produce overconfidence\. However, there is more to it than that\. Fallible agents will sometimes be wrong because they do not know everything\(Moore,[2023](https://arxiv.org/html/2605.23909#bib.bib42)\)\. For AI systems, out\-of\-sample judgments may include these unknown unknowns—that is, relevant knowledge the agent lacks but is not aware of\.

It is possible to imagine a model refining its confidence in light of feedback\. Commercial LLMs are refined through reinforcement learning with human feedback \(RLHF\)\. However, RLHF might actually increase the overconfidence models report\(Tianet al\.,[2023](https://arxiv.org/html/2605.23909#bib.bib54)\)\. If human users prefer models that express confident assurance, RLHF may train the LLM to report greater confidence\.

LLMs’ competence and confidence have led many users to rely on them heavily and unquestioningly\(Houet al\.,[2025](https://arxiv.org/html/2605.23909#bib.bib50)\)\. This trust might be misplaced if models express greater confidence than their accuracy justifies\. When models are faced with tasks for which they cannot perform as well, they maintain the same high level of confidence\. As we see in LifeEval, models fail to sufficiently reduce their confidence as performance declines with task difficulty\. For LLMs to deserve users’ trust, they must be able to reliably report their limitations\. Many users are aware of LLMs’ impressive capabilities but are wary of adoption because of the unpredictable nature of hallucination\. If users do not know when they need to seek additional resources they are forced to either constantly watch over the model or remove it from their workflow entirely\.

Future research should examine how language models perform on Bayesian inference tasks\. As LifeEval demonstrates, models consistently struggle to appropriately reduce their confidence as tasks become more difficult\. Investigating this limitation across different domains may help illuminate the underlying causes and potential remedies\.

In humans, one of the most useful general\-purpose debiasing strategies is getting people to reflect on why it is they might be wrong\(Lordet al\.,[1984](https://arxiv.org/html/2605.23909#bib.bib2)\)\. More specifically, inviting people to consider what information they lack helps them moderate their tendency toward overconfidence\(Walterset al\.,[2017](https://arxiv.org/html/2605.23909#bib.bib14)\)\. The better performance of the reasoning models we examine offers a striking parallel\. It is possible that prompts or training regimens that encourage models to engage in more reflection and self\-criticism could further improve calibration and reduce overconfidence\.

Psychological research distinguishes three forms of overconfidence in humans: overplacement is the exaggerated belief that you are better than others; overestimation is thinking you are better than you are; overprecision is the excessive certainty that you know the truth\(Moore and Healy,[2008](https://arxiv.org/html/2605.23909#bib.bib61)\)\. We employed single\-item confidence measures that ask, “How sure are you that this answer is correct?” These sorts of item\-confidence measures perfectly confound overestimation and overprecision, since being too certain of your answer is the same as overestimating your chance of being correct\. However, it is possible to unconfound these two using higher\-order measures\. One approach elicits an estimate of the respondent’s score on some test, and their certainty about that estimate\. This affords the possibility of being excessively certain of an underestimate, such as the student who is convinced she failed an exam when in reality she passed\. Future research should distinguish between these different forms of overconfidence in LLMs\.

## Limitations

LifeEval is idiosyncratic, in the sense that the task has unique features that might not generalize to all other tasks\. This is a necessary limitation of any particular set of questions, especially a set like LifeEval for which all the questions have a similar content and format\. This similarity facilitates comparison across questions within the set but limits generalizability beyond it\. Another quirk of LifeEval is that models may have had access to the SSA tables\.222The answers for other question sets, such as SciQ, are even more likely to have been present in the models’ training data\.This information should have increased both accuracy and confidence\. The value of our analysis of over\- and underconfidence remains undiminished\. To assess the extent of this potential exposure, we conduct a contamination analysis in Appendix[G](https://arxiv.org/html/2605.23909#A7), where we attempt to evaluate how much access models may have had to the LifeEval data or the SSA tables during training\.

## Ethical Considerations

One of the authors is a Visiting Faculty Researcher at Google, which created some of the LLMs analyzed in this work; however, this manuscript’s work was conducted as part of their employment at a university, not at Google\.

While the misuse of generative AI has become a growing issue in recent years, we do not see a way in which our work exacerbates this issue\. LifeEval only utilizes two demographic identifiers for a person: sex and a minimum age level\. Although it would be ill\-advised, if an inference provider chose to incorporate LifeEval or a similarly structured question set into their training data there would be a possibility of model bias arising along these two axes\. We encourage future researchers to exercise caution and transparency regarding the inclusion of such demographic markers in training pipelines\.

## Acknowledgments

We thank the Center for Advanced Research Computing \(CARC\) at the University of Southern California for providing computing resources that have contributed to the research results reported within this publication\. We are also deeply grateful to Lambda\.ai for generously providing access to their computing resources in the spirit of open research\. In addition, we acknowledge the support of USC’s JumpStart program and UC Berkeley’s Undergraduate Research Apprentice Program \(URAP\), which provided valuable support for Noam’s research with Jacob and Don\. We would also like to thank Kelly Hu, Aleksandre Natchkebia, and Josh Moore for their time spent assisting us in completing this project\.

## References

- M\. Abdar, F\. Pourpanah, S\. Hussain, D\. Rezazadegan, L\. Liu, M\. Ghavamzadeh, P\. Fieguth, X\. Cao, A\. Khosravi, U\. R\. Acharya, V\. Makarenkov, and S\. Nahavandi \(2021\)A review of uncertainty quantification in deep learning: techniques, applications and challenges\.Information Fusion76,pp\. 243–297\.Note:arXiv: 2011\.06225External Links:ISSN 15662535,[Document](https://dx.doi.org/10.1016/j.inffus.2021.05.008)Cited by:[§2](https://arxiv.org/html/2605.23909#S2.p3.1)\.
- S\. Afroogh, A\. Akbari, E\. Malone, M\. Kargar, and H\. Alambeigi \(2024\)Trust in ai: progress, challenges, and future directions\.Humanities and Social Sciences Communications11\(1\),pp\. 1568\.External Links:ISSN 2662\-9992,[Document](https://dx.doi.org/10.1057/s41599-024-04044-8),[Link](https://doi.org/10.1057/s41599-024-04044-8)Cited by:[§1](https://arxiv.org/html/2605.23909#S1.p1.1)\.
- Anthropic \(2024\)Claude haiku 3\.External Links:[Link](https://assets.anthropic.com/m/61e7d27f8c8f5919/original/Claude-3-Model-Card.pdf)Cited by:[§3](https://arxiv.org/html/2605.23909#S3.p1.1)\.
- Anthropic \(2025a\)Claude sonnet 3\.7\.External Links:[Link](https://assets.anthropic.com/m/785e231869ea8b3b/original/claude-3-7-sonnet-system-card.pdf)Cited by:[§3](https://arxiv.org/html/2605.23909#S3.p1.1)\.
- Anthropic \(2025b\)Claude sonnet 4\.External Links:[Link](https://www-cdn.anthropic.com/6d8a8055020700718b0c49369f60816ba2a7c285.pdf)Cited by:[§3](https://arxiv.org/html/2605.23909#S3.p1.1)\.
- A\. Bick, A\. Blandin, and D\. J\. Deming \(2024\)The rapid adoption of generative ai\.Working PaperTechnical Report32966,Working Paper Series,National Bureau of Economic Research\.External Links:[Document](https://dx.doi.org/10.3386/w32966),[Link](http://www.nber.org/papers/w32966)Cited by:[§1](https://arxiv.org/html/2605.23909#S1.p1.1)\.
- Z\. M\. Boundy\-Singer, C\. M\. Ziemba, and R\. L\. T\. Goris \(2023\)Confidence reflects a noisy decision reliability estimate\.Nature Human Behaviour7\(1\),pp\. 142–154\(en\)\.External Links:ISSN 2397\-3374,[Document](https://dx.doi.org/10.1038/s41562-022-01464-x)Cited by:[§2](https://arxiv.org/html/2605.23909#S2.p2.1)\.
- J\. D\. Brown \(2012\)Understanding the better than average effect: motives \(still\) matter\.Personality and Social Psychology Bulletin38\(2\),pp\. 209–219\.Note:Citation Key: Brown2011External Links:ISSN 1552\-7433,[Document](https://dx.doi.org/10.1177/0146167211432763)Cited by:[§2](https://arxiv.org/html/2605.23909#S2.p3.1)\.
- P\. Chhikara \(2025\)Mind the confidence gap: overconfidence, calibration, and distractor effects in large language models\.External Links:2502\.11028,[Link](https://arxiv.org/abs/2502.11028)Cited by:[§2](https://arxiv.org/html/2605.23909#S2.p3.1)\.
- C\. Clark, K\. Lee, M\. Chang, T\. Kwiatkowski, M\. Collins, and K\. Toutanova \(2019\)BoolQ: exploring the surprising difficulty of natural yes/no questions\.External Links:1905\.10044,[Link](https://arxiv.org/abs/1905.10044)Cited by:[§4\.1](https://arxiv.org/html/2605.23909#S4.SS1.p1.1)\.
- A\. P\. Dawid \(1982\)The well\-calibrated bayesian\.Journal of the American Statistical Association77\(379\),pp\. 605–610\.External Links:[Document](https://dx.doi.org/10.1080/01621459.1982.10477856),[Link](https://www.tandfonline.com/doi/abs/10.1080/01621459.1982.10477856)Cited by:[§5](https://arxiv.org/html/2605.23909#S5.p1.1)\.
- DeepSeek \(2024\)DeepSeek v3\.External Links:[Link](https://huggingface.co/deepseek-ai/DeepSeek-V3)Cited by:[§3](https://arxiv.org/html/2605.23909#S3.p1.1)\.
- DeepSeek \(2025\)DeepSeek r1\.External Links:[Link](https://huggingface.co/deepseek-ai/DeepSeek-R1)Cited by:[§3](https://arxiv.org/html/2605.23909#S3.p1.1)\.
- I\. Erev, T\. S\. Wallsten, and D\. V\. Budescu \(1994\)Simultaneous over\- and underconfidence: the role of error in judgment processes\.Psychological Review101\(3\),pp\. 519–527\.Note:Citation Key: Erev1994Cited by:[§2](https://arxiv.org/html/2605.23909#S2.p2.1),[§8](https://arxiv.org/html/2605.23909#S8.p1.1)\.
- S\. Gobara, H\. Kamigaito, and T\. Watanabe \(2024\)Do llms implicitly determine the suitable text difficulty for users?\.External Links:2402\.14453,[Link](https://arxiv.org/abs/2402.14453)Cited by:[§2](https://arxiv.org/html/2605.23909#S2.p4.1)\.
- Google \(2025a\)Gemini 2\.5 flash\.External Links:[Link](https://modelcards.withgoogle.com/assets/documents/gemini-2.5-flash.pdf)Cited by:[§3](https://arxiv.org/html/2605.23909#S3.p1.1)\.
- Google \(2025b\)Gemini 2\.5 pro\.External Links:[Link](https://modelcards.withgoogle.com/assets/documents/gemini-2.5-pro.pdf)Cited by:[§3](https://arxiv.org/html/2605.23909#S3.p1.1)\.
- C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger \(2017\)On calibration of modern neural networks\.External Links:1706\.04599,[Link](https://arxiv.org/abs/1706.04599)Cited by:[§2](https://arxiv.org/html/2605.23909#S2.p3.1)\.
- I\. Hou, H\. V\. Nguyen, O\. Man, and S\. MacNeil \(2025\)The evolving usage of genai by computing students\.InProceedings of the 56th ACM Technical Symposium on Computer Science Education V\. 2,SIGCSE TS 2025,pp\. 1481–1482\.External Links:[Link](http://dx.doi.org/10.1145/3641555.3705266),[Document](https://dx.doi.org/10.1145/3641555.3705266)Cited by:[§8](https://arxiv.org/html/2605.23909#S8.p4.1)\.
- S\. Hwang, H\. Kim, and G\. G\. Lee \(2025\)Can llms estimate cognitive complexity of reading comprehension items?\.External Links:2510\.25064,[Link](https://arxiv.org/abs/2510.25064)Cited by:[§2](https://arxiv.org/html/2605.23909#S2.p4.1)\.
- S\. Kadavath, T\. Conerly, A\. Askell, T\. Henighan, D\. Drain, E\. Perez, N\. Schiefer, Z\. Hatfield\-Dodds, N\. DasSarma, E\. Tran\-Johnson, S\. Johnston, S\. El\-Showk, A\. Jones, N\. Elhage, T\. Hume, A\. Chen, Y\. Bai, S\. Bowman, S\. Fort, D\. Ganguli, D\. Hernandez, J\. Jacobson, J\. Kernion, S\. Kravec, L\. Lovitt, K\. Ndousse, C\. Olsson, S\. Ringer, D\. Amodei, T\. Brown, J\. Clark, N\. Joseph, B\. Mann, S\. McCandlish, C\. Olah, and J\. Kaplan \(2022\)Language models \(mostly\) know what they know\.arXivarXiv:2207\.05221\.Note:arXiv:2207\.05221 \[cs\]External Links:[Link](http://arxiv.org/abs/2207.05221),[Document](https://dx.doi.org/10.48550/arXiv.2207.05221)Cited by:[§2](https://arxiv.org/html/2605.23909#S2.p3.1)\.
- D\. Kahneman \(2011\)Thinking fast and slow\.Farrar, Straus and Giroux,New York\.Note:Citation Key: Kahneman2011Cited by:[§2](https://arxiv.org/html/2605.23909#S2.p1.1)\.
- A\. T\. Kalai, O\. Nachum, S\. S\. Vempala, and E\. Zhang \(2025\)Why language models hallucinate\.External Links:2509\.04664,[Link](https://arxiv.org/abs/2509.04664)Cited by:[§1](https://arxiv.org/html/2605.23909#S1.p1.1)\.
- G\. Keren \(1988\)On the ability of monitoring non\-veridical perceptions and uncertain knowledge: some calibration studies\.Acta Psychologica67\(2\),pp\. 95–119\.Note:Citation Key: Keren1988Cited by:[§4](https://arxiv.org/html/2605.23909#S4.p2.1)\.
- J\. Klayman, J\. B\. Soll, C\. González\-Vallejo, and S\. Barlas \(1999\)Overconfidence: it depends on how, what, and whom you ask\.Organizational Behavior and Human Decision Processes79\(3\),pp\. 216–247\.External Links:ISSN 0749\-5978,[Document](https://dx.doi.org/https%3A//doi.org/10.1006/obhd.1999.2847),[Link](https://www.sciencedirect.com/science/article/pii/S0749597899928479)Cited by:[§6\.3](https://arxiv.org/html/2605.23909#S6.SS3.p1.1)\.
- J\. I\. Krueger and R\. A\. Mueller \(2002\)Unskilled, unaware, or both? the better\-than\-average heuristic and statistical regression predict errors in estimates of own performance\.Journal of Personality and Social Psychology82\(2\),pp\. 180–188\.Cited by:[§2](https://arxiv.org/html/2605.23909#S2.p2.1)\.
- J\. Kruger and D\. Dunning \(1999\)Unskilled and unaware of it: how difficulties in recognizing one’s own incompetence lead to inflated self\-assessments\.Journal of Personality and Social Psychology77\(6\),pp\. 1121–1134\.Note:Citation Key: Kruger1999aCited by:[§2](https://arxiv.org/html/2605.23909#S2.p3.1)\.
- J\. Leng, C\. Huang, B\. Zhu, and J\. Huang \(2025\)Taming overconfidence in llms: reward calibration in rlhf\.arXivarXiv:2410\.09724\.Note:arXiv:2410\.09724 \[cs\]External Links:[Link](http://arxiv.org/abs/2410.09724),[Document](https://dx.doi.org/10.48550/arXiv.2410.09724)Cited by:[§2](https://arxiv.org/html/2605.23909#S2.p3.1)\.
- J\. Li, X\. Cheng, W\. X\. Zhao, J\. Nie, and J\. Wen \(2023\)HaluEval: a large\-scale hallucination evaluation benchmark for large language models\.External Links:[Link](https://arxiv.org/abs/2305.11747)Cited by:[§4\.4](https://arxiv.org/html/2605.23909#S4.SS4.p1.1)\.
- Y\. Li, M\. Xiong, J\. Wu, and B\. Hooi \(2025\)ConfTuner: training large language models to express their confidence verbally\.External Links:2508\.18847,[Link](https://arxiv.org/abs/2508.18847)Cited by:[§2](https://arxiv.org/html/2605.23909#S2.p3.1)\.
- S\. Lichtenstein and B\. Fischhoff \(1977\)Do those who know more also know more about how much they know?\.Organizational Behavior and Human Decision Processes20\(2\),pp\. 159–183\.Note:Citation Key: Lichtenstein1977Cited by:[§2](https://arxiv.org/html/2605.23909#S2.p1.1)\.
- C\. G\. Lord, M\. R\. Lepper, and E\. Preston \(1984\)Considering the opposite: a corrective strategy for social judgment\.Journal of Personality and Social Psychology47\(6\),pp\. 1231–1243\.Note:Citation Key: Lord1984aCited by:[§8](https://arxiv.org/html/2605.23909#S8.p6.1)\.
- T\. Luong, D\. Hwang, H\. H\. Nguyen, G\. Ghiasi, Y\. Chervonyi, I\. Seo, J\. Kim, G\. Bingham, J\. Lee, S\. Mishra, A\. Zhai, C\. H\. Hu, H\. Michalewski, J\. Kim, J\. Ahn, J\. Bae, X\. Song, T\. H\. Trinh, Q\. V\. Le, and J\. Jung \(2025\)Towards robust mathematical reasoning\.External Links:2511\.01846,[Link](https://arxiv.org/abs/2511.01846)Cited by:[§2](https://arxiv.org/html/2605.23909#S2.p4.1)\.
- Meta \(2024a\)Llama 3\.1 70b instruct\.External Links:[Link](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct)Cited by:[§3](https://arxiv.org/html/2605.23909#S3.p1.1)\.
- Meta \(2024b\)Llama 3\.1 8b instruct\.External Links:[Link](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)Cited by:[§3](https://arxiv.org/html/2605.23909#S3.p1.1)\.
- D\. A\. Moore and P\. J\. Healy \(2008\)The trouble with overconfidence\.Psychological Review115\(2\),pp\. 502–517\.Cited by:[§8](https://arxiv.org/html/2605.23909#S8.p7.1)\.
- D\. A\. Moore, E\. R\. Tenney, and U\. Haran \(2015\)Overprecision in judgment\.InHandbook of Judgment and Decision Making,G\. Wu and G\. Keren \(Eds\.\),pp\. 182––212\.Note:Citation Key: Moore2014Cited by:[§4](https://arxiv.org/html/2605.23909#S4.p2.1)\.
- D\. A\. Moore \(2023\)Overprecision is a property of thinking systems\.Psychological Review130\(5\),pp\. 1339–1350\(en\)\.Cited by:[§8](https://arxiv.org/html/2605.23909#S8.p2.1)\.
- M\. P\. Naeini, G\. Cooper, and M\. Hauskrecht \(2015\)Obtaining well calibrated probabilities using bayesian binning\.InProceedings of the AAAI conference on artificial intelligence,Vol\.29\.External Links:ISBN 2374\-3468Cited by:[§3](https://arxiv.org/html/2605.23909#S3.p2.1),[§6\.2](https://arxiv.org/html/2605.23909#S6.SS2.p1.2)\.
- O\. Oelrich, S\. Ding, M\. Magnusson, A\. Vehtari, and M\. Villani \(2020\)When are bayesian model probabilities overconfident?\.arXiv:2003\.04026\.Note:arXiv: 2003\.04026External Links:[Link](http://arxiv.org/abs/2003.04026)Cited by:[§2](https://arxiv.org/html/2605.23909#S2.p3.1)\.
- OpenAI \(2024\)GPT\-4o\.External Links:[Link](https://openai.com/index/gpt-4o-system-card/)Cited by:[§3](https://arxiv.org/html/2605.23909#S3.p1.1)\.
- OpenAI \(2025a\)ChatGPT\.\(en\-US\)\.External Links:[Link](https://chatgpt.com/)Cited by:[§1](https://arxiv.org/html/2605.23909#S1.p1.1)\.
- OpenAI \(2025b\)GPT\-o3\.External Links:[Link](https://openai.com/index/o3-o4-mini-system-card/)Cited by:[§3](https://arxiv.org/html/2605.23909#S3.p1.1)\.
- M\. Pavlovic \(2025\)Understanding model calibration–a gentle introduction and visual exploration of calibration and the expected calibration error \(ece\)\.arXiv preprint arXiv:2501\.19047\.Cited by:[§6\.2](https://arxiv.org/html/2605.23909#S6.SS2.p1.2)\.
- Philip and Hemang \(2024\)SimpleBench: the text benchmark in which unspecialized human performance exceeds that of current frontier models\.Cited by:[§2](https://arxiv.org/html/2605.23909#S2.p4.1)\.
- Social Security Administration, Office of the Chief Actuary \(2025\)Period life table, 2022 \(used in the 2025 trustees report\)\.Note:Web pagePresented by the Office of the Chief Actuary; accessed via SSA websiteExternal Links:[Link](https://www.ssa.gov/oact/STATS/table4c6.html)Cited by:[§4\.5](https://arxiv.org/html/2605.23909#S4.SS5.p1.1),[§4\.5](https://arxiv.org/html/2605.23909#S4.SS5.p4.7)\.
- Y\. Y\. Sung, E\. Fleisig, Y\. Hou, I\. Upadhyay, and J\. L\. Boyd\-Graber \(2025\)GRACE: a granular benchmark for evaluating model calibration against human calibration\.External Links:2502\.19684,[Link](https://arxiv.org/abs/2502.19684)Cited by:[§2](https://arxiv.org/html/2605.23909#S2.p4.1)\.
- H\. M\. S\. Tabib and J\. A\. Deedar \(2025\)Toward trustworthy difficulty assessments: large language models as judges in programming and synthetic tasks\.External Links:2511\.18597,[Link](https://arxiv.org/abs/2511.18597)Cited by:[§2](https://arxiv.org/html/2605.23909#S2.p4.1)\.
- K\. Tian, E\. Mitchell, A\. Zhou, A\. Sharma, R\. Rafailov, H\. Yao, C\. Finn, and C\. D\. Manning \(2023\)Just ask for calibration: strategies for eliciting calibrated confidence scores from language models fine\-tuned with human feedback\.arXiv preprint arXiv:2305\.14975\.External Links:2305\.14975,[Link](http://arxiv.org/abs/2305.14975)Cited by:[§8](https://arxiv.org/html/2605.23909#S8.p3.1)\.
- S\. Tripathi, M\. T\. Nafis, I\. Hussain, and J\. Gao \(2025\)The confidence paradox: can llm know when it’s wrong\.arXivarXiv:2506\.23464\.Note:arXiv:2506\.23464 \[cs\]External Links:[Link](http://arxiv.org/abs/2506.23464),[Document](https://dx.doi.org/10.48550/arXiv.2506.23464)Cited by:[§1](https://arxiv.org/html/2605.23909#S1.p2.1)\.
- T\. S\. Wallsten, D\. V\. Budescu, and R\. Zwick \(1993\)Comparing the calibration and coherence of numerical and verbal probability judgments\.Management Science39\(2\),pp\. 176–190\.Note:Citation Key: Wallsten1993Cited by:[§7](https://arxiv.org/html/2605.23909#S7.p5.1)\.
- D\. J\. Walters, P\. M\. Fernbach, C\. R\. Fox, and S\. A\. Sloman \(2017\)Known unknowns: a critical determinant of confidence and calibration\.Management Science63\(12\),pp\. 4298–4307\.External Links:ISSN 0025\-1909,[Document](https://dx.doi.org/10.1287/mnsc.2016.2580)Cited by:[§8](https://arxiv.org/html/2605.23909#S8.p6.1)\.
- J\. Welbl, N\. F\. Liu, and M\. Gardner \(2017\)Crowdsourcing multiple choice science questions\.arXivarXiv:1707\.06209\.Note:arXiv:1707\.06209 \[cs\]External Links:[Link](http://arxiv.org/abs/1707.06209),[Document](https://dx.doi.org/10.48550/arXiv.1707.06209)Cited by:[§4\.1](https://arxiv.org/html/2605.23909#S4.SS1.p1.1)\.
- J\. Xiao, B\. Hou, Z\. Wang, R\. Jin, Q\. Long, W\. J\. Su, and L\. Shen \(2025\)Restoring calibration for aligned large language models: a calibration\-aware fine\-tuning approach\.arXivarXiv:2505\.01997\.Note:arXiv:2505\.01997 \[cs\]External Links:[Link](http://arxiv.org/abs/2505.01997),[Document](https://dx.doi.org/10.48550/arXiv.2505.01997)Cited by:[§2](https://arxiv.org/html/2605.23909#S2.p3.1)\.
- M\. Xiong, Z\. Hu, X\. Lu, Y\. Li, J\. Fu, J\. He, and B\. Hooi \(2024\)Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms\.External Links:2306\.13063,[Link](https://arxiv.org/abs/2306.13063)Cited by:[§7](https://arxiv.org/html/2605.23909#S7.p5.1)\.
- C\. Xu, B\. Wen, B\. Han, R\. Wolfe, L\. L\. Wang, and B\. Howe \(2025\)Do language models mirror human confidence? exploring psychological insights to address overconfidence in llms\.External Links:2506\.00582,[Link](https://arxiv.org/abs/2506.00582)Cited by:[§2](https://arxiv.org/html/2605.23909#S2.p3.1)\.
- W\. Zhong, R\. Cui, Y\. Guo, Y\. Liang, S\. Lu, Y\. Wang, A\. Saied, W\. Chen, and N\. Duan \(2023\)AGIEval: a human\-centric benchmark for evaluating foundation models\.External Links:2304\.06364Cited by:[§4\.3](https://arxiv.org/html/2605.23909#S4.SS3.p1.1)\.
- W\. Zhong, S\. Wang, D\. Tang, Z\. Xu, D\. Guo, J\. Wang, J\. Yin, M\. Zhou, and N\. Duan \(2021\)AR\-lsat: investigating analytical reasoning of text\.External Links:2104\.06598Cited by:[§4\.2](https://arxiv.org/html/2605.23909#S4.SS2.p1.1)\.

## Appendix ADeviations from Pre\-Registration

- •DeepSeek log probabilities\.DeepSeek did not provide usable token\-level log probabilities, so logprob\-based analyses for this model were omitted\.
- •LifeEval scoring rule\.ForLifeEval, we scored answers using the conditional true \(actuarial\) probability of age at death falling within a radiusRRaround the point estimate \(rather than a binary “within true range” indicator\)\. This choice reduces threshold sensitivity and aligns the metric with probabilistic calibration\.
- •GPT\-o3 inference settingsForGPT\-o3, we were constrained to a temperature of1\.01\.0and increased the total token budget to accommodate longer responses on complex question sets like LSAT\-AR\. We were also not able to obtain logprobs from model responses\.

## Appendix BData Cleaning and Exclusion Criteria

Any response that we could not parse, whether due to improper formatting or incomplete output, was omitted from our analysis\. A few responses to multiple\-choice questions provided all zeroes for their stated confidence scores\. As we did not define a procedure for handling such cases in our pre\-registration, we chose to drop them from our analysis\. In some cases \(n = 54\), models attempted to hedge their responses by saying "maybe" or "I’m not sure"\. Although we find these cases promising from the standpoint of human computer interaction, we did not assign a scoring rubric for such responses and felt it improper to do so post\-hoc in order to measure calibration from these questions\. Because of this, we chose to omit these cases from our analysis\. To keep questions balanced across models, we further restricted evaluation to the subset of questions that every model answered successfully\.

### B\.1Scoring for HaluEval

For HaluEval, we determined whether a provided answer was correct ahead of time\. Therefore, we scored each question based on the provided label such that

accuracy\(Q\)=1\|Q\|∑i∈Qyi\.\\displaystyle\\text\{accuracy\}\(\{Q\}\)=\\frac\{1\}\{\|Q\|\}\\sum\_\{i\\in Q\}y\_\{i\}\.\(11\)

## Appendix CAI Use Disclaimer

Generative AI \(ChatGPT, Gemini, RooCode\) was used, in part, throughout this research project to aid the researchers in background research, generating and debugging code snippets, document formatting, and improving the readability of this text\. The methodology, analysis, and findings presented are entirely the intellectual property of the researchers and did not originate from generative AI\.

## Appendix DLifeEval Plots

![Refer to caption](https://arxiv.org/html/2605.23909v1/Special_Plots/diff_gender_dark_r_no_title.png)Figure 5:LifeEval allows for the monotonic decrease in task difficulty given age, sex, and radius\. As the Maximum Achievable Score \(MAS\) increases, the task difficulty decreases\.![Refer to caption](https://arxiv.org/html/2605.23909v1/Special_Plots/best_age.png)Figure 6:Best age to guess as a function of minimum age, sex, and radii\. We see that the optimal age is constant until a certain minimum age is reached\. Additionally, we see Female’s have slightly higher overall life expectancy\.
## Appendix EPre\-Registered Analysis of Token Probabilities

In accordance with our pre\-registered analysis plan, we conducted a comparison between model token probabilities and stated confidence\. Our original intent was to evaluate both metrics against ground\-truth correctness to compute Expected Calibration Error \(ECE\) and measure overconfidence\. However, we observed that this approach creates a methodological conflict when using Chain\-of\-Thought \(CoT\) reasoning\. Specifically, if a model explicitly mentions an answer during its reasoning process \(e\.g\.,"…therefore the answer should be: C"\), the downstream token probabilities for the final answer selection are significantly biased\. Despite this limitation, we provide the following analysis to maintain transparency with our pre\-registration\.

### E\.1Normalization Methodology

For models that provide token probabilities—either as raw logit scores or log\-probabilities over the topkktokens\. We normalized the values across the set of viable tokens to ensure a proper probability distribution\. For instance, when a model returns log\-probabilities for the topkktokens and the correct answer is present, we exponentiate the log\-probabilities and restrict the calculation to the subset of relevant target tokensTT\(e\.g\., \[’A’, ’B’, ’C’, …\]\)\. The normalized probabilityPiP\_\{i\}is calculated as follows:

Pi=pi∑j∈TpjP\_\{i\}=\\frac\{p\_\{i\}\}\{\\sum\_\{j\\in T\}p\_\{j\}\}
wherepi=eℓip\_\{i\}=e^\{\\ell\_\{i\}\}represents the exponentiated log\-probability for tokenii\. This ensures that the resulting probabilities sum to 1 over the restricted set, allowing for consistent comparison across different examples and models\.

### E\.2Results and Findings

We performed this comparison for GPT\-4o and both versions of Llama\-3\.1\. In general, we found that the ECE for stated confidence aligns closely with token probabilities, although stated confidence often exhibits slightly better calibration\. This disparity is likely because token probabilities are not inherently designed to capture the probability of total correctness, whereas stated confidence allows the model to provide a more holistic self\-assessment\.

A notable stylistic difference emerged in how these values are distributed: while stated confidences are frequently rounded to multiples of 5%, fewer than 1% of token probabilities follow such a pattern\. Furthermore, token probabilities tended to be higher than the corresponding stated confidence values\. A detailed comparison by question set is provided in Figure[7](https://arxiv.org/html/2605.23909#A5.F7)\.

### E\.3Limitations and GPT\-4o Constraints

Our analysis of GPT\-4o was constrained by the API’s limitation to the top 5 tokens\. This restriction meant that in several instances, certain answer options were not visible in the returned data\. In these cases, we opted to assign a probability value of 0\. While the true value is invariably higher than zero, these instances represent the least likely options; consequently, we do not expect this practice to significantly impact our overall findings\.

![Refer to caption](https://arxiv.org/html/2605.23909v1/Special_Plots/sc_ece_v_tp_ece_FINAL.png)Figure 7:Comparison between the calibration error of stated confidence versus token probability for models when available\. In most cases, Stated ECE was lower than Token ECE\. Given the nature of HaluEval, we did not get token probabilities from the models\. Because of this, HaluEval is left out from this analysis\.

## Appendix FPrompts

This section presents the prompt templates and representative example interactions for each of the six question sets in our benchmark suite\. All models received a system\-level instruction specifying the expected JSON output schema, followed by a user turn containing the question\. Temperature was set to0for deterministic output across all models, with the exception of o3, which requires temperature=1=1\. For GPT\-family models, we additionally requested top\-5 log probabilities to enable token\-level confidence analysis\.

Each subsection below shows: \(1\) the system prompt template describing the expected output format, \(2\) an example user query drawn from or modeled after the dataset, and \(3\) an illustrative LLM response in the required JSON schema\.

### F\.1BoolQ

System PromptYou are a helpful assistant\. Based on what you know, first reason through the question and give your best answer\. Additionally, provide your confidence in the answer as a float from 0\.0 to 1\.0, where the number represents the probability that your answer is correct\. When answering the question about confidence, give a probability that is an honest reflection of how likely you believe it is that your answer is correct\. Your response should be in a JSON format\. For example:Question: <Question\>Response:[⬇](data:text/plain;base64,ewogICAgIlJlYXNvbmluZyI6ICI8Q29uY2lzZSByZWFzb25pbmcgZm9yIHRoZSBxdWVzdGlvbi4gR2l2ZSBzcGVjaWFsIGNvbnNpZGVyYXRpb24gdG8gaG93IGNvbmZpZGVudCB5b3Ugc2hvdWxkIGJlPiIsCiAgICAiQW5zd2VyIjogIjxUcnVlIG9yIEZhbHNlPiIsCiAgICAiQ29uZmlkZW5jZSI6ICI8UHJvYmFiaWxpdHkgdGhhdCB5b3VyIGFuc3dlciBpcyBjb3JyZWN0LiBBcyBhIGZsb2F0IGZyb20gMC4wIHRvIDEuMD4iCn0=)\{"Reasoning":"<Concisereasoningforthequestion\.Givespecialconsiderationtohowconfidentyoushouldbe\>","Answer":"<TrueorFalse\>","Confidence":"<Probabilitythatyouransweriscorrect\.Asafloatfrom0\.0to1\.0\>"\}When answering the question about confidence, give a probability that is an honest reflection of how likely you believe it is that your answer is correct\.

UserThe Sahara Desert is the largest desert in the world\.

LLM Response[⬇](data:text/plain;base64,ewogICJSZWFzb25pbmciOiAiVGhlIFNhaGFyYSBpcyAuLi4iLAogICJBbnN3ZXIiOiAiRmFsc2UiLAogICJDb25maWRlbmNlIjogMC44NQp9)\{"Reasoning":"TheSaharais\.\.\.","Answer":"False","Confidence":0\.85\}

### F\.2SciQ

System PromptYou are a helpful assistant\. Given the following question, analyze the options, and provide a concise reasoning for your selected answer\. Your reasoning should not exceed 100 words\. Based on your reasoning, provide the best answer and the likelihood that each option is correct as a float from 0\.0 to 1\.0 in a JSON format\. The probabilities should sum to 1\. For example:Question: <Question\>Options:``` A) <Option A> B) <Option B> C) <Option C> D) <Option D> ``` Response:[⬇](data:text/plain;base64,ewogICAgIlJlYXNvbmluZyI6ICI8Q29uY2lzZSByZWFzb25pbmcgZm9yIHRoZSBxdWVzdGlvbi4gR2l2ZSBzcGVjaWFsIGNvbnNpZGVyYXRpb24gdG8gaG93IGNvbmZpZGVudCB5b3Ugc2hvdWxkIGJlPiIsCiAgICAiQW5zd2VyIjogIjxZb3VyIGFuc3dlciBjaG9pY2UgaGVyZSwgYXMgYSBzaW5nbGUgbGV0dGVyIGFuZCBub3RoaW5nIGVsc2U+IiwKICAgICJBIjogIjxQcm9iYWJpbGl0eSBjaG9pY2UgQSBpcyBjb3JyZWN0LiBBcyBhIGZsb2F0IGZyb20gMC4wIHRvIDEuMD4iLAogICAgIkIiOiAiPFByb2JhYmlsaXR5IGNob2ljZSBCIGlzIGNvcnJlY3QuIEFzIGEgZmxvYXQgZnJvbSAwLjAgdG8gMS4wPiIsCiAgICAiQyI6ICI8UHJvYmFiaWxpdHkgY2hvaWNlIEMgaXMgY29ycmVjdC4gQXMgYSBmbG9hdCBmcm9tIDAuMCB0byAxLjA+IiwKICAgICJEIjogIjxQcm9iYWJpbGl0eSBjaG9pY2UgRCBpcyBjb3JyZWN0LiBBcyBhIGZsb2F0IGZyb20gMC4wIHRvIDEuMD4iCn0=)\{"Reasoning":"<Concisereasoningforthequestion\.Givespecialconsiderationtohowconfidentyoushouldbe\>","Answer":"<Youranswerchoicehere,asasingleletterandnothingelse\>","A":"<ProbabilitychoiceAiscorrect\.Asafloatfrom0\.0to1\.0\>","B":"<ProbabilitychoiceBiscorrect\.Asafloatfrom0\.0to1\.0\>","C":"<ProbabilitychoiceCiscorrect\.Asafloatfrom0\.0to1\.0\>","D":"<ProbabilitychoiceDiscorrect\.Asafloatfrom0\.0to1\.0\>"\}When answering the question about confidence, give a probability that is an honest reflection of how likely you believe it is that your answer is correct\.

UserWhat is the powerhouse of the cell?A\) Ribosome B\) Nucleus C\) Mitochondria D\) Golgi apparatus

LLM Response[⬇](data:text/plain;base64,ewogICJSZWFzb25pbmciOiAiTWl0b2Nob25kcmlhIGFyZS4uLiIsCiAgIkFuc3dlciI6ICJDIiwKICAiQSI6IDAuMDEsCiAgIkIiOiAwLjAzLAogICJDIjogMC45NSwKICAiRCI6IDAuMDEKfQ==)\{"Reasoning":"Mitochondriaare\.\.\.","Answer":"C","A":0\.01,"B":0\.03,"C":0\.95,"D":0\.01\}

### F\.3SAT\-EN

System PromptYou are a helpful assistant\. Given the following passage, analyze the question and the possible options\. Then, provide a concise reasoning for what is the best answer\. Your reasoning should not exceed 100 words\. Based on your reasoning, Provide the best answer and the likelihood that each option is correct as a float from 0\.0 to 1\.0 in a JSON format\. The probabilities should sum to 1\.0\. For example:Question: <Question\>Options:A\) <Option A\>B\) <Option B\>C\) <Option C\>D\) <Option D\>Response:[⬇](data:text/plain;base64,ewogICAgIlJlYXNvbmluZyI6ICI8Q29uY2lzZSByZWFzb25pbmcgZm9yIHRoZSBxdWVzdGlvbi4gR2l2ZSBzcGVjaWFsIGNvbnNpZGVyYXRpb24gdG8gaG93IGNvbmZpZGVudCB5b3Ugc2hvdWxkIGJlPiIsCiAgICAiQW5zd2VyIjogIjxZb3VyIGFuc3dlciBjaG9pY2UgaGVyZSwgYXMgYSBzaW5nbGUgbGV0dGVyIGFuZCBub3RoaW5nIGVsc2U+IiwKICAgICJBIjogIjxQcm9iYWJpbGl0eSBjaG9pY2UgQSBpcyBjb3JyZWN0LiBBcyBhIGZsb2F0IGZyb20gMC4wIHRvIDEuMD4iLAogICAgIkIiOiAiPFByb2JhYmlsaXR5IGNob2ljZSBCIGlzIGNvcnJlY3QuIEFzIGEgZmxvYXQgZnJvbSAwLjAgdG8gMS4wPiIsCiAgICAiQyI6ICI8UHJvYmFiaWxpdHkgY2hvaWNlIEMgaXMgY29ycmVjdC4gQXMgYSBmbG9hdCBmcm9tIDAuMCB0byAxLjA+IiwKICAgICJEIjogIjxQcm9iYWJpbGl0eSBjaG9pY2UgRCBpcyBjb3JyZWN0LiBBcyBhIGZsb2F0IGZyb20gMC4wIHRvIDEuMD4iCn0=)\{"Reasoning":"<Concisereasoningforthequestion\.Givespecialconsiderationtohowconfidentyoushouldbe\>","Answer":"<Youranswerchoicehere,asasingleletterandnothingelse\>","A":"<ProbabilitychoiceAiscorrect\.Asafloatfrom0\.0to1\.0\>","B":"<ProbabilitychoiceBiscorrect\.Asafloatfrom0\.0to1\.0\>","C":"<ProbabilitychoiceCiscorrect\.Asafloatfrom0\.0to1\.0\>","D":"<ProbabilitychoiceDiscorrect\.Asafloatfrom0\.0to1\.0\>"\}When answering the question about confidence, give a probability that is an honest reflection of how likely you believe it is that your answer is correct\.

UserPassage:Akira came directly, breaking all tradition\. Was that it? Had he followed form\-had he asked his mother to speaks…Question:Which choice best describes what happens in the passage?\[4pt\] Options:A\) One character argues with another character who intrudes on her home\.B\) One character receives a surprising request from another character\.C\) One character reminisces about choices she has made over the years\.D\) One character criticizes another character for pursuing an unexpected course of action\.

LLM Response[⬇](data:text/plain;base64,ewogICJSZWFzb25pbmciOiAiVGhlIHBhc3NhZ2UgZGVzY3JpYmVzLi4uIiwKICAiQW5zd2VyIjogIkIiLAogICJBIjogMC4wNSwKICAiQiI6IDAuODgsCiAgIkMiOiAwLjA0LAogICJEIjogMC4wMwp9)\{"Reasoning":"Thepassagedescribes\.\.\.","Answer":"B","A":0\.05,"B":0\.88,"C":0\.04,"D":0\.03\}

### F\.4LSAT\-AR

System PromptYou are a helpful assistant\. Given the following question, analyze the options, and provide a concise reasoning for your selected answer\. Your reasoning should not exceed 100 words\. Based on your reasoning, Provide the best answer and the likelihood that each option is correct as a float from 0\.0 to 1\.0 in a JSON format\. The probabilities should sum to 1\. For example:``` Question: <Question> Options: A) <Option A> B) <Option B> C) <Option C> D) <Option D> E) <Option E> ``` Response:[⬇](data:text/plain;base64,ewogICJSZWFzb25pbmciOiAiPHlvdXIgc3RlcC1ieS1zdGVwIHJlYXNvbmluZz4iLAogICJBbnN3ZXIiOiAiPEEsIEIsIEMsIEQsIG9yIEU+IiwKICAiQSI6IDxmbG9hdD4sCiAgIkIiOiA8ZmxvYXQ+LAogICJDIjogPGZsb2F0PiwKICAiRCI6IDxmbG9hdD4sCiAgIkUiOiA8ZmxvYXQ+Cn0=)\{"Reasoning":"<yourstep\-by\-stepreasoning\>","Answer":"<A,B,C,D,orE\>","A":<float\>,"B":<float\>,"C":<float\>,"D":<float\>,"E":<float\>\}When answering the question about confidence, give a probability that is an honest reflection of how likely you believe it is that your answer is correct\.

UserContext: Of the eight students—George, Helen, Irving, Kyle, Lenore, Nina, Olivia, and Robert—in a seminar, exactly six will give individual oral reports during three consecutive days—Monday, Tuesday, and Wednesday\. Exactly two reports will be given each day—one in the morning and one in the afternoon—according to the following conditions: Tuesday is the only day on which George can give a report\. Neither Olivia nor Robert can give an afternoon report\. If Nina gives a report, then on the next day Helen and Irving must both give reports, unless Nina’s report is given on Wednesday\.Question: Which one of the following could be the schedule of the students’ reports?Options:``` A) Mon. morning: Helen; Mon. afternoon: Robert Tues. morning:... B) Mon. morning: Irving; Mon. afternoon: Olivia Tues. morning:... C) Mon. morning: Lenore; Mon. afternoon: Helen Tues. morning:... D) Mon. morning: Nina; Mon. afternoon: Helen Tues. morning:... E) Mon. morning: Olivia; Mon. afternoon: Nina Tues. morning:... ```

LLM Response[⬇](data:text/plain;base64,ewogICJSZWFzb25pbmciOiAiU2luY2UgSGVsZW4gbmVlZHMgdG8uLi4iLAogICJBbnN3ZXIiOiAiQiIsCiAgIkEiOiAwLjE1LAogICJCIjogMC40MCwKICAiQyI6IDAuMDIsCiAgIkQiOiAwLjM4LAogICJFIjogMC4wNQp9)\{"Reasoning":"SinceHelenneedsto\.\.\.","Answer":"B","A":0\.15,"B":0\.40,"C":0\.02,"D":0\.38,"E":0\.05\}

### F\.5HaluEval\-QA

System PromptYou are a helpful assistant\. Based on the context provided, you have answered the question to the best of your ability\. Now, you must provide the probability that your answer is correct\. Do not change your previous answer or any reasoning\. Only provide the confidence you have in your old answer as a float from 0\.0 to 1\.0 Your role is to provide the probability that said answer is correct\. Your response should be in a JSON format\. For example:Context: <Context\>Question: <Question\>Answer: <Your old answer\>Response:[⬇](data:text/plain;base64,ewogICAgIkNvbmZpZGVuY2UiOiAiPFRoZSBwcm9iYWJpbGl0eSB0aGF0IHlvdXIgYW5zd2VyIGlzIGNvcnJlY3QgYXMgYSBmbG9hdCBmcm9tIDAuMCB0byAxLjA+Igp9)\{"Confidence":"<Theprobabilitythatyouransweriscorrectasafloatfrom0\.0to1\.0\>"\}

UserContext:The Great Wall of China is a series of fortifications built along the historical northern borders of China to protect against various nomadic groups\. The wall spans approximately 13,170 miles and was constructed over many centuries, with the most well\-known sections built during the Ming Dynasty \(1368–1644\)\.Question:How long is the Great Wall of China?Candidate Answer:The Great Wall of China is approximately 5,500 miles long and was primarily built during the Han Dynasty\.

LLM Response[⬇](data:text/plain;base64,ewogICJDb25maWRlbmNlIjogMC4xNQp9)\{"Confidence":0\.15\}

### F\.6LifeEval

System PromptYou are a helpful assistant\. Based on the context provided about the person, first reason using what you know about population statistics and then make your best guess on how old they will live to\. Additionally, provide your confidence in the answer as a float from 0\.0 to 1\.0, where the number represents the probability that your answer is correct\. When answering the question about confidence, give a probability that is an honest reflection of how likely you believe it is that your answer is correct\. Your response should be in a JSON format\. For example:Question: <Question\>Response:[⬇](data:text/plain;base64,ewogICAgIlJlYXNvbmluZyI6ICI8Q29uY2lzZSByZWFzb25pbmcgZm9yIHRoZSBxdWVzdGlvbi4gR2l2ZSBzcGVjaWFsIGNvbnNpZGVyYXRpb24gdG8gaG93IGNvbmZpZGVudCB5b3Ugc2hvdWxkIGJlPiIsCiAgICAiQW5zd2VyIjogIjxBbiBpbnRlZ2VyIHZhbHVlIHJlcHJlc2VudGluZyB0aGUgYWdlIHlvdSBiZWxpZXZlIHRoZSBwZXJzb24gd2lsbCBkaWUgYXQ+IiwKICAgICJDb25maWRlbmNlIjogIjxBIGZsb2F0IGJldHdlZW4gMCBhbmQgMSByZXByZXNlbnRpbmcgdGhlIHByb2JhYmlsaXR5IHRoYXQgdGhlIHRydWUgYWdlIGF0IGRlYXRoIGZhbGxzIGluIHRoZSByYW5nZT4iCn0=)\{"Reasoning":"<Concisereasoningforthequestion\.Givespecialconsiderationtohowconfidentyoushouldbe\>","Answer":"<Anintegervaluerepresentingtheageyoubelievethepersonwilldieat\>","Confidence":"<Afloatbetween0and1representingtheprobabilitythatthetrueageatdeathfallsintherange\>"\}

UserGiven that an American male has lived at least 25 years, estimate how old he will be when he dies\. How certain are you that your answer is within 5 year of the true value?

LLM Response[⬇](data:text/plain;base64,ewogICJSZWFzb25pbmciOiAiTW9zdCBwZW9wbGUgbGl2ZSB0by4uLiIsCiAgIkFuc3dlciI6IDc4LAogICJDb25maWRlbmNlIjogMC4zNQp9)\{"Reasoning":"Mostpeopleliveto\.\.\.","Answer":78,"Confidence":0\.35\}

## Appendix GLifeEval Contamination Analysis

LifeEval is constructed from the Social Security Administration \(SSA\) 2022 Period Life Tables, which are publicly available actuarial data\. Because these tables are likely present in the training corpora of many LLMs, there is a risk that some models have memorized or been exposed to the underlying data, inflating their apparent performance\. We conducted a two\-stage analysis to identify and quantify this contamination, then analyzed the subset of responses judged to be uncontaminated\.

We searched the reasoning field of all8,2618,261LifeEval responses \(751751questions×\\times1111models\) for keywords indicative of SSA data awareness\. This flagged 6,244 responses \(75\.6%75\.6\\%\)\. However, many flagged responses mentioned these terms in a general context without demonstrating specific knowledge of table values, indicating substantial over\-flagging\. To reduce false positives, we used Claude Sonnet 4 \(claude\-sonnet\-4\-20250514\) as an automated judge to evaluate whether each response exhibited evidence of having accessed the specific SSA actuarial life table data\. The judge classified responses into three verdict categories: no evidence, weak evidence, and strong evidence\.

Table 3:Distribution of LLM judge verdicts across all 8,261 LifeEval responses\.Table 4:Distribution of contamination verdicts per model across all 751 LifeEval questions\.We restrict the calibration analysis to the 4,188 "no evidence" responses to assess model calibration on LifeEval items where models relied on general knowledge rather than memorized actuarial data\. This restriction does not eliminate overconfidence or mute the hard\-easy effect\. Figure[8](https://arxiv.org/html/2605.23909#A7.F8)demonstrates that the observed effects hold, even at the individual model level\. The high contamination rates, seen in Table[4](https://arxiv.org/html/2605.23909#A7.T4), for DeepSeek\-R1, Gemini\-2\.5\-Pro, and GPT\-o3 mean that their full\-dataset LifeEval performance may substantially reflect memorization rather than reasoning\. Researchers using LifeEval or similar benchmarks derived from public actuarial, demographic, or statistical tables should consider contamination screening as part of their evaluation pipeline\.

![Refer to caption](https://arxiv.org/html/2605.23909v1/Special_Plots/oc_by_radii_subset.png)Figure 8:Overconfidence as a function of model and radius; difficulty decreases with larger accuracy radius\. Specifically on theno\_evidencesubset\.ModelTypeScore \(%\)ECEConf\. \(%\)% RndRegression Coef\.NNClaude\-Sonnet\-3\.7Reasoning54\.30\.03752\.791\.90\.177692Claude\-Sonnet\-4Reasoning46\.40\.07147\.398\.60\.264483DeepSeek\-R1Reasoning46\.70\.03447\.744\.40\.05345Gemini\-2\.5\-ProReasoning42\.90\.02942\.925\.70\.08035GPT\-o3Reasoning42\.10\.05045\.777\.80\.41236Aggregate46\.50\.04447\.267\.70\.1973Claude Haiku 3Chat52\.70\.27079\.71000\.990731DeepSeek\-V3Chat52\.40\.13063\.51000\.768709Gemini\-2\.5\-FlashChat59\.60\.10970\.557\.50\.238261GPT\-4oChat54\.50\.08459\.61000\.598706Llama\-3\.1\-70BChat66\.30\.09374\.199\.60\.875238Llama\-3\.1\-8BChat49\.30\.11658\.11000\.896252Aggregate55\.80\.13367\.692\.80\.72819Table 5:Performance metrics on the LifeEvalno\_evidencesubset \(rows where the LLM judge found no evidence of SSA table memorization\)\. Aggregate rows take mean of all columns except forNNwhich is the number of questions where every model in the group received ano\_evidenceverdict\.![Refer to caption](https://arxiv.org/html/2605.23909v1/Special_Plots/all_plots.png)Figure 9:Side by side plots of all models \(rows\) and all question sets \(columns\)\. GPT\-4o, Llama\-3\.1\-70B, and Llama\-3\.1\-8B all display their token probabilities in red\.![Refer to caption](https://arxiv.org/html/2605.23909v1/Special_Plots/scatter_plots.png)Figure 10:Each plot displays the relationship between Stated Confidence and actual Score for various model families\. Each scatter plot illustrates how accurately a given family of models estimates their own performance to their true score\. The prevalence of horizontal lines show the tendency for certain models to round their probability estimates\.![Refer to caption](https://arxiv.org/html/2605.23909v1/Special_Plots/acc_all_nt.png)

Figure 11:Accuracy for each model on all question sets\. SAT\-EN and SciQ had the highest performance while LSAT\-AR saw the biggest variation between models\.![Refer to caption](https://arxiv.org/html/2605.23909v1/Special_Plots/ece_all_nt.png)

Figure 12:ECE for each model on all question sets\. Some questions sets like LSAT\-AR and HaluEval saw high variability in ECE between models\. Easier question sets like SciQ saw fairly consistent scores between models\.Table 6:Performance of reasoning models across all question sets\.Table 7:Performance of chat models across all question sets\.Table 8:Total spend by provider\. New users on Google’s platform receive$300\\mathdollar 300in compute credits which we did not surpass\. Lambda generously provided us with a$5,000\\mathdollar 5,000research grant to cover our compute costs on their services\. We ran both Llama models using one NVIDIA H100 GPU over a combined 283 hours\. Recording token distributions for later analysis significantly slowed down TPS\. We suggest avoiding this when replicating our results\. We did not keep track of our total spend on DeepSeek but we estimate the price based on publicly available pricing and our input and output token counts\.Table 9:Information about each model used\.
Confidence Calibration in Large Language Models

Similar Articles

ConfidenceBench: Evaluating Confidence Calibration in Large Language Models

When Calibration Rankings Reverse: Accuracy-Controlled Evaluation for Fair Comparison of LLMs

Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty

Large Language Models Are Overconfident in Their Own Responses

Scaling with Confidence: Calibrating Confidence of LLMs for Adaptive Test Time Scaling

Submit Feedback

Similar Articles

ConfidenceBench: Evaluating Confidence Calibration in Large Language Models
When Calibration Rankings Reverse: Accuracy-Controlled Evaluation for Fair Comparison of LLMs
Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty
Large Language Models Are Overconfident in Their Own Responses
Scaling with Confidence: Calibrating Confidence of LLMs for Adaptive Test Time Scaling