Domain-level metacognitive monitoring in frontier LLMs: A 33-model atlas

arXiv cs.CL 05/11/26, 04:00 AM Papers
llm-evaluation metacognition benchmarking confidence-calibration mmlu frontier-models
Summary
This study presents a 33-model atlas analyzing domain-level metacognitive monitoring in frontier LLMs using MMLU benchmarks, revealing significant variations in confidence calibration across different knowledge domains that are obscured by aggregate metrics.
arXiv:2605.06673v1 Announce Type: new Abstract: Aggregate metacognitive quality scores mask within-model variation across MMLU benchmark domains. We administered 1,500 MMLU items (250 per domain, under an a priori six-domain grouping) to 33 frontier LLMs from eight model families and computed Type-2 AUROC per model-domain cell using verbalized confidence (0-100). Total observations: 47,151. Every model with above-chance aggregate monitoring showed non-trivial domain-level variation. Applied/Professional knowledge was reliably the easiest benchmark domain to monitor (mean AUROC = .742, ranked top-2 in 21 of 33 models); Formal Reasoning and Natural Science were reliably the hardest (one of the two ranked bottom-2 in 27 of 33 models). The three middle domains were statistically indistinguishable (Kendall's W = .164). A subject-level coherence analysis (within-domain similarity ratio = 0.95) confirms the six-domain grouping is a pragmatic benchmark taxonomy, not a validated latent construct. Within-family profile-shape clustering is significant for Anthropic, Google-Gemini, and Qwen (permutation p < .0001) but not DeepSeek, Google-Gemma, or OpenAI. Gemma 4 31B showed a +.202 AUROC improvement over Gemma 3 27B. Three models classified Invalid on binary KEEP/WITHDRAW probes produced normal profiles under verbalized confidence, confirming probe-format specificity. Bootstrap 95% CIs on 198 cells have median width .199. Split-half aggregate stability r = .893; profile-level split-half is weaker (grand median r = .184). These results show stable benchmark-domain variation obscured by aggregate metrics, and support benchmark-stage domain screening as a step before deployment in specific application areas.
Original Article
View Cached Full Text
Cached at: 05/11/26, 06:39 AM
# Domain-level metacognitive monitoring in frontier LLMs: A 33-model atlas
Source: [https://arxiv.org/html/2605.06673](https://arxiv.org/html/2605.06673)
\(April 2026\)

###### Abstract

Aggregate metacognitive quality scores mask within\-model variation across MMLU benchmark domains\. We administered 1,500 MMLU items \(250 per domain, under an a priori six\-domain grouping\) to 33 frontier LLMs from eight model families and computed Type\-2 AUROC per model\-domain cell using verbalized confidence \(0–100\)\. Total observations: 47,151\. Every model with above\-chance aggregate monitoring showed non\-trivial domain\-level variation\. Applied/Professional knowledge was reliably the easiest benchmark domain to monitor \(mean AUROC = \.742, ranked top\-2 in 21 of 33 models\)\. Formal Reasoning and Natural Science were reliably the hardest \(\.658 and \.652 respectively; one of the two occupied a bottom\-2 rank in 27 of 33 models\)\. The three middle domains \(Factual, Social, Humanities\) were statistically indistinguishable \(means within \.007; Kendall’sW=\.164W=\.164indicates models agree on extremes but disagree on the middle\)\. A subject\-level coherence analysis \(within\-domain similarity ratio = 0\.95\) confirms that the six\-domain grouping is a pragmatic benchmark taxonomy, not a validated latent construct\. Within\-family profile similarity was significant for Google\-Gemini \(r=\.842r=\.842,p=\.035p=\.035for the top pair\) but not Anthropic, where profile shape varied despite consistent aggregate quality \(\.708–\.806 across four generations\)\. Gemma 4 31B showed a \+\.202 AUROC improvement over Gemma 3 27B\. Three models classified Invalid on binary KEEP/WITHDRAW probes\[Cacioli,[2026a](https://arxiv.org/html/2605.06673#bib.bib4)\]produced normal profiles under verbalized confidence, confirming probe\-format specificity\. GPT\-oss\-120B showed the highest confidence variance \(SD = 21\.3\) but near\-chance monitoring \(\.530\)\. Bootstrap 95% CIs \(1,000 resamples\) on the 198 model\-domain cells have median width \.199, adequate for detecting large profile differences but insufficient for resolving adjacent\-domain differences in high\-accuracy models \(34% of cells exceed \.25\)\. Split\-half aggregate stability across models isr=\.893r=\.893\. These results show stable benchmark\-domain variation in confidence discrimination that is obscured by aggregate metrics, and support benchmark\-stage domain screening as a step before deployment in specific application areas\.

## 1Introduction

### 1\.1The aggregate\-metric problem

LLM confidence signals are increasingly used for abstention, routing, and safety\-critical escalation in deployment\[Wenet al\.,[2025](https://arxiv.org/html/2605.06673#bib.bib15)\]\. The standard evaluation reports a single aggregate metric, typically AUROC or ECE, across all items\. This aggregate assumes that metacognitive monitoring quality is uniform across cognitive demands\. It is not\.

Cacioli \[[2026f](https://arxiv.org/html/2605.06673#bib.bib3)\]reported that every Valid model in the Classical Minds sample \(14 of 20 frontier LLMs\) had at least one cognitive domain with AUROC below \.55, despite aggregate AUROCs ranging from \.539 to \.717 in that subset\. Sonnet 4\.6 showed \.965 on Executive Function and \.485 on Prospective Regulation\. Claude Haiku 4\.5 showed \.804 on Social Cognition and \.466 on Attention\. These domain\-level variations do not show up in aggregate reporting and carry direct deployment implications\. A confidence\-based abstention system built for legal reasoning \(where a model monitors well\) may behave differently when applied to mathematical problem\-solving \(where the same model monitors poorly\)\.

We ask three questions\. First, does the domain\-level variation observed in a custom battery replicate on a standard benchmark at adequate statistical power? Second, is the variation structured, with consistent domains that are easier or harder to monitor across models? Third, do models within a training family share a domain\-level profile shape?

### 1\.2Prior work

Several strands of research address LLM confidence reliability\.Steyvers and Peters \[[2025](https://arxiv.org/html/2605.06673#bib.bib14)\]reviewed metacognition and uncertainty communication in LLMs\.Xionget al\.\[[2023](https://arxiv.org/html/2605.06673#bib.bib16)\]surveyed confidence elicitation methods and found pervasive overconfidence\.Kadavathet al\.\[[2022](https://arxiv.org/html/2605.06673#bib.bib9)\]showed LLMs can sometimes discriminate questions they answer correctly from those they do not\.Cacioli \[[2026d](https://arxiv.org/html/2605.06673#bib.bib1),[c](https://arxiv.org/html/2605.06673#bib.bib2)\]applied signal detection theory to decompose metacognitive efficiency from task performance\.Wuet al\.\[[2026](https://arxiv.org/html/2605.06673#bib.bib13)\]introduced a decision\-theoretic reliability metric\. All report aggregate metrics\. None profile domain\-level variation\.

A parallel line addresses validity:Cacioli \[[2026a](https://arxiv.org/html/2605.06673#bib.bib4)\]derived six validity indices for LLM confidence data from PAI and MMPI\-3 validity scales\.Cacioli \[[2026e](https://arxiv.org/html/2605.06673#bib.bib5)\]extracted a portable screening protocol classifying models as Invalid, Indeterminate, or Valid\.Cacioli \[[2026b](https://arxiv.org/html/2605.06673#bib.bib6)\]showed the classification predicts selective prediction performance \(d=2\.81d=2\.81,η2=\.470\\eta^\{2\}=\.470\)\. We extend this framework from aggregate screening to domain\-level profiling: once you know the signal is valid, where is it valid?

Haznitramaet al\.\[[2026](https://arxiv.org/html/2605.06673#bib.bib17)\]evaluated LLMs on a neuropsychologically grounded battery of cognitive tasks but did not analyze metacognitive monitoring across those domains\. Closer to our interest,Cacioli \[[2026f](https://arxiv.org/html/2605.06673#bib.bib3)\]\(the Metacognitive Monitoring Battery\) administered 524 items across six cognitive tracks to 20 frontier LLMs, but with binary KEEP/WITHDRAW probes and only 60–116 items per domain\. The atlas replicates and extends that work on a standardized benchmark \(MMLU\), with continuous verbalized confidence, at 250 items per domain and 33 models\.

Recent mechanistic work is relevant\.Kumaranet al\.\[[2026](https://arxiv.org/html/2605.06673#bib.bib11)\]showed that verbal confidence in Gemma 3 27B reflects cached retrieval from answer\-adjacent positions, not just\-in\-time computation, and that confidence representations explain variance beyond token log\-probabilities\.Kim \[[2026](https://arxiv.org/html/2605.06673#bib.bib10)\]identified a metacognitive locus at 61–69% of network depth across two architecturally distinct models, where hidden\-state variance discriminates known from unknown questions prior to any output token being generated\. These findings point to some form of second\-order evaluation of answer quality, although they do not by themselves establish introspective access in the strong human\-cognitive sense\. We ask whether the quality of this second\-order signal varies across MMLU\-domain bins\.

Cacioli \[[2026g](https://arxiv.org/html/2605.06673#bib.bib7)\]showed verbal confidence from 3\-9B instruction\-tuned models saturates under minimal elicitation \(mean ceiling rate 91\.7%, all seven models classified Invalid\)\.Miao and Ungar \[[2026](https://arxiv.org/html/2605.06673#bib.bib12)\]showed calibration and verbalized confidence are encoded in orthogonal directions in the residual stream\. Verbal confidence is not automatically trustworthy\. We use frontier models where the aggregate signal is valid and ask whether validity is uniform across domains\.

### 1\.3A procedural analogy, not an ontological one

Clinical neuropsychological assessment follows a fixed interpretation sequence\. Validity indicators are checked first\[Larrabee,[2012](https://arxiv.org/html/2605.06673#bib.bib18)\]\. If valid, aggregate scores \(e\.g\., FSIQ on the WAIS\-IV\) provide an overall level\. Then the profile is interpreted: index\-level scores reveal relative strengths and weaknesses across cognitive domains\. A patient with FSIQ of 100 may show a 20\-point discrepancy between Verbal Comprehension and Processing Speed\. The discrepancy is the clinical finding\. The aggregate conceals it\.

We adopt this three\-step procedure for LLM confidence evaluation: screen first\[Cacioli,[2026e](https://arxiv.org/html/2605.06673#bib.bib5)\], compute the aggregate, then examine the profile\. The analogy is procedural\. We do not claim equivalence between human cognitive domains \(which rest on decades of factor\-analytic validation\) and the MMLU\-subject bins used here \(which do not, §3\.8\)\. What transfers is the interpretive sequence, not the construct status of the domains\.

### 1\.4Contributions

1. 1\.A 33\-model, 47,151\-observation atlas of Type\-2 AUROC across six MMLU\-domain bins, providing the largest standardized profile dataset for LLM metacognitive monitoring to date\.
2. 2\.A robust extremum ordering: Applied/Professional is reliably the easiest benchmark domain to monitor; Formal Reasoning and Natural Science are reliably the hardest \(Friedmanχ2\(5\)=27\.04\\chi^\{2\}\(5\)=27\.04,p<\.0001p<\.0001; Kendall’sW=\.164W=\.164\)\. The three middle domains are statistically indistinguishable\.
3. 3\.A subject\-level coherence analysis showing that the a priori domain taxonomy groups MMLU subjects by pragmatic cognitive demand, not by empirically cohesive latent structure — a construct\-validity limitation we foreground rather than bury\.
4. 4\.Exploratory within\-family profile similarity analyses: one family \(Google\-Gemini\) shows significant within\-family correlation in profile shape, while others do not\.
5. 5\.Descriptive generational trajectories showing a \+\.202 AUROC improvement from Gemma 3 to Gemma 4 and a plateau across Anthropic 4\.5–4\.7\.
6. 6\.Probe\-format specificity: three models classified Invalid under binary KEEP/WITHDRAW probes\[Cacioli,[2026a](https://arxiv.org/html/2605.06673#bib.bib4),[e](https://arxiv.org/html/2605.06673#bib.bib5)\]produce valid profiles under verbalized confidence, confirming that validity is a property of the model\-probe\-task interaction rather than an intrinsic model property\.
7. 7\.A public leaderboard, item\-level data \(47,151 observations\), analysis code, and bootstrap CIs for all 198 model\-domain cells\.

### 1\.5Scope and what this paper is not

This is an atlas of benchmark\-conditioned profile variation, not a validated map of latent metacognitive domains\. The paper makes three claims and no more\. First, within\-model domain variation is substantial and obscured by aggregate AUROC\. Second, Applied/Professional is reliably the easiest MMLU\-domain bin to monitor and Formal/Science reliably the hardest\. Third, the binary KEEP/WITHDRAW vs\. verbalized\-0\-100 comparison shows that validity is format\-dependent\.

We do not claim that the six domains constitute a validated cognitive taxonomy for LLMs \(§3\.8 shows they do not\)\. We do not claim a causal mechanism for the Applied\-Formal gap \(§4\.1 raises candidate hypotheses\)\. We do not claim that benchmark AUROC transports directly to deployment reliability without further domain\-specific evaluation \(§4\.6\)\. Readers should interpret the atlas as a benchmark\-stage screening tool, not a deployment certification\.

## 2Methods

### 2\.1Models

Thirty\-three frontier LLMs from eight families, administered via the Kaggle Benchmarks platform \(March–April 2026\)\. Models span four Anthropic generations, three DeepSeek versions, seven Google\-Gemini models, five Gemma models, five OpenAI models, four Qwen models, and GLM\-5\. All calls used greedy decoding \(temperature 0\) and independent conversation context per item\. Full model list with canonical IDs is in the repository’sdata/README\.md\.

### 2\.2Substrate

1,500 items from MMLU\[Hendryckset al\.,[2021](https://arxiv.org/html/2605.06673#bib.bib8)\], stratified across six cognitive domains \(250 items per domain\)\. Items were drawn deterministically \(seed = 42\) from the test split via the HuggingFace datasets library\.

### 2\.3Domain mapping

We mapped 56 of 57 MMLU subjects a priori into six cognitive domain bins \(Table[1](https://arxiv.org/html/2605.06673#S2.T1)\)\. One subject \(elementary\_mathematics, 173 items\) was excluded as ambiguous across formal reasoning and applied arithmetic\. The mapping is a pragmatic grouping by surface cognitive demand, not a validated latent taxonomy; see §3\.8 for a coherence analysis\.

Table 1:MMLU\-to\-domain mapping \(partial\)\. Full mapping in the repository notebook\.
### 2\.4Elicitation

Each item was presented to the model with a fixed template requesting the answer letter \(A/B/C/D\) and a verbalized confidence \(0–100\)\. Prompts did not include chain\-of\-thought cues\. The model was instructed to make the confidence judgment alongside the answer\. Full prompt template is in the repository notebook\.

### 2\.5Analysis

Type\-2 AUROC \(confidence predicting correctness\) was computed per model and per model\-domain cell usingsklearn\.metrics\.roc\_auc\_score\. For cells with all\-correct or all\-incorrect items \(rare, only in Gemma 3 1B on some high\-accuracy items\), AUROC is undefined; these cells were flagged and excluded from aggregate statistics rather than imputed\. Bootstrap 95% CIs were computed with 1,000 resamples per cell \(seed = 42\)\.

The portable screening protocol\[Cacioli,[2026e](https://arxiv.org/html/2605.06673#bib.bib5)\]was applied to each model’s aggregate data\. All 33 models classified as Valid or above on the aggregate screen were retained for domain\-level analysis; no models were excluded by the screen at this stage\.

## 3Results

### 3\.1Model coverage and precision

Thirty\-three models produced 47,151 observations \(598–1,500 items per model\)\. Accuracy ranged from \.388 \(Gemma 3 1B\) to \.951 \(Opus 4\.6, Gemini 3 Flash\)\. Confidence SD ranged from 3\.3 \(Gemma 3 12B\) to 21\.3 \(GPT\-oss\-120B\)\. Aggregate AUROC ranged from \.498 \(Gemma 3 1B, chance\) to \.806 \(Opus 4\.6\)\.

Bootstrap 95% CIs were computed on all 198 model\-domain cells \(1,000 resamples, seed = 42\)\. Median CI width was \.199\. One hundred of 198 cells \(51%\) had CI width below \.20\. Sixty\-eight cells \(34%\) had CI width exceeding \.25, concentrated in high\-accuracy models with few errors per domain\. All 198 cells produced computable AUROCs\. The CI widths represent a substantial improvement over the Classical Minds battery \(median CI width \.275 on 60–116 items per domain\) due to the larger per\-domain sample \(250 items\) and continuous confidence scale\.

### 3\.2The domain\-level profile matrix

Table[2](https://arxiv.org/html/2605.06673#S3.T2)reports Type\-2 AUROC per model\-domain cell for all 33 models, sorted by family and aggregate AUROC\. Figure[1](https://arxiv.org/html/2605.06673#S3.F1)visualises the same matrix as a heatmap with family separators\. The matrix is also archived asdata/atlas\_summary\_matrix\.csvin the repository\.

Table 2:Type\-2 AUROC per model\-domain cell, all 33 models\. Aggregate column is total within\-model AUROC across 1,500 items\. Columnnn= items completed\. Models grouped by family, sorted by aggregate AUROC within family\.![Refer to caption](https://arxiv.org/html/2605.06673v1/x1.png)Figure 1:Type\-2 AUROC by model \(rows\) and MMLU\-domain bin \(columns\)\. Color scale is diverging around chance \(\.50\)\. Family separators are horizontal black lines\.n=47,151n=47\{,\}151observations across 33 models\.
### 3\.3Domain difficulty hierarchy

Applied/Professional knowledge is the easiest domain to monitor\. Formal Reasoning and Natural Science are the hardest\. The ordering is supported by a Friedman test \(χ2\(5\)=27\.04\\chi^\{2\}\(5\)=27\.04,p<\.0001p<\.0001\) and by convergent rank\-based evidence \(Table[3](https://arxiv.org/html/2605.06673#S3.T3), Figure[2](https://arxiv.org/html/2605.06673#S3.F2)\)\. Kendall’sW=\.164W=\.164indicates that models agree on the extremes \(Applied at top, Formal and Science at bottom\) but diverge in the middle three positions\.

Table 3:Domain\-level means across 33 models, sorted by mean AUROC\.![Refer to caption](https://arxiv.org/html/2605.06673v1/x2.png)Figure 2:Mean Type\-2 AUROC per domain across 33 models, with standard deviation bars\. Applied is the easiest benchmark domain to monitor; Formal and Science are the hardest; the middle three are statistically indistinguishable\.Applied is ranked top\-2 within model in 21 of 33 models\. Formal or Science occupies a bottom\-2 rank within model in 27 of 33\. The two middle domains \(Factual, Social, Humanities\) are too close to discriminate reliably — the Social\-Humanities difference is \.001\. Applied exceeds Formal in 26 of 33 models\. The seven exceptions are the weakest models in the sample \(Gemma 3 1B/4B/12B, GPT\-oss\-120B, GLM\-5\) where both values sit near chance, plus DeepSeek V3\.1 and V3\.2 where Formal is the strongest track in the profile\.

Is the hierarchy an accuracy artefact?A natural concern is that domains with lower accuracy might produce lower AUROC through sampling variance alone, or that the Applied advantage tracks Applied being the easier domain\. Neither pattern holds\. Domain\-level mean accuracy across the 33 models is: Humanities \.924, Factual \.887, Science \.869, Formal \.851, Social \.835, Applied \.830\. Applied is in fact the domain with the*lowest*mean accuracy\. Humanities, with the highest accuracy, ranks only fourth on AUROC\. Spearman correlation between domain\-level mean accuracy and domain\-level mean AUROC across the six domains isρ=−\.37\\rho=\-\.37\(p=\.47p=\.47\), directionally opposite to what an accuracy confound would predict\. Within\-model, the correlation between per\-domain accuracy and per\-domain AUROC \(both mean\-centered within model\) isr=\.06r=\.06\(p=\.40p=\.40\)\. The domain hierarchy reported here reflects differential monitoring quality across task content, not sampling variation or item difficulty\.

### 3\.4Family\-level profiles

Table[4](https://arxiv.org/html/2605.06673#S3.T4)reports family\-level aggregate metacognitive quality\.

Table 4:Family\-level aggregate metacognitive quality\.Anthropic models show the highest mean and narrowest range \(Figure[3](https://arxiv.org/html/2605.06673#S3.F3)\)\. Haiku 4\.5 \(\.771\) nearly matches Opus 4\.6 \(\.806\), indicating aggregate metacognitive quality is relatively stable across model scale within this family\. Google\-Gemma and OpenAI show the widest ranges, driven by scale and generational effects\.

![Refer to caption](https://arxiv.org/html/2605.06673v1/x3.png)Figure 3:Family\-level mean aggregate AUROC \(bars\) with observed minimum–maximum range \(black lines\)\. Families with only one model \(Zhipu\) show mean only\. Chance = \.50\.Family\-level ipsative profiles\.Figure[4](https://arxiv.org/html/2605.06673#S3.F4)shows the ipsative profile of every model \(6\-domain AUROC centered on the model’s own mean\) colored by family, with each family’s mean profile overlaid in bold\. Three family\-level shapes are discernible by eye: Anthropic and Gemini both show an Applied peak with relatively flat rest\-of\-profile; Gemma shows less differentiation; OpenAI is heterogeneous\.

![Refer to caption](https://arxiv.org/html/2605.06673v1/x4.png)Figure 4:Ipsative domain profiles for all 33 models, colored by family\. Thin lines show individual models; bold lines show the family mean profile\. Centering on each model’s own mean isolates profile shape from overall level\.Permutation test for profile shape clustering\.To test whether domain\-level*profile shape*\(not just aggregate level\) clusters by family, we computed pairwise Pearson correlations on the 6\-domain ipsative profile vectors for all 33 models \(528 pairs\), then compared mean within\-family correlation to mean between\-family correlation\. Observed within\-familyr=\.380r=\.380; observed between\-familyr=\.089r=\.089; observed difference = \.291\. Under a permutation null \(10,000 shuffles of family labels\), the observed difference is larger than all 10,000 null realisations \(p<\.0001p<\.0001\)\. Family\-level profile clustering is therefore not a selection artefact of picking illustrative pairs\. However, the effect is carried by three of six families with sufficient n\. Anthropic \(n=8n=8, 28 pairs, meanr=\.455r=\.455\), Google\-Gemini \(n=7n=7, 21 pairs, meanr=\.511r=\.511\), and Qwen \(n=4n=4, 6 pairs, meanr=\.472r=\.472\) show strong within\-family profile similarity\. DeepSeek \(n=3n=3, meanr=\.125r=\.125\), Google\-Gemma \(n=5n=5, meanr=\.207r=\.207\), and OpenAI \(n=5n=5, meanr=\.086r=\.086\) do not\. The aggregate “partly family\-structured” claim is therefore accurate: some families produce consistent profile shapes across models while others produce heterogeneous profiles despite similar training pipelines\.

### 3\.5Generational trajectories

Figure[5](https://arxiv.org/html/2605.06673#S3.F5)shows aggregate AUROC trajectories for the three families with multi\-generation data\.

![Refer to caption](https://arxiv.org/html/2605.06673v1/x5.png)Figure 5:Aggregate Type\-2 AUROC across generations for three families\. Left: Anthropic \(Opus 4\.1–4\.7, Sonnet 4–4\.6, Haiku 4\.5\)\. Center: Gemma \(3 1B through 27B, then Gemma 4 31B\)\. Right: DeepSeek \(V3\.1, V3\.2, R1\)\.Anthropic\.Eight models spanning four generations and three tiers\. Opus 4\.1 \(\.708\) to Opus 4\.6 \(\.806\): \+\.098\. Opus 4\.6 to Opus 4\.7 \(\.792\): \-\.014\. Sonnet 4 \(\.773\) to Sonnet 4\.5 \(\.795\) to Sonnet 4\.6 \(\.777\): non\-monotonic\. Haiku 4\.5 \(\.771\) is competitive with all Sonnet and most Opus models\. Metacognitive quality improved substantially from the 4\.1 to 4\.5 generation, then plateaued\. The current generation \(4\.6–4\.7\) shows no improvement over 4\.5\.

Google\-Gemma\.Gemma 3 1B \(\.498\), 3 4B \(\.528\), 3 12B \(\.504\), 3 27B \(\.569\): weak monitoring with a modest scale effect\. Gemma 3 1B’s domain CIs are the tightest in the dataset \(all widths < \.06\), confirming that the near\-chance estimates are precise, not noisy\. The model has no metacognitive signal\. Gemma 4 31B \(\.771\): a \+\.202 jump from Gemma 3 27B\. Its domain profile shows strong differentiation: Applied \.869 \[\.781, \.941\], Formal \.812 \[\.646, \.982\], Social \.806 \[\.707, \.895\]\. This is the largest single\-generation improvement in the dataset\. Whatever changed between Gemma 3 and Gemma 4 dramatically improved metacognitive monitoring\.

DeepSeek\.V3\.1 \(\.716\) to V3\.2 \(\.734\) to R1 \(\.769\): steady improvement\. The reasoning\-trained model \(R1\) has the best metacognition in the family under verbalized confidence, despite being classified Invalid on binary KEEP/WITHDRAW probes in the Classical Minds battery\[Cacioli,[2026a](https://arxiv.org/html/2605.06673#bib.bib4)\]\.

### 3\.6Probe\-format specificity

Three models classified Invalid on the Classical Minds battery \(binary KEEP/WITHDRAW probes\) were classified Valid on MMLU \(verbalized confidence 0–100\)\. Table[5](https://arxiv.org/html/2605.06673#S3.T5)summarises their tier reassignment and domain\-level AUROC range\.

Table 5:Three models classified Invalid on binary KEEP/WITHDRAW probes that produce valid profiles under verbalized 0–100 confidence on MMLU\. L and Fp are validity indices fromCacioli \[[2026a](https://arxiv.org/html/2605.06673#bib.bib4)\]\.DeepSeek\-R1’s domain profile is unremarkable: Applied \.843 \[\.774, \.902\], Factual \.682 \[\.535, \.822\], Formal \.745 \[\.579, \.867\], Humanities \.785 \[\.626, \.922\], Science \.694 \[\.541, \.832\], Social \.814 \[\.725, \.893\]\. No domain falls below chance\. The model that showed catastrophic inversion on binary probes \(accuracy dropping to 11\.3% at 10% coverage under selective prediction\) monitors normally under verbalized confidence\.

Gemma 3 1B was classified Indeterminate on the Classical Minds battery \(RBS CI spans zero\) and Invalid on MMLU \(\.498, chance\)\. It is the only model with uninformative confidence across both probe formats\.

These results confirm the claim inCacioli \[[2026e](https://arxiv.org/html/2605.06673#bib.bib5)\]that validity is a property of the model\-probe\-task interaction, not an intrinsic model property\.

A note on cross\-study comparison\.Cacioli \[[2026e](https://arxiv.org/html/2605.06673#bib.bib5)\]reports MMLU AUROCs for 18 of these models computed on a 500\-item stratified subsample with median\-binarised confidence as part of a cross\-benchmark validity replication\. The atlas uses the full 1,500\-item stratified sample with continuous 0–100 confidence\. Per\-model AUROCs therefore differ slightly between the two reports \(e\.g\., R1: \.768 vs\. \.769; Qwen Think: \.718 vs\. \.745; Opus 4\.6: \.836 vs\. \.806\)\. The qualitative conclusions replicate: all three battery\-Invalid models shift to Valid under verbalized confidence; Gemma 3 1B fails under both\.

### 3\.7Anomalous profiles

GPT\-oss\-120B\.Accuracy \.897\. Confidence SD 21\.3 \(highest in the dataset\)\. Aggregate AUROC \.530 \(second lowest among models with\>80%\>80\\%accuracy\)\. The domain profile confirms uninformative monitoring: Applied \.549 \[\.481, \.613\], Factual \.521 \[\.432, \.595\], Formal \.616 \[\.468, \.736\], Humanities \.527 \[\.418, \.628\], Science \.436 \[\.326, \.540\], Social \.503 \[\.415, \.586\]\. The Science CI excludes \.50 on the upper side, confirming below\-chance monitoring\. This model expresses highly variable confidence that does not track correctness\. High variance without discrimination is uninformative\. This contrasts with GPT\-oss\-20B \(\.793 aggregate, SD 8\.2\), which has lower confidence variance but far better discrimination\.

Gemini 2\.5 Pro\.The widest within\-model profile: Applied \.889 \[\.802, \.960\], Science \.485 \[\.469, \.497\]\. A \.404 spread\. The Applied CI lower bound \(\.802\) exceeds the Science CI upper bound \(\.497\), confirming this is a genuine difference, not sampling noise\. This model monitors professional knowledge with strong discrimination but has near\-chance monitoring of its own scientific knowledge\.

GLM\-5\.An unusual profile shape: Humanities \.818 \[\.705, \.963\], Social \.790, but Applied \.602, Factual \.576, Science \.555 \[\.389, \.852\]\. This is the only model where Humanities substantially exceeds Applied\. The wide CIs on GLM\-5 \(n=598n=598\) warrant caution, though the Humanities\-Applied difference \(\.216\) is large\.

### 3\.8Validation analyses

Split\-half stability, aggregate level\.Items were randomly partitioned into two halves \(seed = 42\) and aggregate AUROC was computed on each half independently for all 33 models\. The cross\-model Pearson correlation between half\-1 and half\-2 AUROC isr=\.893r=\.893\(p<\.001p<\.001; Figure[6](https://arxiv.org/html/2605.06673#S3.F6)\), indicating that aggregate AUROC rank ordering is reproducible at half the per\-model sample size\. Three models \(Opus 4\.7, Gemini 3 Flash, GLM\-5\) show larger half\-to\-half discrepancies than their peers; GLM\-5’s instability is expected given its smallerNN\(598\)\.

![Refer to caption](https://arxiv.org/html/2605.06673v1/x6.png)Figure 6:Split\-half stability of aggregate Type\-2 AUROC across 33 models\. Each point is one model; axes are AUROC on half\-1 and half\-2 of the item pool \(random partition, seed = 42\)\. Cross\-modelr=\.893r=\.893\(p<\.001p<\.001\)\.Split\-half stability, profile level\.The aggregate reliability above does not directly address whether the*profile*— the shape of the six\-domain vector within a single model — is reliable\. To test this, items within each domain were randomly split 50/50 for each model, per\-domain AUROC was computed on each half, and the Pearson correlation between the two 6\-domain profile vectors was recorded\. We repeated this procedure for 100 random splits per model and took the per\-model median\. The grand median of these per\-model median correlations is \.184 \(grand mean = \.135; 61% of models show medianr\>0r\>0; 27% show medianr\>\.3r\>\.3\)\. Profile shape is therefore noticeably less reliable than the aggregate summary: for a large minority of models, which specific domain the model looks best or worst on varies substantially across random halves of the item pool\. This parallels the construct\-validity limitation flagged by the subject\-coherence analysis below\. Individual cells of Table[2](https://arxiv.org/html/2605.06673#S3.T2)should be read as estimates with non\-trivial within\-model sampling noise, particularly in high\-accuracy models with few errors per domain and in weak models whose cells sit near chance\. The extremum ordering in §3\.3 \(Applied at the top; Formal/Science at the bottom\) remains robust because it aggregates across 33 models\. Fine\-grained rankings within an individual model’s profile should be read against the bootstrap CIs in Supplementary TableLABEL:tab:s1\.

Cross\-benchmark consistency\.Twenty of the 33 models were also evaluated on the Classical Minds battery\[Cacioli,[2026f](https://arxiv.org/html/2605.06673#bib.bib3),[b](https://arxiv.org/html/2605.06673#bib.bib6)\]\. Battery AUROC was computed on ordinal confidence derived from the binary KEEP/WITHDRAW \+ BET/NO BET probes\. Because the two benchmarks differ in substrate \(524 cross\-domain items vs\. 1,500 stratified MMLU items\), elicitation format \(binary dual probes vs\. verbalized 0–100\), and binarisation \(none vs\. none\), we expect qualitative rather than quantitative agreement\. Three qualitative patterns emerge in Figure[7](https://arxiv.org/html/2605.06673#S3.F7)\. First, battery\-Valid models \(n=14n=14\) occupy similar AUROC bands on both benchmarks \(\.56–\.72 battery; \.62–\.81 MMLU\), with tier assignments preserved in every case\. Linear correlation within this subset is low \(Pearsonr=\.030r=\.030\), reflecting range restriction once the weak and pathological models are removed, not disagreement between benchmarks\. Second, the three battery\-Invalid models \(DeepSeek\-R1, Gemini 3\.1 Pro, Qwen Think\) sit far above the identity line: at chance or below on the binary probe but at \.74–\.77 under verbalized confidence\. Third, the three battery\-Indeterminate models show weaker signals under both probe formats\. The cross\-benchmark comparison therefore supports tier\-level preservation with probe\-format\-specific AUROC levels, rather than scalar agreement — consistent with the probe\-format specificity finding in §3\.6\.

![Refer to caption](https://arxiv.org/html/2605.06673v1/x7.png)Figure 7:Cross\-benchmark consistency between Classical Minds battery AUROC and MMLU atlas AUROC for 20 overlapping models\. X\-axis has a break to accommodate DeepSeek\-R1 near chance on the binary probe\. Marker shape indicates battery validity tier\.Subject\-level coherence\.Within\-domain subject coherence was tested by computing the correlation of per\-subject AUROC patterns across pairs of subjects within the same domain versus pairs from different domains\. The within\-domain similarity ratio is 0\.95: MMLU subjects within a mapped domain are not statistically more similar to each other than to subjects in other domains\. The domain mapping groups subjects by cognitive demand, not by empirically cohesive latent construct\. This is a limitation of the domain taxonomy, not of the domain\-level AUROCs reported here\. The domain means are stable statistics even when the domains themselves lack internal cohesion\.

## 4Discussion

### 4\.1Metacognitive monitoring is domain\-structured

The central descriptive finding is that every model with above\-chance aggregate monitoring shows non\-trivial domain\-level variation\. The variation is not reducible to sampling error \(split\-halfr=\.893r=\.893at the aggregate level; §3\.8\)\. Applied/Professional exceeds Formal Reasoning in 26 of 33 models and occupies a top\-2 rank within model in 21 of 33\. The ordering appears across eight model families, four Anthropic generations, and models ranging from 1B to 480B parameters\. Where exceptions occur they are concentrated in models whose aggregate AUROC sits near chance \(Gemma 3 1B/4B/12B, GPT\-oss\-120B\), where neither value is reliably above \.55\.

Why is Applied/Professional knowledge easiest to monitor? Professional knowledge items — law, medicine, accounting — carry precise factual answers grounded in specific training corpora, producing more distinctive internal representations when the model knows versus does not know the answer\. Formal Reasoning items — abstract algebra, formal logic, mathematical proofs — are adversarial by nature\. A model can execute every computational step correctly and still reach the wrong answer, or reach the right answer through pattern\-matching without genuine understanding\. The confidence signal tracks knowledge retrieval quality more reliably than reasoning validity\.

One candidate account, consistent withKim \[[2026](https://arxiv.org/html/2605.06673#bib.bib10)\]’s Knowledge Landscape hypothesis, is that well\-learned factual associations create distinctive higher\-variance activation patterns in the metacognitive locus region \(60–90% of network depth\) that produce a read\-out signal when the model is confident about a retrieval, while novel reasoning problems traverse less differentiated representational territory\. Under this account, Applied items are more likely to produce distinctive “knows” and “doesn’t know” states for the monitoring signal to read, while Formal items produce more similar activation patterns regardless of whether the final answer is correct\. This is a hypothesis: nothing in the data here analyzes hidden states, entropy, or log\-probabilities\. A second candidate account is that the Applied advantage reflects training\-data structure \(professional knowledge is heavily represented in instruction\-tuning corpora\) rather than any architectural feature of metacognitive monitoring\. We cannot adjudicate between these accounts\. See §4\.7 for further limitations on causal interpretation\.

### 4\.2Family structure

The Anthropic family shows the highest mean aggregate metacognitive quality \(\.778\) and the narrowest range \(\.708–\.806\) in this sample\. This consistency across eight models, four generations, and three model tiers is a descriptive pattern that invites — but does not by itself establish — a causal inference about training pipelines\. It is not driven by scale: Haiku 4\.5 \(\.771\) nearly matches Opus 4\.6 \(\.806\)\. Families differ on many dimensions \(scale, training data, alignment regime, release date\), and the present observational comparison cannot isolate any one of them\.

The OpenAI family shows the widest range \(\.530–\.793\)\. GPT\-oss\-20B \(\.793\) dramatically outperforms GPT\-5\.4 \(\.649\) and the much larger GPT\-oss\-120B \(\.530\)\. The open\-source 20B model has better self\-monitoring than either the flagship or the 6×\\times\-larger open model\. Within this family, training recipe matters more than scale for metacognitive quality\.

### 4\.3Generational dynamics

The Gemma 3 to Gemma 4 leap \(\+\.202 AUROC\) is the largest single\-generation improvement in the dataset\. It provides the clearest evidence that metacognitive quality can be substantially changed by training without a change in architecture class\. Whatever methodological change produced Gemma 4 also produced usable confidence signals where Gemma 3 had none\.

The Anthropic 4\.1 to 4\.5 generation showed substantial improvement \(\+\.098\) followed by a plateau\. The 4\.6 and 4\.7 generations show no further gain on aggregate AUROC\. Two readings fit the pattern: a ceiling effect on the current family of training methods, or post\-training optimization prioritizing other capabilities \(agentic tool use, long\-context reasoning\) over confidence calibration in the 4\.5\-to\-4\.7 transition\.

### 4\.4Probe\-format specificity

The shift in classification for R1, Gemini 3\.1 Pro, and Qwen Think from Invalid \(under binary KEEP/WITHDRAW probes\) to Valid \(under verbalized confidence\) is a methodological finding with broad implications\. A model classified as having “no metacognitive signal” on one evaluation may produce usable confidence under a different measurement approach\. The binary KEEP/WITHDRAW probe is a more conservative test: it forces a categorical commitment, while verbalized confidence \(0–100\) allows graded expression\. Models that cannot express uncertainty in binary can express it on a continuous scale\. We avoid the stronger word “rehabilitation” deliberately: the models have not been shown to have valid metacognition*in general*, only under the second probe format\.

For deployment, the probe format must match the deployment interface: a model that fails the binary screen may still produce usable confidence when asked for a number\. For evaluation, claims about a model’s metacognitive capacity must be qualified by the measurement method\. “This model has no metacognitive signal” should always specify “under this probe, on this task\.”

### 4\.5Confidence variance without discrimination

GPT\-oss\-120B illustrates a failure mode not captured by variance\-based metrics\. Confidence SD of 21\.3 is the highest in the dataset, indicating rich uncertainty expression\. But AUROC of \.530 means the variance does not track correctness\. The model expresses different confidence levels on different items, but the differences are not informative\. This parallels the saturation finding fromCacioli \[[2026g](https://arxiv.org/html/2605.06673#bib.bib7)\]in reverse: saturation compresses all confidence to the ceiling, leaving no variance; GPT\-oss\-120B produces maximum variance but no signal\. Both are uninformative, for different reasons\.

This has practical implications\. Confidence SD is sometimes used as a quick proxy for monitoring quality, under the assumption that meaningful self\-assessment requires expressing a range\. SD is necessary but not sufficient\. High SD with near\-chance AUROC is diagnostic of a model that expresses uncertainty but does not know when it should\.

### 4\.6Deployment implications

The practical implication is that aggregate AUROC is insufficient for domain\-specific deployment reasoning\. A model with \.65 aggregate AUROC may have \.85 on the benchmark domain relevant to your application or \.50\. The atlas profile provides a benchmark\-stage indication of where a given model’s confidence signal is stronger and weaker; it does not replace domain\-specific deployment evaluation on the target task, which may differ in format, distribution, and stakes\. Treating benchmark\-domain AUROC as a screening input, not a deployment certification, is the safer framing\.

### 4\.7Limitations

Single benchmark\.MMLU only\. Whether domain\-level profiles replicate on other benchmarks is an open question\. The domain mapping groups MMLU subjects by cognitive demand, but the mapping is a priori and has not been validated against factor\-analytic structure\.

Verbalized confidence\.A single elicitation method\. Probe\-format specificity \(§3\.6\) demonstrates that measurement method modulates metacognitive quality\. Profiles under binary probes, Likert scales, or logprob\-based confidence may differ\.

Greedy decoding\.Temperature = 0 throughout\. Sampling with temperature\>0\>0may produce different confidence distributions and domain\-level patterns\.

Estimation precision\.Median bootstrap CI width is \.199 across 198 cells\. This is adequate for identifying large domain\-level differences \(e\.g\., Gemini 2\.5 Pro Applied\-Science spread of \.404\) but insufficient for resolving small differences between adjacent domains\. The 34% of cells with CI width exceeding \.25 are concentrated in high\-accuracy models where few errors produce sparse contingency tables\. Domain\-level comparisons within these models should be interpreted with appropriate caution\. Full CIs are reported in Supplementary TableLABEL:tab:s1\.

Partial runs\.Twelve models have fewer than 1,500 items due to API instability during the Kaggle Benchmarks run\. Minimum included: GLM\-5 at 598 items\. Opus 4\.7 and Gemma 3 1B are missing only one item each \(1,499 of 1,500\)\. The two most incomplete models, GLM\-5 \(598 items\) and Gemini 2\.5 Pro \(981 items\), produce wider bootstrap CIs but both remain above\-chance on aggregate monitoring\. Gemma 4 26B A4B was excluded entirely \(repeated API\-side failure at item\-level parsing\)\.

No causal mechanism\.We report domain\-level variation but do not explain why Applied is easier to monitor than Formal\.Kim \[[2026](https://arxiv.org/html/2605.06673#bib.bib10)\]’s Knowledge Landscape hypothesis offers a theoretical account via in\-computation metacognition, andKumaranet al\.\[[2026](https://arxiv.org/html/2605.06673#bib.bib11)\]’s cached\-retrieval representations of verbal confidence offer a complementary mechanism, but no direct evidence connects activation geometry to domain\-level AUROC\.

### 4\.8Future directions

The atlas raises several empirical questions that the present design cannot answer\.

Can domain\-level monitoring weaknesses be repaired?Gemini 2\.5 Pro’s Applied\-Science spread \(\.404\) and Sonnet 4\.6’s Social dip \(\.694\) could be architectural limits \(the monitoring representation cannot capture certain domains\) or training artefacts \(the post\-training regime under\-weighted feedback on those domains\)\. Targeted supervised fine\-tuning on held\-out items from the weak domain, with confidence elicitation in the training signal, would discriminate between these accounts\. The Gemma 3→\\toGemma 4 leap \(\+\.202\) shows training\-based repair is at minimum possible at the aggregate level\.

Does the profile transport across benchmarks?The subject\-coherence result \(§3\.8\) implies the MMLU\-domain grouping is not a latent construct\. The natural next question is whether domain profiles correlate across benchmarks with non\-overlapping items\. GPQA, LiveCodeBench, and an extended Classical Minds battery would each probe a different slice of this question\.

Human baseline\.No human data is collected on these items\. Whether the Applied\-easy / Formal\-hard hierarchy reflects anything human\-like in the structure of the items, or is a property specific to LLM monitoring, is unknowable from this study alone\.

Causal mechanism\.Hidden\-state analyses on open\-weight models \(Gemma 4 31B, GPT\-oss\-20B, Qwen Think\) could test whether Applied items produce more discriminable activations than Formal items, directly evaluating the Knowledge Landscape hypothesis at the domain level\.

Prospective validation\.The atlas is a one\-shot measurement\. Test\-retest reliability \(running each model on a second, non\-overlapping MMLU subsample\) would establish whether profile shape is stable over repeated elicitation or drifts\.

## 5Conclusion

Metacognitive monitoring quality in frontier LLMs is not a single number\. It varies by benchmark domain, by family, and by generation\. Applied/Professional knowledge is reliably the easiest MMLU\-domain bin to monitor, and Formal Reasoning / Natural Science are reliably the hardest, across 33 models from eight families\. The Anthropic family produces the most consistent aggregate metacognitive quality \(\.708–\.806 across eight models\) and, alongside Google\-Gemini and Qwen, shows significant within\-family profile\-shape similarity\. Gemma 4 represents a substantial generational gain over Gemma 3 \(\+\.202 AUROC\)\. Three models classified Invalid under binary KEEP/WITHDRAW probes produce normal profiles under verbalized confidence, confirming probe\-format specificity\.

The practical principle: screen before you interpret, profile before you deploy\. The atlas is the profile\. Screening\[Cacioli,[2026e](https://arxiv.org/html/2605.06673#bib.bib5)\]is the prerequisite\. Domain\-specific deployment evaluation is the next step on the target task\.

## Open science

The benchmark notebook, item\-level data \(47,151 observations across 33 models\), analysis pipeline, and figure\-generation code are publicly available at[https://github\.com/synthiumjp/metacognitive\-profile\-atlas](https://github.com/synthiumjp/metacognitive-profile-atlas)under MIT \(code\) and CC\-BY\-4\.0 \(data\) licenses\. The benchmark is deployed on the Kaggle Benchmarks platform with a public leaderboard\. The MMLU domain mapping is specified in the notebook and documented in this paper\. The validity screening tool used for Stage A screening\[Cacioli,[2026e](https://arxiv.org/html/2605.06673#bib.bib5)\]is available as a separate dependency:pip install validity\-screen\(source:[https://github\.com/synthiumjp/validity\-scaling\-llm](https://github.com/synthiumjp/validity-scaling-llm)\)\. Bootstrap 95% CIs for all 198 model\-domain cells are reported in Supplementary TableLABEL:tab:s1and archived asatlas\_bootstrap\_cis\.csvin the repository\.

## Generative AI disclosure

Claude \(Anthropic\) was used for benchmark design, analysis pipeline development, and assisting in manuscript preparation\. All scientific decisions, domain mappings, and interpretive conclusions were made by the author\.

## References

- Before you interpret the profile: Validity scaling for LLM metacognitive self\-report\.arXiv preprint arXiv:2604\.17707\.Cited by:[item 6](https://arxiv.org/html/2605.06673#S1.I1.i6.p1.1),[§1\.2](https://arxiv.org/html/2605.06673#S1.SS2.p2.2),[§3\.5](https://arxiv.org/html/2605.06673#S3.SS5.p4.1),[Table 5](https://arxiv.org/html/2605.06673#S3.T5)\.
- J\. Cacioli \(2026b\)Concurrent criterion validation of a validity screen for LLM confidence signals via selective prediction\.arXiv preprint arXiv:2604\.17716\.Cited by:[§1\.2](https://arxiv.org/html/2605.06673#S1.SS2.p2.2),[§3\.8](https://arxiv.org/html/2605.06673#S3.SS8.p3.2)\.
- J\. Cacioli \(2026c\)Domain\-specific metacognitive efficiency in large language models: A Type\-2 signal detection theory analysis\.arXiv preprint arXiv:2603\.25112\.Cited by:[§1\.2](https://arxiv.org/html/2605.06673#S1.SS2.p1.1)\.
- J\. Cacioli \(2026d\)LLMs as signal detectors: Sensitivity, bias, and the temperature\-criterion analogy\.arXiv preprint arXiv:2603\.14893\.Cited by:[§1\.2](https://arxiv.org/html/2605.06673#S1.SS2.p1.1)\.
- J\. Cacioli \(2026e\)Screen before you interpret: A portable validity protocol for benchmark\-based LLM confidence signals\.arXiv preprint arXiv:2604\.17714\.Cited by:[item 6](https://arxiv.org/html/2605.06673#S1.I1.i6.p1.1),[§1\.2](https://arxiv.org/html/2605.06673#S1.SS2.p2.2),[§1\.3](https://arxiv.org/html/2605.06673#S1.SS3.p2.1),[§2\.5](https://arxiv.org/html/2605.06673#S2.SS5.p2.1),[§3\.6](https://arxiv.org/html/2605.06673#S3.SS6.p4.1),[§3\.6](https://arxiv.org/html/2605.06673#S3.SS6.p5.1),[§5](https://arxiv.org/html/2605.06673#S5.p2.1),[Open science](https://arxiv.org/html/2605.06673#Sx1.p1.1)\.
- J\. Cacioli \(2026f\)The Metacognitive Monitoring Battery: A cross\-domain benchmark for LLM self\-monitoring\.arXiv preprint arXiv:2604\.15702\.Cited by:[§1\.1](https://arxiv.org/html/2605.06673#S1.SS1.p2.1),[§1\.2](https://arxiv.org/html/2605.06673#S1.SS2.p3.1),[§3\.8](https://arxiv.org/html/2605.06673#S3.SS8.p3.2)\.
- J\. Cacioli \(2026g\)Verbal confidence saturation in 3\-9B open\-weight instruction\-tuned LLMs: A pre\-registered psychometric validity screen\.Note:Manuscript in preparationCited by:[§1\.2](https://arxiv.org/html/2605.06673#S1.SS2.p5.1),[§4\.5](https://arxiv.org/html/2605.06673#S4.SS5.p1.1)\.
- F\. G\. Haznitrama, F\. R\. Ardi, and A\. Oh \(2026\)A neuropsychologically grounded evaluation of LLM cognitive abilities\.arXiv preprint arXiv:2603\.02540\.Cited by:[§1\.2](https://arxiv.org/html/2605.06673#S1.SS2.p3.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Measuring massive multitask language understanding\.International Conference on Learning Representations\.Cited by:[§2\.2](https://arxiv.org/html/2605.06673#S2.SS2.p1.1)\.
- S\. Kadavath, T\. Conerly, A\. Askell, T\. Henighan, D\. Drain, E\. Perez, N\. Schiefer, Z\. Hatfield\-Dodds, N\. DasSarma, E\. Tran\-Johnson,et al\.\(2022\)Language models \(mostly\) know what they know\.arXiv preprint arXiv:2207\.05221\.Cited by:[§1\.2](https://arxiv.org/html/2605.06673#S1.SS2.p1.1)\.
- J\. Kim \(2026\)Knowing before speaking: In\-computation metacognition precedes verbal confidence in large language models\.Note:Preprints\.org, posted 3 April 2026Not peer\-reviewedExternal Links:[Document](https://dx.doi.org/10.20944/preprints202604.0078.v2)Cited by:[§1\.2](https://arxiv.org/html/2605.06673#S1.SS2.p4.1),[§4\.1](https://arxiv.org/html/2605.06673#S4.SS1.p3.1),[§4\.7](https://arxiv.org/html/2605.06673#S4.SS7.p6.1)\.
- D\. Kumaran, A\. Conmy, F\. Barbero, S\. Osindero, V\. Patraucean, and P\. Veličković \(2026\)How do LLMs compute verbal confidence?\.arXiv preprint arXiv:2603\.17839\.Cited by:[§1\.2](https://arxiv.org/html/2605.06673#S1.SS2.p4.1),[§4\.7](https://arxiv.org/html/2605.06673#S4.SS7.p6.1)\.
- G\. J\. Larrabee \(2012\)Forensic neuropsychology: A scientific approach\.2nd edition,Oxford University Press\.Cited by:[§1\.3](https://arxiv.org/html/2605.06673#S1.SS3.p1.1)\.
- M\. M\. Miao and L\. Ungar \(2026\)Closing the confidence\-faithfulness gap in large language models\.arXiv preprint arXiv:2603\.25052\.Cited by:[§1\.2](https://arxiv.org/html/2605.06673#S1.SS2.p5.1)\.
- M\. Steyvers and M\. A\. K\. Peters \(2025\)Metacognition and uncertainty communication in humans and large language models\.Current Directions in Psychological Science\.Cited by:[§1\.2](https://arxiv.org/html/2605.06673#S1.SS2.p1.1)\.
- B\. Wen, H\. Bansal, S\. J\. Semnani, and M\. S\. Lam \(2025\)Know your limits: A survey of abstention in large language models\.Transactions of the Association for Computational Linguistics13,pp\. 529–556\.Cited by:[§1\.1](https://arxiv.org/html/2605.06673#S1.SS1.p1.1)\.
- S\. Wu, F\. K\. Gustafsson, E\. Phillips, B\. Gao, A\. Thakur, and D\. A\. Clifton \(2026\)BAS: A decision\-theoretic approach to evaluating large language model confidence\.arXiv preprint arXiv:2604\.03216\.Cited by:[§1\.2](https://arxiv.org/html/2605.06673#S1.SS2.p1.1)\.
- M\. Xiong, Z\. Hu, X\. Lu, Y\. Li, J\. Fu, J\. He, and B\. Hooi \(2023\)Can LLMs express their uncertainty? An empirical evaluation of confidence elicitation in LLMs\.arXiv preprint arXiv:2306\.13063\.Cited by:[§1\.2](https://arxiv.org/html/2605.06673#S1.SS2.p1.1)\.

## Appendix ASupplementary Tables

Bootstrap 95% CIs \(1,000 resamples, seed = 42\) for all 198 model\-domain AUROC cells\. Median CI width = \.199\. The table is archived asdata/atlas\_bootstrap\_cis\.csvin the repository\.

Table S1:Bootstrap 95% confidence intervals for all 198 model\-domain AUROC cells \(1,000 resamples, seed = 42\)\. Median CI width across all cells = \.199\.ModelDomainnAUROCCI loCI hiOpus 4\.6Applied2500\.8470\.7610\.918Opus 4\.6Factual2500\.8180\.6800\.936Opus 4\.6Human\.2500\.7040\.4190\.939Opus 4\.6Social2500\.8730\.8030\.937Opus 4\.6Formal2500\.7430\.4560\.976Opus 4\.6Science2500\.7580\.5950\.900Opus 4\.5Applied2500\.8210\.7350\.897Opus 4\.5Factual2500\.7950\.6690\.911Opus 4\.5Human\.2500\.6780\.3930\.919Opus 4\.5Social2500\.8670\.7610\.940Opus 4\.5Formal2500\.7870\.5650\.954Opus 4\.5Science2500\.7640\.6330\.894Sonnet 4\.5Applied2500\.8490\.7790\.909Sonnet 4\.5Factual2500\.7090\.5450\.860Sonnet 4\.5Human\.2500\.7150\.4420\.913Sonnet 4\.5Social2500\.8010\.7100\.887Sonnet 4\.5Formal2500\.7520\.4550\.992Sonnet 4\.5Science2500\.7580\.5800\.905Opus 4\.7Applied2500\.8530\.7780\.924Opus 4\.7Factual2490\.7260\.5810\.852Opus 4\.7Human\.2500\.6920\.3530\.959Opus 4\.7Social2500\.8250\.7360\.898Opus 4\.7Formal2500\.7800\.5820\.941Opus 4\.7Science2500\.8170\.6670\.935Sonnet 4\.6Applied2500\.8870\.8200\.937Sonnet 4\.6Factual2500\.7800\.6350\.889Sonnet 4\.6Human\.2500\.7600\.5730\.922Sonnet 4\.6Social2500\.6940\.5890\.784Sonnet 4\.6Formal2500\.7880\.5280\.975Sonnet 4\.6Science2500\.7820\.6050\.914Sonnet 4Applied2500\.8280\.7620\.887Sonnet 4Factual2500\.6830\.5450\.820Sonnet 4Human\.2500\.7560\.6220\.865Sonnet 4Social2500\.7940\.7110\.867Sonnet 4Formal2500\.7350\.5440\.893Sonnet 4Science2500\.7860\.6620\.897Haiku 4\.5Applied2500\.8400\.7640\.907Haiku 4\.5Factual2500\.7200\.6040\.824Haiku 4\.5Human\.2500\.7930\.6540\.904Haiku 4\.5Social2500\.7280\.6400\.822Haiku 4\.5Formal2500\.8050\.6610\.927Haiku 4\.5Science2500\.7060\.5550\.837Opus 4\.1Applied2500\.8380\.7840\.893Opus 4\.1Factual2500\.6150\.4520\.780Opus 4\.1Human\.2500\.7110\.4730\.921Opus 4\.1Social2500\.7810\.6760\.870Opus 4\.1Formal2500\.6910\.5200\.835Opus 4\.1Science2500\.6360\.5300\.755DeepSeek\-R1Applied2500\.8430\.7740\.902DeepSeek\-R1Factual2500\.6820\.5350\.822DeepSeek\-R1Human\.2500\.7850\.6260\.922DeepSeek\-R1Social2500\.8140\.7250\.893DeepSeek\-R1Formal2500\.7450\.5790\.867DeepSeek\-R1Science2500\.6940\.5410\.832DeepSeek V3\.2Applied2280\.7130\.6200\.811DeepSeek V3\.2Factual2210\.7690\.6850\.844DeepSeek V3\.2Human\.2290\.6980\.5670\.815DeepSeek V3\.2Social2310\.6690\.5780\.754DeepSeek V3\.2Formal2310\.7690\.6080\.896DeepSeek V3\.2Science2310\.6970\.5430\.824DeepSeek V3\.1Applied2340\.7560\.6770\.817DeepSeek V3\.1Factual2290\.6900\.5690\.794DeepSeek V3\.1Human\.2300\.7170\.5960\.832DeepSeek V3\.1Social2250\.6360\.5450\.725DeepSeek V3\.1Formal2280\.7970\.6970\.887DeepSeek V3\.1Science2340\.6260\.5140\.740Gemini 3\.1 ProApplied2270\.8410\.7040\.947Gemini 3\.1 ProFactual2150\.7730\.6310\.907Gemini 3\.1 ProHuman\.2280\.7640\.5471\.000Gemini 3\.1 ProSocial2250\.7760\.6560\.876Gemini 3\.1 ProFormal2300\.6450\.4690\.972Gemini 3\.1 ProScience2210\.6400\.4650\.814Gemini 3 FlashApplied2500\.8000\.7110\.884Gemini 3 FlashFactual2500\.6730\.5110\.826Gemini 3 FlashHuman\.2500\.6620\.4900\.828Gemini 3 FlashSocial2500\.7940\.6900\.881Gemini 3 FlashFormal2500\.6810\.5090\.863Gemini 3 FlashScience2500\.6900\.5290\.865Gemini 2\.5 FlashApplied2280\.8080\.7260\.883Gemini 2\.5 FlashFactual2190\.7650\.6350\.885Gemini 2\.5 FlashHuman\.2310\.6440\.4720\.813Gemini 2\.5 FlashSocial2280\.7480\.6520\.843Gemini 2\.5 FlashFormal2300\.6370\.5390\.755Gemini 2\.5 FlashScience2250\.6640\.5300\.797Gemini 2\.5 ProApplied1650\.8890\.8020\.960Gemini 2\.5 ProFactual1720\.6190\.4820\.778Gemini 2\.5 ProHuman\.1600\.6230\.4460\.841Gemini 2\.5 ProSocial1500\.7540\.6320\.884Gemini 2\.5 ProFormal1630\.5730\.4780\.778Gemini 2\.5 ProScience1710\.4850\.4690\.497Gemini 3\.1 FLiteApplied2320\.8290\.7450\.899Gemini 3\.1 FLiteFactual2330\.7520\.6350\.858Gemini 3\.1 FLiteHuman\.2230\.6860\.5070\.866Gemini 3\.1 FLiteSocial2300\.6360\.5460\.732Gemini 3\.1 FLiteFormal2270\.5660\.5040\.641Gemini 3\.1 FLiteScience2230\.5490\.4690\.639Gemini 2\.0 FLiteApplied2500\.6880\.6270\.753Gemini 2\.0 FLiteFactual2500\.6920\.5900\.791Gemini 2\.0 FLiteHuman\.2500\.6620\.5420\.774Gemini 2\.0 FLiteSocial2500\.5550\.4760\.638Gemini 2\.0 FLiteFormal2500\.6670\.5970\.741Gemini 2\.0 FLiteScience2500\.6210\.5330\.709Gemini 2\.0 FlashApplied2500\.7580\.6860\.819Gemini 2\.0 FlashFactual2500\.7320\.6260\.830Gemini 2\.0 FlashHuman\.2500\.6630\.4940\.808Gemini 2\.0 FlashSocial2500\.5030\.4260\.582Gemini 2\.0 FlashFormal2500\.5720\.5000\.648Gemini 2\.0 FlashScience2500\.6060\.5160\.704Gemma 4 31BApplied2500\.8690\.7810\.941Gemma 4 31BFactual2500\.6630\.5260\.790Gemma 4 31BHuman\.2500\.7370\.5800\.885Gemma 4 31BSocial2500\.8060\.7070\.895Gemma 4 31BFormal2500\.8120\.6460\.982Gemma 4 31BScience2500\.7100\.5440\.872Gemma 3 27BApplied2500\.5970\.5350\.658Gemma 3 27BFactual2500\.6740\.5970\.748Gemma 3 27BHuman\.2500\.5900\.5060\.680Gemma 3 27BSocial2500\.5280\.4690\.581Gemma 3 27BFormal2500\.5690\.4990\.636Gemma 3 27BScience2500\.5980\.5460\.651Gemma 3 4BApplied2440\.5360\.4820\.587Gemma 3 4BFactual2410\.5640\.4890\.637Gemma 3 4BHuman\.2400\.5060\.4450\.566Gemma 3 4BSocial2400\.4880\.4330\.544Gemma 3 4BFormal2390\.5410\.4770\.607Gemma 3 4BScience2410\.5660\.5070\.628Gemma 3 12BApplied2420\.5330\.4670\.606Gemma 3 12BFactual2370\.6140\.5300\.699Gemma 3 12BHuman\.2430\.5220\.4100\.631Gemma 3 12BSocial2470\.4640\.3900\.527Gemma 3 12BFormal2340\.5520\.4980\.607Gemma 3 12BScience2440\.5000\.4260\.569Gemma 3 1BApplied2490\.4950\.4660\.525Gemma 3 1BFactual2500\.5010\.4710\.530Gemma 3 1BHuman\.2500\.5000\.4790\.522Gemma 3 1BSocial2500\.4900\.4630\.512Gemma 3 1BFormal2500\.5030\.4750\.526Gemma 3 1BScience2500\.5020\.4800\.522GPT\-oss\-20BApplied2500\.7690\.7080\.831GPT\-oss\-20BFactual2500\.7660\.6760\.849GPT\-oss\-20BHuman\.2500\.8030\.6970\.903GPT\-oss\-20BSocial2500\.8270\.7480\.900GPT\-oss\-20BFormal2500\.7580\.5940\.901GPT\-oss\-20BScience2500\.7090\.6010\.821GPT\-5\.4 miniApplied2270\.7430\.6500\.825GPT\-5\.4 miniFactual2160\.7970\.7150\.866GPT\-5\.4 miniHuman\.2300\.7230\.5970\.838GPT\-5\.4 miniSocial2280\.6040\.5150\.691GPT\-5\.4 miniFormal2320\.5720\.4980\.654GPT\-5\.4 miniScience2230\.7250\.6350\.812GPT\-5\.4Applied2500\.8180\.7380\.885GPT\-5\.4Factual2500\.7770\.6770\.883GPT\-5\.4Human\.2500\.6400\.3870\.879GPT\-5\.4Social2500\.7470\.6360\.845GPT\-5\.4Formal2500\.5390\.4570\.624GPT\-5\.4Science2500\.6490\.5370\.762GPT\-5\.4 nanoApplied2500\.6330\.5600\.700GPT\-5\.4 nanoFactual2500\.7490\.6780\.816GPT\-5\.4 nanoHuman\.2500\.7260\.6500\.798GPT\-5\.4 nanoSocial2500\.4860\.4130\.556GPT\-5\.4 nanoFormal2500\.5480\.4750\.617GPT\-5\.4 nanoScience2500\.6150\.5420\.690GPT\-oss\-120BApplied2500\.5490\.4810\.613GPT\-oss\-120BFactual2500\.5210\.4320\.595GPT\-oss\-120BHuman\.2500\.5270\.4180\.628GPT\-oss\-120BSocial2500\.5030\.4150\.586GPT\-oss\-120BFormal2500\.6160\.4680\.736GPT\-oss\-120BScience2500\.4360\.3260\.540Qwen ThinkApplied2500\.7260\.6140\.829Qwen ThinkFactual2500\.6800\.5220\.815Qwen ThinkHuman\.2500\.7140\.5390\.881Qwen ThinkSocial2500\.7680\.6880\.842Qwen ThinkFormal2500\.6440\.4480\.844Qwen ThinkScience2500\.7570\.6230\.874Qwen CoderApplied2500\.6500\.5770\.715Qwen CoderFactual2500\.7160\.6370\.787Qwen CoderHuman\.2500\.7230\.6030\.817Qwen CoderSocial2500\.6160\.5430\.694Qwen CoderFormal2500\.5760\.4570\.689Qwen CoderScience2500\.6960\.6100\.789Qwen 80B InstApplied2500\.6470\.5690\.727Qwen 80B InstFactual2500\.6780\.5780\.778Qwen 80B InstHuman\.2500\.7160\.5530\.853Qwen 80B InstSocial2500\.6910\.6200\.762Qwen 80B InstFormal2500\.5850\.5170\.663Qwen 80B InstScience2500\.5820\.4830\.670Qwen 235BApplied2500\.6440\.5790\.711Qwen 235BFactual2500\.6220\.5210\.722Qwen 235BHuman\.2500\.7110\.5990\.825Qwen 235BSocial2500\.6500\.5530\.752Qwen 235BFormal2500\.5670\.4760\.657Qwen 235BScience2500\.6370\.5690\.691GLM\-5Applied840\.6020\.2150\.891GLM\-5Factual1110\.5760\.3550\.788GLM\-5Human\.910\.8180\.7050\.963GLM\-5Social1000\.7900\.6450\.909GLM\-5Formal1020\.6090\.3861\.000GLM\-5Science1100\.5540\.3890\.852
Domain-level metacognitive monitoring in frontier LLMs: A 33-model atlas

Similar Articles

The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring

LLMs Show No Signs Of Individuated Metacognition

LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling

OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling

CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks

Submit Feedback

Similar Articles

The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring
LLMs Show No Signs Of Individuated Metacognition
LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling
OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling
CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks