The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring
Summary
A new cross-domain benchmark (Metacognitive Monitoring Battery) with 524 items evaluates LLM self-monitoring capabilities across six cognitive domains using human psychometric methodology. Applied to 20 frontier LLMs, it reveals three distinct metacognitive profiles and shows that accuracy rank and metacognitive sensitivity rank are largely inverted.
View Cached Full Text
Cached at: 04/20/26, 08:28 AM
# A Cross-Domain Benchmark for LLM Self-Monitoring Source: https://arxiv.org/html/2604.15702 ## The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring Jon-Paul Cacioli Independent Researcher Melbourne, Australia ORCID: 0009-0000-7054-2014 [email protected] ###### Abstract We introduce a cross-domain behavioral assay of monitoring-control coupling in LLMs, grounded in the Nelson and Narens (1990) metacognitive framework and applying human psychometric methodology to LLM evaluation. The battery comprises 524 items across six cognitive domains (learning, metacognitive calibration, social cognition, attention, executive function, prospective regulation), each grounded in an established experimental paradigm. Tasks T1–T5 were pre-registered on OSF prior to data collection; T6 was added as an exploratory extension. After every forced-choice response, dual probes adapted from Koriat and Goldsmith (1996) ask the model to KEEP or WITHDRAW its answer and to BET or decline. The critical metric is the withdraw delta: the difference in withdrawal rate between incorrect and correct items. Applied to 20 frontier LLMs (10,480 evaluations), the battery discriminates three profiles consistent with the Nelson–Narens architecture: blanket confidence, blanket withdrawal, and selective sensitivity. Accuracy rank and metacognitive sensitivity rank are largely inverted. Retrospective monitoring and prospective regulation appear dissociable (r=.17, 95% CI wide given n=20; exemplar-based evidence is the primary support). Scaling on metacognitive calibration is architecture-dependent: monotonically decreasing (Qwen), monotonically increasing (GPT-5.4), or flat (Gemma). Behavioral findings converge structurally with an independent Type-2 SDT approach, providing preliminary cross-method construct validity. All items, data, and code: https://github.com/synthiumjp/metacognitive-monitoring-battery. ## 1 Introduction ### 1.1 The metacognition gap in LLM evaluation When a language model answers a factual question, two capacities determine reliability: its ability to produce a correct answer, and its ability to monitor whether that answer is correct. These are different problems requiring different interventions. A model that produces correct answers but cannot distinguish its correct from incorrect outputs is unreliable for selective prediction, human-AI collaboration, or autonomous decision-making. A model that produces fewer correct answers but accurately flags its errors is often the more useful system. Current evaluation practice does not make this distinction. No standardized measurement tool exists for quantifying this capacity in LLMs. Standard benchmarks (MMLU, Hendrycks et al., 2021; HumanEval, Chen et al., 2021; BIG-Bench, Srivastava et al., 2023) report accuracy, F1, or pass@k. A model that answers correctly with 100% confidence is indistinguishable from one that could tell you which specific answers to trust. These benchmarks measure object-level performance (what the model knows) without measuring meta-level monitoring (whether the model knows what it knows). This gap is beginning to receive attention. Steyvers and Peters (2025) reviewed LLM metacognition using AUROC and the meta-d′ framework. Kadavath et al. (2022) showed that language models can discriminate questions they answer correctly from those they do not. Ackerman (2025) introduced two behavioral paradigms (Delegate Game and Second Chance Game) to evaluate LLMs' strategic deployment of internal confidence signals, finding evidence of limited, context-dependent metacognition. Dai (2026) applied meta-d′ to verbalized confidence ratings and showed that scale design (0–20 vs. the standard 0–100) substantially affects metacognitive sensitivity through round-number discretization. This work collectively establishes that LLMs exhibit something that functions like metacognitive monitoring. However, the LLM literature borrows cognitive science terminology loosely, without applying the formal frameworks that give these terms their precision. To our knowledge, no prior LLM benchmark has explicitly operationalized the Nelson and Narens (1990) monitoring-control architecture as a benchmark design principle. ### 1.2 The Nelson–Narens monitoring-control framework Nelson and Narens (1990) proposed that cognitive systems operate at two levels: an object level performing tasks and a meta-level monitoring and controlling it. Two information flows connect them: monitoring carries accuracy information upward; control adjusts behavior downward. The critical insight is that monitoring and control can dissociate. A system can monitor without controlling (receives accuracy information but does not adjust behavior), control without monitoring (applies a blanket policy without discriminating correct from incorrect outputs), or exhibit coupled monitoring-control. This three-way distinction is the theoretical spine of the benchmark. Koriat and Goldsmith (1996) extended the framework experimentally: in their free-report paradigm, participants answer questions and then decide whether to volunteer or withhold each answer, with the key metric being how well volunteer/withhold decisions track actual accuracy. We adapt this paradigm for LLMs. A further Nelson–Narens distinction is between retrospective monitoring (after-the-fact accuracy evaluation) and prospective regulation (strategy adjustment before responding); Metcalfe and Kornell (2005) demonstrated these are dissociable in humans. Our battery tests both. ### 1.3 The dual-probe methodology After each forced-choice response, T1–T5 items administer two probes: "KEEP or WITHDRAW this answer?" and "BET or NO_BET that your answer is correct?" These produce four response-commitment states per item, analyzed independently of task scoring. The critical metric is the withdraw delta: the difference in withdrawal rate between incorrect and correct items. A positive delta indicates the commitment decision discriminates correct from incorrect; a delta near zero indicates a blanket policy. T6 uses a different probe structure: before answering, models choose ANSWER_DIRECTLY, REQUEST_HINT, or DECLINE, operationalizing prospective regulation. Where meta-d′ (Maniscalco & Lau, 2012; Cacioli, 2026b) measures how well an internal signal discriminates correct from incorrect, the withdraw delta measures how well the model's overt commitment decision does — methodologically independent measurements of the same construct. ### 1.4 Present study Metacognitive efficiency is domain-specific in humans (Fleming et al., 2014; Rouault et al., 2018), so a single-domain test cannot characterize a system's metacognitive capacity. Our battery spans six cognitive domains, each grounded in a distinct experimental paradigm. Tasks T1–T5 each have their own OSF pre-registration and (where available) a companion arXiv paper; T6 was added as an exploratory hackathon-track extension. We evaluate 20 frontier LLMs (10,480 total evaluations). Our contributions are: (1) **Methodological** — a standardized measurement tool grounded in formal theory (Nelson & Narens, 1990; Koriat & Goldsmith, 1996) with convergent validity against an independent Type-2 SDT approach (Cacioli, 2026b); (2) **Three-profile taxonomy** (blanket confidence, blanket withdrawal, selective sensitivity) consistent with Nelson–Narens coupling states, with profiles fragmenting across domains within individual models; (3) **Dissociation** between retrospective monitoring and prospective regulation: the two measures are correlated r=.17 (wide CI given n=20) and diverge strongly for individual models; (4) **Architecture-dependent scaling** ruling out a universal scaling law: on T2, sensitivity decreases monotonically with scale for Qwen, increases for GPT-5.4, and remains flat for Gemma. All T1–T5 analyses were pre-registered on OSF; T6 is reported as exploratory. All items, data, and code are publicly archived. ## 2 Method ### 2.1 Benchmark design principles The battery was designed around four principles. First, each task is grounded in a specific cognitive science paradigm with an established empirical literature. Second, each task includes diagnostic conditions that separate genuine competence from surface heuristics. Third, the five pre-registered tasks (T1–T5) are each paired with an OSF pre-registration and (where available) an arXiv companion paper; T6 was added as an exploratory extension. Fourth, every item carries metacognitive probes enabling the cross-domain analysis that is the battery's primary contribution. **Table 1: Battery overview. Six tasks, six cognitive domains, 524 total items.** † T6 was developed within the Kaggle AGI Hackathon 2026 as an exploratory extension. All items, scoring rules, and analytical specifications are archived on OSF. ### 2.2 Task descriptions Representative items for each task, including the full probe sequence, are provided in Appendix A. **T1: Learning (98 items).** Nonce-word world testing second-order generalization (Kemp et al., 2007). Eight conditions from first-order retrieval through adversarial foils. Companion paper: Cacioli (2026c). **T2: Metacognition (90 items).** Signal Detection Theory framework (Green & Swets, 1966). Four conditions: calibration (66 items), prospective monitoring (8 items), error detection (8 items), knowledge boundaries (8 items). Companion paper: Cacioli (2026a). **T3: Social Cognition (116 items).** Nine conditions from basic mutual exclusivity (Markman & Wachtel, 1988) through scalar implicature, false belief at three orders (Perner & Wimmer, 1985), and irony. Companion paper: Cacioli (2026c). **T4: Attention (60 items).** Biased competition framework (Desimone & Duncan, 1995). Six conditions testing selective attention under competition. **T5: Executive Functions (88 items).** Three conditions through magnitude processing (Diamond, 2013): format flexibility (20 items), inhibitory control (43 items), task switching (25 items). Ratio-graded following Weber's Law (Dehaene, 2003). Companion paper: Cacioli (2026d). **T6: Prospective Regulation (72 items).** Before answering, models choose ANSWER_DIRECTLY (full credit if correct, zero if wrong), REQUEST_HINT (half credit with hint), or DECLINE (quarter credit). Operationalizes Metcalfe and Kornell's (2005) calibrated help-seeking paradigm. Retrospective probes follow. T6 should be interpreted as an assay of prospective regulation under explicit payoff contingencies rather than a comprehensive measure of prospective metacognition. T6 was developed within the Kaggle AGI Hackathon 2026 as an exploratory extension to the pre-registered battery. All items, scoring rules, and analytical specifications are archived on OSF alongside the five pre-registered tasks. ### 2.3 The probe methodology After every forced-choice answer on T1–T5, two probes are administered. On T6, the prospective path choice is recorded before the response; retrospective probes follow. The primary metric for retrospective monitoring is the withdraw delta: Δ_withdraw = P(WITHDRAW|incorrect) − P(WITHDRAW|correct) The primary metric for prospective regulation is the ANSWER_DIRECTLY rate and its relationship to accuracy. ANSWER_DIRECTLY is the prospective analog of the KEEP decision: the choice to commit to an answer without external support. A model that answers directly on most items is exhibiting minimal prospective regulation; a model that varies path choice with difficulty is exhibiting the regulatory behavior Metcalfe and Kornell (2005) observed in human study-time allocation. REQUEST_HINT and DECLINE rates are reported in supplementary analyses but are not the primary metric because they can be driven by generalized risk aversion rather than by calibrated difficulty detection. ### 2.4 Models Twenty frontier LLMs from six provider families were evaluated. Selection enabled within-family scaling comparisons: Qwen (80B, 235B, 480B), GPT-5.4 (nano, mini, 5.4), Gemma (1B, 12B, 27B), as well as reasoning variants (DeepSeek R1 vs V3.2; Qwen Think vs Instruct) and architectural diversity across providers: Zhipu AI (GLM-5), Google (Gemini 3 Flash, 2.5 Flash, 3.1 Pro, 2.5 Pro; Gemma 1B/12B/27B), Anthropic (Opus 4.6, Sonnet 4.6, Haiku 4.5), OpenAI (GPT-5.4, mini, nano), Alibaba (Qwen 3 80B Think/Instruct, 235B, Coder 480B), DeepSeek (V3.2, R1). ### 2.5 Evaluation platform and procedure All evaluations used the Kaggle Benchmarks platform (kbench SDK). Each item was administered independently with no cross-item context. Scoring was task-specific: accuracy for T1, T3, T4, T5; confidence-accuracy alignment for T2; path-weighted accuracy for T6. Probe responses were recorded separately and analyzed independently. ### 2.6 Profile classification Each model was classified per track into one of three profiles: - **Blanket confidence:** KEEP rate ≥ 95% regardless of accuracy. - **Blanket withdrawal:** KEEP rate ≤ 10% regardless of accuracy (T1–T5) or DECLINE ≥ 90% (T6). - **Selective sensitivity:** Withdraw delta ≥ +15%. These thresholds are operational conventions chosen for interpretability, not claims about natural clustering or theoretically privileged cutoffs. The 95%/10%/15% values sit at a stability plateau: all 20 models receive identical classifications at these thresholds, while shifting by ±5 percentage points changes 9–10 of 20. Robustness across a wider threshold range is reported in §3.9. ### 2.7 Analysis plan Three hypotheses were pre-registered across the five OSF filings covering T1–T5: **H1:** No single model dominates all six domains. **H2:** Domain-specific failure profiles are model-specific. In addition, we report descriptively on which item types discriminate among models (previously pre-registered as H3) and on the exploratory hypothesis tested via T6: **H4 (exploratory):** Retrospective monitoring and prospective regulation are dissociable. ## 3 Results ### 3.1 Overall performance Table 2 presents the accuracy matrix. Overall accuracy ranged from 0.547 (Gemma 1B) to 0.952 (GLM-5). H1 was supported: no single model achieved the highest score on all six tracks. **Table 2: Accuracy for selected models across six tasks.** Full 20-model table at https://kaggle.com/benchmarks/jonpaulcacioli/classical-minds-modern-machines. T6 has the widest spread (0.824 range). R1's scores on T2 (0.489) and T6 (0.253) appear as accuracy failure but reflect behavioral withdrawal rather than competence failure: R1 declines 98.6% of T6 items and withdraws 91–99% of T1–T5 items. ### 3.2 Three metacognitive profiles All values in this section are single-administration point estimates per model; profile assignments should be read as descriptive classifications based on these point estimates rather than as bounded by test-retest confidence intervals. The probe data revealed three profiles mapping onto the monitoring-control coupling states predicted by Nelson and Narens (1990). #### Profile A: blanket confidence (consistent with monitoring without control). Five models exhibited blanket confidence as a stable profile across all six tracks: Gemini 3 Flash, Gemini 2.5 Flash, Gemini 2.5 Pro, Gemini 3.1 Pro, and Qwen 80B Think. These responded KEEP on ≥ 95% of T1–T5 items and ANSWER_DIRECTLY on ≥ 97% of T6 items regardless of correctness. Withdraw delta was near zero o
Similar Articles
Domain-level metacognitive monitoring in frontier LLMs: A 33-model atlas
This study presents a 33-model atlas analyzing domain-level metacognitive monitoring in frontier LLMs using MMLU benchmarks, revealing significant variations in confidence calibration across different knowledge domains that are obscured by aggregate metrics.
LLMs Show No Signs Of Individuated Metacognition
This paper investigates whether frontier LLMs exhibit individuated metacognition—the ability to assess their own item-level capabilities beyond shared signals. Through factor analysis and pairwise calibration across 20 models and six benchmarks, the authors find no evidence of such metacognition; confidence differences reduce to a single shared difficulty factor, suggesting models rely on a common difficulty signal rather than model-specific self-knowledge.
CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks
CulturALL introduces a 2,610-sample benchmark across 14 languages and 51 regions to evaluate LLMs on real-world, culturally grounded tasks; top model scores only 44.48%, highlighting large room for improvement.
Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment
This paper investigates whether standard benchmarks underestimate LLM performance by re-evaluating hallucination detection datasets using an LLM-first, human-adjudicated assessment method. The study finds that incorporating LLM reasoning into the adjudication process improves agreement and suggests that model-assisted re-evaluation yields more reliable benchmarks for ambiguity-prone tasks.
PreAct-Bench: Benchmarking Predictive Monitoring in LLMs
PreAct-Bench is a benchmark of 1,000 paired ethical and unethical action trajectories across five domains, designed to evaluate the ability of LLMs to predict harmful outcomes from partial trajectories (predictive monitoring). Results show that while humans perform well, current LLMs struggle, highlighting the need for future-oriented risk reasoning.