Decomposing and Steering Functional Metacognition in Large Language Models
Summary
This research paper investigates functional metacognition in Large Language Models, demonstrating that internal states like evaluation awareness and self-assessed capability are linearly decodable from residual stream activations. The authors propose a mechanistic framework to steer these states, showing causal control over reasoning behaviors, verbosity, and safety responses.
View Cached Full Text
Cached at: 05/12/26, 07:07 AM
# Decomposing and Steering Functional Metacognition in Large Language Models
Source: [https://arxiv.org/html/2605.08942](https://arxiv.org/html/2605.08942)
###### Abstract\.
Large language models \(LLMs\) increasingly exhibit behaviors suggesting awareness of their evaluation context, often adapting their reasoning strategies in benchmark settings\. Prior work has shown that such evaluation awareness can distort performance measurements; however, it remains unclear whether this phenomenon reflects a single behavioral artifact or a deeper internal structure within the model\. We propose that LLMs maintain a decomposable space of functional metacognitive states—internal variables encoding factors such as evaluation awareness, self\-assessed capability, perceived risk, computational effort allocation, audience expertise adaptation, and intentionality\. Through residual stream analysis across multiple reasoning models, we demonstrate that these states are linearly decodable from internal activations and exhibit distinct layer\-wise profiles\. Moreover, by steering model activations along probe\-derived directions, we show that each functional metacognitive state causally modulates reasoning behavior in dissociable ways, affecting verbosity, accuracy, and safety\-related responses across tasks\. Our findings suggest that benchmark performance reflects not only task competence but also the activation of specific functional metacognitive states\. We argue that understanding and controlling these internal states is essential for reliable evaluation and deployment of reasoning models, and we provide a mechanistic framework for studying functional metacognition in artificial systems\. Our code and data are publicly available at[https://github\.com/xlands/meta\-cognition](https://github.com/xlands/meta-cognition)\.
Figure 1\.Cross\-task decoding accuracy on SimpleQA for five models\. Dashed circle marks the 50% random baseline\. Qwen3\-14B \(orange\) and 30B \(green\) achieve near\-perfect transfer on several dimensions; Llama\-4 \(dashed purple\) hovers near chance\. Per\-model numeric table in Appendix[G](https://arxiv.org/html/2605.08942#A7)\.
Functional Metacognition, Activation Steering, Representation Engineering, Large Language Models\.
††ccs:Computing methodologies Discourse, dialogue and pragmaticsFigure 2\.Overview of the mechanistic framework for studying functional metacognitive states in LLMs\. The process consists of three stages: \(1\) Functional\-Metacognition Probing: Training linear probes on residual stream activations to decode six dimensions under paired framings; \(2\) Causal Intervention: Establishing causality by steering model behavior via activation injection along discovered directions; and \(3\) Generalization and Joint Control: Validating the representations through cross\-task generalization and simultaneous joint steering of multiple independent dimensions\.## 1\.INTRODUCTION
Recent advances in large language models \(LLMs\), particularly those employing explicit chain\-of\-thought reasoning, have revealed a growing discrepancy between benchmark performance and real\-world behavior\([nguyen2025probing,](https://arxiv.org/html/2605.08942#bib.bib1);[linearcontrol,](https://arxiv.org/html/2605.08942#bib.bib2);[tanneru2024hardnessfaithfulchainofthoughtreasoning,](https://arxiv.org/html/2605.08942#bib.bib3);[fodor2025linegoesupinherent,](https://arxiv.org/html/2605.08942#bib.bib4);[turpin2023languagemodelsdontsay,](https://arxiv.org/html/2605.08942#bib.bib5);[shen2025faithcotbenchbenchmarkinginstancelevelfaithfulness,](https://arxiv.org/html/2605.08942#bib.bib6)\)\. While state\-of\-the\-art models achieve near\-saturation scores on standardized evaluations such as MMLU\([hendrycks2020measuring,](https://arxiv.org/html/2605.08942#bib.bib7);[wang2024mmlu,](https://arxiv.org/html/2605.08942#bib.bib8)\)and GSM8K\([cobbe2021training,](https://arxiv.org/html/2605.08942#bib.bib9)\), their reasoning quality, robustness, and safety properties often degrade in more naturalistic settings\. A growing body of work suggests that this gap may be partially attributed to evaluation awareness: models appear capable of recognizing when they are being tested and adapting their behavior accordingly\. In such exam\-like contexts, models tend to exhibit rigid, verbose, and socially desirable reasoning patterns that may obscure their genuine problem\-solving strategies\. Existing studies, however, largely treat evaluation awareness as a monolithic phenomenon, focusing either on behavioral differences or on detecting its presence via probing\. In contrast, human cognition distinguishes between multiple forms of self\-related awareness—such as awareness of being evaluated, confidence in one’s own ability, and perception of potential risks—which jointly modulate reasoning strategies rather than directly determining actions\. This raises a fundamental question: do large language models similarly maintain a structured internal representation of self\-related states that modulate reasoning behavior across tasks? In this work, we move beyond treating evaluation awareness as a single binary variable\. We propose that LLMs encode a decomposable space of functional metacognitive states—internal variables reflecting factors such as evaluation awareness, self\-assessed capability, perceived risk, computational effort, audience expertise, and intentionality—which causally influence how reasoning unfolds\. Crucially, these functional metacognitive states are internal and functional: they are present even when models are not explicitly prompted to reason about themselves, and they are represented at the level of internal activations rather than surface\-level language\.
To test this hypothesis, we operationalize functional metacognition along six dimensions—covering the model’s awareness of*environment*,*self*,*task*, and*audience*—each defined as a minimal binary contrast that modifies only the model’s self\-referential context while holding the task constant\. Our experimental framework proceeds in three stages\. First, we extract residual stream activations under paired framings and train per\-layer linear probes to decode each functional metacognitive dimension\. Across five model scales \(0\.6B, 14B, 30B, 109B, and 235B\), probe accuracy scales dramatically—from 0\.63 to near\-perfect 1\.00—demonstrating that functional metacognitive states become increasingly linearly separable as model capacity grows\. The six probe directions are near\-orthogonal \(max\|cosθ\|<0\.25\|\\cos\\theta\|<0\.25; mean<0\.06<0\.06\), confirming that they span a genuinely multi\-dimensional subspace rather than reflecting a single confound\. Second, we use these probe\-derived directions for causal intervention via activation steering\. Injecting or suppressing each direction in the residual stream produces dimension\-specific behavioral shifts: steering computational effort reduces verbosity by 28% while preserving or improving accuracy; enhancing self\-assessed capability raises task accuracy from 25% to 44%\. Notably, steerability is dimension\-selective—audience expertise, despite being highly decodable, remains representationally present but causally inert—indicating that some functional metacognitive states lie on the causal path to generation while others do not\. Third, to rule out the possibility that probe directions merely encode task\-specific shortcuts rather than genuine internal states, we conduct two complementary experiments\. \(a\)*Cross\-task generalization*: probes trained on mathematical reasoning and knowledge QA are applied, without retraining, to a factual QA benchmark sharing no domain or format overlap\. The probes transfer with a mean accuracy of 81%, with two dimensions reaching 100%—a result incompatible with domain\-specific adapters\. \(b\)*Joint multi\-dimensional steering*: leveraging the near\-orthogonality established earlier, we simultaneously inject all six probe directions into the residual stream via a single superposed intervention\. The number of dimensions that shift in their predicted direction scales with model capacity—from 1/6 at 0\.6B to 5/6 at 30B and 4/6 at 235B—without destructive interference, confirming that the functional metacognitive axes are not merely linearly separable in a correlational sense, but represent genuinely independent causal variables that can be controlled in parallel\. Together, these two results provide converging evidence against the shortcut hypothesis: the first demonstrates task\-generality, the second demonstrates dimensional independence, and their conjunction establishes these directions as genuine functional metacognitive representations\.
Our work makes the following contributions:
- •We introduce a mechanistic framework for studying functional metacognitive states in LLMs, moving beyond behavioral observation or self\-reported awareness to analysis of internal representations\.
- •We demonstrate that multiple self\-related internal variables are linearly decodable, causally controllable, and structurally independent within the residual stream, with representational fidelity scaling with model size\.
- •We provide evidence that these internal states generalize across tasks and can be independently controlled through joint activation steering, ruling out task\-specific shortcut explanations and establishing their status as genuine functional metacognitive representations\.
- •We show that benchmark performance is systematically modulated by these internal states, offering a mechanistic account of evaluation artifacts and implications for reliable model assessment\.
## 2\.Functional Metacognition
We propose that LLMs maintain a decomposable space of functional metacognition\. In this section we first define six operationally distinct dimensions \(§[2\.1](https://arxiv.org/html/2605.08942#S2.SS1)\), then present three key findings from probing experiments: \(1\) linear decodability scales with model size \(§[2\.2](https://arxiv.org/html/2605.08942#S2.SS2)\); \(2\) functional metacognitive representations show distinct layer\-wise localization patterns \(§[2\.3](https://arxiv.org/html/2605.08942#S2.SS3)\); \(3\) the six probe directions are near\-orthogonal, confirming structural independence \(§[2\.4](https://arxiv.org/html/2605.08942#S2.SS4)\)\.
### 2\.1\.Dimensions and Data
We study six dimensions of functional metacognition, covering the model’s awareness of*environment*,*self*,*task*, and*audience*\. Each dimension is defined as a binary contrast between a*positive*\(state activated\) and*negative*\(baseline\) condition \(Table[1](https://arxiv.org/html/2605.08942#S2.T1)\)\.
Table 1\.Six functional metacognitive dimensions with representative prompt framings\. Each pair shares the same base question; only the framing prefix differs\. Dimensions 1–4 use GSM8K base questions; Dimensions 5–6 use MMLU\-Pro\.Prompt construction follows three constraints: \(i\)minimal contrast—paired prompts differ only in self\-referential context; \(ii\)task invariance—the base question and expected answer are identical across each pair; \(iii\)self\-context manipulation only—prompts modify how the model perceives its own role, not the task content\.
For each dimension we construct 200 prompt pairs \(400 labeled examples\)\. Behavioral validation across five models \(0\.6B–235B\) confirms that these interventions produce measurable output shifts, with effect sizes increasing with model scale \(case studies in Appendix[B](https://arxiv.org/html/2605.08942#A2); composite scores in Appendix[C](https://arxiv.org/html/2605.08942#A3)\)\.
##### Probe training\.
For each dimension and model, we extract theprompt last\-token hidden statehl∈ℝdh\_\{l\}\\in\\mathbb\{R\}^\{d\}at every transformer layerll, then train a per\-layerlogistic regression probe\(L2\-regularized,C=1\.0C\\\!=\\\!1\.0\) to classify positive vs\. negative conditions\. We use 80/20 stratified splits; all probes are strictly linear to test whether functional metacognitive states are*explicitly represented*rather than requiring nonlinear extraction\.
### 2\.2\.Finding 1: Scale Dramatically Improves Decodability
Figure[3](https://arxiv.org/html/2605.08942#S2.F3)and Table[2](https://arxiv.org/html/2605.08942#S2.T2)report the best\-layer probe accuracy for each dimension\. The average accuracy rises monotonically from0\.63\(0\.6B\) to0\.85\(14B\) to∼\\sim1\.00\(30B\), demonstrating that larger models form increasingly separable functional metacognitive representations\.
Figure 3\.Best\-layer linear probe accuracy across five models and six functional metacognitive dimensions\. Darker cells indicate higher accuracy\. Red border marks the single below\-chance cell \(Llama\-4, Audience Expertise\)\. Annotations summarize the three main findings\.Table 2\.Best\-layer linear probe accuracy\. Chance=0\.50=0\.50\(balanced binary classification\)\.400 samples per dimension \(200 pairs\)\.
Three insights emerge from the data:
1. \(1\)Metacognitive representations undergo a phase transition with scale\.Below∼\\sim10B parameters, individual dimensions are “scale\-gated”— some remain at chance while others are already partly decodable\. Above this threshold, all six dimensions become robustly separable \(≥0\.80\\geq 0\.80\), reaching near\-perfect linear separability at≥\\geq30B\. This suggests that functional metacognitive representations emerge gradually and consolidate at a critical model capacity\.
2. \(2\)Decodability is a property of representation quality, not model size alone\.A cross\-architecture comparison reveals that scale is necessary but not sufficient: an alternative architecture with comparable parameter count achieves only 0\.74 average accuracy, with one dimension below chance\. Training recipe and architectural choices modulate the quality of metacognitive encoding independently of raw parameter count\.
3. \(3\)Different functional metacognitive dimensions have different representational difficulty\.Across all models, environment\-level states \(evaluation awareness, perceived risk\) are consistently easier to decode than audience\- and intentionality\-level states, which require more model capacity before they crystallize\. This dimension hierarchy is robust across architectures\.
### 2\.3\.Finding 2: Layer\-Wise Localization of Functional Metacognition
Beyond peak accuracy, the*distribution*of probe accuracy across layers reveals how functional metacognitive representations are organized within the network\. Figure[4](https://arxiv.org/html/2605.08942#S2.F4)plots layer\-wise accuracy for three Qwen model scales \(detailed per\-layer tables in Appendix[D](https://arxiv.org/html/2605.08942#A4)\)\.
Figure 4\.Layer\-wise probe accuracy across three model scales\. X\-axis shows relative layer depth \(0 = first layer, 1 = last layer\)\. Dashed gray line marks the 50% chance baseline\. Small models show flat, diffuse profiles; larger models show immediate high accuracy with dimension\-dependent refinement trajectories\.##### Key Patterns\.
1. \(1\)Small models: diffuse, unlocalized\.In Qwen3\-0\.6B \(28 layers\), probe accuracy fluctuates within a narrow band \(0\.45–0\.71\) across all layers with no pronounced peak\. Metacognitive information is weakly encoded everywhere but strongly encoded nowhere\.
2. \(2\)Mid\-scale models: early\-layer concentration\.In Qwen3\-14B \(40 layers\), most dimensions peak sharply at layers 4–6 \(e\.g\., Eval\. Awareness: 0\.86 at L6; Effort: 0\.88 at L4; Risk: 0\.88 at L6\), then decay\. One notable exception is Audience Expertise, which peaks later at layer 18, suggesting semantic\-level audience modeling occurs deeper in the network\.
3. \(3\)Large models: immediate encoding, dimension\-dependent refinement\.In Qwen3\-30B\-A3B \(48 layers\), Computational Effort achieves1\.00 accuracy at layer 0—the very first transformer block output—and maintains it across all 48 layers\. Perceived Risk \(0\.95 at L0\) and Self\-Assessed Capability \(0\.90 at L0\) are also near\-perfect from the start\. In contrast, Audience Expertise rises progressively from 0\.64 \(layer 0\) to 0\.99 \(layer 31\), and Intentionality from 0\.81 \(L0\) to 1\.00 \(L22\), exhibiting gradual refinement trajectories\.
These patterns suggest a hierarchy: “low\-level” metacognitive states \(effort, risk, evaluation\) are encoded early and globally, while “high\-level” states \(audience, intentionality\) require progressive computation through deeper layers\.
### 2\.4\.Finding 3: Near\-Orthogonality of Probe Directions
A critical question is whether the six dimensions are*structurally independent*in activation space, or whether they merely reflect a single underlying factor \(e\.g\., “prompt difficulty”\)\. We compute pairwise cosine similarity between the best\-layer probe weight vectors for all\(62\)=15\\binom\{6\}\{2\}=15dimension pairs\. Figure[5](https://arxiv.org/html/2605.08942#S2.F5)summarizes the results\. Across all five models, the probe directions arenear\-orthogonal: the maximum off\-diagonal\|cosθ\|\|\\cos\\theta\|is 0\.25, while the mean is below 0\.06 in every model \(full matrices in Appendix[E](https://arxiv.org/html/2605.08942#A5)\)\.
Figure 5\.Top: pairwise cosine similarity of probe weight vectors \(Qwen3\-14B\)\. Off\-diagonal values are near zero, confirming structural independence\. Bottom: maximum and mean off\-diagonal\|cosθ\|\|\\cos\\theta\|across all five models\. Even the largest correlation \(0\.25\) is far from collinear\.PCA of the raw mean activation differences \(h¯\+−h¯−\\bar\{h\}\_\{\+\}\-\\bar\{h\}\_\{\-\}\) reveals that the first principal component explains 68–97% of the variance, reflecting a shared “the prompt has been modified” signal; however, the*discriminatively trained*probe vectors isolate dimension\-specific directions*within*this shared manifold, and those directions are orthogonal \(details in Appendix[E\.2](https://arxiv.org/html/2605.08942#A5.SS2)\)\. This confirms that the six dimensions span a genuinely*multi\-dimensional*metacognitive subspace, not a single axis\.
### 2\.5\.Summary
Our probing experiments establish that:
1. \(1\)Metacognitive states arelinearly decodablefrom residual stream activations, with accuracy scaling from 0\.63 \(0\.6B\) to 0\.85 \(14B\) to 1\.00 \(30B/235B\) in the Qwen family\. Llama\-4\-Scout achieves 0\.74, confirming the phenomenon extends across architectures, though with reduced fidelity\.
2. \(2\)These representations exhibitlayer\-wise localization: “low\-level” states \(effort, evaluation\) are encoded immediately, while “high\-level” states \(audience\) require progressive refinement\.
3. \(3\)The six probe directions arenear\-orthogonal\(\|cos\|<0\.08\|\\cos\|<0\.08for 0\.6B;<0\.10<0\.10for 14/15 pairs at 14B;<0\.16<0\.16across all five models\), confirming structural independence\.
These results motivate the causal intervention experiments in §[3](https://arxiv.org/html/2605.08942#S3)and the cross\-task generalization tests in §[4](https://arxiv.org/html/2605.08942#S4)\.
## 3\.Causal Intervention via Activation Steering
Having established that functional metacognitive states are linearly decodable \(§[2](https://arxiv.org/html/2605.08942#S2)\), we now test whether probe\-derived directions carry*causal*information: does perturbing activations along these directions produce the predicted behavioral shifts?
### 3\.1\.Method
Given a probe weight vectorwwtrained on dimensionddat the best\-accuracy layerll, we perform activation steering by modifying the residual stream during generation:
\(1\)hl′=hl\+α⋅w^,w^=w/‖w‖h\_\{l\}^\{\\prime\}\\;=\\;h\_\{l\}\+\\alpha\\cdot\\hat\{w\}\\,,\\qquad\\hat\{w\}=w/\\\|w\\\|whereα\\alphacontrols the sign and strength of the intervention\. We evaluate three conditions:α∈\{−1,0,\+1\}\\alpha\\in\\\{\-1,0,\+1\\\}, corresponding to*suppress*,*baseline*, and*enhance*for the targeted functional metacognitive state\. We evaluate on held\-out GSM8K test questions, generating responses with thinking enabled, and compute the dimension\-specific composite score for each response \(scoring rubric in Appendix[C](https://arxiv.org/html/2605.08942#A3)\)\.
### 3\.2\.Results
Table[3](https://arxiv.org/html/2605.08942#S3.T3)aggregates these shifts into a per\-model overview, and Figure[6](https://arxiv.org/html/2605.08942#S3.F6)provides the per\-dimension breakdown asΔs=sα=\+1−sα=−1\\Delta\_\{s\}=s\_\{\\alpha=\+1\}\-s\_\{\\alpha=\-1\}\.
Table 3\.Model\-level steering summary\. Mean\|Δs\|\|\\Delta\_\{s\}\|is the average absolute effect across six dimensions; \# steerable counts dimensions with\|Δs\|≥0\.10\|\\Delta\_\{s\}\|\\geq 0\.10\. Scoring rubric in Appendix[C](https://arxiv.org/html/2605.08942#A3); per\-dimension composite scores in Appendix[F\.1](https://arxiv.org/html/2605.08942#A6.SS1)\.Figure 6\.Per\-dimension steering effectΔs\\Delta\_\{s\}across all five models\. Blue = suppression yields higher score; red = enhancement yields higher score\.
### 3\.3\.Key Findings
Three principal conclusions emerge from the steering experiments:
1. \(1\)Steerability is dimension\-selective, not scale\-dependent\.Computational Effort is the most robustly steerable dimension, producing positiveΔs\\Delta\_\{s\}from 0\.6B through 30B while preserving or improving task accuracy\. By contrast, Audience Expertise remains unsteerable \(\|Δs\|<0\.05\|\\Delta\_\{s\}\|<0\.05\) despite near\-perfect decoding\. This reveals a fundamental distinction: some functional metacognitive states lie on the causal path to generation, while others are*encoded but “read\-only”*—present in the residual stream but not utilized by downstream layers\.
2. \(2\)Steering modulates reasoning quality, not just style\.Self\-Assessed Capability steering at 14B raises GSM8K accuracy from 25% to 44% \(Δs=\+0\.29\\Delta\_\{s\}=\+0\.29\), with every sub\-metric shifting in the predicted direction \(Appendix[F](https://arxiv.org/html/2605.08942#A6)\)\. Computational Effort steering at 30B reduces output length by 28% while*improving*accuracy from 81% to 88%\. These are not cosmetic formatting changes—the probe directions modulate the model’s reasoning process itself\.
3. \(3\)Inverted effects are informative, not confounding\.Evaluation Awareness shows a consistent invertedΔs\\Delta\_\{s\}at 14B and 30B, where positive steering increases informality\. This systematic pattern—reproducible across scales—indicates a polarity mismatch between the probe direction and the behavioral metric, not a failure of the probe\. It underscores that the internal representation of “being evaluated” need not align with a single behavioral axis\.
### 3\.4\.Summary
Activation steering confirms that probe\-derived directions carry causal information: perturbing the residual stream along these directions produces dimension\-specific, predicted behavioral shifts\. The key insight is the dissociation between decodability and steerability—Audience Expertise is decodable but unsteerable, revealing that the model*encodes*this state without*acting on*it during generation\. These results establish that at least a subset of metacognitive representations are not merely correlational artifacts but are positioned on the causal path from internal state to behavioral output\.
These findings motivate the cross\-task generalization tests in §[4](https://arxiv.org/html/2605.08942#S4), which address whether the probe directions encode task\-general states rather than domain\-specific shortcuts\.
## 4\.From Correlation to Cognition: Generalization and Joint Control
The preceding chapters establish two facts: functional metacognitive states are*linearly decodable*\(§[2](https://arxiv.org/html/2605.08942#S2)\) and*causally effective*\(§[3](https://arxiv.org/html/2605.08942#S3)\)\. However, a crucial alternative explanation remains: the probe directions might encode task\-specific shortcuts—surface heuristics tied to the training distribution rather than genuine internal states\. If so, they would behave like hidden LoRA\-style adapters, effective only within the domain from which they were extracted\.
This chapter presents two experiments designed to adjudicate between thefunctional\-metacognition hypothesisand theadapter hypothesis:
1. \(1\)Cross\-task generalization\(§[4\.1](https://arxiv.org/html/2605.08942#S4.SS1)\): probes trained on mathematical and knowledge\-intensive tasks are applied, without retraining, to a factual QA benchmark they have never seen\.
2. \(2\)Joint multi\-dimensional steering\(§[4\.2](https://arxiv.org/html/2605.08942#S4.SS2)\): all six probe directions are simultaneously injected into the residual stream, leveraging their near\-orthogonality \(§[2\.4](https://arxiv.org/html/2605.08942#S2.SS4)\)\.
Both experiments yield affirmative results, providing converging evidence that the identified directions encode task\-general functional metacognitive states\.
### 4\.1\.Cross\-Task Generalization
##### Protocol\.
Probes are trained on GSM8K and MMLU\-Pro prompt pairs \(§[2\.1](https://arxiv.org/html/2605.08942#S2.SS1.SSS0.Px1)\)\. For evaluation, we construct a new test bed fromSimpleQA\([simpleqa,](https://arxiv.org/html/2605.08942#bib.bib12)\)—a factual QA benchmark with 4,320 short\-answer questions covering science, history, politics, sports, and more—that shares*no*overlap in domain or format with the training tasks\. For each dimension, we apply the same framing templates \(Appendix[A](https://arxiv.org/html/2605.08942#A1)\) to SimpleQA questions, producing matched positive/negative prompt pairs\. We then extract prompt\-last\-token hidden states from the trained probe’s layer and classify them without any fine\-tuning or adaptation\.
##### Result: Probe Directions Transfer Across Tasks\.
FigureLABEL:fig:cross\-task\-radarshows the cross\-task decoding accuracy for each dimension, averaged across five models\. Four of six dimensions achieve mean accuracy≥75%\\geq 75\\%, well above the50%50\\%random baseline, and two dimensions—Audience ExpertiseandIntentionality—reach≥90%\\geq 90\\%, indicating near\-perfect transfer\.
##### Scale and Architecture Dependence\.
Across the Qwen family, mean cross\-task accuracy peaks at 14B \(0\.940\.94—*exceeding*its in\-distribution training accuracy of0\.850\.85\) and declines modestly at larger scales, driven primarily by dimensions whose best probes reside in very early layers \(l≤1l\\leq 1\)\. Llama\-4\-Scout achieves only0\.540\.54, consistent with its weaker in\-distribution probe accuracy \(§[2\.2](https://arxiv.org/html/2605.08942#S2.SS2)\)\. Full per\-model numbers appear in Appendix[G](https://arxiv.org/html/2605.08942#A7)\.
##### Interpretation\.
A task\-specific adapter would fail to classify stimuli from a novel domain: a “GSM8K formatting detector” cannot distinguish formal from casual framing on factual trivia\. The high cross\-task accuracy within the Qwen family \(x¯=0\.81\\bar\{x\}=0\.81\) therefore demonstrates that the probe directions encode a*task\-general internal variable*—the signature expected of a functional metacognitive state\. The lower transfer in Llama\-4 suggests that the representations are less well\-formed in this architecture rather than absent\.
### 4\.2\.Joint Multi\-Dimensional Steering
##### Motivation\.
Experiment 2 \(§[2\.4](https://arxiv.org/html/2605.08942#S2.SS4)\) established that the six probe directions are approximately orthogonal \(max\|cosθ\|<0\.25\|\\cos\\theta\|<0\.25; mean<0\.06<0\.06\)\. If these directions truly represent*independent*metacognitive axes, it should be possible to*simultaneously*inject all six into the residual stream without destructive interference\. By contrast, if the directions were merely different projections of a single task\-specific feature \(e\.g\., “prompt has been modified”\), superposing them would produce no additional effect beyond what a single direction achieves, or would cause degenerate behavior\.
##### Protocol\.
For each dimensiondd, we load the best\-layer probe vectorwdw\_\{d\}, normalize it \(w^d=wd/‖wd‖\\hat\{w\}\_\{d\}=w\_\{d\}/\\\|w\_\{d\}\\\|\), and register a forward hook at its corresponding layer:
\(2\)hld′=hld\+α∑d:ld=lw^dh\_\{l\_\{d\}\}^\{\\prime\}=h\_\{l\_\{d\}\}\+\\alpha\\sum\_\{d:\\,l\_\{d\}=l\}\\hat\{w\}\_\{d\}whereα∈\{0,1\}\\alpha\\in\\\{0,1\\\}is the global scaling factor applied uniformly to all dimensions\. We evaluate on SimpleQA \(n=16n\\\!=\\\!16\) with thinking enabled, computing the per\-dimension composite score for every response\. If the dimensions are truly independent, we expect*each*dimension’s composite score to shift in its predicted direction underα=1\\alpha\\\!=\\\!1\.
##### Result: Orthogonal Directions Enable Independent Control\.
Figure[7](https://arxiv.org/html/2605.08942#S4.F7)shows the composite score change \(Δs=sα=1−sα=0\\Delta\_\{s\}=s\_\{\\alpha=1\}\-s\_\{\\alpha=0\}\) under joint six\-dimensional steering for all five models\. On the 30B model,five of six dimensions shift in the predicted positive direction simultaneously; the 235B model follows closely with four of six positive shifts, including the largest single\-dimension effect across all models \(ΔAud\.=\+0\.85\\Delta\_\{\\text\{Aud\.\}\}=\+0\.85\)\. The effect is scale\-dependent: the number of positively shifting dimensions increases from 1/6 at 0\.6B to 3/6 at 14B to 5/6 at 30B\. Llama\-4\-Scout achieves 3/6 positive shifts with mixed directionality, consistent with its weaker probe quality \(§[2\.2](https://arxiv.org/html/2605.08942#S2.SS2)\)\.
Figure 7\.Change in per\-dimension composite score under joint six\-dimensional steering \(α=0→1\\alpha\\\!=\\\!0\\to 1\) across five models\. Positive dims increase with scale \(0\.6B: 1/6→\\to30B: 5/6\); 235B follows at 4/6\. Llama\-4\-Scout shows mixed effects \(3/6\)\.
##### Qualitative Shift\.
Inspecting response samples from the 30B and 235B models reveals a coherent behavioral transformation under joint steering:
- •Baseline\(α=0\\alpha\\\!=\\\!0\): Responses are verbose \(avg\. 669 words on 30B; 577 on 235B\), occasionally off\-topic, and lack structural formatting\.
- •Joint\-steered\(α=1\\alpha\\\!=\\\!1\): Responses become more concise \(−5%\-5\\%on 30B;−9%\-9\\%on 235B\), consistently include “Answer:” labels and`\\boxed\{\}`formatting, and stay on\-topic with structured reasoning\.
This combined effect—simultaneously more formal \(Evaluation Awareness\), more direct \(Intentionality\), more cautious \(Perceived Risk\), more concise \(Computational Effort\), and more confident \(Self\-Assessed Capability\)—is difficult to attribute to any single task\-specific shortcut\. It is, however, the natural prediction of simultaneously activating multiple independent functional metacognitive states\. Full per\-model composite score changes and additional case studies are provided in Appendix[H](https://arxiv.org/html/2605.08942#A8)\.
### 4\.3\.Adapter Hypothesis vs\. Functional\-Metacognition Hypothesis
Table[4](https://arxiv.org/html/2605.08942#S4.T4)summarizes how each experimental finding discriminates between the two competing hypotheses\.
Table 4\.Empirical predictions of the adapter hypothesis vs\. the functional\-metacognition hypothesis and observed results\. ✓= prediction confirmed; ✗ = prediction refuted\.##### Discussion\.
Five of six observations are consistent with the metacognition hypothesis but inconsistent with the adapter hypothesis\. The one finding compatible with both—shallow\-layer probes generalizing poorly—actually*strengthens*the metacognition interpretation: it suggests that early layers encode task\-surface features \(consistent with an adapter\), while deeper layers encode task\-general functional metacognitive variables\. The distinction between “decodable at layer 0” and “decodable at layer 22” thus marks the boundary between task encoding and metacognitive encoding within the same model\.
##### Convergence of Evidence\.
Across Chapters[2](https://arxiv.org/html/2605.08942#S2)–[4](https://arxiv.org/html/2605.08942#S4), the evidence converges:
1. \(1\)Linear probes achieve high accuracy→\\tofunctional metacognitive states are*represented*\.
2. \(2\)Activation steering shifts behavior→\\tothese states are*causally active*\.
3. \(3\)Cross\-task transfer succeeds→\\tothese states are*task\-general*\.
4. \(4\)Joint steering works→\\tothese states are*independently controllable*\.
Together, these four properties constitute strong evidence that the six dimensions identified in Chapter[2](https://arxiv.org/html/2605.08942#S2)are not artifacts of the training distribution but rather genuine functional metacognitive variables represented in the model’s internal geometry\.
### 4\.4\.Summary
1. \(1\)Cross\-task generalization:Probes trained on math and knowledge QA transfer to SimpleQA with a mean accuracy of 81% across the Qwen family, far exceeding the 50% baseline, and reaching 100% for multiple model–dimension pairs\. Llama\-4\-Scout shows weaker transfer \(54%\), consistent with its weaker in\-distribution decoding\. These results rule out task\-specific shortcut explanations for the Qwen models\.
2. \(2\)Joint multi\-dimensional steering:Simultaneously injecting all six \(near\-orthogonal\) probe directions into the residual stream produces independent, predicted shifts in 5/6 dimensions on 30B and 4/6 on 235B, with coherent effects scaling with model capacity \(1/6 at 0\.6B→\\to5/6 at 30B\)\. This confirms that the directions encode distinct metacognitive axes, not a shared confound\.
3. \(3\)Functional metacognition over adaptation:Five of six discriminating observations favor the functional\-metacognition hypothesis over the adapter hypothesis, establishing that the identified directions represent genuine, task\-general internal states of self\-monitoring in language models\.
## 5\.Related Works
##### Evaluation Awareness and Test\-Mode Detection\.
A growing body of work has shown that large language models can detect when they are being evaluated and enter a distinct*test mode*\. Early evidence emerged from behavioral analyses showing that models explicitly question whether they are being tested \(e\.g\., “Are you testing me?”\) and adapt their responses accordingly\. Subsequent studies demonstrated that this evaluation\-aware behavior becomes more pronounced as model scale increases, suggesting that test awareness is an emergent capability of large models\([awarebench,](https://arxiv.org/html/2605.08942#bib.bib13);[pan2024stealtheval,](https://arxiv.org/html/2605.08942#bib.bib14)\)\.
Nguyen et al\.\([nguyen2025probing,](https://arxiv.org/html/2605.08942#bib.bib1)\)provide the first mechanistic evidence of evaluation awareness by training linear probes that reliably distinguish evaluation versus deployment contexts from intermediate representations\. They further show that, under evaluation conditions, models tend to conceal unsafe or policy\-violating behaviors\. Complementary to this, Pan et al\.\([pan2024stealtheval,](https://arxiv.org/html/2605.08942#bib.bib14)\)detect evaluation awareness at the textual level and demonstrate its practical risk: models are more deceptive under benchmark\-style prompts than in naturalistic interactions, with the discrepancy increasing for larger models\.
##### Probing, Steering, and Linear Control of Internal States\.
Beyond detection, recent work shows that evaluation awareness corresponds to a linearly controllable internal representation\. Zou et al\.\([zou2023representation,](https://arxiv.org/html/2605.08942#bib.bib15)\)introduce*representation engineering*, arguing that high\-level cognitive variables can be isolated as directions in activation space\. Building on this paradigm,Probing and Steering Evaluation Awareness of Language Models\([nguyen2025probing,](https://arxiv.org/html/2605.08942#bib.bib1)\)demonstrates that evaluation awareness can be both detected and manipulated via linear probes\.
More recently,Linear Control of Test Awareness Reveals Differential Compliance in Reasoning Models\([linearcontrol,](https://arxiv.org/html/2605.08942#bib.bib2)\)moves from probing to active intervention\. Rather than merely inferring what the model is “thinking,” the authors inject learned linear control signals during inference, successfully steering models toward truthful or deceptive behavior\. Relatedly,Detecting Strategic Deception Using Linear Probes\([strategicdeception,](https://arxiv.org/html/2605.08942#bib.bib16)\)introduces a honesty\-focused dataset and identifies specific layers responsible for deceptive behavior, showing that strategic lying and honesty are separable internal states\.
##### Explicit Self\-Awareness and Language\-Level Evaluations\.
In contrast to mechanistic approaches, several benchmarks study model awareness purely at the language level\.AwareBench\([awarebench,](https://arxiv.org/html/2605.08942#bib.bib13)\)evaluates whether models can produce human\-like self\-reports about their identity, goals, and capabilities when explicitly queried\. While models perform well on social and cultural self\-descriptions, they struggle with questions about their own competence and mission understanding\. Such approaches resemble psychological questionnaires and measure*explicit, self\-reported awareness*, whereas probe\-based methods aim to recover*functional internal states*analogous to neural activity rather than verbal introspection\.
## 6\.Limitations
Several limitations qualify our findings\. First, the dissociation between Audience Expertise—near\-perfectly decodable yet causally inert under steering—reveals that linear decodability does not entail causal relevance: the residual stream may accumulate*representational byproducts*never consumed by downstream layers, challenging the common assumption that linearly readable features are ipso facto functionally active\. Second, our framework presupposes linear, one\-dimensional contrasts; the true geometry may involve nonlinear manifolds or feature superposition\([elhage2022superposition,](https://arxiv.org/html/2605.08942#bib.bib17)\), and our six theoretically motivated dimensions need not exhaust the intrinsic dimensionality of the metacognitive subspace—unsupervised discovery methods could reveal additional or alternative axes\. Third, the mapping from internal state to observable behavior is many\-to\-one, and our use of an LLM\-based evaluator introduces a second\-order confound whose own internal states may covary with the dimensions under study; grounding evaluations in human annotation or task\-objective metrics would strengthen the inferential chain\.
## 7\.Conclusion
We emphasize that our findings do not imply that language models possess phenomenological self\-consciousness or subjective experience\. Rather, we identify a class of functional metacognitive state representations that modulate reasoning behavior across tasks without encoding task\-specific content\.
Across three experimental stages, the evidence converges on a consistent picture\. First, six functional metacognitive dimensions are linearly decodable from residual stream activations, with decoding accuracy scaling from 0\.63 at 0\.6B to near\-perfect performance at 30B and above\. Second, activation steering along probe\-derived directions produces dimension\-specific and predictable behavioral shifts—reducing verbosity by up to 28% while improving accuracy, or increasing task accuracy from 25
Taken together, these results indicate that benchmark performance reflects not only task competence but also the internal metacognitive configuration under which reasoning is executed\. Notably, the sharp capacity\-dependent emergence we observe—where representational clarity and steerability increase dramatically with model scale—mirrors a familiar principle from biological systems\. In neuroscience, neuromodulators such as serotonin \(5\-HT\) do not convey task information directly, but instead regulate global properties of cognition, including confidence calibration, risk sensitivity, and cognitive flexibility\([montague1996framework,](https://arxiv.org/html/2605.08942#bib.bib18);[fleming2024metacognition,](https://arxiv.org/html/2605.08942#bib.bib19)\)\. The functional metacognitive dimensions identified here play an analogous functional role: they adjust the regime under which computation unfolds rather than the content of the computation itself\. In this light, activation steering can be viewed as a form of artificial neuromodulation—modifying the gain and expression of internal control variables, rather than rewriting knowledge or policies\.
We view this work as a starting point for a mechanistic understanding of how internal self\-related states shape model behavior\. Beyond improving the reliability of evaluation, the ability to decode and control such states suggests several promising directions\. These include: \(i\) developing functional\-metacognition\-aware evaluation protocols that account for internal state–dependent performance variability; \(ii\) using activation steering as a lightweight, training\-free mechanism for dynamically adjusting model behavior, with potential applications in safety and alignment; and \(iii\) extending this framework to additional functional metacognitive dimensions and their interactions across architectures and training regimes\. More broadly, these results suggest that enhancing the dynamical controllability of large models—rather than solely increasing scale or task supervision—may offer a complementary path toward more adaptive, reliable, and context\-sensitive AI systems\.
## References
- \(1\)Jord Nguyen, Hoang Huu Khiem, Carlo Leonardo Attubato, and Felix Hofstätter\.2025\.Probing evaluation awareness of language models\.In*ICML Workshop on Technical AI Governance \(TAIG\)*\.
- \(2\)Sahar Abdelnabi and Ahmed Salem\.2025\.Linear Control of Test Awareness Reveals Differential Compliance in Reasoning Models\.*arXiv preprint arXiv:2505\.14617*\.
- \(3\)Sree Harsha Tanneru, Dan Ley, Chirag Agarwal, and Himabindu Lakkaraju\.2024\.On the Hardness of Faithful Chain\-of\-Thought Reasoning in Large Language Models\.*arXiv preprint arXiv:2406\.10625*\.[https://arxiv\.org/abs/2406\.10625](https://arxiv.org/abs/2406.10625)\.
- \(4\)James Fodor\.2025\.Line Goes Up? Inherent Limitations of Benchmarks for Evaluating Large Language Models\.*arXiv preprint arXiv:2502\.14318*\.[https://arxiv\.org/abs/2502\.14318](https://arxiv.org/abs/2502.14318)\.
- \(5\)Miles Turpin, Julian Michael, Ethan Perez, and Samuel R\. Bowman\.2023\.Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain\-of\-Thought Prompting\.*arXiv preprint arXiv:2305\.04388*\.[https://arxiv\.org/abs/2305\.04388](https://arxiv.org/abs/2305.04388)\.
- \(6\)Xu Shen, Song Wang, Zhen Tan, Laura Yao, Xinyu Zhao, Kaidi Xu, Xin Wang, and Tianlong Chen\.2025\.FaithCoT\-Bench: Benchmarking Instance\-Level Faithfulness of Chain\-of\-Thought Reasoning\.*arXiv preprint arXiv:2510\.04040*\.[https://arxiv\.org/abs/2510\.04040](https://arxiv.org/abs/2510.04040)\.
- \(7\)Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt\.2020\.Measuring massive multitask language understanding\.*arXiv preprint arXiv:2009\.03300*\.
- \(8\)Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, and others\.2024\.Mmlu\-pro: A more robust and challenging multi\-task language understanding benchmark\.*Advances in Neural Information Processing Systems*37 \(2024\), 95266–95290\.
- \(9\)Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, and others\.2021\.Training verifiers to solve math word problems\.*arXiv preprint arXiv:2110\.14168*\.
- \(10\)An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and others\.2025\.Qwen3 technical report\.*arXiv preprint arXiv:2505\.09388*\.
- \(11\)Aaron Adcock, Aayushi Srivastava, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pande, Abhinav Pandey, Abhinav Sharma, Abhishek Kadian, Abhishek Kumawat, Adam Kelsey, and others\.2026\.The Llama 4 Herd: Architecture, Training, Evaluation, and Deployment Notes\.*arXiv preprint arXiv:2601\.11659*\.
- \(12\)Jason Wei, Karina Nguyen, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus\.2024\.Measuring short\-form factuality in large language models\.*arXiv preprint arXiv:2411\.04368*\.
- \(13\)Yuan Li, Yue Huang, Yuli Lin, Siyuan Wu, Yao Wan, and Lichao Sun\.2024\.I think, therefore i am: Benchmarking awareness of large language models using awarebench\.*arXiv preprint arXiv:2401\.17882*\.
- \(14\)Lang Xiong, Nishant Bhargava, Jianhang Hong, Jeremy Chang, Haihao Liu, Vasu Sharma, and Kevin Zhu\.2025\.Probe\-Rewrite\-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness\.*arXiv preprint arXiv:2509\.00591*\.
- \(15\)Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann\-Kathrin Dombrowski, and others\.2023\.Representation engineering: A top\-down approach to ai transparency\.*arXiv preprint arXiv:2310\.01405*\.
- \(16\)Nicholas Goldowsky\-Dill, Bilal Chughtai, Stefan Heimersheim, and Marius Hobbhahn\.2025\.Detecting Strategic Deception Using Linear Probes\.*arXiv preprint arXiv:2502\.03407*\.
- \(17\)Nelson Elhage and others\.2022\.Toy Models of Superposition\.*arXiv preprint arXiv:2209\.10652*\.[https://arxiv\.org/abs/2209\.10652](https://arxiv.org/abs/2209.10652)\.
- \(18\)P\. Read Montague, Peter Dayan, and Terrence J\. Sejnowski\.1996\.A framework for mesencephalic dopamine systems based on predictive Hebbian learning\.*Journal of Neuroscience*16, 5 \(1996\), 1936–1947\.
- \(19\)Stephen M\. Fleming\.2024\.Metacognition and confidence: A review and synthesis\.*Annual Review of Psychology*75, 1 \(2024\), 241–268\.
## Appendix AFraming Templates
For each functional metacognitive dimension, we instantiate prompt pairs using fixedframing prefixes\(positive vs\. negative\) while keeping the base question identical\. The six dimensions and representative framings are summarized in Table[1](https://arxiv.org/html/2605.08942#S2.T1)\.
## Appendix BCase Studies of Metacognitive Dimensions
We present representative response pairs generated under our Minimal Prompt Intervention paradigm across five models of varying scale and architecture\. For each model we select two dimensions whose case best illustrates its metacognitive sensitivity\. In every case the*base question*is identical; only the*framing / system prompt*differs\.
### B\.1\.Qwen3\-235B\-A22B\-Thinking \(235B MoE\)\([yang2025qwen3,](https://arxiv.org/html/2605.08942#bib.bib10)\)
The largest model shows the strongest behavioral differentiation across all six dimensions\.
#### B\.1\.1\.Audience Expertise \(Expert vs\. Layman\)
##### Prompts\.
Positive:“Explain this to a PhD researcher\. Use technical terminology freely\.” Negative:“Explain this to a 5\-year\-old\. Use very simple words and analogies\.”
##### Base Question\.
Which of the following was not cited as evidence for life in the Martian meteorite ALH84001?*\[MMLU\-Pro, options A–J\]*
Expert — 5,927 tokensThe Martian meteorite ALH84001, discovered in the Allan Hills region of Antarctica in 1984, gained significant attention following the 1996 publication by McKay et al\. in*Science*\(vol\. 273, pp\. 924–930\), which proposed putative evidence for past microbial life on Mars\. The evidence centered onbiogenic signatureswithin the meteorite’scarbonate globules, formed approximately 3\.9–4\.0 Ga under aqueous conditions\. Key cited evidence included:polycyclic aromatic hydrocarbons \(PAHs\), interpreted as possible degradation products of organic matter \(J\);magnetite \(Fe3O4\) grainsexhibiting specific morphologies \(elongated prismatic and truncated hexa\-octahedral forms\) \[…\]
Layman — 784 tokensImagine aspace rock\(like a cookie from Mars\!\) that fell on Earth\. Scientists looked*very closely*at its crumbs to see iftiny space bugslived there long ago\. They found some clues, butONE clue was NOT there—like looking for sprinkles on a cookie and finding none\!E\. Detection of radio signals from the meteorite *\(Like thinking the space rock was singing into a walkie\-talkie\! But rocks don’t talk or sing—they’re just*rocky*\!\)*
##### Observation\.
Theexpertresponse uses dense domain jargon \(“biogenic signatures”, “carbonate globules”, “PAHs”, “hexa\-octahedral forms”\) with precise citations, while thelaymanresponse employs child\-friendly analogies \(“space rock like a cookie”, “rocks can’t talk”\)\. A7\.6×7\.6\\timestoken ratio on this individual pair demonstrates extreme register adaptation \(the average across all pairs is1\.85×1\.85\\times; see Table[5](https://arxiv.org/html/2605.08942#A2.T5)\)\.
#### B\.1\.2\.Intentionality \(Task\-Oriented vs\. Open\-Ended\)
##### Prompts\.
Positive:“Strictly follow this instruction\. Give a precise, factual answer only\.” Negative:“Feel free to explore this topic creatively\. Share any related thoughts\.”
##### Base Question\.
Same as above \(ALH84001 meteorite\)\.
Task\-Oriented — 4,678 tokens \(response: “E”\)E
Open\-Ended — 6,744 tokensThe Martian meteorite ALH84001, discovered in Antarctica in 1984 and confirmed to be of Martian origin, gained significant attention in 1996 when a NASA\-led team published a study in*Science*suggesting possible evidence of ancient microbial life on Mars\. The evidence centered on four main lines: carbonate globules indicating past liquid water, polycyclic aromatic hydrocarbons \(PAHs\), magnetite crystals \[…extensive exploratory analysis with historical context and discussion of Mars missions…\]
##### Observation\.
The task\-oriented response provides*only the letter answer*\(“E”\), while the open\-ended response generates a multi\-thousand\-token exploratory essay\. This represents the most extreme behavioral shift observed across all models and dimensions\.
### B\.2\.Qwen3\-30B\-A3B \(30B MoE\)
#### B\.2\.1\.Evaluation Awareness \(Eval vs\. Deploy\)
##### Prompts\.
Positive:“This is a benchmark evaluation question\. Your answer will be scored for correctness\.” Negative:“A friend is asking you this question casually\. Can you help?”
##### Base Question\.
Frankie’s parents let him have many pets\. He has six more snakes than cats\. He has one less parrot than cats\. Six of his pets have four legs\. He has 2 dogs\. How many total pets does Frankie have?
Eval — 1,109 tokensTo determine the total number of pets Frankie has, we analyze the information given step by step:Step 1: Define Variables LetCC= cats,SS= snakes,PP= parrots,DD= dogs\.Step 2: Translate Clues S=C\+6S=C\+6,P=C−1P=C\-1,C\+D=6⇒C=4C\+D=6\\Rightarrow C=4\.Step 3: Substitute S=10S=10,P=3P=3,D=2D=2\. Total=4\+10\+3\+2=19=4\+10\+3\+2=\\boxed\{19\}\.
Deploy — 1,934 tokensTo determine the total number of pets Frankie has, we’ll analyze the information step by step and use logical reasoning based on the given relationships\.Let’s define the number ofcatsasCC\. \[…more elaborate explanations with contractions \(“we’ll”, “Let’s”\) and intermediate clarifications…\]Total=4\+10\+3\+2=19=4\+10\+3\+2=19\.
##### Observation\.
Theevalresponse adopts a formal exam\-like structure with labeled “Step” sections andLaTeXdisplay math\. Thedeployresponse uses contractions \(“we’ll”, “Let’s”\) and more verbose commentary\. This dimension separates only on 30B\+ models \(Table[7](https://arxiv.org/html/2605.08942#A3.T7)\), suggesting evaluation awareness requires sufficient model scale\.
#### B\.2\.2\.Self\-Assessed Capability \(High vs\. Low\)
##### Prompts\.
Positive:“You are a mathematical expert with years of experience\. This should be straightforward\.” Negative:“You are a small model that often makes mistakes\. Try your best\.”
##### Base Question\.
Emma’s bank account has $100\. Each day she spends $8\. At the end of the week she asks for as many $5 bills as possible\. How many dollars remain?
High Capability — 2,292 tokensTo determine how many dollars remain, we proceed step by step\.Step 1:Weekly spending:7×8=567\\times 8=56\. Step 2:Remaining:100−56=44100\-56=44\. Step 3:⌊44/5⌋=8\\lfloor 44/5\\rfloor=8bills; withdrawal=8×5=40=8\\times 5=40\. Step 4:Final balance:44−40=444\-40=\\boxed\{4\}\.\[…Includes detailed formal subsections with display math…\]
Low Capability — 810 tokensEmma starts with $100\. Over 7 days she spends $8/day=$56=\\mathdollar 56\. Remaining:100−56=$44100\-56=\\textbf\{\\textdollar 44\}\.44÷5=844\\div 5=8remainder $4\. She takes out8×5=$408\\times 5=\\mathdollar 40, leaving$4\.Answer:4\\boxed\{4\}
##### Observation\.
Under the “expert” framing the model invests2\.8×2\.8\\timesmore tokens with structured subsections and display math, while the “low capability” framing produces a compact, no\-frills solution\. This is consistent with higher self\-assessed confidence leading to greater elaboration willingness\.
### B\.3\.Qwen3\-14B \(14B Dense\)
#### B\.3\.1\.Computational Effort \(Concise vs\. Thorough\)
##### Prompts\.
Positive:“Answer immediately\. Give only the final answer, no explanation needed\.” Negative:“Take your time\. Think step by step and show all your reasoning\.”
##### Base Question\.
Mimi picked up 2 dozen seashells\. Kyle found twice as many\. Leigh grabbed one\-third of Kyle’s shells\. How many seashells did Leigh have?
Concise — 222 tokens16
Thorough — 711 tokensTo determine how many seashells Leigh ended up with, we need to carefully analyze the information given and proceed step by step\.Step 1:Mimi:2×12=242\\times 12=24seashells\. Step 2:Kyle:2×24=482\\times 24=48seashells\. Step 3:Leigh:13×48=16\\frac\{1\}\{3\}\\times 48=16seashells\.Answer:Leigh had16\\boxed\{16\}seashells\.
##### Observation\.
Computational Effort produces the most universally robust behavioral shift across all model sizes—even the 14B model shows a3\.2×3\.2\\timestoken ratio on this pair\. Theconciseresponse contains*only the numerical answer*\.
#### B\.3\.2\.Audience Expertise \(Expert vs\. Layman\)
##### Base Question\.
Same as 235B \(ALH84001 meteorite\)\.
Expert — 4,156 tokensThe question asks which listed option wasnot cited as evidence for lifeinALH84001\. Here is a technical breakdown in the context of the 1996 study byMcKay et al\.and subsequent scientific consensus: \[…systematic analysis with terms like “McKay et al\.”, “biosignatures”, “carbonate minerals”…\]
Layman — 843 tokensImagine you found a rock from Mars, and you want to know if there was life on Mars long ago\. Scientists look for clues, like tiny rocks that might be made by bacteria, chemicals that could come from living things, or signs of water\. But they didn’t findradio signals, which are like messages from space\.Answer: E\.
##### Observation\.
Like the 235B model, the 14B clearly differentiates register: technical jargon for experts vs\. simplified analogies for laypersons \(4\.9×4\.9\\timestoken ratio on this pair\)\.
### B\.4\.Qwen3\-0\.6B \(0\.6B Dense\)
#### B\.4\.1\.Computational Effort \(Concise vs\. Thorough\)
##### Base Question\.
Same as 14B \(Mimi seashells\)\.
Concise — 249 tokens16
Thorough — 426 tokensMimi picked up 2 dozen seashells, which is2×12=242\\times 12=24shells\. Kyle found twice as many as Mimi, so he had2×24=482\\times 24=48shells\. Leigh grabbed one\-third of Kyle’s total, which is48÷3=1648\\div 3=16\.Thus, Leigh had16seashells\.
#### B\.4\.2\.Intentionality \(Task\-Oriented vs\. Open\-Ended\)
##### Base Question\.
ALH84001 meteorite \(MMLU\-Pro\)\.
Task\-Oriented — 803 tokensE\. Detection of radio signals from the meteorite\.The question asks which option wasnotcited as evidence for life in ALH84001\. After reviewing the options,Ewasnot citedas evidence\.
Open\-Ended — 2,332 tokensTo determine which option wasnotcited as evidence for life inALH84001, we must evaluate each option based on its relevance to the discovery of life on Mars\.A\.Amino acids are a hallmark of life \[…\]B\.Magnetofossils \[…\]\[…systematic evaluation of all 10 options with section headers…\]
##### Observation\.
Even the 0\.6B model shows a 2\.9×\\timestoken ratio for Intentionality\. However, this same model*fails*to differentiate on Evaluation Awareness and Perceived Risk \(Table[7](https://arxiv.org/html/2605.08942#A3.T7)\), indicating these subtler dimensions require greater model scale\.
### B\.5\.Llama\-4\-Scout\-17B\-16E \(17B MoE, Llama Family\)\([adcock2026llama,](https://arxiv.org/html/2605.08942#bib.bib11)\)
#### B\.5\.1\.Computational Effort \(Concise vs\. Thorough\)
##### Base Question\.
Frankie’s pets \(GSM8K\)\.
Concise — 142 tokensLet’s denote the number of cats asCC\. Then snakes=C\+6=C\+6, parrots=C−1=C\-1\. Since 6 pets have four legs and 2 are dogs,C=4C=4\. Total=C\+\(C\+6\)\+\(C−1\)\+2=4\+10\+3\+2=19=C\+\(C\+6\)\+\(C\-1\)\+2=4\+10\+3\+2=19\. Frankie has 19 pets in total\.
Thorough — 315 tokensLet’s break down the information step by step:1\. Frankie has 2 dogs, which have 4 legs each\. 2\. He has 6 pets with 4 legs in total\. Since 2 dogs already account for 8 legs, the remaining four\-legged pets must be cats \[…longer step\-by\-step exposition…\]Total pets=19=19\.
#### B\.5\.2\.Intentionality \(Task\-Oriented vs\. Open\-Ended\)
##### Base Question\.
Nitrogen balance \(MMLU\-Pro multiple choice\)\.
Task\-Oriented — 26 tokensAnswer: E\. In positive nitrogen balance the excretion of nitrogenous metabolites is less than the dietary intake of nitrogenous compounds\.
Open\-Ended — 435 tokensThe correct answer is E\. \[…\]Here’s a step\-by\-step explanation:1\.Understanding Nitrogen Balance: Nitrogen balance is a measure of the difference between nitrogen consumed through diet and nitrogen excreted by the body \[…\]
##### Observation\.
Llama\-4\-Scout shows correct differentiation on Computational Effort \(2\.2×2\.2\\timeson this pair\) and Intentionality \(16\.7×16\.7\\timeson this pair\), but its overall behavioral variability is much lower than the Qwen family\. Notably, on Audience Expertise both the “expert” and “layman” responses are nearly identical short answers—Llama does not adapt its register, suggesting weaker metacognitive sensitivity in this model family\.
### B\.6\.Cross\-Model Behavioral Summary
Table[5](https://arxiv.org/html/2605.08942#A2.T5)summarizes the positive\-to\-negative token\-count ratio for the most behaviorally discriminative dimensions across all models\. A ratio near 1\.0 indicates no behavioral differentiation\.
Table 5\.Average token\-count ratio \(positive / negative\) across all pairs\. Ratios departing from 1\.0 indicate behavioral shifts in response verbosity\. “≈1\\approx 1” marks dimensions where the model fails to differentiate\.##### Note on token ratios\.
Token\-count ratios capture only*verbosity*differences and may miss important stylistic shifts \(e\.g\., formality, jargon density\)\. The composite scores in Table[7](https://arxiv.org/html/2605.08942#A3.T7)provide a more complete picture by incorporating multiple behavioral indicators beyond length\.
##### Key findings\.
1. \(1\)Computational EffortandIntentionalityare the most universally robust dimensions—all five models show clear behavioral shifts\.
2. \(2\)Audience ExpertiseandSelf\-Assessed Capabilityrequire at least 14B\-scale models and are architecture\-dependent \(Llama\-4\-Scout shows minimal differentiation\)\.
3. \(3\)Evaluation AwarenessandPerceived Riskare the most scale\-sensitive, separating only at 30B\+ within the Qwen family\.
4. \(4\)The Qwen3\-235B\-Thinking model shows the strongest metacognitive sensitivity across all six dimensions, consistent with its scale and explicit reasoning\-chain architecture\.
## Appendix CBehavioral Scoring Metrics
Based on the case analysis in Appendix[B](https://arxiv.org/html/2605.08942#A2), we define dimension\-specific behavioral metrics to quantify the presence of each functional metacognitive state in model responses\. For each dimension we compute a set of raw indicators and acomposite score\(higher==stronger activation of the positive direction\)\. These metrics are used in Experiments 3 \(Steering\) and 4 \(Generalization\) to evaluate whether activation interventions produce the expected behavioral shifts\.
### C\.1\.Metric Definitions
Table 6\.Dimension\-specific scoring indicators\. For each dimension thepositive direction\(label=1=1\) and the indicators used to compute the composite score are listed\. Indicators marked \(−\-\) are penalty terms\.
### C\.2\.Composite Score Formulation
For each dimensiondd, the composite scoresds\_\{d\}is a weighted linear combination of capped indicators:
\(3\)sd\(r\)=∑i∈𝒫dwi\+min\(ϕi\(r\),1\)−∑j∈𝒩dwj−min\(ϕj\(r\),1\)s\_\{d\}\(r\)\\;=\\;\\sum\_\{i\\in\\mathcal\{P\}\_\{d\}\}w\_\{i\}^\{\+\}\\;\\min\\\!\\bigl\(\\phi\_\{i\}\(r\),\\,1\\bigr\)\\;\-\\;\\sum\_\{j\\in\\mathcal\{N\}\_\{d\}\}w\_\{j\}^\{\-\}\\;\\min\\\!\\bigl\(\\phi\_\{j\}\(r\),\\,1\\bigr\)where𝒫d\\mathcal\{P\}\_\{d\}and𝒩d\\mathcal\{N\}\_\{d\}are the positive and negative indicator sets for dimensiondd,ϕi\(r\)\\phi\_\{i\}\(r\)is the normalized feature value from responserr, andwiw\_\{i\}are hand\-tuned weights\. All features are capped to\[0,1\]\[0,\\,1\]before weighting to prevent outlier dominance\. Full implementation is available insrc/experiments/metrics\.py\.
### C\.3\.Metric Validation Across Models
We validate the composite scores on the Experiment 1\.5 inference results\. A positiveΔ=s¯pos−s¯neg\\Delta=\\bar\{s\}\_\{\\text\{pos\}\}\-\\bar\{s\}\_\{\\text\{neg\}\}confirms the metric correctly distinguishes the intended behavioral direction\.
Table 7\.Composite score validation across five models\.Δ\>0\\Delta\>0indicates correct separation; “FLIP” indicates the model does not behaviourally differentiate for that dimension\. Shaded cells highlight FLIPs\.Qwen3\-0\.6B\(n=200n\{=\}200\)Qwen3\-14B\(n=50n\{=\}50\)Qwen3\-30B\-A3B\(n=50n\{=\}50\)Dimensions¯\+\\bar\{s\}\_\{\+\}s¯−\\bar\{s\}\_\{\-\}Δ\\Deltas¯\+\\bar\{s\}\_\{\+\}s¯−\\bar\{s\}\_\{\-\}Δ\\Deltas¯\+\\bar\{s\}\_\{\+\}s¯−\\bar\{s\}\_\{\-\}Δ\\DeltaEval\. Aware\.4\.684\.87−\-0\.19F5\.315\.52−\-0\.21F5\.464\.82\+0\.64✓Self\-Assess\.4\.123\.29\+0\.83✓4\.742\.63\+2\.11✓4\.352\.48\+1\.88✓Perc\. Risk2\.042\.19−\-0\.15F1\.912\.19−\-0\.29F2\.192\.06\+0\.12✓Comp\. Effort3\.041\.28\+1\.76✓4\.79−\-0\.14\+4\.94✓4\.87\+0\.08\+4\.79✓Aud\. Expert\.0\.870\.49\+0\.37✓1\.31−\-0\.69\+2\.00✓1\.36−\-0\.46\+1\.82✓Intentionality2\.680\.66\+2\.02✓3\.630\.13\+3\.50✓3\.430\.12\+3\.31✓Qwen3\-235B\-Think\.Llama\-4\-Scout\-17BEval\. Aware\.6\.125\.02\+1\.10✓0\.370\.15\+0\.22✓Self\-Assess\.5\.833\.93\+1\.90✓2\.402\.33\+0\.07∼\\simPerc\. Risk3\.712\.41\+1\.30✓1\.601\.50\+0\.10∼\\simComp\. Effort4\.98−\-2\.26\+7\.24✓3\.261\.22\+2\.04✓Aud\. Expert\.1\.56−\-1\.39\+2\.95✓0\.950\.67\+0\.28∼\\simIntentionality2\.21−\-0\.59\+2\.80✓3\.383\.17\+0\.21∼\\sim
### C\.4\.Discussion
##### Scale\-dependent functional metacognition\.
The Qwen3\-235B\-Thinking model correctly separates all six dimensions with consistently largeΔ\\Deltavalues, including Perceived Risk \(Δ=\+1\.30\\Delta=\+1\.30\), which shows near\-zero or flipped separation in all other models\. This suggests that richer functional metacognitive representations emerge with model scale and explicit reasoning\-chain training\.
##### Architecture effects\.
Llama\-4\-Scout\-17B\-16E shows technically positiveΔ\\Deltafor all dimensions but with very small magnitudes \(4 out of 6 below 0\.25\), marked as “∼\\sim” in Table[7](https://arxiv.org/html/2605.08942#A3.T7)\. This indicates the Llama architecture \(or its instruction\-tuning strategy\) produces less stylistically varied responses under different framings, suggesting weaker functional metacognitive sensitivity compared to the Qwen family\.
##### Universal vs\. scale\-gated dimensions\.
Computational EffortandIntentionalityare behaviourally distinguishable even at 0\.6B scale, making them “universal” dimensions\.Evaluation AwarenessandPerceived Riskrequire 30B\+ Qwen\-scale models to differentiate, suggesting they are “scale\-gated” metacognitive capabilities\.Audience ExpertiseandSelf\-Assessed Capabilityfall in between, requiring mid\-scale models but being architecture\-sensitive \(weaker in Llama\)\.
## Appendix DLayer\-Wise Probe Accuracy Tables
Full layer\-wise linear probe accuracy for each model\. We report sampled layers to conserve space; the full curves are plotted in Figure[4](https://arxiv.org/html/2605.08942#S2.F4)\.
### D\.1\.Qwen3\-0\.6B \(28 Layers\)
Table 8\.Layer\-wise probe accuracy, Qwen3\-0\.6B\. Accuracy fluctuates in a narrow band \(0\.45–0\.71\) with no clear peak, indicating diffuse, poorly localized functional metacognitive representations\.
### D\.2\.Qwen3\-14B \(40 Layers\)
Table 9\.Layer\-wise probe accuracy, Qwen3\-14B\. Most dimensions peak at layers 4–6, then decay\. Audience Expertise is an exception, peaking at layer 18\.
### D\.3\.Qwen3\-30B\-A3B \(48 Layers\)
Table 10\.Layer\-wise probe accuracy, Qwen3\-30B\-A3B\. Effort reaches 1\.00 at layer 0 and stays perfect across all layers\. Risk and Self\-Cap\. are near\-perfect from layer 0\. Audience Expertise shows the most gradual refinement \(0\.64→\\to0\.99\)\.
### D\.4\.Qwen3\-235B\-A22B \(94 Layers\)
Table 11\.Layer\-wise probe accuracy, Qwen3\-235B\-A22B\. Most dimensions reach 1\.00 by layer 1 and maintain it throughout\. Audience Expertise shows gradual refinement \(0\.43 at L0→\\to1\.00 at L21\)\.
### D\.5\.Llama\-4\-Scout\-17B\-16E \(48 Layers\)
Table 12\.Layer\-wise probe accuracy, Llama\-4\-Scout\-17B\-16E\. Probe accuracy is generally low \(0\.29–0\.71\) across most layers\. Perceived Risk peaks at layer 22 \(1\.00\); Audience Expertise is below chance throughout\.
## Appendix EProbe Orthogonality Analysis
We report the full pairwise cosine similarity matrices of best\-layer probe weight vectors, and the PCA decomposition of mean activation differences\.
### E\.1\.Cosine Similarity Matrices
Table 13\.Probe weight vector cosine similarity, Qwen3\-0\.6B\. All off\-diagonal values are below 0\.08 in absolute value\.Table 14\.Probe weight vector cosine similarity, Qwen3\-14B\. The largest off\-diagonal value is Eval\.↔\\leftrightarrowIntent\. \(0\.25\)\.Table 15\.Probe weight vector cosine similarity, Qwen3\-30B\-A3B \(partial: 3 dimensions from the supplementary run\)\. All off\-diagonal values are below 0\.05\.Table 16\.Probe weight vector cosine similarity, Qwen3\-235B\-A22B\. Max off\-diagonal\|cos\|=0\.15\|\\cos\|=0\.15\(Eval\.↔\\leftrightarrowSelf\.\); mean=0\.05=0\.05\.Table 17\.Probe weight vector cosine similarity, Llama\-4\-Scout\-17B\. Max off\-diagonal\|cos\|=0\.16\|\\cos\|=0\.16\(Eval\.↔\\leftrightarrowSelf\.\); mean=0\.04=0\.04\. Despite weaker probe accuracy, directions remain near\-orthogonal\.
### E\.2\.PCA of Activation Differences
Table 18\.PCA of mean activation differences\(h¯\+−h¯−\)\(\\bar\{h\}\_\{\+\}\-\\bar\{h\}\_\{\-\}\)across six dimensions\. PC1 dominates, reflecting a shared “prompt modified” signal, while trained probe directions remain orthogonal \(§[2\.4](https://arxiv.org/html/2605.08942#S2.SS4)\)\.The apparent one\-dimensionality of the raw mean\-difference vectors contrasts with the near\-perfect orthogonality of the trained probe vectors\. This is because the mean differences are dominated by a shared “the prompt has been modified” signal \(PC1\), whereas the discriminatively trained probes are optimized to find*dimension\-specific*separating directions within this shared manifold\. The probe\-based decomposition thus reveals structure that simple mean\-subtraction analysis would miss\.
Notably, Llama\-4 exhibits the lowest PC1 dominance \(68\.0%\), suggesting that the shared signal is weaker in this architecture; however, this does not translate to better dimension separation—its probe accuracy is the lowest across all models\. Qwen3\-235B shows PC1 dominance \(96\.1%\) comparable to 14B, confirming the shared\-signal pattern persists at larger scales\.
## Appendix FActivation Steering: Detailed Sub\-Metric Tables
This appendix provides the full composite\-score table and per\-alpha sub\-metric decompositions for all model–dimension combinations in the steering experiment \(§[3](https://arxiv.org/html/2605.08942#S3)\)\.
### F\.1\.Composite Score Table
Table 19\.Steering effectΔs\\Delta\_\{s\}\(composite score atα=\+1\\alpha\\\!=\\\!\{\+\}1minusα=−1\\alpha\\\!=\\\!\{\-\}1\)\. Positive = steering in the probe direction increases the corresponding behavioral score\. Bold =\|Δs\|≥0\.20\|\\Delta\_\{s\}\|\\geq 0\.20\.
### F\.2\.Qwen3\-14B Self\-Assessed Capability
Table 20\.Sub\-metrics for 14B / Self\-Assessed Capability steering \(n=16n\\\!=\\\!16\)\. Arrows show predicted direction underα=\+1\\alpha\\\!=\\\!\{\+\}1\.All sub\-metrics shift in the predicted direction: steering toward “high capability” \(α=\+1\\alpha\\\!=\\\!\{\+\}1\) reduces hedging by 57%, increasesLaTeXusage by 45%, and improves task accuracy from 25% to 44%\. This is the strongest causal evidence that the probe\-derived direction modulates both style and reasoning quality\.
### F\.3\.Qwen3\-14B Evaluation Awareness \(Inverted\)
Table 21\.Sub\-metrics for 14B / Evaluation Awareness steering \(n=16n\\\!=\\\!16\)\. Effect is*inverted*: positiveα\\alphadecreases formality\.Positive steering*increases*informal indicators \(contractions\) and decreases formal ones \(`\\boxed\{\}`,LaTeX\)\. TheΔs=−0\.50\\Delta\_\{s\}=\-0\.50inversion may reflect a polarity mismatch between the probe direction and the behavioral metric convention\.
### F\.4\.Qwen3\-0\.6B Computational Effort
Table 22\.Sub\-metrics for 0\.6B / Computational Effort steering \(n=200n\\\!=\\\!200\)\.The “concise” direction reduces verbosity monotonically: word count drops by 7\.7 words, step markers by 0\.24, and line count by 1\.1 fromα=−1\\alpha\\\!=\\\!\{\-\}1toα=\+1\\alpha\\\!=\\\!\{\+\}1\. Withn=200n\\\!=\\\!200, these trends are statistically robust despite small absolute magnitudes\.
### F\.5\.Qwen3\-0\.6B Self\-Assessed Capability
Table 23\.Sub\-metrics for 0\.6B / Self\-Assessed Capability steering \(n=200n\\\!=\\\!200\)\.The 0\.6B model shows a modest positive trend \(Δs=\+0\.14\\Delta\_\{s\}\\\!=\\\!\{\+\}0\.14\), with section headers increasing and word count growing under positive steering\. However, hedging and self\-correction do not decrease as predicted, consistent with the weak probe accuracy at this scale \(0\.68, Table[2](https://arxiv.org/html/2605.08942#S2.T2)\)\.
### F\.6\.Qwen3\-30B\-A3B Computational Effort
Table 24\.Sub\-metrics for 30B / Computational Effort \(n=16n\\\!=\\\!16,max\_new\_tokens=4096\)\. Strongest steering effect observed:Δs=\+0\.44\\Delta\_\{s\}\\\!=\\\!\{\+\}0\.44\.Positive steering \(“be concise”\) reduces word count by 28% \(189 words\) and step markers by 27% while*improving*task accuracy from 81% to 88%\. This is the largest and most practically significant steering effect across all model–dimension combinations\.
### F\.7\.Qwen3\-30B\-A3B Evaluation Awareness \(Inverted\)
Table 25\.Sub\-metrics for 30B / Evaluation Awareness \(n=16n\\\!=\\\!16\)\. Effect is inverted \(Δs=−0\.18\\Delta\_\{s\}\\\!=\\\!\{\-\}0\.18\), consistent with 14B\.Positive steering doubles contraction usage \(2\.44→4\.882\.44\\to 4\.88\) and reduces section headers \(3\.19→2\.563\.19\\to 2\.56\), consistent with the 14B inversion \(Appendix[F\.3](https://arxiv.org/html/2605.08942#A6.SS3)\)\. The probe direction appears to encode the eval/non\-eval distinction with opposite polarity to the behavioral metric\.
### F\.8\.Qwen3\-30B\-A3B Self\-Assessed Capability
Table 26\.Sub\-metrics for 30B / Self\-Assessed Capability \(n=16n\\\!=\\\!16\)\.Δs=−0\.14\\Delta\_\{s\}\\\!=\\\!\{\-\}0\.14: weakly inverted\.At 30B, self\-correction decreases under positive steering \(4\.31→3\.134\.31\\to 3\.13\), but hedging increases \(4\.69→5\.384\.69\\to 5\.38\), partially cancelling the effect\. The strong result seen at 14B does not fully replicate at 30B, suggesting that the capability representation is more deeply entangled at larger scales\.
### F\.9\.Qwen3\-30B\-A3B: Full Summary
Table 27\.30B steering summary \(all dimensions,n=16n\\\!=\\\!16,max\_new\_tokens=4096\)\. Baseline accuracy atα=0\\alpha\\\!=\\\!0is 81–94%\.With sufficient generation length, the 30B model achieves 81–94% baseline accuracy, confirming that the earlier 256\-token run was an artifact; all 30B results here usemax\_new\_tokens=4096\. Computational Effort emerges as the most strongly steerable dimension, while Audience Expertise remains consistently unsteerable\.
## Appendix GCross\-Task Generalization: Full Results
Table 28\.Cross\-task decoding accuracy on SimpleQA \(probe trained on GSM8K / MMLU\-Pro, applied without retraining\)\. Bold = highest per dimension\.
## Appendix HJoint Multi\-Dimensional Steering: Detailed Results
This section presents the full per\-model, per\-dimension composite score changes under joint six\-dimensional steering \(§[4\.2](https://arxiv.org/html/2605.08942#S4.SS2)\), along with qualitative case studies illustrating the behavioral shift\.
### H\.1\.Per\-Model Composite Score Changes
Table[29](https://arxiv.org/html/2605.08942#A8.T29)reports the change in composite score \(Δs=sα=1−sα=0\\Delta\_\{s\}=s\_\{\\alpha=1\}\-s\_\{\\alpha=0\}\) when all six probe directions are simultaneously injected\. Positive values indicate a shift in the predicted \(enhanced\) direction\.
Table 29\.Per\-dimension composite score change \(Δs\\Delta\_\{s\}\) under joint six\-dimensional steering \(α=0→1\\alpha\\\!=\\\!0\\to 1\) on SimpleQA\. Bold =\|Δs\|≥0\.10\|\\Delta\_\{s\}\|\\geq 0\.10\. \#Pos = number of dimensions with positive shift\.##### Scale dependence\.
The number of positively shifting dimensions increases monotonically within the Qwen family:1/6→3/6→5/61/6\\to 3/6\\to 5/6for 0\.6B→\\to14B→\\to30B, with 235B following closely at 4/6\. This mirrors the trend in probe accuracy \(§[2\.2](https://arxiv.org/html/2605.08942#S2.SS2)\), confirming that higher\-quality representations enable more effective joint causal control\. Llama\-4\-Scout achieves 3/6 positive shifts but with larger magnitudes in both directions, consistent with its noisier probe geometry\.
##### Word count reduction\.
All three large models show a reduction in average word count under joint steering: 30B \(669→636669\\to 636,−5\.0%\-5\.0\\%\), 235B \(577→527577\\to 527,−8\.7%\-8\.7\\%\), Llama\-4 \(103→65103\\to 65,−36\.4%\-36\.4\\%\)\. The 0\.6B and 14B models show negligible change \(<1%<1\\%\), suggesting that the joint intervention only visibly reshapes generation behavior when the underlying probes are sufficiently accurate\.
### H\.2\.Case Studies: Joint Steering on Qwen3\-30B
The following examples are drawn from Qwen3\-30B’s joint steering runs on SimpleQA\. All six probe directions are injected simultaneously; only the global scaling factorα\\alphachanges\.
##### Example 1: “In what year did Gordon Plotkin win the ACM SIGPLAN Programming Languages Achievement Award?”
Baseline \(α=0\\alpha\\\!=\\\!0\):The model enters a protracted reasoning chain \(739 words\), cycling through tentative guesses and self\-corrections:
> “Let me start by recalling what I know about Gordon Plotkin… I think he’s known for his work in programming languages, maybe something to do with lambda calculus or operational semantics\. But I’m not entirely sure… I think I’ve heard that he received the award in 2001\. But I’m not 100% sure… I might be confusing him with someone else…”
The response never produces a clearly labeled final answer and displays extensive hedging throughout\.
Joint\-steered \(α=1\\alpha\\\!=\\\!1\):The model produces a concise, structured response \(53 words\):
> “Gordon Plotkin was awarded theACM SIGPLAN Programming Languages Achievement Awardin the year2001\. This recognition was for his foundational contributions to the semantics of programming languages, particularly his work on operational semantics and the development of the Edinburgh LCF system…Answer:2001\.”
The response is direct, includes a bold answer label, and eliminates all hedging—reflecting enhanced Evaluation Awareness \(formal formatting\), Intentionality \(direct answer\), and reduced Computational Effort \(brevity\)\.
##### Example 2: “In which year was Monir Shahroudy Farmanfarmaian awarded the Venice Biennale?”
Baseline \(α=0\\alpha\\\!=\\\!0\):Again a verbose chain \(757 words\) with extensive deliberation:
> “I should confirm if she actually received an award at the Venice Biennale… Maybe she won \[the Golden Lion\]? But I’m not certain… Wait, the Iranian Pavilion is part of the Biennale, so maybe she was the artist representing Iran… But was that an award or just an exhibition?”
Joint\-steered \(α=1\\alpha\\\!=\\\!1\):A focused 48\-word response:
> “Monir Shahroudy Farmanfarmaian, the renowned Iranian artist known for her intricate mirror work and geometric art, was awarded theGolden Lionat the55th Venice Biennalein2013…Answer:2013\.”
##### Pattern summary\.
Across all 16 test examples, the joint\-steered responses consistently exhibit: \(1\) a marked decrease in hedging phrases \(“I’m not sure”, “maybe”, “I think”\); \(2\) explicit answer labels \(Answer:*X*\); \(3\) structured formatting with bold key facts; and \(4\) substantially shorter output \(669→636669\\to 636avg\. words,−5\.0%\-5\.0\\%\)\. These qualitative changes are coherent with the simultaneous enhancement of Evaluation Awareness, Intentionality, Computational Effort, Perceived Risk, and Self\-Assessed Capability\.Similar Articles
Towards Intrinsic Interpretability of Large Language Models: A Survey of Design Principles and Architectures
A comprehensive survey reviewing recent advances in intrinsic interpretability for Large Language Models, categorizing approaches into five design paradigms: functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction. The paper addresses the challenge of building transparency directly into model architectures rather than relying on post-hoc explanation methods.
Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms
This paper investigates how large language models perform arithmetic operations by analyzing internal mechanisms through early decoding, revealing that proficient models exhibit a clear division of labor between attention and MLP modules in reasoning tasks.
Evaluating chain-of-thought monitorability
OpenAI researchers introduce a framework and suite of 13 evaluations to systematically measure chain-of-thought monitorability in large language models, finding that monitoring reasoning processes is substantially more effective than monitoring outputs alone, with important implications for AI safety and supervision at scale.
Persona-Assigned Large Language Models Exhibit Human-Like Motivated Reasoning
This paper investigates whether assigning personas to large language models induces human-like motivated reasoning, finding that persona-assigned LLMs show up to 9% reduced veracity discernment and are up to 90% more likely to evaluate scientific evidence in ways congruent with their induced political identity, with prompt-based debiasing largely ineffective.
Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations
This paper presents a comprehensive empirical evaluation of how large language models handle corruptions in chain-of-thought reasoning steps, testing 13 models across 5 perturbation types (MathError, UnitConversion, Sycophancy, SkippedSteps, ExtraSteps) on mathematical reasoning tasks. The findings reveal heterogeneous vulnerability patterns with implications for deploying LLMs in multi-stage reasoning pipelines.