The Readout Shortcut: Positional Number Copying Dominates Arithmetic CoT Readout in Small Language Models
Summary
This paper identifies a 'positional copying' shortcut where small language models answer arithmetic questions by copying the last number before the answer delimiter, bypassing actual reasoning. This effect explains why shuffling CoT steps retains performance; it accounts for 89-92% of teacher-forcing accuracy in 1-3B models on GSM8K.
View Cached Full Text
Cached at: 05/25/26, 08:55 AM
# The Readout Shortcut: Positional Number Copying Dominates Arithmetic CoT Readout in Small Language Models
Source: [https://arxiv.org/html/2605.22870](https://arxiv.org/html/2605.22870)
###### Abstract
Chain\-of\-thought \(CoT\) prompting is necessary for arithmetic in small language models, yet shuffling its steps preserves most performance\. What does CoT contribute if not logical sequencing? In three 1–3B instruction\-tuned LMs on GSM8K, we isolate the answer\-readout stage via prefix completion and identify a positional shortcut: the model copies whichever number occupies the trailing position before the answer delimiter, regardless of intermediate reasoning\. Gold\-answer presence accounts for 54–92 pp of accuracy \(89–92% of each model’s teacher\-forcing ceiling\); even on incorrect items, the final answer matches the last CoT number 95–96% of the time\. The copy channel takes precedence over retained\-context completion: replacing the trailing number with a wrong value collapses accuracy to near\-zero despite correct intermediates, yet removing it recovers 5–32 pp above that floor—even single\-step arithmetic the model can otherwise perform is suppressed when a copyable number is present\. Qwen and Llama copy novel distractors 87–95% of the time; Gemma gates selectively\. Head\-level ablation implicates architecture\-specific head sets; the effect replicates on GSM\-Symbolic\. On non\-arithmetic BBH tasks, shuffle retention drops sharply; at 7–8B, content\-selective gating emerges\. Step\-level faithfulness evaluations risk conflating positional answer transport with genuine computation—a failure mode for CoT\-based oversight\.
The Readout Shortcut: Positional Number Copying Dominates Arithmetic CoT Readout in Small Language Models
Ming LiuAmazonmlliuz@amazon\.com
\(a\) Clean CoTs1s\_\{1\}s2s\_\{2\}⋯\\cdots72\#\#\#\#72positional copy\(b\) Wrong\-number distractors1s\_\{1\}s2s\_\{2\}⋯\\cdots72999\#\#\#\#999copies wrong \#Figure 1:The*answer\-context\-gated positional readout*: the model reads whichever number appears in answer\-relevant context at the trailing position before the\#\#\#\#delimiter\. \(a\) When the correct answer is last, the readout yields the right output\. \(b\) In Qwen/Llama \(1–3B\), injecting a wrong number in answer context displaces gold and is copied 87–95% of the time; Gemma instead shows stronger content gating \(P\(distractor\)=\.12P\(\\text\{distractor\}\)\{=\}\.12–\.19\.19; §[4\.1](https://arxiv.org/html/2605.22870#S4.SS1)\)\.## 1Introduction
Chain\-of\-thought \(CoT\) prompting\(Wei et al\.,[2022](https://arxiv.org/html/2605.22870#bib.bib36); Kojima et al\.,[2022](https://arxiv.org/html/2605.22870#bib.bib13)\)is necessary: removing it crashes accuracy\(Lanham et al\.,[2023](https://arxiv.org/html/2605.22870#bib.bib15)\)\. Yet its step order barely matters: shuffling steps retains most performance\(Madaan and Yazdanbakhsh,[2022](https://arxiv.org/html/2605.22870#bib.bib18); Wang et al\.,[2023a](https://arxiv.org/html/2605.22870#bib.bib34)\)\. This tension is well\-documented but unexplained—*what*in CoT drives the answer if not logical sequencing? Prior work established that CoT can be unfaithful—Turpin et al\. \([2023](https://arxiv.org/html/2605.22870#bib.bib33)\)showed biased demonstrations alter answers despite correct reasoning steps;Lanham et al\. \([2023](https://arxiv.org/html/2605.22870#bib.bib15)\)quantified unfaithfulness via early answering and paraphrase metrics;Arcuschin et al\. \([2025](https://arxiv.org/html/2605.22870#bib.bib2)\)observed similar patterns in naturalistic settings—and that filler tokens partially substitute for reasoning\(Pfau et al\.,[2024](https://arxiv.org/html/2605.22870#bib.bib26)\); recent mechanistic studies have begun tracing information flow through CoT\(Dutta et al\.,[2024](https://arxiv.org/html/2605.22870#bib.bib7)\)and probing which reasoning steps matter\(Bogdan et al\.,[2025](https://arxiv.org/html/2605.22870#bib.bib3)\)\. These works document*that*readout can bypass reasoning but not*what specific signal*the readout uses,*where*in the network it operates, or*whether*it actively suppresses available computation\. We provide all three\.
We resolve this for 1–3B instruction\-tuned models on arithmetic by identifying an*answer\-context\-gated positional readout*\(Geirhos et al\.,[2020](https://arxiv.org/html/2605.22870#bib.bib9); McCoy et al\.,[2019](https://arxiv.org/html/2605.22870#bib.bib19)\): the model reads whichever number appears in answer\-relevant context at the trailing position before the answer delimiter, largely independent of intermediate computation\. The readout requires answer\-relevant framing \(Appendix[U](https://arxiv.org/html/2605.22870#A21)\) and is correctness\-independent in Qwen/Llama but content\-gated in Gemma \(Figure[1](https://arxiv.org/html/2605.22870#S0.F1)\)\. On three architectures we show*what*the model reads,*why*shuffling preserves it \(the answer token’s positional accessibility survives when logical order does not\), and*where*it breaks \(non\-arithmetic tasks; content\-selective retrieval at 7–8B\)\. Gold\-answer dependence is robust; the*content gate*—operationally, the degree to which the model rejects a novel distractor number at the trailing position, measured as1−P\(distractor\)1\-P\(\\text\{distractor\}\)—varies from absent \(Qwen: gate≈0\{\\approx\}0\) to strong \(Gemma: gate≈\.85\{\\approx\}\.85; 7–8B models: gate≥\.70\{\\geq\}\.70; §[7](https://arxiv.org/html/2605.22870#S7)\)\. Concurrent behavioral work\(Chen et al\.,[2026](https://arxiv.org/html/2605.22870#bib.bib5)\)corroborates frontier\-scale shuffle\-tolerance; we contribute the mechanism, which requires open\-weight access\.
This characterization emerges from four converging lines of evidence on three architectures \(Qwen2\.5\-1\.5B\-Instruct, Llama\-3\.2\-1B\-Instruct, Gemma\-2\-2B\-it\) on GSM8K\(Cobbe et al\.,[2021](https://arxiv.org/html/2605.22870#bib.bib6)\); Gemma participates fully in the corruption decomposition, shuffle hierarchy, and head\-level ablation, but its position/distractor results use a teacher\-forcing\-passing subset \(Appendix[T](https://arxiv.org/html/2605.22870#A20)\): \(i\) a corruption decomposition isolates gold\-answer*presence*as the dominant factor \(54–92 pp raw; 89–92% ceiling\-corrected\), and a seven\-condition causal ladder shows the copy channel takes precedence over available retained\-context completion \(§[3](https://arxiv.org/html/2605.22870#S3)\); \(ii\) the readout selects the trailing answer\-context number with a sharp final\-position jump \(\+20\+20–3131pp\) and novelty\-permissive copying in Qwen/Llama \(\.87–\.95 for novel numbers\) vs\. selective gating in Gemma \(\.12–\.19; §[4](https://arxiv.org/html/2605.22870#S4)\); \(iii\) head\-level ablations reveal architecture\-specific profiles from localized \(Llama\) to distributed \(Qwen; §[5](https://arxiv.org/html/2605.22870#S5)\); \(iv\) the shortcut is present in base weights, replicates on GSM\-Symbolic, shifts toward selectivity at 7–8B, and collapses on non\-arithmetic BBH \(§[6](https://arxiv.org/html/2605.22870#S6)\)\.
#### Scope and faithfulness\.
CoT use involves*rationale generation*\(which may compute\) and*answer readout*\(which maps a completed prefix to the final token\); our prefix\-completion interventions isolate the second stage\. FollowingLanham et al\. \([2023](https://arxiv.org/html/2605.22870#bib.bib15)\), we treat a readout as*faithful*when the answer depends on intermediate computation \(Δoff\-copy\\Delta\_\{\\text\{off\-copy\}\}\) rather than surface features \(Δcopy\\Delta\_\{\\text\{copy\}\}\)\. We make no claim about generation\-time computation—only that the readout does not faithfully use it when a trailing answer\-context number is available\. This undermines the premise of step\-level CoT monitors\(Lightman et al\.,[2023](https://arxiv.org/html/2605.22870#bib.bib16); Chen et al\.,[2025](https://arxiv.org/html/2605.22870#bib.bib4); Korbak et al\.,[2025](https://arxiv.org/html/2605.22870#bib.bib14)\): the readout takes precedence over available retained\-context completion\.
## 2Experimental Setup
#### Models\.
Qwen2\.5\-1\.5B\-Instruct\(Yang et al\.,[2024](https://arxiv.org/html/2605.22870#bib.bib39)\)\(28L, 1536d, 12/2 GQA heads\), Llama\-3\.2\-1B\-Instruct\(Grattafiori et al\.,[2024](https://arxiv.org/html/2605.22870#bib.bib10)\)\(16L, 2048d, 32/8 GQA heads\), and Gemma\-2\-2B\-it\(Riviere et al\.,[2024](https://arxiv.org/html/2605.22870#bib.bib27)\)\(26L, 2304d, 8/4 GQA heads\)\. All use grouped\-query attention\(GQA; Ainslie et al\.,[2023](https://arxiv.org/html/2605.22870#bib.bib1)\)with per\-head𝐨proj\\mathbf\{o\}\_\{\\text\{proj\}\}\(attention output projection\) columns, enabling per\-head ablation\. The three models span distinct tokenizer families \(BPE variants for Qwen/Llama; SentencePiece unigram for Gemma\), so convergent findings are not attributable to shared sub\-word segmentation of numbers\. Gemma appears in core experiments \(corruption decomposition, shuffle hierarchy, head\-level ablation, base\-model probe\); its position/distractor results \(§[4](https://arxiv.org/html/2605.22870#S4)\) show reduced teacher\-forcing fidelity under prefix\-structure perturbation \(∼60%\{\\sim\}60\\%vs\.∼99%\{\\sim\}99\\%on the corruption pipeline; Appendix[T](https://arxiv.org/html/2605.22870#A20)\)\.
#### Task and protocol\.
GSM8K test set\(Cobbe et al\.,[2021](https://arxiv.org/html/2605.22870#bib.bib6)\), first 500 problems\. Baseline greedy CoT accuracy: Qwen 67\.0%, Llama 45\.6%, Gemma 66\.2%\. All experiments use teacher\-forced prefix injection: a \(possibly modified\) CoT is injected as the beginning of the assistant turn using native chat templates, “\#\#\#\#” is appended, and the model generates the answer greedily\. This is the standard protocol for causal CoT analysis, isolating the readout from generation\-time confounds\(Lanham et al\.,[2023](https://arxiv.org/html/2605.22870#bib.bib15)\)\.111Gemma\-2\-2B\-it lacks native system\-prompt support; we prepend the system message to the user turn\.
#### Statistical methodology\.
Wilson 95% CIs\(Wilson,[1927](https://arxiv.org/html/2605.22870#bib.bib37)\)for proportions; McNemar’s exact test\(McNemar,[1947](https://arxiv.org/html/2605.22870#bib.bib21)\)for paired contrasts; Holm–Bonferroni correction\(Holm,[1979](https://arxiv.org/html/2605.22870#bib.bib12)\)within pre\-declared confirmatory sub\-families \(Appendix[S](https://arxiv.org/html/2605.22870#A19)\)\. Auxiliary analyses \(induction overlap, mean\-ablation, activation patching\) are exploratory with uncorrectedpp\-values\.
#### Notation\.
Δcopy=PB−PA\\Delta\_\{\\text\{copy\}\}\{=\}P\_\{B\}\{\-\}P\_\{A\},Δoff\-copy=PC−PB\\Delta\_\{\\text\{off\-copy\}\}\{=\}P\_\{C\}\{\-\}P\_\{B\}, andP\(residual\)=PAP\(\\text\{residual\}\)\{=\}P\_\{A\}are*accuracy contributions*\(differences between conditional accuracies\) that decomposePCP\_\{C\}additively; we useΔ\\Deltafor these counterfactual contrasts and reserveP\(⋅\)P\(\\cdot\)for event proportions \(P\(distractor\)P\(\\text\{distractor\}\),P\(gold\)P\(\\text\{gold\}\)\) measured on a single output distribution\. Causal\-ladder conditions Drep/Dtrunc/Dblankare defined in §[3\.3](https://arxiv.org/html/2605.22870#S3.SS3)\.
## 3What Drives CoT Readout?
Among correctly\-solved items, gold\-answer presence accounts for 54–92 pp \(89–92% of each model’s TF ceiling; §[3\.1](https://arxiv.org/html/2605.22870#S3.SS1)\); a causal ladder \(§[3\.3](https://arxiv.org/html/2605.22870#S3.SS3)\) reveals that correct intermediates carry a latent signal \(4–29 pp above no\-CoT\) that is masked when a trailing number is available\. We condition on baseline\-correct items; the free\-generation diagnostic \(§[3\.2](https://arxiv.org/html/2605.22870#S3.SS2)\) shows even incorrect answers match the last CoT number at \.905–\.957 across all three architectures\.
### 3\.1Isolating the Gold\-Presence Effect
Starting from correctly\-solved problems \(Qwen:n=335n\{=\}335; Llama:n=228n\{=\}228; Gemma:n=331n\{=\}331\), we construct three conditions by selectively corrupting numbers in the CoT prefix:A \(corrupt\-all\)replaces all numbers including gold occurrences;B \(preserve\-gold\)corrupts intermediates but leaves gold occurrences intact;C \(clean\)presents the original CoT\. Corruption uses deterministic per\-example seeding\.222Items where corruption preserves the gold value \(mostly single\-digit golds in Gemma\) are excluded from Condition A \(Qwen/Llama: 5 each; Gemma: 51\); including them yieldsPB−PA≥\.789P\_\{B\}\-P\_\{A\}\\geq\.789in all cases\. Conditions B and C use the full sample\.
Table 1:Corruption decomposition \(Qwenn=335n\{=\}335, Llaman=228n\{=\}228, Gemman=331n\{=\}331\)\. Wilson 95% CIs for condition accuracies \(rows A–C\); paired bootstrap 95% CIs over items for derived differences \(Δcopy\\Delta\_\{\\text\{copy\}\},Δoff\-copy\\Delta\_\{\\text\{off\-copy\}\}\);P\(residual\)=PAP\(\\text\{residual\}\)\{=\}P\_\{A\}uses the Condition A interval directly\. Gold\-presence dominates \(54–92 pp raw; 88–92% ceiling\-normalized\)\. Gemma’s lower rawΔcopy\\Delta\_\{\\text\{copy\}\}reflects reduced teacher\-forcing fidelity \(PC=\.604P\_\{C\}\{=\}\.604\) under the standard prefix format, not a weaker copy mechanism: ceiling\-normalized copy strength converges across architectures \(bottom row\)\.Δ\\Deltadenotes accuracy*attributable to*each component \(i\.e\., a difference between conditional accuracies\), not an event probability\.Gold\-presence \(Δcopy=PB−PA\\Delta\_\{\\text\{copy\}\}=P\_\{B\}\-P\_\{A\}\) accounts for 54–92 pp raw across three architectures \(Table[1](https://arxiv.org/html/2605.22870#S3.T1)\)\. Gemma’s lower raw value reflects its reduced TF\-fidelity \(PC=\.604P\_\{C\}\{=\}\.604; the standard prefix format interacts with Gemma’s chat template\)\. Ceiling\-normalized copy strength \(Δcopy/PC\\Delta\_\{\\text\{copy\}\}/P\_\{C\}\) converges at 88–92%, indicating a quantitatively comparable mechanism operating near each model’s achievable ceiling\. Intermediate\-step computation \(Δoff\-copy=PC−PB\\Delta\_\{\\text\{off\-copy\}\}=P\_\{C\}\-P\_\{B\}\) is at noise floor for Qwen/Llama and modest for Gemma\. Conditions A and B differ only in gold\-answer presence, cleanly isolating this effect\. Whether the copy is positional vs\. content\-selective is tested in §[4\.1](https://arxiv.org/html/2605.22870#S4.SS1)\.
### 3\.2Off\-Ceiling Stress Test
To create headroom for measuringΔoff\-copy\\Delta\_\{\\text\{off\-copy\}\}, we swap in a weaker configuration—Qwen\-base, 0\-shot \(PB=0\.735P\_\{B\}\{=\}0\.735,n=200n\{=\}200\)—with 26 pp of headroom, yetΔoff\-copy=0\.020\\Delta\_\{\\text\{off\-copy\}\}\{=\}0\.020\(bootstrap 95% CI\[−\.015,\.060\]\[\-\.015,\.060\]\)\. This is an order of magnitude below the\.730\.730copy contribution\. Correct intermediate steps add essentially no information to the readout\.
If trailing\-number alignment were a prefix\-injection artifact, it should disappear under unconstrained decoding\. It does not \(Table[2](https://arxiv.org/html/2605.22870#S3.T2)\): under free generation \(n=500n\{=\}500\), all three models’ final answers match the last CoT number 96–97% of the time\. Critically, on items the model gets*wrong*, the match rate is\.905\.905–\.957\.957—under a “compute\-then\-write” alternative, errors would distribute across non\-trailing numbers, but they are pinned to the trailing position\. Conditioning on whether gold occupies the last CoT position reveals a near\-deterministic gate: accuracy is\.991\.991–\.997\.997when gold is last vs\.\.004\.004–\.030\.030when it is not\. This full\-distribution analysis confirms the shortcut operates identically on items the model gets wrong, independent of any teacher\-forcing intervention\.
Table 2:Free\-generation analysis \(n=500n\{=\}500, unconstrained greedy decoding, no prefix injection\)\. Accuracy is near\-perfectly predicted by whether gold occupies the last CoT position\. Even on incorrect items, the answer reflexively matches the trailing number—the same positional\-copy signature confirmed on all three architectures\.
### 3\.3Copy Masks Retained\-Context Computation
The decomposition above leaves a residual question: doesΔoff\-copy≈0\\Delta\_\{\\text\{off\-copy\}\}\\approx 0reflect genuine inability to complete from retained context, or does the copy channel*override*an available completion pathway? We disambiguate with four additional conditions that vary the trailing slot independently of intermediate\-step correctness\.
#### D\-replace: correct intermediates, wrong trailing number\.
We hold intermediates correct \(as in Condition B\) but replace gold\-answer occurrences with a wrong number, yielding ConditionDrep\. Accuracy collapses to\.076\.076\(Qwen,n=328n\{=\}328\),\.084\.084\(Llama,n=214n\{=\}214\),\.010\.010\(Gemma,n=312n\{=\}312\)—within≤1\.1\{\\leq\}1\.1pp of Condition A \(\.073\.073,\.079\.079,\.000\.000\), where intermediates are also corrupted \(Table[3](https://arxiv.org/html/2605.22870#S3.T3); pairwise McNemarp\>\.01p\>\.01; all nine §[3\.3](https://arxiv.org/html/2605.22870#S3.SS3)confirmatory contrasts yield rawp<10−8p<10^\{\-8\}and remain significant under Holm–Bonferroni correction at any family size up to the full 23\-contrast set\)\. The wrong number is copied withP\(distractor\)=\.924P\(\\text\{distractor\}\)=\.924\(Qwen\),\.911\.911\(Llama\),\.574\.574\(Gemma\)\. Clean intermediates provide no benefit when a trailing distractor is present\.
#### D\-truncate: intermediates only, no trailing number\.
To probe whether this completion ability exists but is masked,Dtrunctruncates the CoT before the first gold\-answer occurrence \(at the preceding sentence boundary, retaining∼80%\{\\sim\}80\\%of tokens\), leaving only correct intermediate steps with no trailing number\. Accuracy rises to\.399\.399\(Qwen,n=308n\{=\}308\),\.183\.183\(Llama,n=208n\{=\}208\),\.058\.058\(Gemma,n=292n\{=\}292\)—a\+32\+32/\+10\+10/\+5\+5pp gain over Drep: the same intermediates yield non\-trivial accuracy once the trailing distractor is removed\.333Only 1–2% of Dtruncitems have their new trailing number equal to gold \(Qwen 5/308, Llama 4/208, Gemma 3/292\); filtered rates \(\.389/\.172/\.055\) are virtually identical to overall rates\.A depth partition \(Table[10](https://arxiv.org/html/2605.22870#A1.T10)\) shows 74–77% of recovered items are 1\-op reachable from the last visible intermediate; the shortcut thus suppresses even single\-step arithmetic the model can otherwise perform\.
#### Controls: No\-CoT and D\-blank\.
Two controls close remaining attack surfaces\.No\-CoT\(direct answer, no rationale prefix\) yields\.106\.106\(Qwen\),\.019\.019\(Llama\),\.014\.014\(Gemma\)\. The2929pp gap Dtrunc−\{\-\}No\-CoT on Qwen \(McNemarp<10−12p<10^\{\-12\}\) shows the truncated intermediates carry genuine signal—the model is not re\-solving from the question alone\.Dblankappends a content\-free closing sentence to the truncated prefix \(restoring CoT format without a trailing number\):\.461\.461,\.197\.197,\.213\.213\. Dblank≈\{\\approx\}Dtruncon Qwen and Llama confirms that format completeness is not the confound; removing the trailing number, not the closing template, releases the available completion pathway\.
FloorRecomputeCeilingModelNo\-CoTADrepDtruncDblankBCQwen\-1\.5B\.106\.073\.076\.399\.4611\.0001\.000Llama\-1B\.019\.079\.084\.183\.197\.991\.991Gemma\-2B\.014\.000\.010\.058\.213\.571\.590*Copy\-override gap*\(Dtrunc−\{\-\}Drep\): Qwen\+32pp, Llama\+10pp, Gemma\+5pp*Retained\-context contribution*\(Dtrunc−\{\-\}No\-CoT\): Qwen\+29pp, Llama\+16pp, Gemma\+4ppP\(dist\)P\(\\text\{dist\}\)in Drep: Qwen \.924, Llama \.911, Gemma \.574
Table 3:Causal ladder \(P\(gold\)P\(\\text\{gold\}\), baseline\-correct, per\-experiment common\-index subsets\)\. Three regimes:floor\(No\-CoT, A, Drep\)—trailing number absent or wrong, accuracy collapses;recompute\(Dtrunc, Dblank\)—removing distractor reveals available completion pathway;ceiling\(B, C\)—gold presence\. Gemma B/C lower than Table[1](https://arxiv.org/html/2605.22870#S3.T1)because truncation\-length and leak filters disproportionately exclude Gemma items; per\-conditionnnin Appendix[S](https://arxiv.org/html/2605.22870#A19)\.
#### Copy\-over\-recompute\.
Together: \(a\) retained\-context completion exists \(Dtrunc≫\\ggNo\-CoT, Qwen\+29\+29pp,p<10−12p\{<\}10^\{\-12\}\); \(b\) the copy channel takes precedence \(Drep≈\\approxA despite identical intermediates\); \(c\) the precedence is answer\-context\-gated \(Dblank≈\\approxDtrunc; cf\. §[4\.2](https://arxiv.org/html/2605.22870#S4.SS2), Appendices[U](https://arxiv.org/html/2605.22870#A21),[E](https://arxiv.org/html/2605.22870#A5)\)\. “Override” is operational: the output matches the trailing number rather than the completion pathway’s prediction; we remain agnostic between active suppression and passive monopolization\. The magnitude tracks copy strength: Qwen \(P\(distractor\)=\.924P\(\\text\{distractor\}\)\{=\}\.924\) shows the largest Dtruncgain \(\+32\+32pp\); Gemma \(\.574\.574\) the smallest, with a\+15\+15pp Dblank−\{\-\}Dtruncgap suggesting additional format sensitivity\. For CoT\-based oversight, this is the load\-bearing finding: evaluations that vary intermediates while preserving the trailing answer slot systematically underestimate available computation\.
## 4How Is the Answer Selected?
Three converging tests distinguish end\-anchored copy from content\-selective retrieval or generic recency\(cf\. Liu et al\.,[2024](https://arxiv.org/html/2605.22870#bib.bib17)\): both models copy wrong numbers placed at the end \(§[4\.1](https://arxiv.org/html/2605.22870#S4.SS1)\); fixing position recovers the shuffle drop with a sharp final\-slot jump \(§[4\.2](https://arxiv.org/html/2605.22870#S4.SS2)\); and a seven\-level hierarchy shows performance tracks token accessibility \(§[4\.3](https://arxiv.org/html/2605.22870#S4.SS3)\)\.
### 4\.1The Model Copies Wrong Numbers
If the readout is content\-selective for gold, it should resist distractors\(cf\. Shi et al\.,[2023](https://arxiv.org/html/2605.22870#bib.bib28)\)\. Starting from the keep\-end condition \(accuracy≈1\.0\{\\approx\}1\.0\), we append a wrong number between the answer step and the delimiter\. We pre\-register a decision rule:P\(distractor\)\>0\.70P\(\\text\{distractor\}\)\>0\.70rejects gold\-specific retrieval as the dominant readout \(a content\-selective readout predictsP\(distractor\)P\(\\text\{distractor\}\)near 0;\.70\.70ensures rejection requires substantial copy\-dominance; observed values exceed this by 17–25 pp, so conclusions are insensitive to the exact threshold within\[\.50,\.85\]\[\.50,\.85\]\)\.444Gemma’s Experiments 3–4 are excluded from headline comparisons; teacher\-forcing fidelity falls below the\.80\.80threshold \(Appendix[T](https://arxiv.org/html/2605.22870#A20)\)\. On the fidelity\-passing subset \(n=196n\{=\}196\),P\(distractor\)=\.12P\(\\text\{distractor\}\)\{=\}\.12–\.19\.19\.
Five conditions test adjacent \(gold±\\pm1\), random \(same digit\-count\), and control distractors:
Table 4:Distractor injection results\. C0: baseline \(no distractor\); C0b: non\-numeric filler; C1: adjacent wrong number \(gold±1\{\\pm\}1\); C2: random wrong number; C3: gold duplicated \(positive control\)\. Both models copy a novel wrong number 87–95% of the time\.Both novel\-distractor conditions exceed the\.70\.70threshold by wide margins \(Table[4](https://arxiv.org/html/2605.22870#S4.T4); one\-sided exact binomial: C1p<10−15p<10^\{\-15\}; C2p<10−8p<10^\{\-8\}\)\. With novel numbers, the mechanism is predominantly not content\-selective for gold in the 1–3B regime\. A∼7\{\\sim\}7pp C1–C2 gap suggests minor content sensitivity, well below\.70\.70\. The asymmetric C0b sensitivity \(Qwen−22\-22pp vs\. Llama−0\.4\-0\.4pp from non\-numeric filler\) mirrors the topology contrast in §[5](https://arxiv.org/html/2605.22870#S5): Qwen’s distributed copy circuit is destabilized by any trailing material, whereas Llama’s concentrated circuit implements a tighter numeric\-target selector\.
Novel\-delimiter controls \(P\(distractor\)≥\.90P\(\\text\{distractor\}\)\\geq\.90; Appendix[E](https://arxiv.org/html/2605.22870#A5)\) rule out delimiter surface form\. The “answer\-context\-gated” qualifier is established by a framing\-dissociation experiment \(Table[5](https://arxiv.org/html/2605.22870#S4.T5)\): bare numbers at the trailing position yield low copying \(\.45/\.26\), while answer\-relevant inline framing \(“actually it should be X”\) recovers high copying \(\.77/\.85\)—same number, same position, butP\(distractor\)P\(\\text\{distractor\}\)swings∼2×\{\\sim\}2\{\\times\}based purely on whether the surrounding text is answer\-relevant\. The readout is gated by answer\-context framing, not raw position or template surface form\.
Table 5:Framing\-dissociation control \(n=335n\{=\}335Qwen;n=227n\{=\}227Llama\)\. F\-codes denote framing conditions \(distinct from distractor C\-codes in Table[4](https://arxiv.org/html/2605.22870#S4.T4)\)\. The readout is*answer\-context\-gated*: bare trailing numbers \(F2\) and non\-answer framing \(F3\) fall well below the\.70\.70threshold; answer\-relevant inline text \(F4\) triggers copying comparably to standard templates \(F1\)\. This dissociates the mechanism from both raw positional recency and template\-specific parsing\.#### Intermediate\-result distractor\.
When the trailing number is an*intermediate computation result*\(e\.g\., a subtotal from an earlier step\)—already bound to a semantic role in the CoT—the models diverge \(n=200n\{=\}200\)\. Qwen copies it 86% \[\.81,\.90\], consistent with novelty\-permissive copying\. Llama copies it only 25% \[\.20,\.31\], recovering the gold answer 74\.5% of the time—evidence of a weak repetition\-suppression filter layered on the dominant positional mechanism \(25% remains far above the∼1%\{\\sim\}1\\%no\-distractor error floor, confirming positional copying still dominates\)\. Architectures thus differ not in*whether*the shortcut is answer\-context\-gated but in*how sharply*the gate discriminates non\-novel numerals: Qwen<<Llama<<Gemma\.
### 4\.2Fixing Position Recovers the Shuffle Drop
If the readout is end\-anchored, most of the shuffle drop should be recoverable by controlling where the answer step appears\. We test four conditions:ordered,full\_shuffle,keep\_end\(non\-answer steps shuffled, answer fixed at end\), andmove\_front\(answer at front\)\.
00\.250\.250\.50\.50\.750\.75110\.50\.50\.60\.60\.70\.70\.80\.80\.90\.911Answer step fractional positionAccuracyQwen \(r=1\.0r\{=\}1\.0\)Llama \(r=\.90r\{=\}\.90\)Gemma \(r=\.00r\{=\}\.00\)Figure 2:Answer\-position curve \(5\-position sweep,n=123n\{=\}123–315315\)\. Qwen/Llama show a shallow\-gradient\-then\-jump profile: accuracy rises non\-monotonically over 0–75% then jumps\+31\+31/\+20\+20pp at position 1\.0 \(end\)\. Gemma is flat \(teacher\-forcing fidelity issue; Appendix[T](https://arxiv.org/html/2605.22870#A20)\)\. Discrete conditions: keep\_end \(pos==1\.0\) recovers ordered accuracy within≤1\{\\leq\}1pp \(McNemarp≥\.25p\\geq\.25\)\.Keeping the answer step at the end recovers 99% \(Qwen\) / 93% \(Llama\) of the shuffle drop \(Figure[2](https://arxiv.org/html/2605.22870#S4.F2)\); ordered and keep\_end differ by≤1\{\\leq\}1pp \(McNemarp≥\.25p\\geq\.25\)\. The shallow\-gradient\-then\-jump profile—\+7\+7/\+10\+10pp over positions 0–75%, then\+31\+31/\+20\+20pp at position 1\.0—is inconsistent with smooth recency weighting\.
A positional\-encoding control \(Appendix[C](https://arxiv.org/html/2605.22870#A3)\) rules out RoPE\(Su et al\.,[2024](https://arxiv.org/html/2605.22870#bib.bib31)\)artifacts: monotonic position\-ID stretching \(2\.5×2\.5\{\\times\}\) causes no measurable accuracy loss \(n=66n\{=\}66\), showing that position\-encoding perturbation does not account for the shuffle drop\.
### 4\.3Performance Tracks Token Accessibility
The binary keep\_end/move\_front contrast is coarse\. A seven\-level shuffle hierarchy decomposes the perturbation along structural granularity: step\-level \(step\_shuffle, reverse\_order\), word\-level \(within\_step, word\_shuffle\), and token\-level \(token\_shuffle\), plus ordered and no\_cot anchors\.
OrderedWithin\-stepStep\-shuffleWord\-shuffleReverseToken\-shuffleNo\-CoT02525505075751001000Retention \(%\)QwenLlamaGemmaFigure 3:Shuffle hierarchy with bootstrap 95% CIs \(n=228n\{=\}228–335335items\)\. Retention \(no\-CoT\-anchored\)=\(Pcond−Pno\_cot\)/\(Pord−Pno\_cot\)=\(P\_\{\\text\{cond\}\}\-P\_\{\\text\{no\\\_cot\}\}\)/\(P\_\{\\text\{ord\}\}\-P\_\{\\text\{no\\\_cot\}\}\)anchors 0% at no\-CoT and 100% at ordered\. Step\-level conditions preserve the answer token and retain 68–87%; token\-level shuffling destroys token identity and collapses to no\-CoT\. Negative values indicate active interference\.The hierarchy is monotone in answer\-token accessibility \(Figure[3](https://arxiv.org/html/2605.22870#S4.F3)\): within each granularity level, conditions preserving the answer at the end \(within\_step\) outperform those displacing it \(reverse\_order\), so accessibility—not logical coherence—is the controlling variable\. Self\-generated CoT shuffle replicates the pattern onN=2N\{=\}2\(Appendix[I](https://arxiv.org/html/2605.22870#A9)\)\.
## 5Architectural Variation in Head\-Level Copy Sensitivity
Three interventions—zero\-ablation, mean\-ablation, and activation patching—converge on architecture\-specific copy\-sensitive head sets \(Table[6](https://arxiv.org/html/2605.22870#S5.T6)\)\. All three architectures participate in zero\-ablation and mean\-ablation; activation patching is reported for Qwen \(the most distributed profile, where convergent evidence is most diagnostic\)\. Gemma’s zero\-ablation results \(K=505\{\}\_\{50\}\{=\}5, L14H4 single\-head−19\.4\-19\.4pp\) and induction\-head overlap \(p<0\.01p\{<\}0\.01\) are fully included\.
### 5\.1Zero\-Ablation Reveals Three Distinct Sensitivity Profiles
FollowingWu et al\. \([2024](https://arxiv.org/html/2605.22870#bib.bib38)\), we rank heads by mean attention mass from the last prefix token to gold\-answer tokens \(n=100n\{=\}100held\-out\), then zero their𝐨proj\\mathbf\{o\}\_\{\\text\{proj\}\}output columns on a disjoint evaluation set \(n=100n\{=\}100\), ablating at query\-head granularity under GQA\. Specificity is tested with random\-5 head sets \(Llama: 20 layer\-stratified, extended ton=1,000n\{=\}1\{,\}000permutation; Table[6](https://arxiv.org/html/2605.22870#S5.T6)\)\.
Table 6:Zero\-ablation sensitivity \(n=100n\{=\}100Qwen/Llama;n=165n\{=\}165Gemma\)\. Gemma’s lower baseline \(\.697\) reflects its TF\-fidelity\-passing evaluation subset\. Llama: 20pp drop from 5 heads, zero from random controls \(permutationp<\.001p\{<\}\.001,n=1,000n\{=\}1\{,\}000\)\. Gemma: L14H4 single\-head−19\.4\-19\.4pp\. Qwen: 0pp from top\-5; cumulative ablation collapses atK50=17K\_\{50\}\{=\}17, and mean\-ablation shows real signal \(−33\.5\-33\.5pp atK=20K\{=\}20; §[5\.2](https://arxiv.org/html/2605.22870#S5.SS2)\)\.Three profiles emerge \(Table[6](https://arxiv.org/html/2605.22870#S5.T6)\): Llama is*localized*\(L10–L15; permutationp<\.001p\{<\}\.001,n=1,000n\{=\}1\{,\}000\); Gemma is*concentrated but compensable*\(L14H4 dominant under zero\-ablation, collectively load\-bearing under mean\-ablation; §[5\.2](https://arxiv.org/html/2605.22870#S5.SS2)\); Qwen is*distributed*\(K50=17K\_\{50\}\{=\}17, sharp phase transition atK≈14K\{\\approx\}14–1515\)\. After ablating Llama’s top\-5 \(n=100n\{=\}100, acc\.=\.79\.\{=\}\.79; 21 failures\),0/210/21are verbatim copies;14/2114/21produce2×2\{\\times\}/3×3\{\\times\}gold—a shift from copy\-errors to recomputation\-errors, indicating selectivity for the copy pathway rather than a general arithmetic deficit\.
### 5\.2Converging Mechanistic Evidence
To address OOD concerns with zero\-ablation\(Zhang and Nanda,[2024](https://arxiv.org/html/2605.22870#bib.bib40)\), mean\-ablation \(replacing each head’s output with its mean over a held\-out reference set\) reproduces copy\-specific top\-KKdrops on all three architectures: Gemma−52\-52pp atK=20K\{=\}20, Qwen−33\.5\-33\.5pp \(K50=11K\_\{50\}\{=\}11, retaining43%43\\%of zero\-ablation drop\), Llama−6\.5\-6\.5pp peak\. Single\-head effects diverge between protocols \(Gemma L14H4:−19\.4\-19\.4pp zero vs\.−0\.5\-0\.5pp mean\), but aggregate results converge\. Activation patching on Qwen recovers 61% of the shuffle gap \(Appendix[Q](https://arxiv.org/html/2605.22870#A17)\)\. Content\-tracking and induction\-head analyses \(Appendix[P](https://arxiv.org/html/2605.22870#A16)\) corroborate a convergent\-to\-diffuse gradient: Gemma/Llama show significant induction/copy overlap \(p<0\.01p\{<\}0\.01\); Qwen’s is near\-chance\. We applyWu et al\. \([2024](https://arxiv.org/html/2605.22870#bib.bib38)\)’s attention\-mass ranking outside its original long\-context setting; relative to prior arithmetic\-circuit work\(Nikankin et al\.,[2025](https://arxiv.org/html/2605.22870#bib.bib23); Dziri et al\.,[2023](https://arxiv.org/html/2605.22870#bib.bib8); Stolfo et al\.,[2023](https://arxiv.org/html/2605.22870#bib.bib30); McDougall et al\.,[2023](https://arxiv.org/html/2605.22870#bib.bib20)\), we document a regime where the readout operates largely independently of internal arithmetic\.
## 6Generalization and Boundaries
We test memorization, instruction\-tuning artifact, scale, and task specificity\.
### 6\.1Novel\-Instantiation Replication \(GSM\-Symbolic\)
GSM\-Symbolic\(Mirzadeh et al\.,[2025](https://arxiv.org/html/2605.22870#bib.bib22)\)\(main config, test split\) regenerates GSM8K\-style problems from symbolic templates with numeric values drawn outside the GSM8K distribution, eliminating verbatim overlap with pre\-training data\. Step\-shuffle retains 73% \(Qwen,n=74n\{=\}74\) / 81% \(Llama,n=86n\{=\}86\)555Qwen ceiling\-corrected; Llama uncorrected \(ordered acc\. = 1\.0\)\.—consistent with the GSM8K values \(68% / 83%\)\. On SVAMP\(Patel et al\.,[2021](https://arxiv.org/html/2605.22870#bib.bib25)\), a structurally distinct arithmetic word\-problem benchmark \(n=300n\{=\}300\), the copy decomposition replicates:Δcopy=1\.00\\Delta\_\{\\text\{copy\}\}=1\.00\(Qwen\),\.990\.990\(Llama\),\.603\.603\(Gemma; reduced by TF\-fidelity; Appendix[M](https://arxiv.org/html/2605.22870#A13)\)\. The copy phenomenon is not an artifact of memorization or GSM8K\-specific distributional features\.
### 6\.2Base\-Model Origin
The copy mechanism is present in base\-model weights \(Qwen\-baseΔcopy=\.730\\Delta\_\{\\text\{copy\}\}\{=\}\.730; Gemma\-base\.335\.335; Llama\-base\.975\.975with 4\-shot; Appendix[O](https://arxiv.org/html/2605.22870#A15)\)\. Instruction\-tuning amplifies copy strength \(Gemma1\.6×1\.6\{\\times\}raw,\.34→\.54\.34\{\\to\}\.54; ceiling\-normalized\.77→\.90\.77\{\\to\}\.90\) but does not create the mechanism\. Together with the novel\-delimiter control \(Appendix[E](https://arxiv.org/html/2605.22870#A5)\) and GSM\-Symbolic replication \(§[6\.1](https://arxiv.org/html/2605.22870#S6.SS1)\), these results rule out IT\-introduced format prior and delimiter\-specific surface\-form learning\.
### 6\.3At 7–8B, Gold\-Presence Persists but Copying Becomes Selective
Scaling up by5×5\{\\times\}–8×8\{\\times\}, we test Qwen2\.5\-7B\-Instruct and Llama\-3\.1\-8B\-Instruct\.666Gemma\-2\-9B\-it excluded: pipeline debugging incomplete within submission timelines \(Appendix[T](https://arxiv.org/html/2605.22870#A20)\); not evidence about Gemma scaling\.
Table 7:Scale extension \(n=200n\{=\}200evaluated; Llama\-8B copy metrics computed onn=163n\{=\}163correct\-baseline items\)\. Both 7–8B models retainΔcopy\>\.70\\Delta\_\{\\text\{copy\}\}\>\.70whileP\(distractor\)P\(\\text\{distractor\}\)drops well below the\.70\.70threshold\. 1–3B column: Qwen\-1\.5B / Llama\-1B\. Qwen\-7BP\(distractor\)P\(\\text\{distractor\}\)uses the original single\-condition protocol; Llama\-8B range spans adjacent \(C1\) and random \(C2\) conditions\.Both 7–8B models retain gold\-presence dependence \(Δcopy≥\.798\\Delta\_\{\\text\{copy\}\}\\geq\.798\), butP\(distractor\)P\(\\text\{distractor\}\)drops below\.70\.70\(Table[7](https://arxiv.org/html/2605.22870#S6.T7)\): the novelty\-permissive characterization does not hold at 7–8B\. Llama\-8B develops a substantial off\-copy pathway \(Δoff\-copy=\.196\\Delta\_\{\\text\{off\-copy\}\}\{=\}\.196vs\. near\-zero at 1B\)\. WithN=2N\{=\}2at this scale, these results are preliminary observations rather than a scaling claim\.
An RL\-distilled reasoning model \(DeepSeek\-R1\-Distill\-Qwen\-1\.5B;Guo et al\.,[2025](https://arxiv.org/html/2605.22870#bib.bib11)\) retains96%96\\%ceiling\-normalized copy\-driven accuracy despite generating longer rationales \(§[7](https://arxiv.org/html/2605.22870#S7); Appendix[K](https://arxiv.org/html/2605.22870#A11)\)\.
### 6\.4A Predicted Boundary: Shuffle Tolerance Disappears on Non\-Arithmetic BBH
The copy account predicts shuffle tolerance should vanish when no copyable numeric answer occupies the trailing position\. On BIG\-Bench\-Hard logical deduction\(Suzgun et al\.,[2023](https://arxiv.org/html/2605.22870#bib.bib32); Srivastava et al\.,[2023](https://arxiv.org/html/2605.22870#bib.bib29)\)\(3\-way MCQ, letter answers\), retention drops to 44%/21% \(Qwen/Llama\)777Gemma’s BBH baseline is below chance \(1/3\), so retention is undefined\.—far below 68–87% on GSM8K\. A second task \(tracking shuffled objects\) replicates: chance\-corrected retention888\(pshuffled−1/3\)/\(pordered−1/3\)\(p\_\{\\text\{shuffled\}\}\-1/3\)/\(p\_\{\\text\{ordered\}\}\-1/3\); Appendix[N](https://arxiv.org/html/2605.22870#A14)\.of 6\.8% \(Llama\) / 45\.5% \(Qwen\)\. The predicted collapse occurs on both tasks\. These tasks differ from GSM8K along multiple axes \(reasoning type, answer format, chance floor\), so we characterize this as a boundary observation consistent with the shortcut’s preconditions rather than causal isolation of a single factor\.
## 7Discussion
The readout factors into two independently\-varying components: robust gold\-answer dependence \(all five model×\{\\times\}scale cells\) and a content gate ranging from absent \(Qwen\) through partial \(Llama\-1B\) and strong \(Gemma\-2B\) to dominant at 7–8B\. The causal ladder \(§[3\.3](https://arxiv.org/html/2605.22870#S3.SS3)\) shows the copy channel takes precedence over available retained\-context completion—predominantly single\-step \(74–77% of recovered items; Table[10](https://arxiv.org/html/2605.22870#A1.T10)\)—making the shortcut’s dominance over even minimal arithmetic all the more diagnostic\. The override magnitude \(5–32 pp\) tracks copy\-channel strength\. Because gold\-presence dependence holds universally while the gate varies, order\-insensitivity is a property of the readout’s invariant tier\. Table[8](https://arxiv.org/html/2605.22870#S7.T8)maps ten alternative explanations to the dedicated control that addresses each\.
Table 8:Ten alternative explanations for the readout shortcut and the controls that address each\. Every plausible confound is tested by at least one dedicated experiment or control condition\.#### Implications for CoT\-based oversight\.
Process\-reward models\(Lightman et al\.,[2023](https://arxiv.org/html/2605.22870#bib.bib16)\)and step\-level monitors\(Chen et al\.,[2025](https://arxiv.org/html/2605.22870#bib.bib4); Korbak et al\.,[2025](https://arxiv.org/html/2605.22870#bib.bib14)\)presuppose step quality is informative about the answer\-producing computation\. At 1–3B, the readout is largely independent of intermediate steps, so step\-level rewards operate on a signal only weakly coupled to the answer\-producing computation \(Dtrunc−\{\-\}No\-CoT=4=4–2929pp; §[3\.3](https://arxiv.org/html/2605.22870#S3.SS3)\)\. Note that process\-reward models studied byLightman et al\. \([2023](https://arxiv.org/html/2605.22870#bib.bib16)\)target much larger generators \(GPT\-4 scale\); our concern applies specifically to 1–3B models, where content\-selectivity is weakest\. The override result \(§[3\.3](https://arxiv.org/html/2605.22870#S3.SS3)\) sharpens this concern: a wrong answer\-context number in the trailing slot is*worse*than no trailing number \(Drep<<Dtrunc\), meaning a partially\-corrupted CoT can be more misleading for oversight than a truncated one\. As content\-selectivity emerges at 7–8B \(§[6\.3](https://arxiv.org/html/2605.22870#S6.SS3)\), monitorability should partially recover; output\-side validators offer a practical mitigation for small\-model deployments\.
#### Reasoning\-trained models\.
DeepSeek\-R1\-Distill\-Qwen\-1\.5B\(Guo et al\.,[2025](https://arxiv.org/html/2605.22870#bib.bib11)\)retains the shortcut despite longer rationales: ceiling\-normalizedΔcopy=\.928\\Delta\_\{\\text\{copy\}\}\{=\}\.928\(96%96\\%; Appendix[K](https://arxiv.org/html/2605.22870#A11)\)\. ItsP\(distractor\)=\.706P\(\\text\{distractor\}\)\{=\}\.706\[\.636,\.768\]\[\.636,\.768\]\(n=180n\{=\}180\) sits below Qwen/Llama \(\.87–\.95\), indicating that reasoning training introduces partial content selectivity without removing the positional copy mechanism\. The persistence claim rests onΔcopy\\Delta\_\{\\text\{copy\}\}; the lower distractor acceptance suggests the content gate is partially engaged even at 1\.5B when reasoning\-trained\.
#### Practical recommendations\.
For 1–3B deployments: output\-side answer verification should be preferred over step\-level process rewards, since the readout’s near\-independence from intermediate steps \(§[3\.3](https://arxiv.org/html/2605.22870#S3.SS3)\) leaves step\-level signals only weakly coupled to the answer\-producing computation; stripping answer\-template framing may partially disrupt the shortcut \(Table[5](https://arxiv.org/html/2605.22870#S4.T5)\); and monitoring whether the answer matches the last CoT number provides a simple copy\-dominance diagnostic\.
## 8Conclusion
In three 1–3B instruction\-tuned LMs, the CoT readout is dominated by a positional shortcut: the model reads the trailing number in answer\-relevant context\. A causal ladder reveals that intermediate steps carry exploitable retained\-context signal \(4–29 pp above no\-CoT\) that is masked when a trailing number is available—the copy channel takes precedence over retained\-context completion\. Gold\-answer dependence is robust; content\-gating varies by architecture and scale\.
## Limitations
#### Scope\.
Primary evidence covers 1–3B instruction\-tuned models on arithmetic \(GSM8K, GSM\-Symbolic, SVAMP \[Appendix[M](https://arxiv.org/html/2605.22870#A13)\]\)\. The copy\-detection paradigm requires a numeric gold answer at an identifiable trailing position, which structurally restricts the primary scope to arithmetic; we test the predicted boundary on two non\-numeric BBH tasks \(§[6\.4](https://arxiv.org/html/2605.22870#S6.SS4)\)\. Scale extension to 7–8B isN=2N\{=\}2\(Qwen, Llama\); Gemma\-2\-9B\-it was deferred for pipeline reasons \(see §[6\.3](https://arxiv.org/html/2605.22870#S6.SS3)footnote\), not because preliminary data was unfavorable\. Non\-arithmetic coverage isN=2N\{=\}2BBH tasks \(logical deduction, tracking shuffled objects\), where shuffle retention collapses to 7–46% chance\-corrected \(§[6\.4](https://arxiv.org/html/2605.22870#S6.SS4)\); extending the diagnostic to non\-numeric reasoning would require a different operationalization of answer\-position copying and remains open\. Reasoning\-trained model coverage isN=1N\{=\}1\(DeepSeek\-R1\-Distill\-Qwen\-1\.5B\); whether the shortcut persists across reasoning\-distillation methods or base architectures remains open\.
#### Architecture\-specific caveats\.
Content\-blindness varies across architectures \(§[4\.1](https://arxiv.org/html/2605.22870#S4.SS1)\)\. Gemma’s Exp 3–4 are excluded due to teacher\-forcing fidelity drops \(Appendix[T](https://arxiv.org/html/2605.22870#A20)\)\.
#### Methodology\.
Causal contrasts use prefix completion\(Lanham et al\.,[2023](https://arxiv.org/html/2605.22870#bib.bib15)\), the field\-standard protocol for isolating readout from generation\-time confounds\. Three ecological checks corroborate the readout\-shortcut pattern under unconstrained decoding: \(i\) free\-generation answers match the last CoT number∼97%\{\\sim\}97\\%overall and\.905\.905–\.957\.957on incorrect items \(Appendix[G](https://arxiv.org/html/2605.22870#A7),N=3N\{=\}3\); \(ii\) self\-generated CoT shuffle replicates the step\>\{\>\}word\>\{\>\}token hierarchy \(Appendix[I](https://arxiv.org/html/2605.22870#A9),N=2N\{=\}2step\-only,N=1N\{=\}1full\); \(iii\) gold\-position gating in free generation \(\.991\.991–\.997\.997vs\.\.004\.004–\.030\.030\) mirrors the prefix\-completion decomposition\. Distractor and position effects under unconstrained generation remain future work\. Dtruncretains∼80%\{\\sim\}80\\%of the CoT and provides a*lower bound*on retained\-context contribution; the retained prefix may end with a near\-gold number \(e\.g\., a penultimate\-step result\), so the Dtrunc−\{\-\}No\-CoT gap reflects any exploitation of retained context, including one\-step completion as well as multi\-step retained\-context derivation \(a depth partition in Table[10](https://arxiv.org/html/2605.22870#A1.T10)quantifies this: 74–77% of items are 1\-op reachable\)\. Head\-level ablation identifies sensitivity profiles, not complete circuits; behavioral evidence \(§[3](https://arxiv.org/html/2605.22870#S3)–[3\.3](https://arxiv.org/html/2605.22870#S3.SS3)\) is primary for the copy\-dominance claim, with mechanistic localization \(§[5](https://arxiv.org/html/2605.22870#S5)\) as supporting evidence\.
## Ethics Statement
We restrict experiments to controlled prefix interventions on open\-weight models \(Qwen2\.5 \[Apache\-2\.0\], Llama\-3 \[Llama Community License\], Gemma\-2 \[Gemma Terms of Use\]\) and open datasets \(GSM8K \[MIT\], GSM\-Symbolic \[CC\-BY\-NC\-ND\-4\.0\], BBH \[MIT\]\)\. Compute for reported experiments:∼40\{\\sim\}40GPU\-hours; including pilot runs and debugging iterations, total project compute is∼60\{\\sim\}60–8080GPU\-hours \(Appendix[B](https://arxiv.org/html/2605.22870#A2)\)\. Practitioners deploying small\-model CoT monitors should not treat rationales as load\-bearing evidence of the answer\. Our findings expose a vulnerability that could be exploited: an adversary aware of the trailing\-number copy channel could construct CoTs whose intermediate steps appear to reason correctly while routing the final answer through the positional shortcut, evading step\-level monitors\. This risk is most acute for 1–3B models in cost\-constrained deployments\. We mitigate disclosure risk by restricting demonstrations to open\-weight models in the 1–3B regime, reporting concrete defenses \(§[7](https://arxiv.org/html/2605.22870#S7)\), and showing that content\-selectivity recovers at 7–8B\. We believe public disclosure is net\-positive for safety: it equips evaluators with a diagnostic and discourages premature reliance on small\-model CoT as evidence of computation\.
## References
- Ainslie et al\. \(2023\)Joshua Ainslie, James Lee\-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai\. 2023\.GQA: Training generalized multi\-query transformer models from multi\-head checkpoints\.In*Proceedings of EMNLP*\.
- Arcuschin et al\. \(2025\)Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy\. 2025\.Chain\-of\-thought reasoning in the wild is not always faithful\.*arXiv preprint arXiv:2503\.08679*\.
- Bogdan et al\. \(2025\)Paul C\. Bogdan, Uzay Macar, Neel Nanda, and Arthur Conmy\. 2025\.Thought anchors: Which LLM reasoning steps matter?*arXiv preprint arXiv:2506\.19143*\.
- Chen et al\. \(2025\)Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, Vlad Mikulik, Samuel R\. Bowman, Jan Leike, Jared Kaplan, and Ethan Perez\. 2025\.Reasoning models don’t always say what they think\.*arXiv preprint arXiv:2505\.05410*\.
- Chen et al\. \(2026\)Yi\-Chang Chen, Feng\-Ting Liao, Da\-shan Shiu, and Hung\-yi Lee\. 2026\.Rethinking dense sequential chains: Reasoning language models can extract answers from sparse, order\-shuffling chain\-of\-thoughts\.*arXiv preprint arXiv:2605\.07307*\.
- Cobbe et al\. \(2021\)Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, and 1 others\. 2021\.Training verifiers to solve math word problems\.*arXiv preprint arXiv:2110\.14168*\.
- Dutta et al\. \(2024\)Subhabrata Dutta, Joykirat Singh, Soumen Chakrabarti, and Tanmoy Chakraborty\. 2024\.How to think step\-by\-step: A mechanistic understanding of chain\-of\-thought reasoning\.*arXiv preprint arXiv:2402\.18312*\.
- Dziri et al\. \(2023\)Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D\. Hwang, Soumya Sanyal, Sean Welleck, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi\. 2023\.Faith and fate: Limits of transformers on compositionality\.In*NeurIPS*\.
- Geirhos et al\. \(2020\)Robert Geirhos, Jörn\-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A\. Wichmann\. 2020\.Shortcut learning in deep neural networks\.*Nature Machine Intelligence*, 2:665–673\.
- Grattafiori et al\. \(2024\)Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al\-Dahle, Aieleen Letman, Akhil Mathur, Alan Schelten, and 1 others\. 2024\.The llama 3 herd of models\.*arXiv preprint arXiv:2407\.21783*\.
- Guo et al\. \(2025\)Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, and 1 others\. 2025\.DeepSeek\-R1 incentivizes reasoning in LLMs through reinforcement learning\.*Nature*, 645\(8081\):633–638\.
- Holm \(1979\)Sture Holm\. 1979\.A simple sequentially rejective multiple test procedure\.*Scandinavian Journal of Statistics*, 6\(2\):65–70\.
- Kojima et al\. \(2022\)Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa\. 2022\.Large language models are zero\-shot reasoners\.In*NeurIPS*\.
- Korbak et al\. \(2025\)Tomek Korbak, Mikita Balesni, Elizabeth Barnes, and 1 others\. 2025\.Chain of thought monitorability: A new and fragile opportunity for AI safety\.*arXiv preprint arXiv:2507\.11473*\.
- Lanham et al\. \(2023\)Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, and 1 others\. 2023\.Measuring faithfulness in chain\-of\-thought reasoning\.*arXiv preprint arXiv:2307\.13702*\.
- Lightman et al\. \(2023\)Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe\. 2023\.Let’s verify step by step\.*arXiv preprint arXiv:2305\.20050*\.
- Liu et al\. \(2024\)Nelson F\. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang\. 2024\.Lost in the middle: How language models use long contexts\.*Transactions of the ACL*, 12:157–173\.
- Madaan and Yazdanbakhsh \(2022\)Aman Madaan and Amir Yazdanbakhsh\. 2022\.Text and patterns: For effective chain of thought, it takes two to tango\.*arXiv preprint arXiv:2209\.07686*\.
- McCoy et al\. \(2019\)Tom McCoy, Ellie Pavlick, and Tal Linzen\. 2019\.Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference\.In*Proceedings of ACL*, pages 3428–3448\.
- McDougall et al\. \(2023\)Callum McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, and Neel Nanda\. 2023\.Copy suppression: Comprehensively understanding an attention head\.*arXiv preprint arXiv:2310\.04625*\.
- McNemar \(1947\)Quinn McNemar\. 1947\.Note on the sampling error of the difference between correlated proportions or percentages\.*Psychometrika*, 12\(2\):153–157\.
- Mirzadeh et al\. \(2025\)Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar\. 2025\.GSM\-Symbolic: Understanding the limitations of mathematical reasoning in large language models\.In*ICLR*\.
- Nikankin et al\. \(2025\)Yaniv Nikankin, Anja Reusch, Aaron Mueller, and Yonatan Belinkov\. 2025\.Arithmetic without algorithms: Language models solve math with a bag of heuristics\.In*ICLR*\.
- Olsson et al\. \(2022\)Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, and 1 others\. 2022\.In\-context learning and induction heads\.*Transformer Circuits Thread*\.
- Patel et al\. \(2021\)Arkil Patel, Satwik Bhattamishra, and Navin Goyal\. 2021\.Are NLP models really able to solve simple math word problems?In*Proceedings of NAACL\-HLT*, pages 2080–2094\.
- Pfau et al\. \(2024\)Jacob Pfau, William Merrill, and Samuel R\. Bowman\. 2024\.Let’s think dot by dot: Hidden computation in transformer language models\.In*Conference on Language Modeling \(COLM\)*\.
- Riviere et al\. \(2024\)Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, and 1 others\. 2024\.Gemma 2: Improving open language models at a practical size\.*arXiv preprint arXiv:2408\.00118*\.
- Shi et al\. \(2023\)Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed Chi, Nathanael Schärli, and Denny Zhou\. 2023\.Large language models can be easily distracted by irrelevant context\.In*ICML*\.
- Srivastava et al\. \(2023\)Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R\. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga\-Alonso, and 1 others\. 2023\.Beyond the imitation game: Quantifying and extrapolating the capabilities of language models\.*Transactions on Machine Learning Research*\.
- Stolfo et al\. \(2023\)Alessandro Stolfo, Yonatan Belinkov, and Mrinmaya Sachan\. 2023\.A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis\.In*Proceedings of EMNLP*\.
- Su et al\. \(2024\)Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu\. 2024\.RoFormer: Enhanced transformer with rotary position embedding\.*Neurocomputing*, 568:127063\.
- Suzgun et al\. \(2023\)Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, and Jason Wei\. 2023\.Challenging BIG\-Bench tasks and whether chain\-of\-thought can solve them\.In*Findings of ACL*\.
- Turpin et al\. \(2023\)Miles Turpin, Julian Michael, Ethan Perez, and Samuel R\. Bowman\. 2023\.Language models don’t always say what they think: Unfaithful explanations in chain\-of\-thought prompting\.In*Advances in Neural Information Processing Systems 36 \(NeurIPS\)*\.
- Wang et al\. \(2023a\)Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen, You Wu, Luke Zettlemoyer, and Huan Sun\. 2023a\.Towards understanding chain\-of\-thought prompting: An empirical study of what matters\.In*Proceedings of ACL*\.
- Wang et al\. \(2023b\)Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt\. 2023b\.Interpretability in the wild: a circuit for indirect object identification in GPT\-2 small\.In*ICLR*\.
- Wei et al\. \(2022\)Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V\. Le, and Denny Zhou\. 2022\.Chain\-of\-thought prompting elicits reasoning in large language models\.In*Advances in Neural Information Processing Systems 35 \(NeurIPS\)*\.
- Wilson \(1927\)Edwin B\. Wilson\. 1927\.Probable inference, the law of succession, and statistical inference\.*Journal of the American Statistical Association*, 22\(158\):209–212\.
- Wu et al\. \(2024\)Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, and Yao Fu\. 2024\.Retrieval head mechanistically explains long\-context factuality\.*arXiv preprint arXiv:2404\.15574*\.
- Yang et al\. \(2024\)An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Chang Wang, and 1 others\. 2024\.Qwen2\.5 technical report\.*arXiv preprint arXiv:2412\.15115*\.
- Zhang and Nanda \(2024\)Fred Zhang and Neel Nanda\. 2024\.Towards best practices of activation patching in language models: Metrics and methods\.In*ICLR*\.
## Appendix AExperimental Details
Table[9](https://arxiv.org/html/2605.22870#A1.T9)summarizes the design of all experiments reported in the paper\.
Table 9:Summary of all experiments\.*Baseline\-correct*: items conditioned on the model answering correctly under Condition C \(clean prefix\)\.*TF\-passing*: teacher\-forcing fidelity check applied \(metric in parentheses\)\.*Paired*: same items across conditions \(enables McNemar\)\.*Seeds*: number of stochastic seeds for randomized conditions\.#### Number corruption \(§[3\.1](https://arxiv.org/html/2605.22870#S3.SS1)\)\.
Intermediate numbers corrupted with deterministic per\-example seeding: outer seed viahashlib\.sha256of problem index and condition string, inner number perturbation viahashlib\.md5of the number string and seed offset\. Corruption magnitude:max\(1,⌊0\.3×\|v\|⌋\)\\max\(1,\\lfloor 0\.3\\times\|v\|\\rfloor\), direction chosen pseudorandomly\. Gold answer occurrences detected via regex with digit boundaries\. Items where the gold value persists in the corrupted text \(due to accidental collision or small\-number preservation\) are excluded from Condition A \(see footnote in §[3\.1](https://arxiv.org/html/2605.22870#S3.SS1)\)\.
#### Causal ladder and controls \(§[3\.3](https://arxiv.org/html/2605.22870#S3.SS3)\)\.
Drep: starting from Condition B \(correct intermediates, gold present\), all gold\-answer occurrences are replaced with a single deterministic wrong number per item \(same corruption seed for all gold positions\)\. Dtrunc: the CoT is truncated at the sentence boundary preceding the first gold\-answer occurrence; sentence boundaries are detected by the regex\(?<=\[\.\!?\\n\]\)\\s\+, with fallback to the last newline and then to a character\-level cut when no boundary precedes gold\. Items retaining<30%<30\\%of text or leaking gold into the truncated prefix are excluded\. Truncated prefix length: mean 78\.8%/79\.7%/78\.9% of original CoT tokens \(Qwen/Llama/Gemma\)\. Dblank: the Dtruncprefix is appended with “The answer is determined by the steps above\.\\n\\n\#\#\#\# ”\. No\-CoT: prefix is “\#\#\#\# ” only\. Common sample sizes: Drep—Qwenn=328n\{=\}328, Llaman=214n\{=\}214, Gemman=312n\{=\}312; Dtrunc/Dblank/No\-CoT—Qwenn=308n\{=\}308–310310, Llaman=208n\{=\}208–213213, Gemman=292n\{=\}292–296296\(slight variation due to truncation\-length filter\)\. Dtrunctrailing\-number partition: items where the new trailing number==gold are rare \(Qwen 5/308, Llama 4/208, Gemma 3/292\); excluding them yields\.389\.389,\.172\.172,\.055\.055—essentially identical to overall Dtruncaccuracy\.
#### Dtruncdepth partition\.
To assess whether Dtruncaccuracy reflects primarily one\-step completion \(the gold answer is reachable from the last retained intermediate in a single arithmetic operation\) or multi\-step retained\-context derivation, we classify each Dtrunccorrect item by depth\. An item is*1\-op reachable*if the gold answer can be obtained from any pair of numbers in the last retained sentence via a single arithmetic operation \(\+\+,−\-,×\\times,÷\\div\); otherwise it is*multi\-step*\.
Table 10:Dtruncdepth partition\. 1\-op reachable items \(74–77% of total\) account for most Dtruncaccuracy; multi\-step items \(23–26%\) still exceed the No\-CoT floor on all models, indicating some genuine multi\-step derivation, though at reduced magnitude\.
#### Distractor control \(§[4\.1](https://arxiv.org/html/2605.22870#S4.SS1)\)\.
Distractors have same digit\-count as gold\. Adjacent: gold±1\{\\pm\}1\. Random: uniform from same\-magnitude range, rejection\-sampled to avoid gold\. Templates: “Therefore, the answer is \{X\}\.” etc\. Qwen:n=335n\{=\}335; Llama:n=227n\{=\}227\.
#### Answer\-position \(§[4\.2](https://arxiv.org/html/2605.22870#S4.SS2)\)\.
Answer step identified as last step containing gold answer with digit\-boundary regex\. Steps split by paragraph breaks; sentence\-level fallback\. Qwen:n=335n\{=\}335; Llama:n=227n\{=\}227\(1 excluded due to<3<3steps\)\.
#### Shuffle hierarchy \(§[4\.3](https://arxiv.org/html/2605.22870#S4.SS3)\)\.
Seven conditions: ordered, within\_step\_shuffle, step\_shuffle, word\_shuffle, reverse\_order, token\_shuffle, no\_cot\. Stochastic conditions: 5 seeds; deterministic: seed 0 only\. For stochastic conditions, each item’s accuracy is the mean over its 5 seeds; reported accuracies and bootstrap CIs treat per\-item means as the sample units\. Sequence length filter: 1024 tokens\. Problems with<2<2steps are excluded\.
#### Head\-level ablation \(§[5\.1](https://arxiv.org/html/2605.22870#S5.SS1)\)\.
Requiresattn\_implementation="eager"\. Phase 1/Phase 2 use disjoint held\-out splits \(50/50\)\. Attention mass: mean over answer\-token positions\. Ablation zeroes𝐨proj\\mathbf\{o\}\_\{\\text\{proj\}\}columns\[h⋅dh:\(h\+1\)⋅dh\]\[h\\cdot d\_\{h\}:\(h\{\+\}1\)\\cdot d\_\{h\}\]\. Random\-5 control: 20 layer\-stratified sets for Llama; single non\-copy sets for Qwen and Gemma\. Sequence length filter: 1536 tokens\.
#### Top attention\-mass heads\.
Llama: L11H14 \(\.383\), L12H13 \(\.348\), L10H13 \(\.293\), L11H20 \(\.268\), L15H12 \(\.252\)\. Gemma: L18H6 \(\.146\), L16H6 \(\.120\), L17H4 \(\.114\), L17H3 \(\.107\), L16H4 \(\.101\)\.
#### Generation\.
Greedy decoding, max 32 new tokens for prefix\-continuation experiments; max 512 for free\-generation \(§[G](https://arxiv.org/html/2605.22870#A7)\)\. Hardware: NVIDIA A10G \(24GB\), bfloat16\. EOS:<\|eot\_id\|\>\(Llama\),<\|im\_end\|\>\(Qwen\),<end\_of\_turn\>\(Gemma\)\.
## Appendix BReproducibility
All experiments: greedy decoding \(temperature==0\), deterministic per\-example seeding \(outer seed viahashlib\.sha256of problem index and condition string; inner number perturbation viahashlib\.md5\)\. All stochastic operations \(corruption magnitude, distractor generation, step reordering\) derive from these per\-item seeds, making results bitwise\-identical regardless of global seed choice\. Per\-example outcome records are released in the result JSONs for reproducibility\. Hardware: bfloat16 on NVIDIA A10G GPUs\. Software: PyTorch 2\.5\.1–2\.6\.0, HuggingFace Transformers 5\.7\.0–5\.8\.1, Python 3\.11–3\.12\. Compute for reported experiments:∼40\{\\sim\}40GPU\-hours; including pilot runs and debugging iterations, total project compute is∼60\{\\sim\}60–8080GPU\-hours\. Code is available at[https://github\.com/mlzoo/cot\-readout\-shortcut](https://github.com/mlzoo/cot-readout-shortcut)\.
## Appendix CPositional Encoding Control
Monotonic position\-ID stretching \(2\.5×2\.5\{\\times\}\) on ordered CoT causes no measurable accuracy loss \(0/660/66errors\), showing that position\-encoding perturbation alone does not reproduce the shuffle effect\. A RoPE\(Su et al\.,[2024](https://arxiv.org/html/2605.22870#bib.bib31)\)2×22\{\\times\}2factorial \(Table[11](https://arxiv.org/html/2605.22870#A3.T11)\) crossing content order with position\-ID assignment shows: the content\-shuffle main effect \(−31\.3\-31\.3pp\) dominates the position\-encoding channel\.
Table 11:RoPE2×22\{\\times\}2factorial \(n=300n\{=\}300, Qwen\)\. “OOD pos\.”==monotonically increasing position IDs with random gaps \(Uniform\{1,…,5\}\\mathrm\{Uniform\}\\\{1\{,\}\\ldots\{,\}5\\\}\)\. Content\-shuffle dominates; OOD positions cause large drops but are confounded by the gap distribution being OOD for the model’s RoPE\.
## Appendix DAnswer Position Curve
For problems with≥5\{\\geq\}5steps, we place the answer step at fractional positions\{0,0\.25,0\.5,0\.75,1\.0\}\\\{0,0\.25,0\.5,0\.75,1\.0\\\}; non\-answer steps shuffled \(5 seeds per position\)\.
Table 12:Answer\-position curve\. Spearmanpp\-values are exact two\-sided permutation tests \(n=5n\{=\}5positions; with only 5 data points, the per\-position CIs and effect sizes carry the primary statistical weight\)\. Qwen/Llama: shallow\-gradient\-then\-jump recency\. Gemma: no monotonic effect \(consistent with the teacher\-forcing fidelity issue noted in §[4](https://arxiv.org/html/2605.22870#S4)\)\.
## Appendix ENovel\-Delimiter Distractor Control
Three delimiters designed to be absent from pretraining \(\>\>\>RESULT:,\#\#FINAL\#\#,\[ANSWER\]\) tested with distractor injection \(n=200n\{=\}200per delimiter\)\.
Table 13:Novel\-delimiter control \(n=200n\{=\}200per condition\)\. All conditions exceed \.70 threshold by large margins\. The format\-prior account predicts a drop with unseen delimiters; none observed\. Standard\-delimiter row uses the samen=200n\{=\}200subsample for comparability\.
## Appendix FDelimiter Robustness
Three familiar delimiters tested on Qwen \(n=200n\{=\}200each\):\#\#\#\#,Answer:,The answer is\. All yieldPB\>\.95P\_\{B\}\>\.95andP\(distractor\)=\.96P\(\\text\{distractor\}\)=\.96–\.965 with overlapping CIs \(pairwise McNemarp\>\.7p\>\.7\)\.
## Appendix GFree\-Generation Consistency
Under unconstrained decoding \(n=500n\{=\}500per model\), all three architectures’ answers match the last number in their own CoT 96–97% of the time \(Table[14](https://arxiv.org/html/2605.22870#A7.T14)\)\. Critically,P\(answer=last num∣incorrect\)=\.945P\(\\text\{answer\}\{=\}\\text\{last num\}\\mid\\text\{incorrect\}\)=\.945\(Qwen\) / \.957 \(Llama\) / \.905 \(Gemma\)—even wrong answers match the last CoT number\. Conditioning on whether the gold answer occupies the last CoT position reveals a near\-deterministic gate: accuracy is\.991\.991–\.997\.997when gold is last vs\.\.004\.004–\.030\.030when it is not\. This full\-distribution analysis extends the baseline\-correct decomposition \(§[3\.1](https://arxiv.org/html/2605.22870#S3.SS1)\) to the entire sample without baseline\-correct conditioning\. The\.905\.905–\.957\.957match rate on*incorrect*items is particularly diagnostic: under a “compute\-then\-write” alternative where the trailing number reflects the model’s computation, errors would distribute across non\-trailing numbers; instead, even wrong answers reflexively match the trailing token—the same positional\-copy signature observed under teacher\-forced corruption, now confirmed to operate during the model’s own unconstrained generation\.
Table 14:Full\-distribution free\-generation analysis \(n=500n\{=\}500per model, greedy decoding\)\. The readout shortcut is not an artifact of teacher\-forcing or baseline\-correct conditioning: across all items including incorrect ones, accuracy is near\-perfectly predicted by whether the gold answer is the last CoT number\. The positional\-copy mechanism operates identically during the model’s own unconstrained generation\.
## Appendix HBase Model Few\-Shot Control
Llama\-base 4\-shot ICL resolves format collapse \(PC:\.065→\.985P\_\{C\}:\.065\\rightarrow\.985\) and yieldsΔcopy=\.975\\Delta\_\{\\text\{copy\}\}\{=\}\.975,P\(distractor\)=\.890P\(\\text\{distractor\}\)\{=\}\.890\. The 0\-shot null is consistent with format collapse rather than mechanism absence; however, 4\-shot ICL simultaneously provides format scaffolding and demonstrations of the copy pattern, so we cannot fully separate latent\-mechanism elicitation from ICL\-induced pattern matching\.
## Appendix ISelf\-Generated CoT Shuffle
To test whether the shuffle hierarchy persists under the model’s own CoTs \(rather than teacher\-forced human\-written CoTs\), we collect Llama\-3\.2\-1B\-Instruct’s free\-generation CoTs on GSM8K \(n=500n\{=\}500, 5 seeds\), filter to the correctly\-solved subset, and apply step\-shuffle, word shuffle, and token\-shuffle to the model’s own outputs\. The modified self\-generated CoT is then injected as a teacher\-forced prefix and the model generates the answer\.
Llama replicates the hierarchy: step\-shuffle retains 82\.1%, word shuffle 77\.0%, and token\-shuffle 14\.6%—matching the teacher\-forced ordering \(step\>\>word\>\>token\)\. Qwen’s step\-shuffle retention is 67\.1%—closely matching its teacher\-forced value \(68\.4%\), confirming step\-shuffle ecological validity on both primary models \(N=2N\{=\}2\); the full hierarchy \(step\>\>word\>\>token\) is shown for Llama \(N=1N\{=\}1\)\. However, Qwen’s word\-level conditions are confounded by a format artifact: Qwen’s self\-generated CoTs use a markdown\-numbered\-list format, and 79\.6% of step\-shuffle errors are list\-continuation outputs \(the model generates “1\. …” instead of an answer\)\. This format dominance is itself consistent with the paper’s “format\-driven copy” thesis—the model perseverates on list continuation rather than answer extraction when the list order is disrupted—but it prevents clean measurement of word\-level conditions for Qwen\. We therefore report the full hierarchy for Llama only, with Qwen’s step\-shuffle confirmation noted above\.
## Appendix JScale Extension: Full Results
See main text Table[7](https://arxiv.org/html/2605.22870#S6.T7)\. Additional detail: Qwen\-7BPB=\.950\[\.910,\.973\]P\_\{B\}\{=\}\.950\\;\[\.910,\.973\],PA=\.010\[\.003,\.036\]P\_\{A\}\{=\}\.010\\;\[\.003,\.036\],PC=1\.00\[\.981,1\.00\]P\_\{C\}\{=\}1\.00\\;\[\.981,1\.00\]\(n=200n\{=\}200\)\. Llama\-8BΔcopy=\.798\\Delta\_\{\\text\{copy\}\}\{=\}\.798,Δoff\-copy=\.196\\Delta\_\{\\text\{off\-copy\}\}\{=\}\.196,P\(residual\)=\.006P\(\\text\{residual\}\)\{=\}\.006\(n=163n\{=\}163correct of 200 evaluated\)\. The substantial off\-copy pathway in Llama\-8B \(19\.619\.6pp vs\. near\-zero at 1B\) suggests the larger model partially recovers correct answers from intermediate reasoning steps\.
## Appendix KReasoning Model
DeepSeek\-R1\-Distill\-Qwen\-1\.5B\(Guo et al\.,[2025](https://arxiv.org/html/2605.22870#bib.bib11)\):PC=\.967P\_\{C\}\{=\}\.967,Δcopy=\.928\\Delta\_\{\\text\{copy\}\}\{=\}\.928,Δoff\-copy=\.033\\Delta\_\{\\text\{off\-copy\}\}\{=\}\.033,P\(residual\)=\.006P\(\\text\{residual\}\)\{=\}\.006,P\(distractor\)=\.706P\(\\text\{distractor\}\)\{=\}\.706\(n=180n\{=\}180\)\. Ceiling\-normalized \(Δcopy/PC\\Delta\_\{\\text\{copy\}\}/P\_\{C\}\):96\.0%96\.0\\%copy\-driven vs\.92\.6%92\.6\\%for standard Qwen \(PC=1\.00P\_\{C\}\{=\}1\.00\)\. RL reasoning training does not develop substantial off\-copy pathways\.
## Appendix LCumulative\-K Ablation Sweep
Table 15:Cumulative\-K zero\-ablation on Qwen \(n=200n\{=\}200\)\. Phase transition atK=14K\{=\}14–15;K50=17K\_\{50\}\{=\}17\.
## Appendix MCross\-Task: SVAMP
Δcopy=1\.00\\Delta\_\{\\text\{copy\}\}=1\.00\(Qwen\) / \.990 \(Llama\) / \.603 \(Gemma\) on SVAMP\(Patel et al\.,[2021](https://arxiv.org/html/2605.22870#bib.bib25)\)\(n=300n\{=\}300\)\. Gemma’s lower SVAMP value reflects the same teacher\-forcing fidelity issue as in GSM\-Symbolic:PC=\.665P\_\{C\}\{=\}\.665on SVAMP \(vs\.∼1\.0\{\\sim\}1\.0for Qwen/Llama\), so the Gemma SVAMP decomposition is unreliable\. A similar issue affects Gemma’s GSM\-Symbolic ceiling\-corrected step\-shuffle retention, which yields an uninterpretable 117% due to the low teacher\-forcing fidelity\. Position sensitivity is attenuated on Qwen/Llama \(keep\_end−\-full\_shuffle=\+9=\+9pp vs\.\+18\+18–3030pp on GSM8K\), plausibly due to SVAMP’s redundant lexical answer cues; Gemma’s keep\_end−\-full\_shuffle is negative \(−8\.5\-8\.5pp\), consistent with the teacher\-forcing fidelity issue described above\.
## Appendix NBBH Full Results
Table 16:BBH logical deduction\. Retention==shuffled/ordered \(simple ratio\); Figure[3](https://arxiv.org/html/2605.22870#S4.F3)uses the no\-CoT\-anchored formula; §[6\.4](https://arxiv.org/html/2605.22870#S6.SS4)tracking task uses chance\-corrected retention to account for the MCQ floor\. Under either metric, BBH retention \(44%/21%\) is far below GSM8K \(68–87%\)\. Llama drops below chance—shuffled CoT actively interferes\.Additional: Qwen free\-gen baseline \.344; Llama free\-gen baseline \.260\. Llama’s below\-chance shuffled accuracy \(CI\[\.043,\.187\]\[\.043,\.187\], excluding \.333\) indicates active interference rather than mere signal absence\.
Table 17:BBH tracking shuffled objects \(3\-way MCQ\)\. Chance\-corrected retention=\(pshuffled−1/3\)/\(pordered−1/3\)=\(p\_\{\\text\{shuffled\}\}\-1/3\)/\(p\_\{\\text\{ordered\}\}\-1/3\)\. Both models show highly significant shuffle effects; Llama’s chance\-corrected retention \(6\.8%\) approaches the MCQ floor\.
## Appendix OBase Model Full Results
Table 18:Base model decomposition \(n=200n\{=\}200each\)\. Llama’s near\-zero results reflect format collapse, not absence of mechanism \(see Appendix[H](https://arxiv.org/html/2605.22870#A8)\)\.
## Appendix PSupplementary Head\-Level Analyses
#### Scope\.
The analyses below characterize head\-level sensitivity, not a verified computational subgraph; full circuit identification is left to follow\-up work\. An activation\-patching screen \(§[Q](https://arxiv.org/html/2605.22870#A17)\) provides in\-distribution causal evidence complementing the zero\-ablation sensitivity analysis\.
#### Induction\-score protocol\.
FollowingOlsson et al\. \([2022](https://arxiv.org/html/2605.22870#bib.bib24)\): per head, the*prefix\-matching score*is the average attention from positionK\+iK\{\+\}iback to positioni\+1i\{\+\}1onN=200N\{=\}200random\-token repeated sequences\[r1…rK\]\[r1…rK\]\[r\_\{1\}\{\\ldots\}r\_\{K\}\]\[r\_\{1\}\{\\ldots\}r\_\{K\}\]\(K=50K\{=\}50, vocabulary restricted to digits, operators, and common math tokens\); the*copying score*is the OV\-circuit direct\-logit attribution of each head’s contribution to the gold \(first\-half\) token at second\-half positions\. The “induction score” cited in\-text is the product of prefix\-matching and copying scores; per\-head score pairs and the previous\-token\-head check on the top\-5 causal heads are released with the code\.
Content\-tracking: Under word\-shuffle, 95\.8% of Qwen’s 336 heads show content Jaccard\>\>position Jaccard \(Monte Carlo null: 0/336 heads; mean null Jaccards≈0\.8%\{\\approx\}0\.8\\%\)\. Ordering\-sensitivity: Gini=0\.738=0\.738\(p<0\.001p<0\.001\), but layer\-matched random control yields permutationp=0\.38p\{=\}0\.38—shuffled CoT is generically fragile under any ablation\. Induction overlap \(hypergeometric test on top\-20 sets\): Gemma L18H6 = top induction head \(score0\.440\.44\), 7/20 overlap \(p=9\.2×10−4p\{=\}9\.2\{\\times\}10^\{\-4\}\); Llamaρ=0\.248\\rho\{=\}0\.248, 5/20 overlap \(p=5\.7×10−4p\{=\}5\.7\{\\times\}10^\{\-4\}\); Qwenρ=0\.046\\rho\{=\}0\.046\(induction vs\. patching\-screen ranking\), 2/20 overlap \(p=0\.34p\{=\}0\.34, near\-chance\), full\-rank induction\-vs\.\-ordered\-attention\-massρ=0\.254\\rho\{=\}0\.254\(p=2\.3×10−6p\{=\}2\.3\{\\times\}10^\{\-6\}\)—a gradient from convergent \(Gemma\) through partial \(Llama\) to diffuse \(Qwen\)\. RoPE2×22\{\\times\}2factorial: see Appendix[C](https://arxiv.org/html/2605.22870#A3)\.
## Appendix QActivation Patching Screen
For each example, we cache per\-head𝐨proj\\mathbf\{o\}\_\{\\text\{proj\}\}inputs from an ordered run and a shuffled run, then replace the shuffled run’s head slice with the ordered slice and measure logit recovery on the gold\-answer token\(Wang et al\.,[2023b](https://arxiv.org/html/2605.22870#bib.bib35)\)\. This is an in\-distribution activation replacement—both source and target are real model activations—and is therefore immune to the OOD critique that applies to zero\-ablation\.
On Qwen \(n=150n\{=\}150validation, disjoint from then=34n\{=\}34screening set used for head selection with\|ΔLD\|\>0\.3\|\\Delta\\mathrm\{LD\}\|\>0\.3nat filter\): patching the top\-20 heads \(by recovery ratio\) restores accuracy from\.693\.693\(shuffled baseline\) to\.880\.880\(group\-patch\), recovering 61% of the shuffle gap\. The top individual head by validation accuracy \(L0H8\) achieves\.813\.813alone \(vs\.\.693\.693shuffled\); the top screening\-rank head \(L14H8\) achieves\.787\.787\. Split\-half Jaccard for the top\-20 set is\.496±\.074\.496\\pm\.074\(50 random splits\), indicating stable head selection\. The Gini coefficient across all 336 heads is\.738\.738\(0/1,0000/1\{,\}000permutations exceed observed;p<\.001p\{<\}\.001\), indicating that the recovery effect is concentrated in a small head subset rather than broadly distributed—a random\-20 set drawn from this distribution would recover substantially less than the top\-20’s 61%\. Of the top\-20 patching heads, 6 overlap with the top\-20 attention\-mass heads \(30%\), confirming that attention mass and causal recovery are correlated but non\-identical rankings\.
## Appendix RParaphrased CoT Control
Replacing numbers with English words causes−2\.4%\-2\.4\\%\(p=\.013p\{=\}\.013,n=335n\{=\}335\); operator paraphrasing:0%0\\%effect\. Errors concentrate on items where the final answer is spelled out—only the answer\-slot token form matters\.
## Appendix SStatistical Methodology Details
We use the simplest test appropriate to each contrast’s design \(paired/unpaired, one/two\-sided\) and correct within each pre\-declared confirmatory sub\-family\. The sub\-families contain 23 contrasts in total:Exp 1A vs\. B \(×2\{\\times\}2models, 2 McNemar\);Exp 3ordered vs\. full\_shuffle, ordered vs\. keep\_end, keep\_end vs\. full\_shuffle, keep\_end vs\. move\_front \(×2\{\\times\}2models, 8 McNemar\);Exp 4C1 and C2 vs\.P=0\.70P\{=\}0\.70\(×2\{\\times\}2models, 4 one\-sided binomial\);§[3\.3](https://arxiv.org/html/2605.22870#S3.SS3)causal ladderDrepvs\. A, Dtruncvs\. No\-CoT, Dblankvs\. Dtrunc\(×3\{\\times\}3models, 9 McNemar\)\. Sensitivity: doubling one\-sidedpp\-values before correction leaves all significant\. Wilson CIs are used for single\-condition proportions; derived differences \(Δcopy=PB−PA\\Delta\_\{\\text\{copy\}\}\{=\}P\_\{B\}\{\-\}P\_\{A\},Δoff\-copy=PC−PB\\Delta\_\{\\text\{off\-copy\}\}\{=\}P\_\{C\}\{\-\}P\_\{B\}\) use paired bootstrap over items \(10,000 resamples\), which correctly accounts for within\-item dependence across conditions\. Bootstrap and Wilson intervals agree within±1\{\\pm\}1pp\. Mean\-ablation and activation\-patching results are exploratory and not part of the confirmatory contrast family\. The intermediate\-result distractor experiment \(QwenP=\.860P\{=\}\.860; LlamaP=\.250P\{=\}\.250\) is reported in §[4\.1](https://arxiv.org/html/2605.22870#S4.SS1)as evidence for architecture\-dependent content gating\.
## Appendix TTeacher\-Forcing Fidelity Diagnostic
Teacher\-forcing fidelity measures whether a model produces the expected continuation when given its own correct CoT as a prefix\. We report fidelity for each experiment×\\timesmodel combination to make exclusion decisions transparent\. The fidelity metric is experiment\-specific: for Exp 1,PCP\_\{C\}\(accuracy under Condition C\); for Exp 2, ordered\-condition accuracy; for Exp 3, accuracy at position=1\.0\{\}=1\.0\(unperturbed answer placement\); for Exp 4, baseline distractor\-free accuracy\.
Table 19:Teacher\-forcing fidelity per experiment×\\timesmodel\. Gemma’s fidelity drops below the\.80\.80threshold on Exp 3 \(answer position\) and Exp 4 \(distractor injection\), where structural perturbation causes the model to deviate from the teacher\-forced prefix\. These cells are excluded from the main analysis \(§[4](https://arxiv.org/html/2605.22870#S4)\); a TF\-passing subset analysis \(below\) shows the core patterns replicate when conditioning on fidelity\-passing items\. All other cells exceed\.93\.93\.The\.80\.80threshold is conservative: the gap between passing cells \(≥\.931\{\\geq\}\.931\) and failing cells \(≤\.60\{\\leq\}\.60\) is large, so any threshold in\[\.60,\.93\]\[\.60,\\,\.93\]yields identical inclusion decisions—the specific value is immaterial\. The Exp 1 column \(PCP\_\{C\}\) serves as the TF\-fidelity diagnostic for the entire corruption pipeline \(Conditions A, B, C, Drep, Dtrunc, Dblank, No\-CoT\), all of which share the prefix\-completion infrastructure\. Drepmodifies only the trailing number \(a textually mild perturbation\); the model’s acceptance of the corruption is captured directly byP\(distractor\)P\(\\text\{distractor\}\)\. Dtruncremoves the trailing tail by design, so TF fidelity in the “expected continuation” sense is not defined for that condition\.
#### TF\-passing subset analysis\.
As a robustness check, we condition Gemma’s Exp 3 and Exp 4 on the subset of items where teacher\-forcing fidelity passes \(i\.e\., the model produces a parseable answer under the unperturbed condition\)\. On this subset \(n=196n\{=\}196\), Exp 3 patterns replicate: ordered\.990\.990, full\-shuffle\.847\.847, keep\_end\.949\.949, move\_front\.735\.735\(recovery=71%=71\\%\)\. The 5\-position answer\-placement curve yields Spearmanr=1\.0r\{=\}1\.0\(n=187n\{=\}187TF\-strict\), restoring the monotonic recency gradient absent in the full sample\. For Exp 4, TF\-passingP\(distractor\)P\(\\text\{distractor\}\)is\.194\[\.145,\.255\]\.194\\;\[\.145,\.255\]\(C1\) and\.117\[\.079,\.170\]\.117\\;\[\.079,\.170\]\(C2\)—well below the\.70\.70threshold, confirming that Gemma’s Exp 3 exclusion reflects a teacher\-forcing artifact rather than a genuine mechanistic divergence from Qwen and Llama\.
## Appendix UBare\-Number Distractor Control
All main\-experiment distractors are wrapped in answer\-template sentences \(“Therefore, the answer is \{X\}\.” etc\.\)\. A reviewer might attribute the highP\(distractor\)P\(\\text\{distractor\}\)to template parsing rather than positional copying\. We test four conditions that vary answer\-relevant framing while keeping the distractor number identical \(n=335n\{=\}335Qwen;n=227n\{=\}227Llama\):
Table 20:Bare\-number distractor control \(F\-codes as in Table[5](https://arxiv.org/html/2605.22870#S4.T5)\)\. F1: standard template \(“Therefore, the answer is X”\); F2: bare number only; F3: non\-answer framing \(“Note: X”\); F4: natural inline answer\-revision \(“But wait, actually it should be X”\)\. The readout requires answer\-relevant framing \(F2/F3≪\.70\\ll\.70\) but is not template\-specific \(F4≥\.70\\geq\.70\)\.Bare numbers \(F2\) and non\-answer\-context numbers \(F3\) fall well below the\.70\.70threshold on both models, showing the readout is not purely positional—a trailing number without answer\-relevant framing is insufficient\. However, natural inline answer\-revision text \(F4\) triggers copying at rates comparable to the standard template \(\.770/\.846 vs\. \.893/\.877\), showing the mechanism is not template\-specific\. The readout is*answer\-context\-gated*: it requires some form of answer\-relevant framing at the trailing position, but is indifferent to the specific surface form of that framing and—within answer context—remains indifferent to whether the number is correct\.
## Appendix VAlternative Explanations Summary
Table 21:Alternative explanations for the readout shortcut and the controls that address each\. Each alternative is tested by at least one dedicated experiment or control condition\.
## Appendix WQualitative Examples
We illustrate the copy\-dominance phenomenon with representative cases from Qwen2\.5\-1\.5B\-Instruct on GSM8K\.
#### Example 1: Distractor copied \(Condition C1\)\.
The gold answer is 72\. A wrong\-number distractor sentence \(“Therefore, the answer is 45\.”\) is appended at the trailing position after the correct CoT\. The model outputs45\(the distractor\), ignoring the correct computation in the prefix\. This occurs in 87% of Qwen trials\.
#### Example 2: Dtruncsucceeds\.
Same problem; CoT is truncated before the final step \(removing the trailing gold\-answer number\)\. The model now outputs72\(correct\), recovering the answer from the retained penultimate\-step context\. This demonstrates the retained\-context computation \(29 pp above No\-CoT for Qwen\) that is*masked*when a trailing number is present\.
#### Example 3: Distractor resisted \(Gemma content gate\)\.
Same Condition C1 setup on Gemma\-2\-2B\-it\. The model outputs72\(gold answer\), rejecting the novel distractor\. Gemma’s content gate \(1−P\(distractor\)≈\.851\-P\(\\text\{distractor\}\)\\approx\.85\) prevents the copy channel from accepting semantically implausible trailing numbers\.Similar Articles
Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions
This paper evaluates three approaches (pure chain-of-thought reasoning, single-shot code execution, and iterative code execution) on 1,000 GSM-Symbolic problems using Claude Haiku 4.5, finding that chain-of-thought is the most robust to perturbation, while code execution does not improve reasoning robustness on grade-school math problems.
Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms
This paper investigates how large language models perform arithmetic operations by analyzing internal mechanisms through early decoding, revealing that proficient models exhibit a clear division of labor between attention and MLP modules in reasoning tasks.
Reasoning Models Don't Just Think Longer, They Move Differently
This paper investigates whether reasoning-trained language models simply allocate more compute (longer chains of thought) or follow qualitatively different internal trajectories by analyzing hidden-state trajectory geometry across code, math, and SAT domains. After correcting for generation length, they find that reasoning-trained models exhibit distinct trajectory geometry—most clearly in code—indicating reasoning training changes how computation unfolds, not just how much is used.
Code-Guided Reasoning for Small Language Models: Evaluating Executable MCQA Scaffolds
This paper introduces Code-Guided Reasoning (CGR), an evaluation protocol for measuring how executable reasoning scaffolds improve small language model performance on multiple-choice question answering tasks, showing a significant accuracy improvement over direct answering.
Many-Shot CoT-ICL: Making In-Context Learning Truly Learn
This paper investigates many-shot chain-of-thought in-context learning for reasoning tasks, revealing that standard scaling rules do not transfer and proposing Curvilinear Demonstration Selection (CDS) for improved ordering, achieving up to 5.42 percentage-point gain.