Can Editing 1 Neuron Fix Repetition Loops in LLMs?

arXiv cs.LG Papers

Summary

This paper investigates whether repetition loops in long factual enumeration tasks by Gemma 4 models can be fixed by editing a single neuron. It finds that targeted weight edits on a small set of MLP neurons can significantly reduce loop failures, though not completely eliminate doom looping in larger models.

arXiv:2606.13705v1 Announce Type: new Abstract: Yes. Can it cure doom loops? Probably not. The Gemma 4 instruction-tuned models share a reproducible failure: on long factual enumeration prompts, such as listing every episode of a TV series, the 88 IAU constellations, or the 151 original Pokemon, they collapse into repetition, either a tight verbatim loop or a list whose entries decay onto a single answer. These loops occur at rates as high as 95% and survive prompt rewording, inference-engine changes, and most sampling adjustments. In this paper we explore whether this behavior is localized enough to remove by weight edits. To localize the cause, we use per-layer ablation and per-neuron attribution, then confirm the strongest candidates with full-generation sweeps. The loops trace to a small set of MLP neurons (or, in the 26B-A4B Mixture-of-Experts model, a few routed experts) which we suppress with static weight edits. These "surgeries" can be as small as a single sign-inverted neuron (in the E2B model). The size of the effective edits grows with model scale, but in all cases, the loop patterns can be addressed at normal generation budgets while preserving general-purpose benchmark scores. However, the edits do not solve everything: we also study longer thinking budgets, where the two larger models most visibly enter doom looping, i.e. a non-convergent regime in which the model self-corrects in circles over a fact it cannot recall, exhausting the budget without committing to a final answer. We show this residual failure is reduced but not eliminated by the same edits, and argue it is fundamentally a knowledge-precision problem rather than a removable circuit; weight surgery can delete a loop, but it cannot supply a missing fact. Our results are both a feasibility demonstration, that is, evidence that a concrete generation pathology can be localized to a few parameters and edited out, and a delineation of where that approach stops.
Original Article
View Cached Full Text

Cached at: 06/15/26, 09:06 AM

# Can Editing 1 Neuron Fix Repetition Loops in LLMs?
Source: [https://arxiv.org/html/2606.13705](https://arxiv.org/html/2606.13705)
Aristotelis Lazaridis Aman Sharma Dylan Bates Brian King Vincent Lu Jack FitzGerald Edgerunner AI research@edgerunnerai\.com

###### Abstract

Yes\. Can it cure doom loops? Probably not\. The Gemma 4 instruction\-tuned models share a reproducible failure: on long factual enumeration prompts, such as listing every episode of a TV series, the 88 IAU constellations, or the 151 original Pokémon, they collapse into repetition, either a tight verbatim loop or a list whose entries decay onto a single answer\. These loops occur at rates as high as 95% and survive prompt rewording, inference\-engine changes, and most sampling adjustments\. In this paper, we explore whether this behavior is localized enough to remove by weight edits\. To localize the cause, we use per\-layer ablation and per\-neuron attribution, then confirm the strongest candidates with full\-generation sweeps\. The loops trace to a small set of MLP neurons \(or, in the 26B\-A4B Mixture\-of\-Experts model, a few routed experts\) which we suppress with static weight edits\. These “surgeries” can be as small as a single sign\-inverted neuron \(in the E2B model\)\. The size of the effective edits grows with model scale, but in all cases, such loop patterns can be addressed at normal generation budgets while preserving general\-purpose benchmark scores\. However, the edits do not solve everything: we also study longer thinking budgets, where the two larger models most visibly enter doom looping, i\.e\., a non\-convergent regime in which the model self\-corrects in circles over a fact it cannot recall, exhausting the budget without committing to a final answer\. We show this residual failure is reduced but not eliminated by the same edits, and argue it is fundamentally a knowledge\-precision problem rather than a removable circuit; weight surgery can delete a loop, but it cannot supply a missing fact\. Our results are both a feasibility demonstration, that is, evidence that a concrete generation pathology can be localized to a few parameters and edited out, and a delineation of where that approach stops\.

## 1Introduction

The Gemma 4 model family exhibits a reproducible repetition failure on long factual enumeration tasks\. Prompts such as listing episodes of the seriesThe WireorFirefly, enumerating Generation I Pokémon, or listing constellations require the model to maintain a moderately long factual sequence while generating\. Under the default sampling configuration111temperature=0\.7,top\_p=0\.95,top\_k=64, and no repetition penalty\., baseline failure rates range from roughly10%10\\%for the lower\-failure\-rate models to95%95\\%forgemma\-4\-31B\-iton a prompt that asks the model to list all episodes ofThe Wireacross five seasons\.

Those failures are not limited to exact string repetition: sometimes the model collapses into a tight loop, where it commits to a short phrase and repeats it until the generation budget is exhausted, while in other cases they keep the surface structure of an enumerated list, but many entries converge to the same \(repeated\) answer\. Both forms survive prompt rewording, inference\-engine changes, and most standard sampling adjustments\.

We investigate whether these repetition failures can be traced to compact, identifiable components of the network and whether targeted static weight edits to those components \(i\.e\.weight surgery\) can substantially reduce the observed failures\. Working from per\-layer attribution analysis and empirical sweeps across a range of surgical operations \(MLP neuron zeroing, sign inversion, amplification, and expert\-slot masking\), we find that the looping behavior does concentrate in small, localizable parts of the network across all four models studied\. Ingemma\-4\-E2B\-it\(hereafterE2B\), a single MLP \(Multi\-Layer Perceptron\) neuron modification is sufficient to eliminate the observed loops at a slight benchmark cost that a two\-neuron variant roughly halves\. Ingemma\-4\-E4B\-it\(E4B\) andgemma\-4\-31B\-it\(31B\), small sets of MLP neurons are stripped \(three of 430,080 neurons in E4B; 1100 neurons in 31B\)\. Ingemma\-4\-26B\-A4B\-it\(26B\), a Mixture\-of\-Experts \(MoE\) model, three routed expert positions are masked out of 3840 layer\-expert slots\.222The selected edited model variants are available on Hugging Face: [https://huggingface\.co/edgerunner\-ai/gemma\-4\-E2B\-it\-noloop](https://huggingface.co/edgerunner-ai/gemma-4-E2B-it-noloop) [https://huggingface\.co/edgerunner\-ai/gemma\-4\-E4B\-it\-noloop](https://huggingface.co/edgerunner-ai/gemma-4-E4B-it-noloop) [https://huggingface\.co/edgerunner\-ai/gemma\-4\-26B\-A4B\-it\-noloop](https://huggingface.co/edgerunner-ai/gemma-4-26B-A4B-it-noloop) [https://huggingface\.co/edgerunner\-ai/gemma\-4\-31B\-it\-noloop](https://huggingface.co/edgerunner-ai/gemma-4-31B-it-noloop)

The size of the required edit grows with model scale: one MLP neuron in E2B, three in E4B, 1100 in 31B, and three routed expert positions in 26B\. This variation reflects genuine differences in how the looping mechanism is internally organized across the family\. Across all four models, the edits substantially reduce or eliminate the observed loops while preserving benchmark performance within small percentage\-point deltas\. These results are intended as a demonstration of feasibility, since the reported interventions are one set of configurations that work, and a more systematic exploration could yield smaller, more generalizable, or more effective modifications\.

This investigation is motivated by the question of whether a high\-level language model generation failure can be mapped to a small enough internal mechanism to be edited directly\. We draw inspiration from recent mechanistic\-interpretability work byKazemiet al\.\([2026](https://arxiv.org/html/2606.13705#bib.bib1)\), which shows that a single MLP neuron can mediate the refusal axis in safety\-tuned LLMs\. In the same spirit, we examine whether Gemma 4’s repetition failures have an editable loop circuit\. For the models and failure modes studied here, the answer is yes: the mechanism proves small enough to localize, edit, and validate through standard benchmark evaluation, although the specific operation required differs by model\.

The main contributions of this work are:

- •Diagnosis of the looping failure mode in the Gemma 4 model family\.
- •Characterization of two loop phenotypes \(tight loops and soft loops\), and a prompt\-agnostic deterministic loop detector for both\.
- •A per\-layer attribution methodology for identifying the MLP neurons and MoE expert slots most strongly associated with the looping behavior in each model\.
- •Empirical demonstrations of weight surgery feasibility across four Gemma 4 models using a variety of surgical operations, showing that a single MLP neuron modification can be sufficient to eliminate the failure in E2B, and that similarly small edits substantially reduce failures in E4B, 31B, and 26B\.
- •An analysis of doom looping at extended generation budgets: a non\-convergent self\-correction regime that can persist in the two larger models even after the primary looping failure is eliminated, and that we argue reflects factual\-knowledge limitations rather than a surgically removable loop circuit\.

## 2Related Work

Holtzmanet al\.\([2020](https://arxiv.org/html/2606.13705#bib.bib3)\)characterize repetitive text degeneration in likelihood\-trained models and propose nucleus sampling\. We address the same phenomenon mechanistically rather than through decoding changes\. A complementary line of work mitigates repetition through training objectives such as unlikelihood training\(Wellecket al\.,[2020](https://arxiv.org/html/2606.13705#bib.bib25)\), and analyzes its origins through the self\-reinforcing dynamics of repeated tokens\(Xuet al\.,[2022](https://arxiv.org/html/2606.13705#bib.bib23)\)and the high\-inflow problem in the token distribution\(Fuet al\.,[2021](https://arxiv.org/html/2606.13705#bib.bib24)\); these operate at the data, objective, or decoding level, whereas we localize and edit the responsible weights directly\. In mechanistic interpretability,Olssonet al\.\([2022](https://arxiv.org/html/2606.13705#bib.bib8)\)identify induction heads as a core copy mechanism whose pathological over\-firing may underlie repetition attractors\.Wanget al\.\([2022](https://arxiv.org/html/2606.13705#bib.bib9)\)andConmyet al\.\([2023](https://arxiv.org/html/2606.13705#bib.bib10)\)develop causal\-intervention methods for circuit discovery that inform our per\-layer ablation approach, andMcDougallet al\.\([2023](https://arxiv.org/html/2606.13705#bib.bib11)\)show that a single attention head in GPT\-2 Small mediates copy suppression, paralleling our finding that single MLP neurons can mediate anti\-loop behavior\.Elhageet al\.\([2022](https://arxiv.org/html/2606.13705#bib.bib16)\)show that features are stored in superposition, complicating neuron\-level analysis, yetGurneeet al\.\([2023](https://arxiv.org/html/2606.13705#bib.bib14)\)demonstrate that individual neurons can still encode interpretable high\-level features, andMarkset al\.\([2025](https://arxiv.org/html/2606.13705#bib.bib15)\)extend circuit discovery to fine\-grained sparse feature editing\.Kovalevaet al\.\([2021](https://arxiv.org/html/2606.13705#bib.bib17)\)andPuccettiet al\.\([2022](https://arxiv.org/html/2606.13705#bib.bib18)\)show that pre\-trained transformers are fragile to removal of a tiny number of outlier dimensions \(<<0\.0001% of weights\), establishing a precedent for extreme parameter\-level sensitivity\.

For model editing,Menget al\.\([2022a](https://arxiv.org/html/2606.13705#bib.bib5)\)introduce causal tracing and ROME for targeted MLP weight edits\.Menget al\.\([2022b](https://arxiv.org/html/2606.13705#bib.bib6)\)scale this to thousands of edits, andHaseet al\.\([2023](https://arxiv.org/html/2606.13705#bib.bib7)\)find that localization and optimal edit layer can diverge, a result we corroborate in E2B where the layers identified by ablation do not coincide with the best intervention layers\.Turneret al\.\([2023](https://arxiv.org/html/2606.13705#bib.bib12)\)andZouet al\.\([2023](https://arxiv.org/html/2606.13705#bib.bib13)\)develop activation\-space steering methods applied at inference time\. Our sign\-inversion edit is analogous but encoded permanently into static weights\. Most directly,Kazemiet al\.\([2026](https://arxiv.org/html/2606.13705#bib.bib1)\)demonstrate that a single MLP neuron mediates the refusal axis in safety\-tuned LLMs, inspiring our investigation of whether repetition failures are similarly localized, andWeiet al\.\([2024](https://arxiv.org/html/2606.13705#bib.bib4)\)show that safety\-critical regions occupy only∼\\sim3% of parameters, a structural parallel to our finding that loop\-driving behavior concentrates in 0\.0007–0\.085% of FFN neurons\.

## 3Failure Phenotypes and Sampling Controls

We classify the observed repetition behavior into two phenotypes\. The first is atight loop: the model commits to a short phrase and re\-emits it verbatim, repeating the same token sequence until the generation budget is exhausted\. The second is asoft loop, i\.e\. a list\-collapse repetition: the model continues the surface structure of an enumerated list, but the contents of multiple entries collapse to the same answer\.

- •Tight loop\.This phenotype is most visible in the 31B and 26B models\. After several hundred tokens in the thinking block, the model may commit to a single word or phrase \(such as “S1E6: The Detail \.\.\. no\.”\) and re\-emit the same short token sequence until the maximum response length is reached\.
- •Soft loop\.This phenotype is most visible in the E2B, E4B, and 26B models\. The model does not lock into a repeating token sequence; instead, the numbered list scaffold remains syntactically intact while many entries converge to the same item\. For example, on theconstellationsprobe \(E2B, thinking disabled\), the list opens correctly for the first∼\\sim47 entries and then collapses: > … 45\. Vela 46\. Volans 47\. Virgo 48\. Volans 49\. Volans … 87\. Volans 88\. Volans Forty\-one of the 88 list items are the literal wordVolans\. The numbering keeps incrementing, preserving the visual structure of a list, yet the semantic content has collapsed entirely onto a single answer\.

We refer to both phenotypes collectively asfast\-commit loops: failures in which the model locks onto a repeated output within the first generation pass and sustains it until the budget cap\. This distinguishes them fromdoom looping, a non\-convergent self\-correction regime that emerges at extended generation budgets, in which the model spends thousands of tokens revisiting and rephrasing the same uncertain fact without resolving it\. We discuss doom looping in more detail in Section[7\.1](https://arxiv.org/html/2606.13705#S7.SS1)\.

In all models, the failures persist across inference engines and framework implementations: we reproduced them with both HF Transformers and vLLM to rule out an implementation\-specific cause\. They also persist under prompt rewording and with the Gemma 4 Multi\-Token Prediction \(MTP\) drafter either enabled or disabled\. However, most sampling changes do not remove the behavior\.

A repetition penalty of at least 1\.15 can reduce repetition in several cases, but this is not a reliable solution because it can degrade unrelated generation behavior\. For example, on the E4B model, increasing the repetition penalty to 1\.30 breaks code generation: 22 of 30 Rust code\-generation prompts reach the maximum output length with progressively degraded syntax\.

An example of such degraded Rust code generation is presented below:

writeln\!\(stdout\(\),"\\nINFERENCESTATUS:"\)?\)?;

writeln\!\(stdout\(\),"\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-"\)?;

writable\!\(cout,Cursor::Cyan\)\(concat\!\("ActiveRequests:",inf\.in\_flight\_requests\)\)?;

writable\!\(cout,Cursor::Magenta\)\(concat\!\("AvgThroughput:",nf\!\(inf\.avg\_tokens\_per\_second,"\.1"\)\)\)?;

writable\!\(cout,Cursor::White\)\(concat\!\("LatestIDFound:",&\(nf\!\(inf

In this example, the model has drifted into invented syntax:writable\!is not a Rust macro; it is a hallucinated alternative that arose afterwriteln\!was penalized for having appeared repeatedly\. The same effect is visible in refusal language: atrepetition\_penalty= 1\.30 , “I cannot help with that” becomes “I cannot help with thus” or “I cannot assist with such”, because the standard phrasing is penalized after repeated use\. Atrepetition\_penalty=1\.15and at the baseline value of 1\.00, the code generation is unaffected in our tests\.

This side effect was not observed on the 31B model in our Rust tests, though we expect the same mechanism to manifest on other prompts or models where idiomatic token margins are smaller\. More broadly, the side effects of a repetition penalty are workload\- and model\-dependent, which makes it an unreliable general solution\. The selected weight edits target the internal mechanism that drives the loop and operate atrepetition\_penalty=1\.0, avoiding this tradeoff entirely\. The full quantitative results are reported in Appendix[D](https://arxiv.org/html/2606.13705#A4)\.

## 4Methodology \- Experimental Setup

This section describes our prompt suite for triggering loops, the loop detection mechanism, the activation\-based localization procedure, and our evaluation benchmark protocol used throughout the paper\.

We use a suite of eightenumeration probes\(ortriggers\), i\.e\. prompts that ask the model to produce a factual list long enough to expose the failure modes described above under default sampling \(e\.g\. naming the episodes of a television series, or enumerating a fixed scientific category\)\. Full prompt text, enumeration targets, and failure\-mode roles for each probe are given in Appendix[A](https://arxiv.org/html/2606.13705#A1)\. The eight probe identifiers are:

- •firefly\_list
- •wire\_episodes
- •pokemon\_gen1
- •mcu\_films
- •noble\_gases
- •constellations
- •us\_presidents
- •eu\_member\_states

For each model variant, the loop sweep consists of these 8 probes×\\times8 seeds×\\times2 thinking modes \(thinking on or off\), for 128 generations per model variant\. Unless otherwise specified, loop rates are reported per thinking mode over 64 generations, since the loops are primarily triggered during thinking\. When both thinking modes are combined, we report totals over 128 generations to also capture failure rates in thinking\-disabled mode\. We use the same default sampling configuration throughout the loop sweeps unless stated otherwise\. All seeds are fixed, so the reported loop counts are exact, reproducible outcomes over this grid rather than estimates with sampling error\.

To classify outputs, we created a deterministic, prompt\-agnostic detector with two checks: tight loops \(via exhaustive token\-periodicity search\) and soft loops \(via prefix\-stripped line deduplication\)\.333The detector also includes a phrase\-repetition check \(sliding word\-windows over decoded text\), which fires when a verbatim text phrase recurs at the tail of an output without strict token\-level periodicity\. This class does not appear in any reported canonical or long\-budget evaluation cell; see Appendix[B](https://arxiv.org/html/2606.13705#A2)for details and a representative example\.

For regression checks, we use the following benchmark suite: IFEval, TruthfulQA, ARC\-Easy, ARC\-Challenge, MMLU, and GSM8K\. Each benchmark is run in both thinking modes with the same sampling configuration used everywhere:temperature=0\.0, top\_p=0\.95, max\_output\_tokens=8192and no repetition penalty\.

For each model, we first capture one rollout that reliably enters a loop with thinking enabled\. We used thefirefly\_listprobe for 31B and E4B, theconstellationsprobe for E2B, and thewire\_episodesprobe for the 26B model\. We allow generation up to 4096 tokens, identify the exact loop window\(period, start\), and split the generated tokens into two sets:

- •Pre\-loop tokens= generated token positions\[200, start−\-50\]\. These are the tokens emitted during the model’s thinking and self\-correction phase before the loop crystallizes\. We skip the first 200 tokens \(still in the user\-prompt context\) and the 50 tokens immediately before the loop \(during the lock\-in transition\)\.
- •Loop tokens= generated token positions\[start\+\+50,NN\]\. The model is firmly inside the loop attractor\.

The two distributions of internal activations at the same residual position are what we compare and contrast in subsequent steps\.

The static edits below target the fast\-commit loop mechanism \(defined in Section[3](https://arxiv.org/html/2606.13705#S3)\), which is distinct from the doom looping \(discussed later in Section[7\.1](https://arxiv.org/html/2606.13705#S7.SS1)\)\.

The first localization step is aper\-layer ablation sweep, i\.e\. testing every layer in turn\. For each layerℓ∈\{0​…​L−1\}\\ell\\in\\\{0\\ldots L\{\-\}1\\\}, we set the output of either attention or MLP to zero on the captured rollout and measure how the probability of the loop token under the next\-token distribution changes\. This identifies candidate anti\-loop layers: depths where removing a component strongly changes the probability of the next loop token\. These candidate layers guide attribution, but as we observed they do not by themselves determine the final intervention layer\.

For E2B, this procedure reveals a dissociation: the per\-layer ablation identifies L13–L15 as the site of the strongest loop\-token signal \(where the current loop token is most directly written into the residual stream\), but interventions at those layers did not produce the best behavioral outcome\. The selected edit acts earlier, at L10/L12, and was selected after running full\-generation loop\-rate sweeps over candidate layers\. Single\-token ablation can reveal where the loop token is committed, while full\-generation sweeps are needed to find where a static edit changes whether the trajectory enters the loop at all\. This is a behavior\-level instance of a disconnect, which is in line with the findings on factual\-knowledge editing byHaseet al\.\([2023](https://arxiv.org/html/2606.13705#bib.bib7)\): the location where a behavior is localized need not be the location where editing it works best\. We give the detailed per\-layer evidence for E2B in Appendix[E](https://arxiv.org/html/2606.13705#A5)\.

Figure[1](https://arxiv.org/html/2606.13705#S4.F1)shows the full per\-layer ablation sweep for 26B; Table[1](https://arxiv.org/html/2606.13705#S4.T1)summarizes the candidate and final intervention layers across all four models\. Analogous ablation sweeps identify L36 for 31B and L18 for E4B\. For E2B, the strongest single\-token ablation signal appears earlier, around L13–L15\. As discussed below, the final E2B intervention is selected empirically from loop\-rate intervention sweeps rather than from the ablation magnitude alone\.

![Refer to caption](https://arxiv.org/html/2606.13705v1/images/main/image3.png)Figure 1:Per\-layer component ablation for the 26B model\. Each bar pair shows the change inp​\(loop\-token∣context\)p\(\\text\{loop\-token\}\\mid\\text\{context\}\)when attention \(red, left\) or MLP \(green, right\) is zero\-ablated at that layer; negative values mean the component supports the loop token, positive values mean it suppresses it\. The strongest pro\-loop attention signal is at L16 \(Δ​p=−0\.237\\Delta p=\-0\.237\); the strongest anti\-loop MLP signal is at L26 \(Δ​p=\+0\.143\\Delta p=\+0\.143, dashed line\), with additional large pro\-loop MLP signal at L28–29 that reflects the final write\-out layers rather than loop\-specific structure\. The expert\-attribution analysis \(Figure[3](https://arxiv.org/html/2606.13705#S5.F3)\) then drills into the L21–22 region to identify the specific expert positions that drive the loop\. E2B, E4B, and 31B are analyzed with the same ablation procedure; E2B illustrates that the strongest single\-token write signal need not coincide with the most effective intervention layer\.Table 1:Candidate layers identified by per\-layer ablation and final intervention layers for each model\. For 31B and E4B the ablation peak coincides with the intervention layer\. For E2B, ablation identifies L13–L15 as the site of strongest loop\-token signal, while the selected intervention acts at L10/L12 \(see Section[4](https://arxiv.org/html/2606.13705#S4)\)\. For 26B, the ablation signal peaks across L21–L26, and the selected expert masks are applied at L21–L22\.modellayersablation / intervention layerwhichdepthgemma\-4\-31B\-it60 \(dense\)L36MLPdown\_proj60%gemma\-4\-E4B\-it42 \(dense\)L18MLPdown\_proj43%gemma\-4\-E2B\-it35 \(dense\)L13–L15/L10, L12MLPdown\_proj\(ablation: L13–L15; intervention: L10, L12\)37–43% / 29–34%gemma\-4\-26B\-A4B\-it30 \(MoE\)L21–L22routed experts70–73%

For the E2B, E4B, and 31B models, we score every MLP neuron in the candidate layerℓ\\ell\. Gemma 4 text MLPs use a GeGLU\(Shazeer,[2020](https://arxiv.org/html/2606.13705#bib.bib19); Dauphinet al\.,[2017](https://arxiv.org/html/2606.13705#bib.bib21)\)gating structure:

𝐡=W↓​\(ϕ​\(Wg​𝐱\)⊙Wu​𝐱\),\\mathbf\{h\}\\;=\\;W\_\{\\\!\\downarrow\}\\\!\\Bigl\(\\phi\(W\_\{g\}\\,\\mathbf\{x\}\)\\odot W\_\{u\}\\,\\mathbf\{x\}\\Bigr\),whereϕ\\phiis GELU\(Hendrycks and Gimpel,[2016](https://arxiv.org/html/2606.13705#bib.bib20)\)\(tanh approximation\) andW↓W\_\{\\\!\\downarrow\},WgW\_\{g\},WuW\_\{u\}denotedown\_proj,gate\_proj, andup\_projrespectively\. We define*neuronnn*at layerℓ\\ellas index positionnnin the intermediate dimension: on the output side it corresponds to columnnnofW↓\(ℓ\)W\_\{\\\!\\downarrow\}^\{\(\\ell\)\}\(the direction it writes into the residual stream\); on the input side it is driven by rownnofWg\(ℓ\)W\_\{g\}^\{\(\\ell\)\}\(through the GELU gate\) and rownnofWu\(ℓ\)W\_\{u\}^\{\(\\ell\)\}\(the value branch\)\. All interventions in this work target columnnnofW↓\(ℓ\)W\_\{\\\!\\downarrow\}^\{\(\\ell\)\}directly—by zeroing, scaling, or sign\-inverting it—thereby modifying the neuron’s write direction into the residual stream while leaving its read directions and scalar activation unchanged\. On a captured looping rollout, we run a forward pass over the commit prefix and record, at every positionppand every neuronnn, the post\-gate activation

an\(ℓ\)​\(p\)=ϕ​\(Wg\(ℓ\)​\[n\]⋅𝐱p\)⋅\(Wu\(ℓ\)​\[n\]⋅𝐱p\),a\_\{n\}^\{\(\\ell\)\}\(p\)\\;=\\;\\phi\\\!\\left\(W\_\{g\}^\{\(\\ell\)\}\[n\]\\cdot\\mathbf\{x\}\_\{p\}\\right\)\\cdot\\left\(W\_\{u\}^\{\(\\ell\)\}\[n\]\\cdot\\mathbf\{x\}\_\{p\}\\right\),together with the gradient of the target loop\-token log\-probability with respect to that activation:

gn\(ℓ\)​\(p\)=∂log⁡pθ​\(t∣x<p\)∂an\(ℓ\)​\(p\)\.g\_\{n\}^\{\(\\ell\)\}\(p\)\\;=\\;\\frac\{\\partial\\log p\_\{\\theta\}\(t\\mid x\_\{<p\}\)\}\{\\partial a\_\{n\}^\{\(\\ell\)\}\(p\)\}\.Theloop\-attribution scoreof neuronnnis

Δn\(ℓ\)=−∑pan\(ℓ\)​\(p\)⋅gn\(ℓ\)​\(p\)\.\\Delta\_\{n\}^\{\(\\ell\)\}\\;=\\;\-\\sum\_\{p\}\\;a\_\{n\}^\{\(\\ell\)\}\(p\)\\,\\cdot\\,g\_\{n\}^\{\(\\ell\)\}\(p\)\.Intuitively,Δn\(ℓ\)\\Delta\_\{n\}^\{\(\\ell\)\}is the first\-order prediction of how much the loop\-token log\-probability would change if neuronnn’s contribution to the residual stream were set to zero at every measured position\. Equivalently,

Δn\(ℓ\)≈log⁡pθ\(ℓ,n,0\)​\(t\)−log⁡pθ​\(t\),\\Delta\_\{n\}^\{\(\\ell\)\}\\;\\approx\\;\\log p\_\{\\theta^\{\(\\ell,n,0\)\}\}\(t\)\\;\-\\;\\log p\_\{\\theta\}\(t\),whereθ\(ℓ,n,0\)\\theta^\{\(\\ell,n,0\)\}denotes the model with neuronnn’sdown\_projcolumn zeroed\. We use this gradient\-times\-activation approximation because it scores all neurons in a single forward and backward pass; the exact counterfactual difference would require one extra forward pass per neuron\.

Sign convention\.A positiveΔn\(ℓ\)\\Delta\_\{n\}^\{\(\\ell\)\}means neuronnnis currently*suppressing*the loop token: zero\-ablating it would raisep​\(loop\)p\(\\text\{loop\}\), sonnis ananti\-loopneuron and a candidate for amplification \(scaling its output up to strengthen its suppressive effect\)\. A negativeΔn\(ℓ\)\\Delta\_\{n\}^\{\(\\ell\)\}meansnnis currently pushing the model toward the loop token, sonnis apro\-loopneuron and a candidate for zeroing, downscaling, or sign inversion of itsdown\_projcolumn\. A small number of neurons concentrate most of the contrast\. For example, for the E4B model, only 3 of 10,240 neurons carry enough loop\-specificΔn\(ℓ\)\\Delta\_\{n\}^\{\(\\ell\)\}that zeroing them eliminates the loop entirely\. The top\-KKcandidates by\|Δn\(ℓ\)\|\\lvert\\Delta\_\{n\}^\{\(\\ell\)\}\\rvert\(whereKKis the number of neurons to select, swept empirically in the intervention experiments below\) are then confirmed by exact zero\-ablation: the selected columns are zeroed in the model weights and full\-generation loop rates are measured on the8×88\{\\times\}8evaluation grid \(8 prompts×\\times8 seeds, per thinking mode\) to verify that the attribution ranking translates to an actual reduction in loop rate\.

The E2B model requires an additional methodological distinction\. The strongest single\-token ablation signal appears around L13–L15, the layers where the current loop token is most directly committed to the residual stream\. However, interventions at those layers did not produce the best behavioral result\. The selected E2B intervention acts earlier, at L10/L12, and was identified by empirical loop\-rate sweeps over candidate MLP layers and intervention directions\. Thus, the ablation analysis localizes where the current loop token is written, while full\-generation sweeps identify where a static edit changes the multi\-step trajectory\.

For the Gemma 4 26B MoE model, the analogous object is not an MLP neuron but a routed expert slot\. We apply the same pre\-loop vs loop contrast to router\-selection probabilities for each\(layer, expert\)pair\. This produces a small set of expert positions that are selected far more often once the model enters the loop\.

The following section evaluates static weight\-space interventions derived from these attribution and sweep results\.

## 5Intervention Search and Model\-Specific Results

We evaluate four classes of static weight interventions on the top\-KKselected units:

- •Stripping: zero the selecteddown\_projcolumns, removing those neurons’ write contribution to the residual stream\. Analogously, in the MoE case \(26B\), zero the post\-router scale of the selected expert slots\.
- •Amplification: scale the selected units by a factorα\>1\\alpha\>1\.
- •Suppression: scale the selected units by a factor0<β<10<\\beta<1\.
- •Sign inversion: scale a selected MLP column byα<0\\alpha<0, converting a pro\-loop direction into an active anti\-loop direction\.

For the 31B and E4B models, stripping the selected MLP neurons produced the lowest loop rates without measurable benchmark degradation\. E2B required a different intervention direction: stripping the most pro\-loop L10 neurons reduced failures but left residual soft\-loop failures, whereas sign inversion of the dominant L10 pro\-loop neuron eliminated the failures\. For the 26B model, shared\-MLP interventions were not appropriate because the shared MLP accounts for only about27%27\\%of FFN compute\. Routed experts carry the remaining FFN computation, so expert masking was selected as the intervention method instead\.

### 5\.1Selected interventions per model

For the E2B, E4B, and 31B models, the intervention modifies selected pro\-loop or anti\-loop MLP neurons in the relevant layers\. In 31B and E4B, the selected intervention is stripping\. In E2B, the selected intervention combines sign inversion of one L10 MLP neuron with amplification of one L12 MLP neuron\.

- •31B: strip 1100 MLP neurons in L36, selected from two enumeration probes: 1000 from thefirefly\_listepisode prompt and 100 from the prompt forwire\_episodes\. This eliminates loops with thinking enabled \(0/64\) and reduces them to 1/64 with thinking disabled; a single residual tight loop on thefirefly\_listprompt \(seed 3\) remains\. The edit touches 1100 of 1,290,240 total FFN neurons: 0\.085%\.
- •E4B: strip the top 3 MLP neurons in L18\. This eliminates all failures with thinking enabled \(0/64\) and reduces them to 2/64 with thinking disabled; 2 residual soft loops remain \(one on theconstellationprompt, one on thefirefly\_listprompt\)\. The edit touches 3 of 430,080 total FFN neurons: 0\.0007%\.
- •E2B: apply sign inversion to the L10 pro\-loop neuron 3513 by multiplying itsdown\_projcolumn byα=−0\.8\\alpha=\-0\.8, then amplify one L12 anti\-loop neuron, 3838, byα=3\.0\\alpha=3\.0\. This intervention reduces E2B failures from 6/64 to 0/64 with thinking enabled, and from 7/64 to 0/64 with thinking disabled\. The edit touches 2 of 215,040 total FFN neurons: 0\.0009%\.

Across the external general\-purpose benchmark suite, the selected interventions stay within roughly±1\\pm 1pp for 31B,\(−1,\+2\)\(\-1,\+2\)pp for E4B, and within−1\.3\-1\.3/\+1\.7\+1\.7pp for E2B\. The detailed benchmark results are presented in Section[6](https://arxiv.org/html/2606.13705#S6)\.

### 5\.2E2B: sign inversion of a pro\-loop neuron

E2B differs from E4B and 31B in two ways\. First, its baseline failures are concentrated almost entirely in theconstellationprobe: 6/64 failures with thinking enabled and 7/64 with thinking disabled\. Second, all baseline failures are soft loops rather than tight loops\. Zeroing the three strongest pro\-loop L10 neurons reduced failures from 13/128 to 4/128\. The four residual seeds still locked ontoconstellationnames\- different ones in different seeds, meaning other L10 neurons can sustain the same failure when the top three are removed\.

Sign inversion resolves this zeroing\-only limitation\. A one\-neuron variant that multiplies the L10 neuron 3513 byα=−1\\alpha=\-1reaches 0/128 failures, but has a worse benchmark tail of about−2\.6\-2\.6pp\. The selected two\-neuron variant tunes this sign inversion toα=−0\.8\\alpha=\-0\.8and adds an L12 anti\-loop amplification atα=3\.0\\alpha=3\.0, giving 0/128 failures with worst regression−1\.3\-1\.3pp and average absolute benchmark delta 0\.57 pp\. The full Pareto sweep across neuron counts is shown in Figure[2](https://arxiv.org/html/2606.13705#S5.F2)and Table[2](https://arxiv.org/html/2606.13705#S5.T2)\.

The mechanistic lesson is that zeroing left the model neutral enough to reroute the constellation list\-collapse direction through other L10 neurons\. Sign inversion actively emits an anti\-loop steering vector where the pro\-loop feature used to fire\. The L12 amplification then restores enough balance to reduce benchmark regression\.

![Refer to caption](https://arxiv.org/html/2606.13705v1/images/main/03_E2B_per_neuron_pareto.png)Figure 2:E2B per\-neuron\-count Pareto frontier\. Each bar is a weight\-edit variant withnnmodified MLP neurons, plotted at its worst\-case benchmark regression across six benchmarks \(IFEval, TruthfulQA, ARC\-E/C, MMLU, GSM8K\) under both thinking modes\. Annotations report loop counts over 128 generations; green bars eliminate all 13 baseline loops, red retain residual loops\. The selected two\-neuron edit \(n=2n\{=\}2, bold border\) achieves 0 loops at−1\.3\-1\.3pp worst\-case regression\.Table 2:E2B interventions grouped by number of modified MLP neurons\. Loop counts are reported over the full 8\-prompt×\\times8\-seed×\\times2\-mode grid\. In the worstΔ\\Deltacolumn, “disabled”/“enabled” refer to thinking modes\.†E2B scores near chance on MMLU \(thinking disabled; baseline 25\.10%, chance = 25\.00%\), so that delta carries no meaningful signal; next worst is−0\.74\-0\.74pp on IFEval \(thinking disabled\)\.
### 5\.3Probe\-combination lesson for 31B

The results for the 31B model depended on how we combined probes\. A natural question was whether to select neurons from one probe \(e\.g\.firefly\_listprobe only\), from the intersection of multiple triggers, or from a combined ranking\. We compared five strategies on the model’s L36 layer \(all stripping methods\), all evaluated on the same8×88\\times 8grid \(Table[3](https://arxiv.org/html/2606.13705#S5.T3)\)\.

Table 3:Probe\-combination strategies for the 31B L36 stripping edit\. Each row is a neuron selection strategy evaluated on the8×88\{\\times\}8grid \(64 generations, thinking enabled\);best resultis the lowest loop count achieved across the tested values ofKK\(number of neurons stripped; the best\-performingKKfor each strategy is encoded in the variant name\);⋆\\starmarks the selected variant \(which retains 1/64 residual in thinking\-disabled mode\)\.The intersection numbers indicate that neurons that are loop\-driving oneverytrigger are not necessarily the right target\. Different prompts can enter different attractor basins in the same layer\. For example, thefirefly\_listprobe often collapses toward “The Message”, while the probe forwire\_episodesoften collapses toward “The Detail”\. The neurons supporting those two attractors overlap only partially\. In this case, the right intervention is to strip both sets, withKKtuned per probe: 1000 neurons from thefirefly\_listprobe and 100 from the probe forwire\_episodes\.sum\-K1000is the second\-best variant \(1/64; same recipe, single hyperparameterKK\)\.

The valuable insight from these experiments is that “generally pro\-loop” neurons are not necessarily the right target\. The intersection strategy keeps dominant pro\-loop neurons for different probes, but it throws away prompt\-specific drivers that each attractor actually uses\. The selected edit keeps the top drivers from one probe, and adds the sharpest drivers of the second probe separately\. In this case, two partitioned attractors cover most of the observed loop behavior better than a single generic loop ranking\.

### 5\.4MoE model: masking routed experts

The 26B model is a Mixture\-of\-Experts model with 30 layers; each layer has a shared MLP \(2112\-dim\) and 128 routed experts \(each 704\-dim\), with the router selecting top\-8 experts per token\. The shared MLP is only∼\\sim27% of FFN compute; experts carry∼\\sim73%\. Stripping the shared MLP, as in the dense 31B model, caused severe regressions: for example, GSM8K dropped by 34 pp atK=100K\{=\}100with thinking enabled\. SmallerKKvalues such as 30 or 40 reduced the damage but still lost 5–11 pp on QA and GSM\-style tasks with thinking enabled\.

Consequently, we treated the 26B model differently: score expert routing instead of MLP activations\. For each layerLLand routed expertEE, we compare how often the router selects that expert on loop tokens versus pre\-loop tokens\. This is the loop\-specificity score analogous to the per\-neuronΔn\\Delta\_\{n\}above, applied equivalently to router\-selection events instead of MLP\-neuron ablations:

ΔE\(L\)=1\|Tloop\|​∑p∈Tloop𝟏​\[E∈TopK​\(gL​\(x<p\)\)\]−1\|Tpre\|​∑p∈Tpre𝟏​\[E∈TopK​\(gL​\(x<p\)\)\]\\Delta\_\{E\}^\{\(L\)\}\\;=\\;\\frac\{1\}\{\|T\_\{\\mathrm\{loop\}\}\|\}\\sum\_\{p\\in T\_\{\\mathrm\{loop\}\}\}\\mathbf\{1\}\\\!\\left\[E\\in\\mathrm\{TopK\}\\\!\\left\(g\_\{L\}\(x\_\{<p\}\)\\right\)\\right\]\\;\-\\;\\frac\{1\}\{\|T\_\{\\mathrm\{pre\}\}\|\}\\sum\_\{p\\in T\_\{\\mathrm\{pre\}\}\}\\mathbf\{1\}\\\!\\left\[E\\in\\mathrm\{TopK\}\\\!\\left\(g\_\{L\}\(x\_\{<p\}\)\\right\)\\right\]
Here,gL​\(x<p\)g\_\{L\}\(x\_\{<p\}\)denotes the router logits at layerLLfor positionpp, andTopK\\mathrm\{TopK\}is the router’s selected expert set\. A large positiveΔE\(L\)\\Delta\_\{E\}^\{\(L\)\}means the expert is selected much more often inside the loop than before lock\-in\.

Through this analysis, we observed that a few expert positions are extremely loop\-specific, as shown in Figure[3](https://arxiv.org/html/2606.13705#S5.F3)\.

![Refer to caption](https://arxiv.org/html/2606.13705v1/images/main/26B_loop_experts.png)Figure 3:Top\-15 \(layer, expert\) pairs ranked byΔ=p​\(expert∣loop\)−p​\(expert∣pre\-loop\)\\Delta=p\(\\text\{expert\}\\mid\\text\{loop\}\)\-p\(\\text\{expert\}\\mid\\text\{pre\-loop\}\), computed on a single 26B rollout ofwire\_episodesseed 1 \(period\-11 tight loop; 1,291 pre\-loop and 209 loop tokens\)\. The three*\[masked\]*experts \(L22:E47, L21:E98, L21:E47\) jump from under 6% selection pre\-loop to over 60% inside the loop; masking these three yields thev2\_top3intervention \(Figure[4](https://arxiv.org/html/2606.13705#S5.F4)\)\.We then mask these experts in different combinations, test the masks at inference, and encode into the model’s weights those that minimize loop rates without benchmark regression\. The best variant masks only L21:E47, L21:E98, and L22:E47, masking three of30×128=384030\\times 128=3840expert slots \(0\.078%\)\. It reduces tight loops from 6/64 to 2/64 and from 5/8 to 0/8 on the highest\-failurewire\_episodesprobe\. Results on the8×88\\times 8grid are given in Figure[4](https://arxiv.org/html/2606.13705#S5.F4)\.

![Refer to caption](https://arxiv.org/html/2606.13705v1/images/main/26B_mask_sweep.png)Figure 4:Tight\-loop counts \(8​prompts×8​seeds=648\\text\{ prompts\}\\times 8\\text\{ seeds\}=64generations,max\_new\_tokens=1536, thinking enabled\) under five expert\-mask interventions\.v2\_top3\(L21:E47, L21:E98, L22:E47\) is the selected variant, reducing tight loops from 6/64 to 2/64\. Masking more experts \(v3\_top5,v4\_top10\) does not improve further as the model re\-routes to alternate attractors\.v5\_doomtargets experts active during long\-budget rumination but proves to suppress general\-purpose reasoning, introducing new loops onpokemon\_gen1\(5/8 seeds\)\.It should be noted that expert masking, as applied here, isunconditional: the mask is active for every token on every prompt\. In principle one could apply the mask conditionally, only when a run\-time detector signals that the model is approaching a loop; however, this would require custom inference code, an inference\-time loop detector, and the associated latency overhead\. Our goal is instead to provide a static weight modification that constitutes a drop\-in replacement for the Gemma 4 model weights\. Once the mask is encoded into the static weights, every token of every prompt sees those expert outputs as zero\. This constraint is precisely what makes the sweep delicate\. The objective is not to identify every expert involved in looping, but to find the smallest set that is selectively active on loop tokens, such that permanently zeroing them does not degrade non\-loop generations\.

## 6Results

The selected variants are the smallest interventions we found that remove the main loop phenotype without broad benchmark regression\. Table[4](https://arxiv.org/html/2606.13705#S6.T4)maps the internal experiment identifiers to the names used in the paper; Table[5](https://arxiv.org/html/2606.13705#S6.T5)summarizes the before/after loop counts and long\-budget behavior for each selected variant\.

The weight surgery eliminates or substantially reduces the observed loop behaviors across all four models \(Figure[5](https://arxiv.org/html/2606.13705#S6.F5)\)\. In the thinking\-enabled mode, E2B, E4B, and 31B reach zero loop failures; in thinking\-disabled mode, small residuals remain for E4B \(2 soft\) and 31B \(1 tight\)\. The remaining failure mode \(doom looping\) is discussed in Section[7](https://arxiv.org/html/2606.13705#S7)\.

![Refer to caption](https://arxiv.org/html/2606.13705v1/images/main/00_loop_rates_all_models.png)Figure 5:Loop rates before and after the selected interventions at the canonical 1\.5k\-token budget, broken down by thinking mode and loop kind \(tight / soft\)\. E2B reaches zero loops in both modes\. E4B and 31B reach zero loops in thinking\-enabled mode; E4B retains 2 soft residuals and 31B retains 1 tight residual in thinking\-disabled mode\. The MoE \(26B\) edit substantially reduces loops across both modes\. See Figure[7](https://arxiv.org/html/2606.13705#S7.F7)for long\-budget robustness and Appendix[B](https://arxiv.org/html/2606.13705#A2)for the full per\-prompt, per\-seed breakdown\.Table 4:Selected intervention identifiers\. Themodel\_idcolumn is a readable label used throughout the paper; the internal identifier records the exact experiment name\.All variants were also evaluated on a general\-purpose benchmark suite using the Inspect\-AI evaluation frameworkUK AI Security Institute \([2024](https://arxiv.org/html/2606.13705#bib.bib2)\), including 6 benchmarks×\\times2 thinking modes with the same sampling configuration used everywhere \(temperature=0\.0,top\_p=0\.95,max\_output\_tokens=8192, no rep penalty\) \(Figure[6](https://arxiv.org/html/2606.13705#S6.F6)\)\.

![Refer to caption](https://arxiv.org/html/2606.13705v1/images/main/01_benchmark_deltas.png)Figure 6:Benchmark performance deltas relative to the unedited baseline, across six general\-purpose benchmarks and two thinking modes\. Green = improvement over baseline; red = regression\. The four selected variants \(one per model\) are highlighted\. Five alternative model variants are included along with the selected variants for comparison\. In MMLU, baseline and patched models score very low \(both thinking modes\) because the default configuration at the Inspect AI evaluation framework for this task caps each question at 16 output tokens and expects a single letter, but Gemma 4 emits explanatory prose first; deltas in that column compare two equally\-truncated cells and are reported only for completeness\.The benchmark deltas are small: within±1\\pm 1pp for 31B, within\(−1,\+2\)\(\-1,\+2\)pp for E4B, worst−1\.3\-1\.3pp for the two\-neuron E2B Pareto variant, and within\(−1,\+1\.6\)\(\-1,\+1\.6\)pp for 26B on every benchmark in every mode\. MMLU is uninformative here: under the Inspect\-AI task configuration its 16\-token answer cap makes Gemma 4’s prose\-first outputs score near zero in both modes, so its deltas compare two equally\-truncated cells rather than real capability change \(Figure[6](https://arxiv.org/html/2606.13705#S6.F6)\)\. E2B improves several cells while keeping the tail regression modest\. The important point is not that every number improves, it is that the loop fix does not buy robustness by paying for it with broad capability loss\.

Table 5:Selected variants and before/after behavior\. Internal experiment identifiers are listed in Table[4](https://arxiv.org/html/2606.13705#S6.T4)\. T = tight loop; L = soft loop \(list\-collapse\); nat\-EOS = natural end\-of\-sequence completion\.†Worst:−1\.30\-1\.30pp MMLUno\(near chance\)\. Excluding MMLUno, the largest drop is−0\.74\-0\.74pp on IFEvalno\.

## 7Discussion

### 7\.1Doom Looping

The results presented previously show that our proposed weight edits substantially reduce or eliminate fast\-commit loops at normal generation budgets\. For the two smaller models, the proposed interventions hold at long budgets as well: sweeps at 4k and 8k tokens confirm zero tight loops for E2B and E4B, with no new emergence of doom looping\. For the two larger models \(26B and 31B\), at longer thinking budgets \(max\_new\_tokens≥\\geq4096,enable\_thinking=true\) on factually\-uncertain prompts \(e\.g\. “List every episode of The Wire”\), doom looping is observed at substantial rates\.

This regime is not unique to long generations\. Even at the canonical 1\.5k\-token limit, the larger models rarely produce a finished answer on these doom\-prone prompts \(about 3/24 for 31B, with or without the edit\); the self\-correction is cut off before it locks into a verbatim loop, so it registers as an unfinished generation rather than a detected loop\. We study longer generation lengths because they let the regime fully manifest\.

![Refer to caption](https://arxiv.org/html/2606.13705v1/images/main/01_long_budget_doom.png)Figure 7:Selected\-variant loop rates at 1\.5k, 4k, and 8k token budgets \(think=yes,rep\_pen=1\.0\), pooled over 3 doom\-prone prompts×\\times8 seeds per cell \(n=24n\{=\}24, all models\)\. E2B and E4B remain clean at long budgets \(≤1/24\\leq 1/24residual, all soft\)\. 26B and 31B exhibit substantial doom looping that does not worsen between 4k and 8k but does not resolve either\.In these runs, the model can spend 1500–3000 tokens self\-correcting without resolving the missing fact: “S1E6: The Detail … no\. … I am looping\. Let me just list the titles correctly …”\. We call this regimedoom looping, even though other terms are also used to describe it \(e\.g\.circular reasoning\(Duanet al\.,[2026](https://arxiv.org/html/2606.13705#bib.bib22)\)\): a non\-convergent self\-evaluation cycle in which the model treats its own outputs as new evidence and reproduces its uncertainty without making progress\. Doom looping presents in two surface forms that we observe in our long\-budget sweeps: the first form is a tight token\-period lock on the self\-correction template itself \(e\.g\.\*S1 E6 is "The Detail" is wrong\.\*repeated to budget exhaustion\), classified by the detector as a tight loop\. The second form is indefinite rephrasing \(“I must be hallucinating”, “Let me start over”\) until the budget runs out, producing neither a loop verdict nor a natural\-EOS \(End\-of\-Sequence\) completion\. Both forms share the same underlying mechanism, which is sustained self\-correction under factual uncertainty, and differ only in whether the rephrasing eventually settles onto a verbatim repetition\. Adding a repetition penalty makes the verbatim lock\-in less likely but does not eliminate it, and its principal effect on this regime is to shift more runs from doom looping of either form to clean natural\-EOS completion \(Figure[8](https://arxiv.org/html/2606.13705#S7.F8)\)\.

Table[8](https://arxiv.org/html/2606.13705#A3.T8)\(Appendix[C](https://arxiv.org/html/2606.13705#A3)\) reports outcome shares for the two doom\-prone models pooled across the three doom\-prone prompts, 8 seeds, and both long budgets \(n = 48 per row, think=yes\)\. This breakdown reveals an observation that the detector\-positive loop count alone hides: atrep\_pen=1\.0the weight edit largelyreshapesrather than reduces doom looping\. More specifically, for the 31B, the edit cuts tight loops from 24/48 to 10/48 but converts the freed share into endless self\-correction \(6/48 to 26/48\), so total doom looping rises slightly \(32/48 to 38/48\)\. For 26B the pattern is similar \(tight 17/48 to 6/48; endless 17/48 to 29/48; total 34/48 to 39/48\)\. The clear reduction in total doom looping for both selected variants comes from addingrep\_pen=1\.15\(31B: 38 to 26; 26B: 39 to 16\), which also raises natural\-EOS completions \(31B: 10 to 20; 26B: 9 to 32\)\. The weight edit and the penalty thus act on different aspects of the same regime: the edit removes much of the tight lock\-in surface form \(the same tight/soft phenotype as the fast\-commit loop, now arising late within the doom\-looping regime\), while the penalty converts the remaining sustained self\-correction into clean terminations\.

Conceptually, the selected edit reveals an underlying knowledge gap\. Without the edit, the fast\-commit loop can hide uncertainty by locking ontosomethingand repeating it early\. With the edit, the two larger Gemma 4 models, 26B and 31B, can self\-correct for thousands more tokens\. If the required fact is unavailable, however, the doom looping regime then takes over: the model revisits the same uncertain fact\(s\) without converging, exhausting the generation budget in either form \(Figure[8](https://arxiv.org/html/2606.13705#S7.F8)\)\.

From that point,whya knowledge gap does not allow the model to commit during reasoning remains unresolved\. One plausible explanation is loss of factual precision during pruning or distillation from a larger model, but this remains a hypothesis rather than a conclusion established by these experiments\.

![Refer to caption](https://arxiv.org/html/2606.13705v1/images/main/02_wire_doom_matrix_all_models.png)Figure 8:Natural\-EOS \(End\-of\-Sequence\) rate \(cell colour\) on thewire\_episodesprompt across all four models, both variants \(baseline and selected intervention\), and four long\-budget cells \(2×\\timestoken budget×\\times2×\\timesrepetition penalty\), think=yes, 8 seeds per cell\. Cell annotations shownat:N/8 \(natural\-EOS count\),T\(tight loops\), andL\(soft loops\)\. E2B and E4B remain fully clean across all cells \(all green\)\. For the 31B selected variant, themax\_new=8192/rep\_pen=1\.15configuration is the only one that rescues completions: 0/8 tight loops, 4/8 soft loops, and 4/8 natural\-EOS endings \(vs\. baseline’s 1/8 natural\-EOS at the same cell\)\. For the 26B selected variant,rep\_pen=1\.15increases tight loops on this prompt \(2/8 to 7/8\) but substantially raises natural\-EOS completions; the two settings optimize different objectives\.Is the doom\-loop mechanism a separable circuit?Our evidence suggests no\. We attempted to target experts active during self\-correction interjections such as\*Actually:\*,\*Let’s try again\*, and\(No\), then masked those experts on top ofv2\_top3for the 26B model\.

Two things happened simultaneously:

- •Itdidkill 26B’s late\-budget tight loops \(the doom\-loop pattern we were targeting\)\.
- •Italsointroduced previously absent loops on thepokemon\_gen1probe \(5/8 seeds entered tight loops onZapdos\) and on thefirefly\_listprobe\.

That second observation is diagnostic: the experts that activate during the self\-correction interjections are likely to be general\-purpose verification and reasoning components for prompts that require fact\-checking, and not a separable doom\-loop circuit\. Consequently, suppressing them reduces the doom\-looping rate at the cost of disrupting more broadly used reasoning capabilities\. Doom looping is therefore structurally distinct from fast\-commit loops: the fast\-commit loop mechanism is localized enough to be edited in isolation, whereas the mechanism sustaining doom looping is entangled with general reasoning capability under factual uncertainty\.

A concrete illustration of this entanglement is visible on theconstellationsprompt atrep\_pen=1\.0\. The 31B baseline reliably produces a complete 88\-item list at 4k tokens \(5/8 seeds\), relying on a thinking pattern that lists constellations in alphabetical order, and in which the model writes “W:\(none\), X:\(none\), Y:\(none\), Z:\(none\)” at the end of its thinking block to verify full alphabetical coverage before committing\. The selected variant retains this behavior on only 1/8 seeds, a regression of 4 successful baseline seeds\. On the failing seeds, the model no longer produces thisW\-Zthinking closure step and enters doom looping instead\. This collateral effect is consistent with the exploratory nature of these edits: a 1100\-neuron strip ondown\_projis intentionally broad, and small but real side\-effects on specific reasoning steps and factual recall are not unexpected\. Addingrep\_pen=1\.15largely closes the gap at 8k tokens: both variants then produce a correct list in 4/8 seeds\. Notably, the penalty also disrupts the baseline’s alphabetical\-verification step \(its success rate drops from 5/8 atrep\_pen=1\.0to 4/8 atrep\_pen=1\.15\), which is one reasonrep\_pen=1\.15is recommended for 31B in Table[5](https://arxiv.org/html/2606.13705#S6.T5)\.

At long budgets the weight edit substantially reduces thelock\-insurface form of doom looping \(tight and soft locks\) in the larger models, though, as shown above, it does not reduce total doom looping atrep\_pen=1\.0\. For 31B \(per budget,n=24n=24; counts are essentially identical at 4k and 8k\), the edit reduces genuine lock\-ins \(excluding the false\-positive\(none\)bookkeeping described above\) from 13/24 to 6/24 \(5 tight, 1 soft\)\. Atrep\_pen=1\.0this drop in verbatim lock\-in is offset by a rise in endless self\-correction \(Table[8](https://arxiv.org/html/2606.13705#A3.T8)\), so the lock\-in count falls but total doom looping does not\. Addingrepetition\_penalty=1\.15converts much of the residual into clean completions: tight loops fall to 1/24 at both budgets and natural\-EOS \(End\-of\-Sequence\) completions rise from 5/24 \(atrep\_pen=1\.0\) to 15/24 at 8k\. The same penalty on the unedited baseline is less effective \(7/24 lock\-ins and 12/24 natural\-EOS at 8k\), so the edit and the penalty are complementary: the edit removes verbatim lock\-in, and the penalty converts the freed self\-correction into natural termination\.

For the 26B model the pattern is the same in shape \(per budget,n=24n=24\)\. The edit reduces detector\-flagged lock\-ins from 8–9/24 to 5/24 atrep\_pen=1\.0, of which only 3/24 are tight \(versus 8–9/24 tight in the baseline\), but as with 31B, this trades verbatim lock\-in for endless self\-correction rather than reducing total doom looping \(Table[8](https://arxiv.org/html/2606.13705#A3.T8)\)\. Addingrep\_pen=1\.15modestly increases the lock\-in count \(to 7/24\) yet substantially raises natural\-EOS completions, from 3/24 to 15/24 at 4k and from 6/24 to 17/24 at 8k, so the model commits its reasoning and terminates naturally far more often\. The two settings optimize different objectives:rep\_pen=1\.0minimizes lock\-in count, whilerep\_pen=1\.15maximizes natural completion rate at the cost of two additional lock\-ins per 24 seeds\.

The residual doom looping that persists even after intervention is, in our assessment, fundamentally aknowledgeproblem: the model lacks the factual precision to resolve the enumeration \(for example, it cannot reliably recall the correct title for Season 1 Episode 6 ofThe Wire\), and no naïve weight edit can supply missing knowledge\. The overall finding is narrow but useful: the proposed weight edits eliminate fast\-commit loops at normal budgets across all four models, and for E2B and E4B this holds at long budgets too, where no doom looping emerges; for 31B and 26B, the edit removes most verbatim lock\-in at long budgets, and the combined edit \+rep\_pen=1\.15also substantially reduces total doom looping \(Table[8](https://arxiv.org/html/2606.13705#A3.T8)\), but neither of the selected interventions resolves the failure where the root cause is factual uncertainty\.

### 7\.2Limitations and Future Work

The interventions reported in this work remove fast\-commit loops at normal generation budgets across all four Gemma 4 models studied\. For the E2B and E4B models, this generalizes to long budgets as well, where the selected variants show essentially zero residual loops\. For 26B and 31B, however, the same factually\-uncertain prompts can still collapse into tight loops at extended budgets, after the model has spent thousands of tokens self\-correcting over facts it does not reliably know\. The static weight edits therefore reduce but do not cure the underlying failure in the larger models: the residual is caused by a knowledge\-precision problem that these weight edits cannot supply, and addressing it would require a different class of intervention, such as targeted post\-training that teaches the model to terminate gracefully under uncertainty, or a runtime\-conditional edit gated by an online loop detector\.

Our proposed approach constitutes an exploratory investigation rather than a systematic search for optimal interventions\. The reported configurations represent one demonstrably effective set of weight edits, not the result of an exhaustive sweep over intervention types, target layers, or neuron selection criteria\. A more thorough exploration of that parameter space could yield smaller, more generalizable, or more effective modifications than those reported here\.

Several additional caveats constrain the scope of the findings: the code\-generation side\-effect analysis in Appendix[D](https://arxiv.org/html/2606.13705#A4)is based on a single Rust prompt; the apparent safety ofrep\_pen=1\.15for E4B should be treated with caution, as prompts whose idiomatic tokens carry smaller logit margins could surface the same corruption pathway at lower penalty values\. The doom\-loop attractor in the two larger models proved difficult to surgically isolate: a candidate intervention targeting experts active during long\-budget self\-correction \(v5\_doom\) reduced late\-budget tight loops but simultaneously introduced failures on previously\-clean prompts\. This suggests that the components sustaining the doom attractor are entangled with general\-purpose reasoning, rather than forming a separable circuit that can be suppressed without collateral cost\. Finally, all four models studied belong to a single model family; whether the same per\-layer, per\-neuron localization of loop behavior extends to other instruction\-tuned LLMs remains an open question outside the scope of this work\. The procedure itself, that is per\-layer attribution, per\-unit loop\-specificity scoring, and a static edit validated on a held\-out prompt×\\timesseed grid,is architecture\-agnostic; only the specific units it selects are model\-dependent, so the method transfers directly to other language models even if the exact neurons reported here do not\.

## 8Conclusion

In this work, we investigated the enumeration\-loop failure of the Gemma 4 model family, a reproducible phenomenon in which models commit to a repeated phrase or collapsing list on long factual enumeration tasks\. Through per\-layer attribution analysis and systematic sweeps over a range of surgical operations, including neuron zeroing, sign inversion, weight amplification, and expert\-slot masking, we characterized the failure, identified the internal components most strongly associated with it, and demonstrated that targeted modifications to those components substantially reduce or eliminate the observed loops while preserving general\-purpose benchmark performance within small percentage\-point deltas\. The required intervention varied substantially by model\. Most strikingly, a single MLP neuron modification is sufficient to eliminate the loops in E2B, with a two\-neuron variant giving the best benchmark tradeoff; this result illustrates how localized the failure mechanism can be\. In E4B, stripping three neurons in one layer suffices, whereas in 31B, probe\-combined neuron selection across a larger set was necessary, as a single generic attribution intersection proved insufficient\. In the MoE 26B, expert\-slot masking was used in place of MLP neuron editing, reflecting the model’s different internal architecture\. These differences reflect genuine model\-to\-model variation in how the looping mechanism is distributed across the network\. At extended generation budgets, we found that doom looping, i\.e\. a non\-convergent self\-correction regime, persists in the two larger models even after the primary fast\-commit loop pathway is removed, a failure mode that we attribute to factual\-knowledge gaps rather than a surgically removable circuit\.

## References

- Towards automated circuit discovery for mechanistic interpretability\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2606.13705#S2.p1.1)\.
- Y\. N\. Dauphin, A\. Fan, M\. Auli, and D\. Grangier \(2017\)Language modeling with gated convolutional networks\.InInternational conference on machine learning,pp\. 933–941\.Cited by:[§4](https://arxiv.org/html/2606.13705#S4.p14.1)\.
- Z\. Duan, L\. Pang, Z\. Wei, W\. Duan, Y\. Tian, S\. Xu, J\. Deng, Z\. Yin, and X\. Cheng \(2026\)Circular reasoning: understanding self\-reinforcing loops in large reasoning models\.arXiv preprint arXiv:2601\.05693\.Cited by:[§7\.1](https://arxiv.org/html/2606.13705#S7.SS1.p3.1)\.
- N\. Elhage, T\. Hume, C\. Olsson, N\. Schiefer, T\. Henighan, S\. Kravec, Z\. Hatfield\-Dodds, R\. Lasenby, D\. Drain, C\. Chen,et al\.\(2022\)Toy models of superposition\.arXiv preprint arXiv:2209\.10652\.Cited by:[§2](https://arxiv.org/html/2606.13705#S2.p1.1)\.
- Z\. Fu, W\. Lam, A\. M\. So, and B\. Shi \(2021\)A theoretical analysis of the repetition problem in text generation\.InProceedings of the AAAI Conference on Artificial Intelligence \(AAAI\),Vol\.35,pp\. 12848–12856\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v35i14.17520)Cited by:[§2](https://arxiv.org/html/2606.13705#S2.p1.1)\.
- W\. Gurnee, N\. Nanda, M\. Pauly, K\. Harvey, D\. Troitskii, and D\. Bertsimas \(2023\)Finding neurons in a haystack: case studies with sparse probing\.arXiv preprint arXiv:2305\.01610\.Cited by:[§2](https://arxiv.org/html/2606.13705#S2.p1.1)\.
- P\. Hase, M\. Bansal, B\. Kim, and A\. Ghandeharioun \(2023\)Does localization inform editing? surprising differences in causality\-based localization vs\. knowledge editing in language models\.InAdvances in Neural Information Processing Systems,Cited by:[Appendix E](https://arxiv.org/html/2606.13705#A5.p4.1),[§2](https://arxiv.org/html/2606.13705#S2.p2.1),[§4](https://arxiv.org/html/2606.13705#S4.p12.1)\.
- D\. Hendrycks and K\. Gimpel \(2016\)Gaussian Error Linear Units \(GELUs\)\.arXiv preprint arXiv:1606\.08415\.External Links:[Link](https://arxiv.org/abs/1606.08415)Cited by:[§4](https://arxiv.org/html/2606.13705#S4.p14.18)\.
- A\. Holtzman, J\. Buys, L\. Du, M\. Forbes, and Y\. Choi \(2020\)The curious case of neural text degeneration\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.13705#S2.p1.1)\.
- H\. Kazemi, A\. Chegini, and M\. Safi \(2026\)A single neuron is sufficient to bypass safety alignment in large language models\.arXiv preprint arXiv:2605\.08513\.Cited by:[§1](https://arxiv.org/html/2606.13705#S1.p5.1),[§2](https://arxiv.org/html/2606.13705#S2.p2.1)\.
- O\. Kovaleva, S\. Kulshreshtha, A\. Rogers, and A\. Rumshisky \(2021\)BERT busters: outlier dimensions that disrupt transformers\.InFindings of the Association for Computational Linguistics: ACL 2021,Cited by:[§2](https://arxiv.org/html/2606.13705#S2.p1.1)\.
- S\. Marks, C\. Rager, E\. J\. Michaud, Y\. Belinkov, D\. Bau, and A\. Mueller \(2025\)Sparse feature circuits: discovering and editing interpretable causal graphs in language models\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.13705#S2.p1.1)\.
- C\. McDougall, A\. Conmy, C\. Rushing, T\. McGrath, and N\. Nanda \(2023\)Copy suppression: comprehensively understanding an attention head\.arXiv preprint arXiv:2310\.04625\.Cited by:[§2](https://arxiv.org/html/2606.13705#S2.p1.1)\.
- K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov \(2022a\)Locating and editing factual associations in GPT\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2606.13705#S2.p2.1)\.
- K\. Meng, A\. S\. Sharma, A\. Andonian, Y\. Belinkov, and D\. Bau \(2022b\)Mass\-editing memory in a transformer\.arXiv preprint arXiv:2210\.07229\.Cited by:[§2](https://arxiv.org/html/2606.13705#S2.p2.1)\.
- C\. Olsson, N\. Elhage, N\. Nanda, N\. Joseph, N\. DasSarma, T\. Henighan, B\. Mann, A\. Askell, Y\. Bai, A\. Chen,et al\.\(2022\)In\-context learning and induction heads\.arXiv preprint arXiv:2209\.11895\.Cited by:[§2](https://arxiv.org/html/2606.13705#S2.p1.1)\.
- G\. Puccetti, A\. Rogers, A\. Drozd, and F\. Dell’Orletta \(2022\)Outlier dimensions that disrupt transformers are driven by frequency\.InFindings of the Association for Computational Linguistics: EMNLP 2022,Cited by:[§2](https://arxiv.org/html/2606.13705#S2.p1.1)\.
- N\. Shazeer \(2020\)GLU Variants Improve Transformer\.arXiv preprint arXiv:2002\.05202\.External Links:[Link](https://arxiv.org/abs/2002.05202)Cited by:[§4](https://arxiv.org/html/2606.13705#S4.p14.1)\.
- A\. M\. Turner, L\. Thiergart, G\. Leech, D\. Udell, J\. J\. Vazquez, U\. Mini, and M\. MacDiarmid \(2023\)Steering language models with activation engineering\.arXiv preprint arXiv:2308\.10248\.Cited by:[§2](https://arxiv.org/html/2606.13705#S2.p2.1)\.
- UK AI Security Institute \(2024\)Inspect AI: Framework for Large Language Model EvaluationsExternal Links:[Link](https://github.com/UKGovernmentBEIS/inspect_ai)Cited by:[§6](https://arxiv.org/html/2606.13705#S6.p3.1)\.
- K\. Wang, A\. Variengien, A\. Conmy, B\. Shlegeris, and J\. Steinhardt \(2022\)Interpretability in the wild: a circuit for indirect object identification in GPT\-2 small\.arXiv preprint arXiv:2211\.00593\.Cited by:[§2](https://arxiv.org/html/2606.13705#S2.p1.1)\.
- B\. Wei, K\. Huang, Y\. Huang, T\. Xie, X\. Qi, M\. Xia, P\. Mittal, M\. Wang, and P\. Henderson \(2024\)Assessing the brittleness of safety alignment via pruning and low\-rank modifications\.arXiv preprint arXiv:2402\.05162\.Cited by:[§2](https://arxiv.org/html/2606.13705#S2.p2.1)\.
- S\. Welleck, I\. Kulikov, S\. Roller, E\. Dinan, K\. Cho, and J\. Weston \(2020\)Neural text generation with unlikelihood training\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2606.13705#S2.p1.1)\.
- J\. Xu, X\. Liu, J\. Yan, D\. Cai, H\. Li, and J\. Li \(2022\)Learning to break the loop: analyzing and mitigating repetitions for neural text generation\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2606.13705#S2.p1.1)\.
- A\. Zou, L\. Phan, S\. Chen, J\. Campbell, P\. Guo, R\. Ren, A\. Pan, X\. Yin, M\. Mazeika, A\. Dombrowski,et al\.\(2023\)Representation engineering: a top\-down approach to AI transparency\.arXiv preprint arXiv:2310\.01405\.Cited by:[§2](https://arxiv.org/html/2606.13705#S2.p2.1)\.

## Appendix AEvaluation Probes

All experiments use the following 8 enumeration probes\. Each probe is run with 8 random seeds, giving 64 generations per \(model, variant, thinking mode\) cell in the canonical 1\.5k\-token sweep\. Table[6](https://arxiv.org/html/2606.13705#A1.T6)lists the probes along with their enumeration target and role in the evaluation\. The full\-enum\-table \(Appendix[B](https://arxiv.org/html/2606.13705#A2)\) uses abbreviated probe IDs to fit the seed columns:firefly=firefly\_list,noble\_gas=noble\_gases,us\_pres=us\_presidents,eu\_states=eu\_member\_states,mcu=mcu\_films; the remaining three IDs \(constellations,pokemon\_gen1,wire\_episodes\) are unchanged\.

Table 6:Enumeration probes used throughout all sweeps\. Bold entries mark the primary failure prompts for each model\. “Control” probes produce near\-zero baseline loop rates and are used to detect regressions introduced by a candidate intervention\.
## Appendix BFull Per\-Cell Enumeration Results

TableLABEL:tab:full\_enumshows the loop verdict for every \(model, variant, thinking mode, prompt, seed\) cell in the8​\-prompt×8​\-seed×2​\-mode8\\text\{\-prompt\}\\times 8\\text\{\-seed\}\\times 2\\text\{\-mode\}evaluation grid, as determined by the canonical loop detector applied uniformly across all 128 generations per model variant\. Glyph meanings:T= tight loop;L= soft loop \(list\-collapse\);∙\\bullet= no loop\.

Table 7:Loop verdict for every \(model, variant, thinking mode, prompt, seed\) cell in the 8\-prompt×\\times8\-seed×\\times2\-mode evaluation grid, for both the unpatched baseline and the selected variant of each model\. Each cell reports the loop classification for a single generation:T= tight token\-period loop,L= numbered\-list collapse,∙\\bullet= no loop\. TheTcolumn counts tight loops;softcounts list\-collapse loops;n/Nis the total loop count over all 8 seeds for that prompt\. Selected variants: E4B usesstrip\-L18\-K3; 31B usesstrip\-ff\-K1000\-wonly\-K100; 26B usesbake\-mask\-v2\_top3\(L21:E47, L21:E98, L22:E47\); E2B usesflipL10K1\-a\-0\.8 \+ ampL12K1\-a3\.0\.settingprompts0s1s2s3s4s5s6s7Tsoftn/Ngemma\-4\-E2B\-itBaseline,think=noconstellationsLLLLLLL∙\\bullet077/8firefly∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8wire∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8pokemon∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8us\_pres∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8mcu∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8eu\_states∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8noble\_gas∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8Total077/64Baseline,think=yesconstellationsLLL∙\\bulletL∙\\bulletL∙\\bullet055/8firefly∙\\bulletL∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet011/8wire∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8pokemon∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8us\_pres∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8mcu∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8eu\_states∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8noble\_gas∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8Total066/64Selected,think=noconstellations∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8firefly∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8wire∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8pokemon∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8us\_pres∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8mcu∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8eu\_states∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8noble\_gas∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8Total000/64Selected,think=yesconstellations∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8firefly∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8wire∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8pokemon∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8us\_pres∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8mcu∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8eu\_states∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8noble\_gas∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8Total000/64gemma\-4\-E4B\-itBaseline,think=noconstellations∙\\bullet∙\\bulletL∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet011/8fireflyLLLL∙\\bulletLLL077/8wire∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8pokemon∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8us\_pres∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8mcu∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8eu\_states∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8noble\_gas∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8Total088/64Baseline,think=yesconstellations∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bulletTT202/8firefly∙\\bullet∙\\bullet∙\\bullet∙\\bulletLLLL044/8wire∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8pokemon∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8us\_pres∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8mcu∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8eu\_states∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8noble\_gas∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8Total246/64Selected,think=noconstellations∙\\bullet∙\\bulletL∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet011/8firefly∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bulletL∙\\bullet011/8wire∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8pokemon∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8us\_pres∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8mcu∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8eu\_states∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8noble\_gas∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8Total022/64Selected,think=yesconstellations∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8firefly∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8wire∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8pokemon∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8us\_pres∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8mcu∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8eu\_states∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8noble\_gas∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8Total000/64gemma\-4\-31B\-itBaseline,think=noconstellations∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8firefly∙\\bullet∙\\bullet∙\\bullet∙\\bulletT∙\\bullet∙\\bullet∙\\bullet101/8wire∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8pokemon∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8us\_pres∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8mcu∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8eu\_states∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8noble\_gas∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8Total101/64Baseline,think=yesconstellations∙\\bulletLLL∙\\bulletL∙\\bulletL055/8firefly∙\\bullet∙\\bullet∙\\bullet∙\\bulletT∙\\bulletTT303/8wire∙\\bulletT∙\\bulletTLT∙\\bulletT415/8pokemon∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8us\_pres∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8mcu∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8eu\_states∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8noble\_gas∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8Total7613/64Selected,think=noconstellations∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8firefly∙\\bullet∙\\bullet∙\\bulletT∙\\bullet∙\\bullet∙\\bullet∙\\bullet101/8wire∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8pokemon∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8us\_pres∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8mcu∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8eu\_states∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8noble\_gas∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8Total101/64Selected,think=yesconstellations∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8firefly∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8wire∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8pokemon∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8us\_pres∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8mcu∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8eu\_states∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8noble\_gas∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8Total000/64gemma\-4\-26B\-A4B\-itBaseline,think=noconstellations∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8firefly∙\\bulletT∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet101/8wire∙\\bullet∙\\bullet∙\\bulletTT∙\\bulletT∙\\bullet303/8pokemon∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8us\_pres∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8mcu∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8eu\_states∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8noble\_gas∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8Total404/64Baseline,think=yesconstellations∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8firefly∙\\bullet∙\\bullet∙\\bulletT∙\\bullet∙\\bullet∙\\bullet∙\\bullet101/8wire∙\\bulletTTTTT∙\\bullet∙\\bullet505/8pokemon∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8us\_pres∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8mcu∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8eu\_states∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8noble\_gas∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8Total606/64Selected,think=noconstellations∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8firefly∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8wire∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bulletT∙\\bullet∙\\bullet101/8pokemon∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8us\_pres∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8mcu∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8eu\_states∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8noble\_gas∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8Total101/64Selected,think=yesconstellations∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8firefly∙\\bulletL∙\\bulletT∙\\bullet∙\\bullet∙\\bullet∙\\bullet112/8wire∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8pokemon∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8us\_pres∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8mcu∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8eu\_states∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8noble\_gas∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet∙\\bullet000/8Total112/64#### Note on the phrase\-repetition class \(P\)\.

Our loop detector includes a third check, namely phrase repetition, which fires when a verbatim text phrase recurs at the tail of the output without strict token\-level periodicity \(i\.e\. the rendered text repeats, but the surrounding token IDs do not form an exact period\)\. No cell in the reported canonical or long\-budget evaluations is classified as phrase repetition \(Pdoes not appear in the table above\), but the class was observed in a small number of residual outputs from non\-selected variants\. A representative example occurs in one of our E2B variants residual on theconstellationsprobe:

> … 69\. Volans Australis Septentrionalis Meridionalis Minor Minor Minor … Minor 70\. Volans Australis Septentrionalis Meridionalis Minor Minor Minor … Minor Major 71\. Volans Australis Septentrionalis Meridionalis Minor Minor Minor … Minor 72\. Volans Australis Septentrionalis Meridionalis Minor Minor Minor … Minor Major

The inner phrase repeats across lines, but the leading line number and trailingMinor/Majortoken vary each cycle, so the tight detector \(which requires exact token\-ID periodicity\) does not fire; the phrase detector catches the repeated substring instead\.

## Appendix CLong\-Budget Doom\-Looping Outcomes

Table[8](https://arxiv.org/html/2606.13705#A3.T8)reports outcome shares for the two doom\-prone models \(26B and 31B\) on the three doom\-prone prompts \(wire\_episodes,firefly\_list,constellations\), pooled over both 4k and 8k budgets withthink=yes\. Each row covers8​seeds×3​prompts×2​budgets=488\\,\\text\{seeds\}\\times 3\\,\\text\{prompts\}\\times 2\\,\\text\{budgets\}=48generations\. Columns:tight= tight loop running to budget;soft= soft loop running to budget;endless= budget exhausted without a verbatim lock and without a natural end\-of\-sequence token \(the model was still self\-correcting when the budget ran out\);doom= tight \+ soft \+ endless;loop\+nat= detector\-flagged as a loop but terminated naturally \(manual verification\);nat\-EOS= natural end\-of\-sequence completion\. E2B and E4B are excluded; both reach≥\\geq46/48 natural\-EOS in every long\-budget cell\.

Table 8:Doom\-looping outcome shares for 26B and 31B on the three doom\-prone prompts, pooled over 4k and 8k budgets \(n = 48 per row, think=yes\)\. See Section[7\.1](https://arxiv.org/html/2606.13705#S7.SS1)for discussion\. Note: 8 entries in the 31B baselinerep\_pen=1\.0row are detector\-flagged soft loops onconstellationsthat are in fact clean completions; the model writes an alphabetical\-coverage verification step in its scratchpad \(“W:\(none\), X:\(none\), Y:\(none\), Z:\(none\)”\) that the soft\-loop detector flags, but it still emits a complete 88\-line list and terminates naturally; one further such case appears in the 31B selectedrep\_pen=1\.15row\. All are counted in nat\-EOS rather than in the loop columns\.
## Appendix DRust Code\-Generation Side\-Effect Check

The simplest way to reduce the enumeration loops described in Section[3](https://arxiv.org/html/2606.13705#S3)is at sampling time, by raising the repetition penalty \(rep\_pen\)\. This is effective on the surface, but repetition penalty is a blunt instrument: it penalizes*any*previously emitted token, including idiomatic Rust macros \(println\!,writeln\!,Result\) that a correct long program is expected to reuse\. We quantify whether the penalty values that suppress enumeration loops also corrupt unrelated code\-generation tasks, and whether this side effect depends on model size\. The selected E4B weight edit \(strip\-L18\-K3\) operates atrep\_pen=1\.0and is therefore expected to bypass this tradeoff; the patched\-E4B rows in Table[9](https://arxiv.org/html/2606.13705#A4.T9)are the positive control confirming that it does not introduce a new corruption pathway\.

We use a single, deliberately long prompt,rust\_clone\_state:“Write a complete Rust TUI application that queries Ollama’s HTTP API and shows a real\-time dashboard with VRAM usage, in\-flight inferences, and KV\-cache stats\. Use crossterm for the TUI\. UseArc<Mutex<AppState\>\>and clone the state across multiple threads \(one polling thread per metric\)\. Include the fullmain\.rswith all imports and theAppStatedefinition\.”The expected output is 3,000–4,000 tokens of Rust, making the task exercise sustained idiom reuse across a long generation\.

For every configuration we run 30 seeds withtemperature=0\.7,top\_p=0\.95,enable\_thinking=True, andmax\_tokens=4096\. Each completion is classified by two independent signals:

- •Stop reason: vLLM’sfinish\_reason\.stopmeans the model emitted the EOS token;lengthmeans the model exhausted the 4096\-token budget without stopping\. Used alone,finish\_reason=lengthis ambiguous: it may indicate either a coherent answer that exceeded the budget, or a model stuck in non\-terminating output\.
- •Manual code review: for everylengthcompletion we inspect the tail of the generation\. A completion is marked*corrupted*if it contains fabricated macro names \(e\.g\.writable\!,nf\!,defer\!,stop\_workers\!\), fabricated identifiers \(SysOutMock,MockTerminalInfo,safe\_initialization\_trigger\), incorrectly cased standard items \(Duration::from\_Millis\), or other clearly non\-compiling Rust\. Completions whose tails exhibit coherent, idiomatic Rust that simply did not reach EOS within the budget are marked*clean\-truncated*\.

A completion is counted as apassiffinish\_reason=stop*or*the tail review marks it clean\-truncated; it is counted as a failure only if it is corrupted\.

#### Results\.

We evaluate the E4B baseline \(gemma\-4\-E4B\-it\) and the 31B baseline \(gemma\-4\-31B\-it\) at three repetition penalty values each\. Because 31B exhibits no corruption at any penalty value, no patched\-31B positive control is warranted; the positive control rows test only the selected E4B weight edit \(strip\-L18\-K3\) at the two inference\-time values we recommend\. All counts are per configuration,N=30N=30\(Table[9](https://arxiv.org/html/2606.13705#A4.T9)\)\.

Table 9:Rust code\-generation outcomes at varying repetition penalties\.*Stop*= model emitted EOS naturally;*Length*= model exhausted the 4096\-token budget\.*Corrupted*counts length\-truncated completions confirmed by manual review to contain fabricated macros, fabricated identifiers, or non\-compiling syntax \(≥20\{\\geq\}20at E4Brep\_pen=1\.30, with 2 additional borderline cases excluded\)\.*Pass*is the strict count: stop or clean\-truncated\. The 31B rows show that the corruption pathway is model\-specific: at all three penalties 31B produces clean, complete Rust\. The bottom two rows confirm that the selected weight edit does not introduce the corruption pathway on E4B\.The 31B baseline rows all terminate cleanly at every penalty value \(30/30 stop, zero corrupted\)\. However, the picture for the E4B model is more nuanced\. The singlelengthcase at E4Brep\_pen=1\.15\(baseline row 2\) and atrep\_pen=1\.00\(selected variant row 1\) are both clean\-truncated: the model was writing idiomatic Rust and exhausted the token budget before emitting EOS, thus both count as pass\.

Atrep\_pen=1\.30the picture is qualitatively different for the baseline E4B model\. Of the 22 length\-truncated completions:

- •1was caught by the strict periodic\-loop heuristic during data collection \(seed 15\)\.
- •A further15contain fabricated macro names or identifiers detectable by a tail regex:writable\!,nf\!,defer\!,stop\_workers\!,SysOutMock,MockTerminalInfo,safe\_initialization\_trigger,Duration::from\_Millis,application\_printer,crashed\!,uxtheme\!\.
- •Manual review of the remaining6yields4additional cases with non\-obvious fabrications \(invented APIs such asBlock::fixed\_area,SubArea::with\_vertical\_chunks,simulate\_background\_polling,libc::\_IO\_FILE\_,event::take\(\)\) and2cases that appear clean\-truncated in the visible tail\.

This gives at least 20 corrupted and 2 ambiguous cases\. The pass count of 8/30 is the strict count, treating every length\-truncated completion as a failure; counting the 2 borderline cases as passes yields 10/30\. The qualitative conclusion is the same under either convention\.

#### Sample completions\.

We illustrate the qualitative difference with the final 600 characters of two representative completions on the samerust\_clone\_stateprompt\.

Clean completion\(E4B baseline,rep\_pen=1\.00, seed 27,finish\_reason=stop, total 2,499 tokens\):

> …eld, and then releases the lock\. It checks the is\_running flag to exit gracefully\. 4\.run\_tui: The main thread handles the user interface\. It runs in a loop, waiting for input \(crossterm::event::poll\)\. Crucially,*before*every redraw, it locks the Mutex to get the latest metrics, draws the UI, and then releases the lock\. 5\.Graceful Shutdown: When the user presses ‘q’, the main thread sets state\.is\_running = false\. The polling thread detects this flag on its next loop iteration and exits cleanly\.

All macros referenced \(crossterm::event::poll,Mutex::lock\) are real, the architecture is internally consistent, and the model emitted EOS naturally\.

Corrupted completion\(E4B baseline,rep\_pen=1\.30, seed 12,finish\_reason=length, total 4,096 tokens\):

> …tern crate num\_cpus; external\_crate::protocol::\*; enum protocol \{\} trait ProtocolTrait \{\} impl ProtocolTrait for protocol \{\} module tiledb \{ pub trait QueryTermiosMut \{ fn ?\(self\)&\(mut TcpStream\) \-\> Result<\(\), AnyHow\>\(\); \}; fn query\_termios\(ts: &mut TcpStream\) \-\> Result<MockTerminalInfo\>; \} pub struct MockTerminalInfo

This is not valid Rust:external\_crateis not a crate keyword,moduleshould bemod,fn ?\(self\)&\(mut TcpStream\)is not valid function syntax, andMockTerminalInfoandtiledbare fabricated identifiers\. At this point the model is in a degenerate loop in which each successive attempt to correct the code is itself penalized, causing a drift onto a new fabricated alternative\.

#### Why the corruption appears at rep\_pen=1\.30 on E4B but not at 1\.15 and not on 31B\.

We argue that the key quantity is the logit margin by which a common idiom exceeds its nearest alternative at each emission point; how large this margin is depends on the model’s code prior\. On E4B, the idiomswriteln\!,println\!, andResultsit only 2–3 nats above alternatives\. On 31B, a larger parameter count and a stronger code distribution place the same idioms 5–10 nats above alternatives\. Repetition penalty applies a multiplicative1/rep\_pen1/\\texttt\{rep\\\_pen\}to a token’s logit each time it has appeared in the context\. After five emissions ofwriteln\!, the cumulative penalty is5​ln⁡\(1\.30\)≈1\.35\\ln\(1\.30\)\\approx 1\.3nats: on 31B this remains well below the 5–10\-nat margin and the idiomatic token stays at the top of the distribution; on E4B it is sufficient to push the token out of thetop\_p=0\.95sampling set\. The model then falls onto a fabricated alternative \(writable\!\), which suffers the same penalty after a few uses, producing the chain of fabrications visible in the tails\. Atrep\_pen=1\.15, the equivalent cumulative penalty is5​ln⁡\(1\.15\)≈0\.75\\ln\(1\.15\)\\approx 0\.7nats, comfortably below even E4B’s 2–3\-nat margin, and the idiomatic tokens remain in the sampling set on both models\.

The primary finding of this experiment is that repetition penalty carries a real, model\-dependent side effect on code generation: atrep\_pen=1\.30, E4B silently corrupts long\-form Rust output even while suppressing enumeration loops, whereas 31B, which has a stronger code prior, is unaffected at all three penalty values\. The apparent safety ofrep\_pen=1\.15for E4B should be treated with caution: this result comes from a single prompt, and we did not test settings where idiomatic tokens carry smaller logit margins\. A lower corruption threshold may exist for some prompts or models, and we cannot rule it out on the basis of this experiment alone\.

The selected weight editstrip\-L18\-K3sidesteps this tradeoff entirely: it achieves loop elimination on E4B atrep\_pen=1\.0, and the bottom rows of Table[9](https://arxiv.org/html/2606.13705#A4.T9)confirm it does not introduce a new corruption pathway at eitherrep\_pen=1\.00or1\.15\. Among the values tested here,rep\_pen=1\.15is the safest broadly applicable choice in deployments that do not use the weight edit;rep\_pen=1\.30is safe only in 31B\-only settings\.

## Appendix EE2B Intervention

This appendix gives the detailed evidence behind the E2B intervention\-layer choice noted in Section[4](https://arxiv.org/html/2606.13705#S4)\.

On theconstellationsprompt, the E2B loop repeats one word: at the loop position the model predicts “ Scor” \(the start ofScorpius\) with probability0\.999970\.99997\. To find where this word is produced, we zero each layer’s attention or MLP output in turn and measure how much the probability of that word drops \(Figure[9](https://arxiv.org/html/2606.13705#A5.F9)\)\. The word is produced by three consecutive layers: zeroing the L13 MLP drops the probability from0\.999970\.99997to0\.0170\.017, the L14 attention to almost zero, and the L15 MLP to0\.0020\.002; no other layer matters\. For E4B and 31B, this same test points directly at the layer we edit \(Table[1](https://arxiv.org/html/2606.13705#S4.T1)\)\.

E2B is the exception\. The neurons we edit are at L10/L12, which come a few layers before the three layers that produce the word \(L13–L15\), and there the same test shows almost no effect\. We did not choose L10 from this test; we chose it by trying edits at every layer from L8 to L20 and measuring the loop rate over full generations\. Stripping pro\-loop neurons at one of the later layers \(for example L15\) is actually worse than doing nothing: it just makes the model commit to a different constellation\. L10 is the only layer, in our experiments, where editing a few neurons reliably stops the loop without hurting benchmark scores\.

The reason is that a single L10 neuron has only a tiny effect at any one position: zeroing neuron 3513 at the loop position changes the probability of the loop word by less than10−410^\{\-4\}, which is why the single\-position test above does not flag it\. Its influence is cumulative instead: the neuron’s small push toward the loop word, repeated over the hundreds of tokens the model generates, is what gradually steers the output into the repeated\-constellation answer, and this only becomes visible when we generate the full response and check whether the loop forms\. This is also why the selected editreversesthe neuron’s sign rather than simply zeroing it \(Section[5\.1](https://arxiv.org/html/2606.13705#S5.SS1)\): zeroing removes the push but lets the model reroute to the same answer through other neurons, whereas reversing the sign actively steers away from it\. In short, the layers where the loop wordappears\(L13–L15\) are not the layer where it is bestprevented\(L10\), similar to the finding ofHaseet al\.\([2023](https://arxiv.org/html/2606.13705#bib.bib7)\)that the place where information is stored in a model need not be the best place to edit it\. We see an analogous gap in 26B, where the loop signal peaks at different layers than the experts we mask \(Figure[1](https://arxiv.org/html/2606.13705#S4.F1)\)\.

![Refer to caption](https://arxiv.org/html/2606.13705v1/images/main/04_E2B_layer_ablation.png)Figure 9:Per\-layer component zero\-ablation for E2B on theconstellationsloop \(loop word “ Scor”, baseline probability0\.999970\.99997\)\. Each bar shows how much the probability of the loop word changes when a layer’s attention \(red\) or MLP \(green\) output is zeroed\. The word is produced by three consecutive layers \(L13–L15\); the layers we actually edit, L10/L12, show almost no effect here\.
## Appendix FHugging Face Model Revisions

For reproducibility, Table[10](https://arxiv.org/html/2606.13705#A6.T10)lists the exact Hugging Face commits of the four base instruction\-tuned Gemma\-4 models used throughout this paper\. All four are loaded asAutoModelForCausalLM\.from\_pretrained\(model\_id, revision=<sha\>\)\.

Table 10:Pinned Hugging Face revisions of the four base Gemma\-4 instruction\-tuned models\. These are the commits listed in therefs/mainpointer of each repository, and they were used for the long\-budget re\-sweeps reported in Section[7](https://arxiv.org/html/2606.13705#S7)\. The earlier loop\-attribution and intervention sweeps \(Sections 4–5, mid\-late May\) ran against a prior revision of each model in which the safetensors weight blobs are byte\-identical; onlychat\_template\.jinjadiffered\. The earlier revisions are3555bddc93a623db8887dd2e52123facc45ade77\(E4B\),462a98a12e28e2cbcfccaf78fe41e3e50235e6ae\(26B\-A4B\), andba74f5b6c647c0911554e50278d6f6f4477f9010\(31B\); E2B never received a chat\-template\-only update during our experiment window\. The upstream diff between the old and new templates is a purely additive 9\-line branch that emits multimodal placeholders inside tool\-response blocks, which is never entered by our text\-only enumeration probes\. All sweeps load weights and tokenizer files from a local on\-disk Hugging Face cache; the vLLM serving containers bind\-mount the same cache, so no run reads from the live Hub\. Loading either revision per model therefore yields byte\-identical parameter tensors and behaviorally indistinguishable prompts for the inputs used in this paper\.

Similar Articles

LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models

Hugging Face Daily Papers

LoopUS is a post-training framework that converts pretrained LLMs into looped architectures for improved reasoning performance via latent-refinement and adaptive early exiting. It addresses computational costs and capability preservation issues found in existing looped computation methods.

Natively Unlearnable Large Language Models

arXiv cs.LG

The paper proposes NULLs (Natively Unlearnable LLMs), a model class that isolates source-specific contributions in sparsely activated sinks while sharing backbone neurons, enabling clean unlearning of individual data sources without retraining and preserving general language capabilities.