Low Perplexity is Repetition: A One-Dimensional Self-Conditioning Attractor in Continuous Diffusion LMs

arXiv cs.CL Papers

Summary

This paper reveals that the low generative perplexity (Gen-PPL) reported by continuous diffusion language models like ELF is misleading, as it rewards repetition; the authors identify a one-dimensional attractor in the self-conditioning loop as the cause and propose ACE, a simple fix that subtracts this direction to reduce repetition without sacrificing quality.

arXiv:2607.00588v1 Announce Type: new Abstract: Continuous diffusion language models such as ELF report record-low generative perplexity (Gen-PPL). We find a catch: these models repeat far more than human text, and Gen-PPL rewards rather than penalizes that repetition, so its low scores overstate quality. Strip the repetition and ELF-B's Gen-PPL rises from $19.5$ to $27.7$; the smallest model even posts the best Gen-PPL because it repeats most. We trace the repetition to its source: a contractive attractor along a \emph{single direction} in the self-conditioning feedback loop, the loop that feeds each step's clean estimate into the next. Because the failure is one-dimensional, a one-dimensional fix suffices, and we propose one. \textbf{ACE} (Attractor-Contrast-Escape) subtracts that single, label-free direction from the feedback at each step. Estimated once on the $105$M model, the direction cuts repetition to near the human level while keeping quality competitive, and transfers near-unchanged to the $342$M and $652$M models and across samplers; the same recipe recovers useful directions on other architectures. Since Gen-PPL itself rewards repetition, we instead measure the compute each fix needs to produce human-clean text, where ACE is $1.5$--$5\times$ cheaper.
Original Article
View Cached Full Text

Cached at: 07/02/26, 05:37 AM

# Low Perplexity is Repetition: A One-Dimensional Self-Conditioning Attractor in Continuous Diffusion LMs
Source: [https://arxiv.org/html/2607.00588](https://arxiv.org/html/2607.00588)
Shuai Zhang1,2, Zijie Chen2, Hongliang He2, Lun Du3,†, Zhenzhong Lan2,† 1Zhejiang University2Westlake University3Ant Group zhangshuai@westlake\.edu\.cnlanzhenzhong@westlake\.edu\.cn †Corresponding authors

###### Abstract

Continuous diffusion language models such as ELF report record\-low generative perplexity \(Gen\-PPL\)\. We find a catch: these models repeat far more than human text, and Gen\-PPL rewards rather than penalizes that repetition, so its low scores overstate quality\. Strip the repetition and ELF\-B’s Gen\-PPL rises from19\.519\.5to27\.727\.7; the smallest model even posts the best Gen\-PPL because it repeats most\. We trace the repetition to its source: a contractive attractor along a*single direction*in the self\-conditioning feedback loop, the loop that feeds each step’s clean estimate into the next\. Because the failure is one\-dimensional, a one\-dimensional fix suffices, and we propose one\.ACE\(Attractor\-Contrast\-Escape\) subtracts that single, label\-free direction from the feedback at each step\. Estimated once on the105105M model, the direction cuts repetition to near the human level while keeping quality competitive, and transfers near\-unchanged to the342342M and652652M models and across samplers; the same recipe recovers useful directions on other architectures\. Since Gen\-PPL itself rewards repetition, we instead measure the compute each fix needs to produce human\-clean text, where ACE is1\.51\.5–5×5\\timescheaper\.

## 1Introduction

Continuous diffusion language models \(DLMs\) are a promising non\-autoregressive route to text generation: they denoise a whole sequence in parallel within a differentiable embedding space, steerable by gradients and guidance\. Self\-conditioning\(Chenet al\.,[2022](https://arxiv.org/html/2607.00588#bib.bib23)\)improves their sample quality by feeding the model’s own clean estimate back into each step to refine the next\. Recent models such as ELF\(Huet al\.,[2026](https://arxiv.org/html/2607.00588#bib.bib9)\)report low generative perplexity \(Gen\-PPL\), the number the field reads as generation quality\. We find that this headline hides a defect: ELF’s samples repeat far more than human text, and Gen\-PPL*rewards*the repetition instead of penalizing it\. Strip the repetition and ELF\-B’s Gen\-PPL rises from19\.519\.5to27\.727\.7, enough for the larger ELF\-M to overtake it; the smallest model posts the best Gen\-PPL only because it repeats the most \(Table[4](https://arxiv.org/html/2607.00588#footnote4)\)\.

The defect is heavy and systematic\. A large share of ELF samples lock onto a few repeated44\-grams and loop them for hundreds of words \(Table[17](https://arxiv.org/html/2607.00588#A7.T17)\), which human text essentially never does\. The link is not only across models but within one: at a fixed setting, sample\-level repetition correlates with the GPT\-2\(Radfordet al\.,[2019](https://arxiv.org/html/2607.00588#bib.bib43)\)PPL the samples are scored by \(Table[16](https://arxiv.org/html/2607.00588#A6.T16)\)\. The defect stays hidden because the certifying metric is blind to it: repeated text is highly probable under the scorer, so it earns a flatteringly low Gen\-PPL, analogous to likelihood\-based degeneration in autoregressive generation\(Holtzmanet al\.,[2020](https://arxiv.org/html/2607.00588#bib.bib14); Wellecket al\.,[2020](https://arxiv.org/html/2607.00588#bib.bib13)\)\.

![Refer to caption](https://arxiv.org/html/2607.00588v1/x1.png)Figure 1:Repetition is a basin; ACE escapes it\.Even as ELF denoises toward a clean sample, self\-conditioning drags its representation𝒖\\bm\{u\}along one direction𝒅\\bm\{d\}\(the high\- minus low\-repetition gap\) into a repetition state𝒖⋆\\bm\{u\}^\{\\star\}; the baseline slides into this basin \(red\), while ACE subtracts𝒅\\bm\{d\}to hold𝒖\\bm\{u\}in the human\-clean zone \(blue\)\. Background: measured repetition rate\.We trace the defect to its mechanism rather than stop at the symptom\. Like audio feedback, this self\-conditioning loop settles on whatever is most self\-predictable, which is repeated content\. Two probes pin this down\. Turning the feedback strength up, with nothing else changed, drives repetition up and Gen\-PPL down together \(§[3](https://arxiv.org/html/2607.00588#S3)\): the loop*creates*the repetition the metric then rewards\. And linearizing the loop, its Jacobian has a single slowest\-contracting mode, so the repeated state is a one\-dimensional*contractive attractor*along one direction𝒅\\bm\{d\}\(Fig\.[1](https://arxiv.org/html/2607.00588#S1.F1); §[4](https://arxiv.org/html/2607.00588#S4)\), a basin that sharper sampling only deepens\. This is specific to self\-conditioned continuous DLMs \(ELF,Plaid\(Gulrajani and Hashimoto,[2023](https://arxiv.org/html/2607.00588#bib.bib36)\)\), and its one\-dimensional geometry is exactly what makes a one\-dimensional fix possible\.

Because the attractor is one\-dimensional, one direction is enough to escape it\.ACE\(Attractor\-Contrast\-Escape\)111Code:[https://github\.com/ZhangShuai1230/ACE\-DLM](https://github.com/ZhangShuai1230/ACE-DLM)subtracts that single direction𝒅\\bm\{d\}from the fed\-back estimate at every step\. The direction is recovered label\-free, as the difference of means of the feedback between trajectories trapped in the basin \(top\-repetition tertile\) and those that stay free \(bottom tertile\): no per\-token labels, no auxiliary model, no retraining\. Crucially ACE acts where the defect is born, on the self\-conditioning feedback rather than at token selection, because repetition is set in the continuous latent, upstream of that selection, so decode\-time fixes are poorly placed to reach it\. A single frozen direction, estimated once on the smallest model within a closed\-form usable window \(§[4](https://arxiv.org/html/2607.00588#S4)\), cuts repetition to near the human level at competitive quality and transfers near\-unchanged across inference knobs and model sizes \(cosine0\.820\.82–0\.960\.96to the per\-config re\-estimate\); the same recipe recovers useful directions on other architectures \(Plaid, LangFlow\(Chenet al\.,[2026](https://arxiv.org/html/2607.00588#bib.bib35)\)\)\.

Evaluating the fix needs care, since Gen\-PPL is fooled by the very repetition ACE removes\. We therefore*accept*text under a human repetition bar, read quality on the accepted set with standard reference\-free metrics \(§[3](https://arxiv.org/html/2607.00588#S3)\), and measure the compute needed to reach genuinely non\-repetitive text \(§[5](https://arxiv.org/html/2607.00588#S5)\); under this evaluation ACE makes human\-clean text1\.51\.5–5×5\\timescheaper at competitive quality\.

#### Contributions\.

1. 1\.Gen\-PPL rewards repetition\.Continuous DLMs repeat far more than human text; we show that Gen\-PPL, the field’s headline metric,*rewards*rather than penalizes this and even reverses the model ranking, and we propose a defect\-controlled evaluation that accepts text under a human\-repetition bar and scores compute\-to\-clean instead of Gen\-PPL \(§[3](https://arxiv.org/html/2607.00588#S3), §[5](https://arxiv.org/html/2607.00588#S5)\)\.
2. 2\.Its mechanism: an effectively one\-dimensional attractor\.By direct ablation and a linear\-stability analysis we trace the repetition to an effectively one\-dimensional contractive attractor of the self\-conditioning loop along one direction𝒅\\bm\{d\}\(§[4](https://arxiv.org/html/2607.00588#S4)\)\.
3. 3\.Its fix: one frozen direction \(ACE\)\.A single cheap, label\-free, frozen direction, applied within a closed\-form usable steering window, removes most repetition and transfers across knobs and scales, with the same recipe recovering useful directions on other architectures; under our evaluation it reaches human\-clean text at comparable quality and1\.51\.5–5×5\\timeslower cost \(§[5](https://arxiv.org/html/2607.00588#S5), §[6](https://arxiv.org/html/2607.00588#S6)\)\.

#### Relation to prior work\.

Prior work exposes metric pathologies in \(diffusion\-\)LM evaluation\(Zhenget al\.,[2025](https://arxiv.org/html/2607.00588#bib.bib21); Wanget al\.,[2022](https://arxiv.org/html/2607.00588#bib.bib22); Franca and Tong,[2026](https://arxiv.org/html/2607.00588#bib.bib48)\)or studies discretization and decoding bottlenecks\(Liet al\.,[2022](https://arxiv.org/html/2607.00588#bib.bib24); Dielemanet al\.,[2022](https://arxiv.org/html/2607.00588#bib.bib25)\); autoregressive\-degeneration work studies repetition along token\-time\(Holtzmanet al\.,[2020](https://arxiv.org/html/2607.00588#bib.bib14); Wellecket al\.,[2020](https://arxiv.org/html/2607.00588#bib.bib13)\); and steering work shows low\-dimensional interventions can control diffusion\-LM behavior\(Shnaidmanet al\.,[2025](https://arxiv.org/html/2607.00588#bib.bib39)\)\. We connect these lines: we surface a text\-visible repetition defect in self\-conditioned continuous DLMs, trace it to the self\-conditioning feedback loop, and remove it with a single label\-free feedback\-direction intervention \(full discussion in App\.[H](https://arxiv.org/html/2607.00588#A8)\)\.

## 2Background and metrics

### 2\.1ELF, self\-conditioning, and the two samplers

ELF\(Huet al\.,[2026](https://arxiv.org/html/2607.00588#bib.bib9)\)is a continuous\-embedding flow\-matching language model\. Generation starts from Gaussian noise𝒛0\\bm\{z\}\_\{0\}and follows a trajectoryt∈\[0,1\]t\\in\[0,1\]from noise \(t=0t\{=\}0\) to the clean text embedding \(t=1t\{=\}1\), which is read out to token ids by an independent per\-positionarg⁡max\\arg\\max\. Two samplers are used: ODE \(deterministic Euler integration\) and SDE \(Euler steps interleaved with partial noise re\-injection, governed by the sampler’s rate parameterγ\\gamma, which we call the*churn*after the analogous stochastic\-sampler knob ofKarraset al\.\([2022](https://arxiv.org/html/2607.00588#bib.bib42)\)\)\.

Self\-conditioning\(Chenet al\.,[2022](https://arxiv.org/html/2607.00588#bib.bib23)\)is a refinement trick for diffusion sampling: instead of predicting the clean data𝒙^\\hat\{\\bm\{x\}\}from the noised input alone, each step feeds back its previous estimate as an extra input and refines it\. It adds no extra forward pass, improves sample quality, and is widely used in continuous DLMs \(ELF, Plaid\(Gulrajani and Hashimoto,[2023](https://arxiv.org/html/2607.00588#bib.bib36)\), LangFlow\(Chenet al\.,[2026](https://arxiv.org/html/2607.00588#bib.bib35)\)\)\. Formally it turns the denoiser into a recurrence driven toward a fixed point, the view our analysis builds on \(§[4](https://arxiv.org/html/2607.00588#S4)\)\.

ELF reuses this same channel to distill classifier\-free guidance: instead of two forward passes per step, it feeds the previous estimate𝒙^prev\\hat\{\\bm\{x\}\}\_\{\\text\{prev\}\}back together with a scalar SC\-CFG scaleww, and a single pass produces the guided velocity\. At each stepii,

𝒗i=fθ​\(𝒛i,ti,w,𝒙^prev\),𝒙^prev←𝒙^i\.\\bm\{v\}\_\{i\}=f\_\{\\theta\}\\big\(\\bm\{z\}\_\{i\},\\,t\_\{i\},\\,w,\\,\\hat\{\\bm\{x\}\}\_\{\\text\{prev\}\}\\big\),\\qquad\\hat\{\\bm\{x\}\}\_\{\\text\{prev\}\}\\leftarrow\\hat\{\\bm\{x\}\}\_\{i\}\.\(1\)This loop propagates the model’s commitment across later steps and is central to the repetition defect\.

### 2\.2The reported metrics

*Generative perplexity*\(Gen\-PPL\) under GPT\-2 Large is the standard metric for unconditional DLM evaluation\. For a generated text𝒙=\(x1,…,xN\)\\bm\{x\}\{=\}\(x\_\{1\},\\dots,x\_\{N\}\),

PPL​\(𝒙\)=exp⁡\(−1N​∑i=1Nlog⁡pGPT\-2​\(xi∣x<i\)\),\\textsc\{PPL\}\(\\bm\{x\}\)=\\exp\\\!\\Big\(\\\!\-\\tfrac\{1\}\{N\}\\textstyle\\sum\_\{i=1\}^\{N\}\\log p\_\{\\text\{GPT\-2\}\}\(x\_\{i\}\\mid x\_\{<i\}\)\\Big\),\(2\)aggregated at corpus level\. Unigram entropy is reported as a check against trivial collapse\. Crucially, nonn\-gram repetition metric is reported in the original ELF paper or in most DLM benchmarks: the gap this paper fills\.

### 2\.3The repetition metric

We use the standard44\-gram self\-repetition rate, the fraction of44\-gram occurrences in a text that duplicate an earlier one:

rep4​\(𝒙\)=∑g∈𝒢max⁡\(c𝒙​\(g\)−1,0\)\|𝒢\|,\\textsc\{rep\}\_\{4\}\(\\bm\{x\}\)=\\frac\{\\sum\_\{g\\in\\mathcal\{G\}\}\\max\(c\_\{\\bm\{x\}\}\(g\)\-1,\\,0\)\}\{\|\\mathcal\{G\}\|\},\(3\)with𝒢\\mathcal\{G\}the multiset of44\-grams andc𝒙​\(g\)c\_\{\\bm\{x\}\}\(g\)its counts\. This is identical toseq\-rep\-4\(Wellecket al\.,[2020](https://arxiv.org/html/2607.00588#bib.bib13)\)\(the same duplicatenn\-gram family, complementary to diversity metrics\(Suet al\.,[2022](https://arxiv.org/html/2607.00588#bib.bib31); Liet al\.,[2016](https://arxiv.org/html/2607.00588#bib.bib33)\)\); we tokenize by whitespace and report the median over samples, with the human\-clean acceptance threshold calibrated to human text rather than an arbitrary constant \(§[3](https://arxiv.org/html/2607.00588#S3)\)\.

## 3The repetition defect: Gen\-PPL ranks ELF backwards

ELF’s record\-low generative perplexity comes largely from repetition that the metric rewards rather than penalizes\. We study the ELF series\(Huet al\.,[2026](https://arxiv.org/html/2607.00588#bib.bib9)\)of unconditional OpenWebText\(Gokaslan and Cohen,[2019](https://arxiv.org/html/2607.00588#bib.bib44)\)models at the6464\-step,γ=1\.0\\gamma\{=\}1\.0operating point \(SC\-CFGw=3w\{=\}3\), measuring the defect metric of §[2](https://arxiv.org/html/2607.00588#S2)overn=1000n\{=\}1000samples per model againstn=1000n\{=\}1000human \(BBC/XSum\(Narayanet al\.,[2018](https://arxiv.org/html/2607.00588#bib.bib16)\)\) articles \(Table[4](https://arxiv.org/html/2607.00588#footnote4)\)\. No authoritative per\-sample cutoff for “degenerate” exists, so we calibrate to these: theirseq\-rep\-4has median0\.00%0\.00\\%and9595th percentile1\.92%1\.92\\%, the human\-clean bar\.222A generous bar; the few percent of human text above it is genuinely repetitive \(lists, names, refrains\)\.A sample is*human\-clean*when its seq\-rep\-4 is under this bar, and we read two quantities off it: the*accept*rate \(the share of samples below it\) and*clean\-PPL*\(the Gen\-PPL of the accepted samples alone, so repetition can no longer lower it\)\. Clean\-PPL is a guardrail against passing the bar with diverse nonsense, not the quality verdict, which rests on the reference\-free signals in Table[4](https://arxiv.org/html/2607.00588#footnote4)\(grammatical acceptability and within\-text diversity\)\.

Table 1:Gen\-PPL ranks ELF backwards by rewarding repetition; one shared direction removes it at competitive quality\.Per\-size baseline vs steered \(6464steps,λ=2\\lambda\{=\}2\): defect metrics on the full10001000\-sample pool \(left\) and quality on the clean accepted set, length\-matched \(right\)\. G\-PPL/c\-PPL: Gen\-/clean\-PPL; s\-BLEU: self\-BLEU\.444We fix6464steps across sizes; ours then matches ELF’s reported6464\-step ELF\-M/L \(Gen\-PPL22\.1/24\.022\.1/24\.0vs\.21\.7/23\.321\.7/23\.3\), while ELF reports ELF\-B at3232steps \(24\.124\.1;[https://github\.com/lillian039/ELF](https://github.com/lillian039/ELF)\)\.#### The defect, and the backwards ranking\.

ELF’s output carries a defect absent from human text yet invisible to the metrics certifying it: heavy44\-gram self\-repetition, common in ELF and near\-absent in human prose \(Table[4](https://arxiv.org/html/2607.00588#footnote4);Wellecket al\.,[2020](https://arxiv.org/html/2607.00588#bib.bib13)\)\. The ranking it produces runs*backwards*: Gen\-PPL places the smaller ELF\-B above the larger ELF\-M, yet our defect\-controlled clean\-PPL reverses them\. Strip the repetition and ELF\-M is the better model; ELF\-B led only because it repeats most: repeated text is trivially predictable, so the GPT\-2 scorer assigns it high probability and a flatteringly low Gen\-PPL\. This is not a measurement artifact \(controls in App\.[F](https://arxiv.org/html/2607.00588#A6)\); a side\-by\-side example of the defect and the fix is in App\.[G](https://arxiv.org/html/2607.00588#A7)\(Table[17](https://arxiv.org/html/2607.00588#A7.T17)\)\.

#### Self\-conditioning creates the repetition that Gen\-PPL rewards\.

Feeding back an*attenuated*estimateα​𝒙^\\alpha\\hat\{\\bm\{x\}\}and turningα\\alphafrom0\(feedback off\) up to11\(full\), with nothing else changed,555No retraining: classifier\-free guidance already trains the model both with the self\-conditioning signal \(α=1\\alpha\{=\}1\) and without it \(α=0\\alpha\{=\}0\), so the sweep only interpolates between regimes it already runs\.drives repetition up and Gen\-PPL down together: the metric calls the generator more than four times “better” just as it grows most repetitive \(Table[2](https://arxiv.org/html/2607.00588#S3.T2)\)\. Acceptance under the human bar collapses asα\\alpharises: the Gen\-PPL gain is part real fluency and part rewarded repetition, which the metric cannot tell apart\.

Table 2:Raising the self\-conditioning strengthα\\alphaincreases repetition and lowers Gen\-PPL together\.Feeding backα​𝒙^\\alpha\\hat\{\\bm\{x\}\}, no retraining \(ELF\-B,6464steps,γ=1\.0\\gamma\{=\}1\.0\); accept: share under the1\.92%1\.92\\%human bar\. Gen\-PPL: all samples; clean\-PPL: accepted subset only\.

## 4Mechanism: repetition is a one\-dimensional attractor of the self\-conditioning loop

The ablation of §[3](https://arxiv.org/html/2607.00588#S3)identifies the self\-conditioning feedback as a causal driver of repetition; we next ask whether that effect is diffuse or concentrated: does the loop’s drift lie along one identifiable direction that predicts a sample’s repetition and, when subtracted, suppresses it? We find that it does: repetition is a contracting attractor of the self\-conditioning loop that amplifies the most self\-predictable signal, repeated content, and is drawn toward a repetition fixed point𝒖⋆\\bm\{u\}^\{\\star\}\(Fig\.[1](https://arxiv.org/html/2607.00588#S1.F1)\)\. An idealized linear model of this attractor predicts the failure is*one\-dimensional*, repetition living on one slow mode𝒗1\\bm\{v\}\_\{1\}separable from the benign denoising that writes the text; we then test each prediction against measurement \(the formal statements, and the idealizing assumptions they rest on, are in App\.[A](https://arxiv.org/html/2607.00588#A1)\)\. The cheap direction𝒅\\bm\{d\}that exploits it is §[5](https://arxiv.org/html/2607.00588#S5); its transfer across knobs and sizes is §[6](https://arxiv.org/html/2607.00588#S6)\.

#### The loop splits into a self\-conditioning response and a denoising drive\.

ELF denoises by*self\-conditioning*: each stepkkfeeds the model its own previous clean\-latent estimate𝒙^k∈ℝL×e\\hat\{\\bm\{x\}\}\_\{k\}\\in\\mathbb\{R\}^\{L\\times e\}and reads off the next; iterating this feedback is the loop\(Chenet al\.,[2022](https://arxiv.org/html/2607.00588#bib.bib23)\)\. We track its position pool \(eethe embedding dimension,LLthe length\),

𝒖k=1L​∑l=1L𝒙^k​\[l\]∈ℝe,𝒖k\+1=g​\(𝒖k\)=𝒔​\(𝒖k\)⏟self\-conditioning\+𝒇k⏟denoising drive\.\\bm\{u\}\_\{k\}\\;=\\;\\tfrac\{1\}\{L\}\\textstyle\\sum\_\{l=1\}^\{L\}\\hat\{\\bm\{x\}\}\_\{k\}\[l\]\\;\\in\\;\\mathbb\{R\}^\{e\},\\qquad\\bm\{u\}\_\{k\+1\}\\;=\\;g\(\\bm\{u\}\_\{k\}\)\\;=\\;\\underbrace\{\\bm\{s\}\(\\bm\{u\}\_\{k\}\)\}\_\{\\text\{self\-conditioning\}\}\+\\underbrace\{\\bm\{f\}\_\{k\}\}\_\{\\text\{denoising drive\}\}\.\(4\)One stepggsplits in two: the*self\-conditioning response*𝒔​\(𝒖k\)\\bm\{s\}\(\\bm\{u\}\_\{k\}\), how the next estimate depends on the fed\-back one \(the channel by which repeated content reinforces itself\), and the*denoising drive*𝒇k\\bm\{f\}\_\{k\}, the ordinary denoising the step would do with the feedback off \(set by the noise leveltkt\_\{k\}and latent𝒛k\\bm\{z\}\_\{k\}\), which carries the bulk of the motion and writes the text\. Repetition is governed mainly by𝒔\\bm\{s\}, with only the small on\-axis component of𝒇k\\bm\{f\}\_\{k\}setting the driven offset;𝒇k\\bm\{f\}\_\{k\}is otherwise the benign carrier it rides on\. App\.[A](https://arxiv.org/html/2607.00588#A1)derives the split from the model’s clean\-latent prediction𝑿θ\\bm\{X\}\_\{\\theta\}\(the fed\-back𝒙^\\hat\{\\bm\{x\}\}of §[2](https://arxiv.org/html/2607.00588#S2)\) by a first\-order expansion, pooling the feedback over positions into a tractableee\-dimensional state whose link to text\-level repetition is empirical \(Fig\.[2](https://arxiv.org/html/2607.00588#S4.F2)a\)\.

###### Assumption 1\(Repetition attractor\)\.

The self\-conditioning map𝐬\\bm\{s\}has a fixed point𝐮⋆\\bm\{u\}^\{\\star\}\(repeated content\) and isC1C^\{1\}near it with a contracting Jacobian𝐉=D​𝐬​\(𝐮⋆\)\\bm\{J\}=\\mathrm\{D\}\\bm\{s\}\(\\bm\{u\}^\{\\star\}\)\. For the analysis we idealize𝐉\\bm\{J\}as symmetric, with orthonormal eigenvectors\{𝐯i\}\\\{\\bm\{v\}\_\{i\}\\\}and real eigenvalues1\>μ1≥μ2≥⋯≥01\>\\mu\_\{1\}\\geq\\mu\_\{2\}\\geq\\cdots\\geq 0\. The leading eigenvector𝐯1\\bm\{v\}\_\{1\}is the*repetition axis*, andρ:=1−μ1∈\(0,1\)\\rho:=1\-\\mu\_\{1\}\\in\(0,1\)its contraction rate\.

The symmetry idealization is used only to decouple the𝒗1\\bm\{v\}\_\{1\}coordinate into the scalar recursion below; the measured finite\-difference𝑱\\bm\{J\}is only approximately symmetric \(App\.[B](https://arxiv.org/html/2607.00588#A2)\), so the theory reads as a local scalar approximation\. The axis𝒗1\\bm\{v\}\_\{1\}is slowest only relative to the faster off\-axis modes, not near\-marginal \(measuredμ1≈0\.15\\mu\_\{1\}\{\\approx\}0\.15\)\.

#### Repetition is one\-dimensional\.

Linearizing𝒔\\bm\{s\}about𝒖⋆\\bm\{u\}^\{\\star\}, its response contracts fastest off, slowest along, the leading eigenvector𝒗1\\bm\{v\}\_\{1\}, so the per\-step change decomposes as

Δ​𝒖k≈βk​𝒗1leading\+𝒓kdecays⏟repetition mode\+𝒇k⏟denoising drive\\Delta\\bm\{u\}\_\{k\}\\;\\approx\\;\\underbrace\{\\underset\{\\text\{leading\}\}\{\\beta\_\{k\}\\,\\bm\{v\}\_\{1\}\}\\;\+\\;\\underset\{\\text\{decays\}\}\{\\bm\{r\}\_\{k\}\}\}\_\{\\text\{repetition mode\}\}\\;\+\\;\\underbrace\{\\bm\{f\}\_\{k\}\}\_\{\\text\{denoising drive\}\}\(5\)\(Lemmas[1](https://arxiv.org/html/2607.00588#Thmlemma1)–[2](https://arxiv.org/html/2607.00588#Thmlemma2), App\.[A](https://arxiv.org/html/2607.00588#A1)\): the repetition modeβk​𝒗1\\beta\_\{k\}\\bm\{v\}\_\{1\}along the axis \(coefficientβk=−ρ​ak\(1\)\\beta\_\{k\}=\-\\rho\\,a^\{\(1\)\}\_\{k\}, withak\(1\)=⟨𝒖k−𝒖⋆,𝒗1⟩a^\{\(1\)\}\_\{k\}=\\langle\\bm\{u\}\_\{k\}\-\\bm\{u\}^\{\\star\},\\bm\{v\}\_\{1\}\\ranglethe*repetition level*, how far the feedback sits along𝒗1\\bm\{v\}\_\{1\}\), the subordinate off\-axis modes𝒓k⟂𝒗1\\bm\{r\}\_\{k\}\\perp\\bm\{v\}\_\{1\}\(the same self\-conditioning response on the faster\-contracting directions\), and the near\-orthogonal drive𝒇k\\bm\{f\}\_\{k\}\. Because𝒗1\\bm\{v\}\_\{1\}contracts the slowest \(spectral gapμ1/μ2\\mu\_\{1\}/\\mu\_\{2\}\), the off\-axis transients in𝒓k\\bm\{r\}\_\{k\}die out faster than the𝒗1\\bm\{v\}\_\{1\}component, so the structured residual concentrates on𝒗1\\bm\{v\}\_\{1\}: repetition is*effectively one\-dimensional*along this axis\. Freezing the drive \(𝒇k≡𝒇\\bm\{f\}\_\{k\}\\equiv\\bm\{f\}\), the repetition level settles at

𝒖∞=𝒖⋆\+\(𝑰−𝑱\)−1​𝒇\\bm\{u\}\_\{\\infty\}\\;=\\;\\bm\{u\}^\{\\star\}\+\(\\bm\{I\}\-\\bm\{J\}\)^\{\-1\}\\bm\{f\}\(6\)\(Lemma[3](https://arxiv.org/html/2607.00588#Thmlemma3), App\.[A](https://arxiv.org/html/2607.00588#A1)\): the distance\|a∞\(1\)\|=\|f\(1\)\|/ρ\|a^\{\(1\)\}\_\{\\infty\}\|=\|f^\{\(1\)\}\|/\\rhoalong the axis is the driven offset, the small drive componentf\(1\)=⟨𝒇,𝒗1⟩f^\{\(1\)\}=\\langle\\bm\{f\},\\bm\{v\}\_\{1\}\\rangledivided by the contraction rateρ\\rho, largest along𝒗1\\bm\{v\}\_\{1\}since𝒗1\\bm\{v\}\_\{1\}has the smallest1−μi1\-\\mu\_\{i\}; the faster off\-axis modes settle at smaller offsets\.

#### The repetition axis is real and dominant\.

The measured loop bears this out\. As the basin forms the spectral gapμ1/μ2\\mu\_\{1\}/\\mu\_\{2\}rises,𝒗1\\bm\{v\}\_\{1\}becoming the clearly dominant mode \(Fig\.[2](https://arxiv.org/html/2607.00588#S4.F2)b, Tab\.[10](https://arxiv.org/html/2607.00588#A2.T10)\); a sample’s repetition levela\(1\)a^\{\(1\)\}predicts its final repetition \(Fig\.[2](https://arxiv.org/html/2607.00588#S4.F2)a\); and the cheap difference\-of\-means𝒅\\bm\{d\}\(§[5](https://arxiv.org/html/2607.00588#S5)\) aligns with𝒗1\\bm\{v\}\_\{1\}, the overlap\|cos⁡\(𝒗1,𝒅\)\|\\lvert\\cos\(\\bm\{v\}\_\{1\},\\bm\{d\}\)\\rvertclimbing to0\.550\.55as the basin forms \(Fig\.[2](https://arxiv.org/html/2607.00588#S4.F2)b\)\. Here𝒅\\bm\{d\}need not equal the single\-point eigenvector𝒗1\\bm\{v\}\_\{1\}: it is a trajectory\-averaged steering direction that partially aligns with𝒗1\\bm\{v\}\_\{1\}yet steers better \(Tab\.[5](https://arxiv.org/html/2607.00588#S5.T5)\) by integrating the basin\-entry drift over the whole trajectory\. Repetition concentrates on this one mode: the defect is effectively one\-dimensional\. We obtain𝒗1\\bm\{v\}\_\{1\}and the spectral gap from the feedback Jacobian \(Alg\.[2](https://arxiv.org/html/2607.00588#alg2)\)\.

![Refer to caption](https://arxiv.org/html/2607.00588v1/x2.png)Figure 2:Repetition concentrates on one dominant modev1\\bm\{v\}\_\{1\}, with which the cheapd\\bm\{d\}partially aligns\.\(a\) mean feedback projection onto the repetition axis𝒗1\\bm\{v\}\_\{1\}\(the leading Jacobian eigenvector, read once the basin has formed, at trajectory fraction∼0\.85\{\\sim\}0\.85\) vs final repetition\. \(b\) as the basin forms a dominant mode emerges \(μ1/μ2\\mu\_\{1\}/\\mu\_\{2\}\); the cheap𝒅\\bm\{d\}partially aligns with this local Jacobian mode while integrating basin\-entry drift over the trajectory \(Tab\.[10](https://arxiv.org/html/2607.00588#A2.T10)\)\.

## 5ACE: one subtracted direction escapes the attractor

The mechanism is prescriptive \(§[4](https://arxiv.org/html/2607.00588#S4)\): if repetition is a single direction in the fed\-back estimate, subtract only that direction and keep the rest\. We call thisACE\(*Attractor\-Contrast\-Escape*\):*Contrast*the average self\-conditioning feedback of repetitive vs\. non\-repetitive trajectories to get the attractor direction𝒅\\bm\{d\}\(a difference of means, in the spirit of activation steering\), then*Escape*by steering against it\. This is principled, not heuristic: from just two class means,𝒅\\bm\{d\}recovers the mechanism’s mode𝒗1\\bm\{v\}\_\{1\}under an idealized separability model \(Prop\.[1](https://arxiv.org/html/2607.00588#Thmproposition1), App\.[A](https://arxiv.org/html/2607.00588#A1)\) and empirically aligns with the measured dominant mode\. Algorithm[1](https://arxiv.org/html/2607.00588#alg1)instantiates it\.

Algorithm 1Difference\-of\-means self\-conditioning steering \(ACE\)1:full\-SC sampler

GG, count

NN\(estimation\); model

fθf\_\{\\theta\}, direction

𝒅\\bm\{d\}, strength

λ\\lambda, steps

TT\(generation\)

2:attractor direction

𝒅\\bm\{d\}; one steered sample

3:procedureEstimateDirection\(

G,NG,\\,N\)

4:for

n=1,…,Nn=1,\\dots,Ndo

5:

𝒔n←1T​∑k=1T1L​∑l=1L𝒙^k\(n\)​\[l\]\\bm\{s\}\_\{n\}\\leftarrow\\tfrac\{1\}\{T\}\\textstyle\\sum\_\{k=1\}^\{T\}\\tfrac\{1\}\{L\}\\sum\_\{l=1\}^\{L\}\\hat\{\\bm\{x\}\}^\{\(n\)\}\_\{k\}\[l\]⊳\\trianglerightmean self\-conditioning feedback

6:

rn←rep4​\(dec​\(𝒛n\)\)r\_\{n\}\\leftarrow\\textsc\{rep\}\_\{4\}\(\\textsc\{dec\}\(\\bm\{z\}\_\{n\}\)\)⊳\\trianglerightfinal repetition

7:endfor

8:

𝒯←\{n:rn≥q2/3​\(r\)\}\\mathcal\{T\}\\leftarrow\\\{\\,n:r\_\{n\}\\geq q\_\{2/3\}\(r\)\\,\\\}⊳\\trianglerighttrapped: top\-rep tertile

9:

ℱ←\{n:rn≤q1/3​\(r\)\}\\mathcal\{F\}\\leftarrow\\\{\\,n:r\_\{n\}\\leq q\_\{1/3\}\(r\)\\,\\\}⊳\\trianglerightfree: bottom\-rep tertile

10:

𝒅←1\|𝒯\|​∑n∈𝒯𝒔n−1\|ℱ\|​∑n∈ℱ𝒔n\\bm\{d\}\\leftarrow\\tfrac\{1\}\{\|\\mathcal\{T\}\|\}\\textstyle\\sum\_\{n\\in\\mathcal\{T\}\}\\bm\{s\}\_\{n\}\-\\tfrac\{1\}\{\|\\mathcal\{F\}\|\}\\sum\_\{n\\in\\mathcal\{F\}\}\\bm\{s\}\_\{n\}
11:return

𝒅/∥𝒅∥\\bm\{d\}/\\lVert\\bm\{d\}\\rVert
12:endprocedure

13:

14:procedureSteeredGenerate\(

fθ,𝒅,λ,Tf\_\{\\theta\},\\,\\bm\{d\},\\,\\lambda,\\,T\)

15:

𝒛0∼𝒩​\(0,σ2​I\)\\bm\{z\}\_\{0\}\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}I\)
16:

𝒙^0←𝟎\\hat\{\\bm\{x\}\}\_\{0\}\\leftarrow\\bm\{0\}
17:for

k=1,…,Tk=1,\\dots,Tdo

18:

\(𝒛k,𝒙^k\)←Step​\(fθ,𝒛k−1,tk,𝒙~k−1\)\(\\bm\{z\}\_\{k\},\\,\\hat\{\\bm\{x\}\}\_\{k\}\)\\leftarrow\\textsc\{Step\}\(f\_\{\\theta\},\\bm\{z\}\_\{k\-1\},t\_\{k\},\\,\\tilde\{\\bm\{x\}\}\_\{k\-1\}\)
19:

𝒙~k←𝒙^k−λ​𝒅​1L⊤\\tilde\{\\bm\{x\}\}\_\{k\}\\leftarrow\\hat\{\\bm\{x\}\}\_\{k\}\-\\lambda\\,\\bm\{d\}\\,\\bm\{1\}\_\{L\}^\{\\\!\\top\}⊳\\triangleright⋆\\starsubtract along𝒅\\bm\{d\}, broadcast overLLpositions

20:endfor

21:return

arg⁡maxv⁡dec​\(𝒛T\)​\[⋅,v\]\\arg\\max\_\{v\}\\,\\textsc\{dec\}\(\\bm\{z\}\_\{T\}\)\[\\cdot,v\]
22:endprocedure

### 5\.1Main results: repetition and quality

#### One frozen direction cuts repetition at competitive quality\.

At the operating pointλ=2\\lambda\{=\}2\(the cross\-size result, Tab\.[4](https://arxiv.org/html/2607.00588#footnote4); full dose sweep Tab\.[14](https://arxiv.org/html/2607.00588#A3.T14)\), steering cuts median repetition to near the human level at competitive clean\-PPL, with reference\-free quality preserved \(grammaticality and within\-text diversity\): the targeted subtraction keeps the rest of the feedback intact\. That a*single*difference\-of\-means direction recovers most of the gap is direct evidence the repetition signal is one\-dimensional and separable from coherence\. Steering is one causal intervention on𝒅\\bm\{d\}\(at inference, Fig\.[3](https://arxiv.org/html/2607.00588#S5.F3)\); a training\-time intervention confirms it: an anti\-attractor regularizer on the𝒅\\bm\{d\}\-component of the feedback lowers plain repetition from6\.8%6\.8\\%to3\.3%3\.3\\%at a small clean\-PPL cost \(27\.9→29\.127\.9\{\\to\}29\.1vs\. the matched continue\-train control\), while the same fine\-tune without the penalty barely moves it \(Table[3](https://arxiv.org/html/2607.00588#S5.T3); recipe in App\.[E](https://arxiv.org/html/2607.00588#A5)\)\.

Table 3:Penalizingd\\bm\{d\}in training cuts repetition, at a small clean\-PPL cost\.Continued fine\-tuning of ELF\-B \(128128optimizer steps\); anti\-attractor vs\. the matched continue\-train control; clean\-PPL on the reject\-to\-10001000accepted set\.![Refer to caption](https://arxiv.org/html/2607.00588v1/x3.png)Figure 3:ACE cancels the repetition drift\.Self\-conditioning feedback drifts along a single direction𝒅\\bm\{d\}into a repetition basin \(a\); ACE subtracts𝒅\\bm\{d\}and cancels that drift, holding repetition down \(b\)\. Baseline \(red\) vs steered \(blue\), ELF\-B\.
#### The steer is bounded: a usable dose window\.

Subtractingλ​𝒅\\lambda\\bm\{d\}escapes only within a closed\-form window\[λ⋆,λmax\]\[\\lambda^\{\\star\},\\lambda\_\{\\max\}\]\(Prop\.[2](https://arxiv.org/html/2607.00588#Thmproposition2), App\.[A](https://arxiv.org/html/2607.00588#A1)\);λ=2\\lambda\{=\}2is the operating point inside it \(Tab\.[14](https://arxiv.org/html/2607.00588#A3.T14), Fig\.[5](https://arxiv.org/html/2607.00588#A3.F5)\)\. Below it the dose is too weak and repetition is not cut; above it repetition stays low but two costs appear: the perturbed latent leaves the real\-token manifold and the text decodes to non\-words, an observable proxy for leaving the manifold \(the rate spiking byλ≈8\\lambda\{\\approx\}8, Fig\.[5](https://arxiv.org/html/2607.00588#A3.F5)\), and the accepted text turns generic \(self\-BLEU climbs withλ\\lambda, Tab\.[14](https://arxiv.org/html/2607.00588#A3.T14)\)\.

#### Why steering keeps quality competitive\.

The drive𝒇k\\bm\{f\}\_\{k\}is large but nearly orthogonal to𝒅≈𝒗1\\bm\{d\}\{\\approx\}\\bm\{v\}\_\{1\}\(Prop\.[1](https://arxiv.org/html/2607.00588#Thmproposition1); App\.[A](https://arxiv.org/html/2607.00588#A1)\)\. ACE subtracts onlyλ​𝒅\\lambda\\bm\{d\}, so it cancels repetition while leaving the text\-writing drive intact, dropping repetition to near the human bar at competitive reference\-free quality \(Tab\.[4](https://arxiv.org/html/2607.00588#footnote4)\), with self\-BLEU the only mild cost\.

### 5\.2Compute\-to\-clean comparison

#### ACE beats every feedback\-side alternative we test, and makes clean text cheap\.

We benchmark the four routes to less repetition on a common, hard\-to\-game footing: a reject\-to\-NNloop scored by compute \(NFE\) and clean\-PPL \(Tab\.[4](https://arxiv.org/html/2607.00588#S5.T4); each route defined in App\.[D](https://arxiv.org/html/2607.00588#A4)\)\. ACE is the only one both cheap and clean; each alternative fails one axis\. Rejecting full\-SC post hoc is clean but expensive, its accepts scarce because the low Gen\-PPL is the very repetition the bar removes\. Disabling self\-conditioning \(SC\-reset\) is cheap but decodes to the diverse nonsense the clean\-PPL guard exists to catch\. Soft\-SC either barely cuts repetition or sacrifices coherence to remove it\. ACE instead subtracts one direction and leaves the rest of the feedback intact, reaching human\-clean text at competitive clean\-PPL and1\.51\.5–5×5\\timescheaper than full\-SC rejection across sizes \(Fig\.[6](https://arxiv.org/html/2607.00588#A4.F6)\)\.

Table 4:Only ACE reaches the human bar cheaply and at competitive clean\-PPL\.Compute\-to\-clean atγ=1\.0\\gamma\{=\}1\.0,6464steps: direct\-generation repetition, expected NFE \(10310^\{3\}forward passes\) to one human\-clean sample \(seq\-rep\-4≤1\.92%\\,\\leq 1\.92\\%\), and clean\-PPL on the reject\-to\-10001000accepted set\. Full grid over all soft\-SC variants and doses in Tab\.[15](https://arxiv.org/html/2607.00588#A4.T15)\.Table 5:No alternative direction beats ACE’s cheap difference\-of\-meansd\\bm\{d\}\.Direction estimators at theγ=1\.0\\gamma\{=\}1\.0operating point, dose fixed atλ=2\\lambda\{=\}2\(rep is the median\); clean\-PPL is a guardrail\. Recoverability of𝒅\\bm\{d\}: Tab\.[8](https://arxiv.org/html/2607.00588#A2.T8); dose\-form ablation: Tab\.[9](https://arxiv.org/html/2607.00588#A2.T9)\.*Two ablations isolate the remaining choices, the direction and the dose\.*

### 5\.3Ablations: direction and dose

#### The direction:𝒅\\bm\{d\}matches or beats every estimator\.

In the idealized separability model the difference of means coincides with two theory\-optimal directions at once, the dynamical optimum𝒗1\\bm\{v\}\_\{1\}\(Prop\.[1](https://arxiv.org/html/2607.00588#Thmproposition1)\) and the Fisher discriminant\(Fisher,[1936](https://arxiv.org/html/2607.00588#bib.bib40)\), and it stays*cheap and robust*: where those would need the loop’s Jacobian eigenvector or a noisy high\-dimensional covariance inverse,𝒅\\bm\{d\}reads off only two class means\. It matches or beats every alternative at the operating point \(Tab\.[5](https://arxiv.org/html/2607.00588#S5.T5)\): a regularized LDA, the optimal linear discriminant, only matches it, and𝒅\\bm\{d\}is in fact its simpler isotropic special case; the unsupervised top\-PC and Jacobian eigenvector steer worse\. The Jacobian eigenvector’s lower clean\-PPL is not a quality win: its direct\-generation repetition stays high \(5\.03%5\.03\\%\), so its accepted set is a small filtered subset rather than a fix\. The direction itself is robustly recoverable:cos\\costo𝒅\\bm\{d\}stays in\[0\.97,1\.00\]\[0\.97,1\.00\]across tertile fractions and step windows and collapses to near zero for random or label\-permuted controls \(Tab\.[8](https://arxiv.org/html/2607.00588#A2.T8)\)\. A black\-box search lowers repetition only slightly further \(1\.851\.85vs2\.112\.11\) at∼8×\{\\sim\}8\\timesthe cost and along essentially the same axis \(cos≈0\.92\\cos\{\\approx\}0\.92\) \(Tab\.[11](https://arxiv.org/html/2607.00588#A2.T11)\); this cheap, label\-free difference\-of\-means direction is thus near\-optimal\.

#### The dose: a fixedλ\\lambdabeats exact projection\.

Per\-step projection \(subtract the instantaneous𝒅\\bm\{d\}\-component,𝒙^k−⟨𝒙^k,𝒅⟩​𝒅\\hat\{\\bm\{x\}\}\_\{k\}\-\\langle\\hat\{\\bm\{x\}\}\_\{k\},\\bm\{d\}\\rangle\\bm\{d\}\) is the apparently optimal dose, removing exactly the offending component so it can never over\-steer; yet it*under*\-doses, because the loop re\-amplifies that component before the next step, and a fixedλ≈2\\lambda\{\\approx\}2that pre\-compensates the amplification does better \(Tab\.[9](https://arxiv.org/html/2607.00588#A2.T9)\)\. The dose matches the loop’s Jacobian gain to an order of magnitude; the full sweep and usable window are in App\.[C\.2](https://arxiv.org/html/2607.00588#A3.SS2)\.

## 6The defect and the fix generalize across knobs, sizes, and models

We test whether the steering fix of §[5](https://arxiv.org/html/2607.00588#S5)is an artefact of one configuration, and whether the underlying repetition defect is specific to ELF\. Neither holds\.

#### Steering generalizes across every inference knob\.

Steering atλ=2\\lambda\{=\}2drops repetition under every inference knob: across denoising*steps*and*guidance*scale, the two basin\-deepening knobs in Figure[4](https://arxiv.org/html/2607.00588#S6.F4), and across the remaining knobs \(noise scale, SDE/ODE sampler, SDE churnγ\\gamma, and an88\-seed robustness band\) at competitive clean\-PPL \(Table[7](https://arxiv.org/html/2607.00588#A2.T7)\)\.

![Refer to caption](https://arxiv.org/html/2607.00588v1/x4.png)Figure 4:Baseline repetition grows steeply with steps and guidance; steering suppresses it throughout\.Baseline \(red\) vs single𝒅\\bm\{d\}\(λ=2\\lambda\{=\}2, blue\) over \(a\) denoising steps and \(b\) guidance scale\.
#### Cross\-model: the direction transfers to other self\-conditioned LMs\.

The same difference\-of\-means procedure recovers a steerable attractor direction on other continuous\-latent diffusion LMs, so the mechanism is not ELF\-specific \(Table[6](https://arxiv.org/html/2607.00588#S6.T6)\)\. On LangFlow and the same\-recipe Plaid\(Gulrajani and Hashimoto,[2023](https://arxiv.org/html/2607.00588#bib.bib36)\), both soft self\-conditioning hints like ELF’s, the same recipe transfers: it cuts LangFlow’s repetition outright and Plaid’s heavy\-repetition tail\.

Table 6:The repetition defect and the steering direction generalize across continuous\-latent diffusion language models\.The same difference\-of\-means recipe recovers useful steering directions on other soft self\-conditioned LMs; the rep\>\>5% tail shows the defect is worse than the median suggests\.
#### Highlight: one direction, estimated once, transfers across every knob and size, tuning\-free\.

A*single*ELF\-B direction𝒅B\\bm\{d\}\_\{B\}, estimated once from a few hundred unlabeled samples on the smallest model, steers near\-natively across all four inference knobs \(steps, guidance, churn, noise\) and across model sizes ELF\-B/M/L \(Table[7](https://arxiv.org/html/2607.00588#A2.T7)\): cosine0\.820\.82–0\.960\.96to the natively re\-estimated axis across knobs,0\.850\.85–0\.880\.88across sizes\. So𝒅\\bm\{d\}is a property of the model family, not of the size or operating point, and one𝒅\\bm\{d\}suffices\.*That a single direction steers every size hints the attractor is not a per\-checkpoint quirk but a systematic bias of the self\-conditioned training recipe, set by the shared paradigm rather than learned afresh at each scale\.*

## 7Conclusion

We showed that the headline metric of self\-conditioned \(continuous\) diffusion language models rewards repetition: ELF’s low Gen\-PPL comes mostly from repetition it rewards, so on Gen\-PPL the105105M model outranks the342342M one that beats it once repetition is controlled\. We traced this to its source, a contractive fixed point of the self\-conditioning loop that forms a one\-dimensional basin, and turned that mechanism into a fix\. We propose ACE \(Attractor\-Contrast\-Escape\): it recovers the basin’s direction𝒅\\bm\{d\}once by a difference of means and subtracts it from the feedback; this removes most repetition, transfers across samplers and model sizes, and is partly internalizable by training\. Since Gen\-PPL rewards the very repetition ACE removes, we score the fix on a compute\-to\-clean evaluation against a human repetition floor, where it reaches genuinely non\-repetitive text at competitive clean\-PPL and1\.51\.5–5×5\\timeslower cost\. That one direction serves every model size points to a defect of the self\-conditioned paradigm itself rather than any single checkpoint, and the same construction recovers useful directions on other architectures\.

## References

- Simple self\-conditioning adaptation for masked diffusion models\.arXiv preprint arXiv:2604\.26985\.Cited by:[Appendix H](https://arxiv.org/html/2607.00588#A8.SS0.SSS0.Px4.p1.1)\.
- T\. Chen, R\. Zhang, and G\. Hinton \(2022\)Analog bits: generating discrete data using diffusion models with self\-conditioning\.arXiv preprint arXiv:2208\.04202\.Cited by:[§1](https://arxiv.org/html/2607.00588#S1.p1.2),[§2\.1](https://arxiv.org/html/2607.00588#S2.SS1.p2.1),[§4](https://arxiv.org/html/2607.00588#S4.SS0.SSS0.Px1.p1.4)\.
- Y\. Chen, C\. Liang, H\. Sui, R\. Guo, C\. Cheng, J\. You, and G\. Liu \(2026\)LangFlow: continuous diffusion rivals discrete in language modeling\.arXiv preprint arXiv:2604\.11748\.Cited by:[§1](https://arxiv.org/html/2607.00588#S1.p4.3),[§2\.1](https://arxiv.org/html/2607.00588#S2.SS1.p2.1)\.
- S\. Dieleman, L\. Sartran, A\. Roshannai, N\. Savinov, Y\. Ganin, P\. H\. Richemond, A\. Doucet, R\. Strudel, C\. Dyer, C\. Durkan,et al\.\(2022\)Continuous diffusion for categorical data\.arXiv preprint arXiv:2211\.15089\.Cited by:[Appendix H](https://arxiv.org/html/2607.00588#A8.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2607.00588#S1.SS0.SSS0.Px2.p1.1)\.
- R\. A\. Fisher \(1936\)The use of multiple measurements in taxonomic problems\.Annals of eugenics7\(2\),pp\. 179–188\.Cited by:[§5\.3](https://arxiv.org/html/2607.00588#S5.SS3.SSS0.Px1.p1.11)\.
- A\. Franca and A\. Tong \(2026\)Hacking generative perplexity: why unconditional text evaluation needs distributional metrics\.arXiv preprint arXiv:2606\.08417\.Cited by:[Appendix H](https://arxiv.org/html/2607.00588#A8.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2607.00588#S1.SS0.SSS0.Px2.p1.1)\.
- A\. Gokaslan and V\. Cohen \(2019\)OpenWebText corpus\.Note:[http://Skylion007\.github\.io/OpenWebTextCorpus](http://skylion007.github.io/OpenWebTextCorpus)Cited by:[§3](https://arxiv.org/html/2607.00588#S3.p1.8)\.
- G\. H\. Golub and C\. F\. Van Loan \(2013\)Matrix computations\.JHU press\.Cited by:[9](https://arxiv.org/html/2607.00588#alg2.l9.1)\.
- I\. Gulrajani and T\. B\. Hashimoto \(2023\)Likelihood\-based diffusion language models\.Advances in Neural Information Processing Systems36,pp\. 16693–16715\.Cited by:[§1](https://arxiv.org/html/2607.00588#S1.p3.1.3),[§2\.1](https://arxiv.org/html/2607.00588#S2.SS1.p2.1),[§6](https://arxiv.org/html/2607.00588#S6.SS0.SSS0.Px2.p1.1)\.
- T\. He, J\. Zhang, T\. Wang, S\. Kumar, K\. Cho, J\. Glass, and Y\. Tsvetkov \(2023\)On the blind spots of model\-based evaluation metrics for text generation\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 12067–12097\.Cited by:[Appendix H](https://arxiv.org/html/2607.00588#A8.SS0.SSS0.Px1.p1.1)\.
- A\. Holtzman, J\. Buys, L\. Du, M\. Forbes, and Y\. Choi \(2020\)The curious case of neural text degeneration\.InInternational Conference on Learning Representations,Cited by:[Appendix H](https://arxiv.org/html/2607.00588#A8.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2607.00588#S1.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2607.00588#S1.p2.1)\.
- K\. Hu, L\. Qiu, Y\. Lu, H\. Zhao, T\. Li, Y\. Kim, J\. Andreas, and K\. He \(2026\)ELF: embedded language flows\.arXiv preprint arXiv:2605\.10938\.Cited by:[§1](https://arxiv.org/html/2607.00588#S1.p1.2),[§2\.1](https://arxiv.org/html/2607.00588#S2.SS1.p1.6),[§3](https://arxiv.org/html/2607.00588#S3.p1.8)\.
- T\. Karras, M\. Aittala, T\. Aila, and S\. Laine \(2022\)Elucidating the design space of diffusion\-based generative models\.Advances in neural information processing systems35,pp\. 26565–26577\.Cited by:[§2\.1](https://arxiv.org/html/2607.00588#S2.SS1.p1.6)\.
- K\. Kukich \(1992\)Techniques for automatically correcting words in text\.ACM computing surveys \(CSUR\)24\(4\),pp\. 377–439\.Cited by:[§C\.1](https://arxiv.org/html/2607.00588#A3.SS1.p1.3)\.
- I\. Li, Z\. Shao, B\. Wang, R\. Yu, G\. V\. d\. Broeck, and A\. Liu \(2026\)Breaking the factorization barrier in diffusion language models\.arXiv preprint arXiv:2603\.00045\.Cited by:[Appendix H](https://arxiv.org/html/2607.00588#A8.SS0.SSS0.Px1.p1.1)\.
- J\. Li, M\. Galley, C\. Brockett, J\. Gao, and W\. B\. Dolan \(2016\)A diversity\-promoting objective function for neural conversation models\.InProceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies,pp\. 110–119\.Cited by:[§2\.3](https://arxiv.org/html/2607.00588#S2.SS3.p1.6)\.
- X\. Li, J\. Thickstun, I\. Gulrajani, P\. S\. Liang, and T\. B\. Hashimoto \(2022\)Diffusion\-lm improves controllable text generation\.Advances in neural information processing systems35,pp\. 4328–4343\.Cited by:[Appendix H](https://arxiv.org/html/2607.00588#A8.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2607.00588#S1.SS0.SSS0.Px2.p1.1)\.
- S\. Narayan, S\. B\. Cohen, and M\. Lapata \(2018\)Don’t give me the details, just the summary\! topic\-aware convolutional neural networks for extreme summarization\.InProceedings of the 2018 conference on empirical methods in natural language processing,pp\. 1797–1807\.Cited by:[§3](https://arxiv.org/html/2607.00588#S3.p1.8)\.
- L\. Németh \(2003\)Hunspell spell checker\.Note:[http://hunspell\.github\.io](http://hunspell.github.io/)Cited by:[§C\.1](https://arxiv.org/html/2607.00588#A3.SS1.p1.3)\.
- P\. Pynadath, J\. Shi, and R\. Zhang \(2026\)Generative frontiers: why evaluation matters for diffusion language models\.arXiv preprint arXiv:2604\.02718\.Cited by:[Appendix H](https://arxiv.org/html/2607.00588#A8.SS0.SSS0.Px1.p1.1)\.
- A\. Radford, J\. Wu, R\. Child, D\. Luan, D\. Amodei, I\. Sutskever,et al\.\(2019\)Language models are unsupervised multitask learners\.OpenAI blog1\(8\),pp\. 9\.Cited by:[§1](https://arxiv.org/html/2607.00588#S1.p2.1)\.
- J\. Shen, J\. Zhao, Z\. He, and Z\. Lin \(2026\)Codar: continuous diffusion language models are more powerful than you think\.arXiv preprint arXiv:2603\.02547\.Cited by:[Appendix H](https://arxiv.org/html/2607.00588#A8.SS0.SSS0.Px1.p1.1)\.
- A\. Shnaidman, E\. Feiglin, O\. Yaari, E\. Mentel, A\. Levi, and R\. Lapid \(2025\)Activation steering for masked diffusion language models\.arXiv preprint arXiv:2512\.24143\.Cited by:[Appendix H](https://arxiv.org/html/2607.00588#A8.SS0.SSS0.Px4.p1.1),[§1](https://arxiv.org/html/2607.00588#S1.SS0.SSS0.Px2.p1.1)\.
- R\. Speer, J\. Chin, A\. Lin, S\. Jewett, and L\. Nathan \(2018\)LuminosoInsight/wordfreq: v2\. 2\.Zenodo\.Cited by:[§C\.1](https://arxiv.org/html/2607.00588#A3.SS1.p1.3)\.
- R\. Strudel, C\. Tallec, F\. Altché, Y\. Du, Y\. Ganin, A\. Mensch, W\. Grathwohl, N\. Savinov, S\. Dieleman, L\. Sifre,et al\.\(2022\)Self\-conditioned embedding diffusion for text generation\.arXiv preprint arXiv:2211\.04236\.Cited by:[Appendix H](https://arxiv.org/html/2607.00588#A8.SS0.SSS0.Px2.p1.1)\.
- Y\. Su, T\. Lan, Y\. Wang, D\. Yogatama, L\. Kong, and N\. Collier \(2022\)A contrastive framework for neural text generation\.Advances in Neural Information Processing Systems35,pp\. 21548–21561\.Cited by:[Appendix H](https://arxiv.org/html/2607.00588#A8.SS0.SSS0.Px3.p1.1),[§2\.3](https://arxiv.org/html/2607.00588#S2.SS3.p1.6)\.
- Y\. Wang, J\. Deng, A\. Sun, and X\. Meng \(2022\)Perplexity from plm is unreliable for evaluating text quality\.arXiv preprint arXiv:2210\.05892\.Cited by:[Appendix H](https://arxiv.org/html/2607.00588#A8.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2607.00588#S1.SS0.SSS0.Px2.p1.1)\.
- S\. Welleck, I\. Kulikov, S\. Roller, E\. Dinan, K\. Cho, and J\. Weston \(2020\)Neural text generation with unlikelihood training\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2607.00588#S1.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2607.00588#S1.p2.1),[§2\.3](https://arxiv.org/html/2607.00588#S2.SS3.p1.6),[§3](https://arxiv.org/html/2607.00588#S3.SS0.SSS0.Px1.p1.1)\.
- J\. Xu, X\. Liu, J\. Yan, D\. Cai, H\. Li, and J\. Li \(2022\)Learning to break the loop: analyzing and mitigating repetitions for neural text generation\.Advances in Neural Information Processing Systems35,pp\. 3082–3095\.Cited by:[Appendix H](https://arxiv.org/html/2607.00588#A8.SS0.SSS0.Px2.p1.1)\.
- K\. Zheng, Y\. Chen, H\. Mao, M\. Liu, J\. Zhu, and Q\. Zhang \(2025\)Masked diffusion models are secretly time\-agnostic masked models and exploit inaccurate categorical sampling\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 63186–63227\.Cited by:[Appendix H](https://arxiv.org/html/2607.00588#A8.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2607.00588#S1.SS0.SSS0.Px2.p1.1)\.

## Appendix AFormal analysis and empirical validation

This appendix develops the idealized linear model of the self\-conditioning loop in derivation order, then validates the structure on the trained network\. We linearize the loop \(Lemma[1](https://arxiv.org/html/2607.00588#Thmlemma1)\), read off the per\-step decomposition splitting each update into drive and repetition mode \(Lemma[2](https://arxiv.org/html/2607.00588#Thmlemma2)\), and locate where the driven loop settles \(Lemma[3](https://arxiv.org/html/2607.00588#Thmlemma3)\)\. From this we derive the two consequences used in §[4](https://arxiv.org/html/2607.00588#S4): a label\-free difference of means recovers the repetition axis \(Prop\.[1](https://arxiv.org/html/2607.00588#Thmproposition1)\), and subtracting it escapes the basin within a closed\-form dose window \(Prop\.[2](https://arxiv.org/html/2607.00588#Thmproposition2)\)\. Every displayed result is exact for this idealized model under its stated assumptions \(a symmetrized Jacobian, an exogenous drive, and the scalar approximations flagged at each step\); the final empirical section then estimates the Jacobian on the real network and checks each prediction against measurement\.

### A\.1Structure of the linearized loop

#### Setup: where𝒔\\bm\{s\},𝒇k\\bm\{f\}\_\{k\}, and𝑱\\bm\{J\}come from\.

We reuse the pooled\-feedback loop𝒖k\+1=g​\(𝒖k\)\\bm\{u\}\_\{k\+1\}=g\(\\bm\{u\}\_\{k\}\)of equation[4](https://arxiv.org/html/2607.00588#S4.E4)and Assumption[1](https://arxiv.org/html/2607.00588#Thmassumption1), writing\(⋅\)¯=1L​∑l=1L\(⋅\)​\[l\]\\overline\{\(\\cdot\)\}=\\tfrac\{1\}\{L\}\\sum\_\{l=1\}^\{L\}\(\\cdot\)\[l\]for the position pool\. One step is the trained model𝑿θ\\bm\{X\}\_\{\\theta\}evaluated on its two inputs, the current noisy latent𝒛k\\bm\{z\}\_\{k\}and the fed\-back estimate \(taken uniform across positions, the pooling assumption of §[4](https://arxiv.org/html/2607.00588#S4)\), at noise leveltkt\_\{k\}:

g​\(𝒖k\)=𝑿θ​\(\[𝒛k,𝒖k​𝟏L\],tk\)¯\.g\(\\bm\{u\}\_\{k\}\)\\;=\\;\\overline\{\\bm\{X\}\_\{\\theta\}\\\!\\big\(\[\\bm\{z\}\_\{k\},\\,\\bm\{u\}\_\{k\}\\bm\{1\}\_\{L\}\],\\,t\_\{k\}\\big\)\}\.\(7\)The model sees𝒖k\\bm\{u\}\_\{k\}only through the fed\-back channel, so we expand in that channel about the repetition fixed point𝒖⋆\\bm\{u\}^\{\\star\}\(Assumption[1](https://arxiv.org/html/2607.00588#Thmassumption1)\):

𝑿θ​\(\[𝒛k,𝒖k\],tk\)=𝑿θ​\(\[𝒛k,𝒖⋆\],tk\)⏟zeroth order\+𝑱​\(𝒖k−𝒖⋆\)⏟first order\+o​\(∥𝒂k∥\)\.\\bm\{X\}\_\{\\theta\}\\\!\\big\(\[\\bm\{z\}\_\{k\},\\bm\{u\}\_\{k\}\],t\_\{k\}\\big\)\\;=\\;\\underbrace\{\\bm\{X\}\_\{\\theta\}\\\!\\big\(\[\\bm\{z\}\_\{k\},\\bm\{u\}^\{\\star\}\],t\_\{k\}\\big\)\}\_\{\\text\{zeroth order\}\}\\;\+\\;\\underbrace\{\\bm\{J\}\\,\(\\bm\{u\}\_\{k\}\-\\bm\{u\}^\{\\star\}\)\}\_\{\\text\{first order\}\}\\;\+\\;o\(\\lVert\\bm\{a\}\_\{k\}\\rVert\)\.\(8\)The three objects used throughout are the terms of equation[8](https://arxiv.org/html/2607.00588#A1.E8): the*denoising drive*𝒇k:=𝑿θ​\(\[𝒛k,𝒖⋆\],tk\)¯−𝒖⋆\\bm\{f\}\_\{k\}:=\\overline\{\\bm\{X\}\_\{\\theta\}\(\[\\bm\{z\}\_\{k\},\\bm\{u\}^\{\\star\}\],t\_\{k\}\)\}\-\\bm\{u\}^\{\\star\}\(zeroth order: a function of𝒛k,tk\\bm\{z\}\_\{k\},t\_\{k\}, constant in𝒖k\\bm\{u\}\_\{k\}\), the self\-conditioning Jacobian𝑱:=∂𝑿θ¯/∂𝒖\|𝒖⋆\\bm\{J\}:=\\partial\\,\\overline\{\\bm\{X\}\_\{\\theta\}\}/\\partial\\bm\{u\}\\big\|\_\{\\bm\{u\}^\{\\star\}\}, and the self\-conditioning map𝒔​\(𝒖\):=𝒖⋆\+𝑱​\(𝒖−𝒖⋆\)\+o​\(∥𝒖−𝒖⋆∥\)\\bm\{s\}\(\\bm\{u\}\):=\\bm\{u\}^\{\\star\}\+\\bm\{J\}\(\\bm\{u\}\-\\bm\{u\}^\{\\star\}\)\+o\(\\lVert\\bm\{u\}\-\\bm\{u\}^\{\\star\}\\rVert\), which fixes𝒖⋆\\bm\{u\}^\{\\star\}by construction,𝒔​\(𝒖⋆\)=𝒖⋆\\bm\{s\}\(\\bm\{u\}^\{\\star\}\)=\\bm\{u\}^\{\\star\}\. Collecting them gives the split

g​\(𝒖k\)=𝒔​\(𝒖k\)\+𝒇k\.g\(\\bm\{u\}\_\{k\}\)\\;=\\;\\bm\{s\}\(\\bm\{u\}\_\{k\}\)\+\\bm\{f\}\_\{k\}\.\(9\)The one substantive reduction is treating the latent trajectory\{𝒛k\}\\\{\\bm\{z\}\_\{k\}\\\}as exogenous, so𝒇k\\bm\{f\}\_\{k\}acts as a𝒖k\\bm\{u\}\_\{k\}\-independent forcing \(its𝒖k\\bm\{u\}\_\{k\}\-dependence beyond linear order is theo​\(∥𝒂k∥\)o\(\\lVert\\bm\{a\}\_\{k\}\\rVert\)remainder\)\. Downstream measurements support this: the drive is nearly orthogonal to the repetition axis \(\|cos⁡\(𝒇k,𝒗1\)\|=0\.15\\lvert\\cos\(\\bm\{f\}\_\{k\},\\bm\{v\}\_\{1\}\)\\rvert\{=\}0\.15; App\.[A\.4](https://arxiv.org/html/2607.00588#A1.SS4)\) and𝑱\\bm\{J\}is contracting with a dominant mode aligned to𝒅\\bm\{d\}\(Tab\.[10](https://arxiv.org/html/2607.00588#A2.T10)\)\.

Let𝒂k=𝒖k−𝒖⋆\\bm\{a\}\_\{k\}=\\bm\{u\}\_\{k\}\-\\bm\{u\}^\{\\star\}be the residual, with coordinatesak\(i\)=⟨𝒂k,𝒗i⟩a^\{\(i\)\}\_\{k\}=\\langle\\bm\{a\}\_\{k\},\\bm\{v\}\_\{i\}\\ranglein the eigenbasis of Assumption[1](https://arxiv.org/html/2607.00588#Thmassumption1); by the symmetric\-𝑱\\bm\{J\}idealization𝒗1\\bm\{v\}\_\{1\}is both a left and right eigenvector, so the𝒗1\\bm\{v\}\_\{1\}\-coordinate decouples below\.

###### Lemma 1\(Linearized loop\)\.

Near the fixed point the residual obeys the linear recursion

𝒂k\+1=𝑱​𝒂k\+𝒇k,\\bm\{a\}\_\{k\+1\}\\;=\\;\\bm\{J\}\\,\\bm\{a\}\_\{k\}\+\\bm\{f\}\_\{k\},\(10\)the residual form of the loop equation[4](https://arxiv.org/html/2607.00588#S4.E4)near𝐮⋆\\bm\{u\}^\{\\star\}\.

###### Proof\.

Subtract𝒖⋆\\bm\{u\}^\{\\star\}from the split equation[9](https://arxiv.org/html/2607.00588#A1.E9)and use𝒔​\(𝒖k\)=𝒖⋆\+𝑱​𝒂k\+o​\(∥𝒂k∥\)\\bm\{s\}\(\\bm\{u\}\_\{k\}\)=\\bm\{u\}^\{\\star\}\+\\bm\{J\}\\bm\{a\}\_\{k\}\+o\(\\lVert\\bm\{a\}\_\{k\}\\rVert\)from equation[8](https://arxiv.org/html/2607.00588#A1.E8):𝒂k\+1=𝑱​𝒂k\+𝒇k\+o​\(∥𝒂k∥\)\\bm\{a\}\_\{k\+1\}=\\bm\{J\}\\bm\{a\}\_\{k\}\+\\bm\{f\}\_\{k\}\+o\(\\lVert\\bm\{a\}\_\{k\}\\rVert\), which is equation[10](https://arxiv.org/html/2607.00588#A1.E10)once the higher\-order remainder is dropped \(exact for the idealized linear model\)\. ∎

###### Lemma 2\(Per\-step decomposition\)\.

The one\-step change splits exactly as

Δ​𝒖k=βk​𝒗1\+𝒓k\+𝒇k,\\Delta\\bm\{u\}\_\{k\}\\;=\\;\\beta\_\{k\}\\,\\bm\{v\}\_\{1\}\\;\+\\;\\bm\{r\}\_\{k\}\\;\+\\;\\bm\{f\}\_\{k\},\([5](https://arxiv.org/html/2607.00588#S4.E5)\)the decomposition of equation[5](https://arxiv.org/html/2607.00588#S4.E5), with repetition coefficientβk=\(μ1−1\)​ak\(1\)\\beta\_\{k\}=\(\\mu\_\{1\}\-1\)\\,a^\{\(1\)\}\_\{k\}and remainder𝐫k=∑i≥2\(μi−1\)​ak\(i\)​𝐯i\\bm\{r\}\_\{k\}=\\sum\_\{i\\geq 2\}\(\\mu\_\{i\}\-1\)\\,a^\{\(i\)\}\_\{k\}\\bm\{v\}\_\{i\}\. The drive𝐟k\\bm\{f\}\_\{k\}is orthogonal to𝐯1\\bm\{v\}\_\{1\}\(off the repetition axis, App\.[A\.4](https://arxiv.org/html/2607.00588#A1.SS4)\), and the remainder collects the off\-axis modes, which contract faster \(μi≤μ1\\mu\_\{i\}\\leq\\mu\_\{1\}fori≥2i\\geq 2\)\.

###### Proof\.

From Lemma[1](https://arxiv.org/html/2607.00588#Thmlemma1),Δ​𝒖k=𝒖k\+1−𝒖k=𝒂k\+1−𝒂k=\(𝑱−𝑰\)​𝒂k\+𝒇k\\Delta\\bm\{u\}\_\{k\}=\\bm\{u\}\_\{k\+1\}\-\\bm\{u\}\_\{k\}=\\bm\{a\}\_\{k\+1\}\-\\bm\{a\}\_\{k\}=\(\\bm\{J\}\-\\bm\{I\}\)\\bm\{a\}\_\{k\}\+\\bm\{f\}\_\{k\}\. Expanding the residual in the orthonormal eigenbasis \(Assumption[1](https://arxiv.org/html/2607.00588#Thmassumption1)\),𝒂k=∑iak\(i\)​𝒗i\\bm\{a\}\_\{k\}=\\sum\_\{i\}a^\{\(i\)\}\_\{k\}\\bm\{v\}\_\{i\}, gives\(𝑱−𝑰\)​𝒂k=∑i\(μi−1\)​ak\(i\)​𝒗i\(\\bm\{J\}\-\\bm\{I\}\)\\bm\{a\}\_\{k\}=\\sum\_\{i\}\(\\mu\_\{i\}\-1\)a^\{\(i\)\}\_\{k\}\\bm\{v\}\_\{i\}; separating thei=1i\{=\}1term asβk​𝒗1\\beta\_\{k\}\\bm\{v\}\_\{1\}from the rest as𝒓k\\bm\{r\}\_\{k\}yields the stated identity\. Since projecting equation[10](https://arxiv.org/html/2607.00588#A1.E10)on𝒗i\\bm\{v\}\_\{i\}gives the scalar recursionak\+1\(i\)=μi​ak\(i\)\+f\(i\)a^\{\(i\)\}\_\{k\+1\}=\\mu\_\{i\}a^\{\(i\)\}\_\{k\}\+f^\{\(i\)\}, whose homogeneous transient decays asμik\\mu\_\{i\}^\{k\}, fastest for the smallestμi\\mu\_\{i\}; so the off\-axis transients \(μi≤μ1\\mu\_\{i\}\\leq\\mu\_\{1\},i≥2i\\geq 2\) die out at least as fast as the𝒗1\\bm\{v\}\_\{1\}transient\. They do not vanish: each settles at the driven offsetf\(i\)/\(1−μi\)f^\{\(i\)\}/\(1\-\\mu\_\{i\}\)\(the off\-axis content𝒘\\bm\{w\}of Prop\.[1](https://arxiv.org/html/2607.00588#Thmproposition1)\), largest along𝒗1\\bm\{v\}\_\{1\}\. So𝒗1\\bm\{v\}\_\{1\}carries both the slowest transient and the largest steady offset, the persistent structured part of the residual \(Lemma[3](https://arxiv.org/html/2607.00588#Thmlemma3)\)\. ∎

###### Lemma 3\(Where the driven loop settles\)\.

With the drive frozen at𝐟k≡𝐟\\bm\{f\}\_\{k\}\\equiv\\bm\{f\}, the recursion equation[10](https://arxiv.org/html/2607.00588#A1.E10)converges geometrically \(at rateμ1\\mu\_\{1\}\) to the unique fixed point𝐚∞=\(𝐈−𝐉\)−1​𝐟\\bm\{a\}\_\{\\infty\}=\(\\bm\{I\}\-\\bm\{J\}\)^\{\-1\}\\bm\{f\}, i\.e\.

𝒖∞=𝒖⋆\+\(𝑰−𝑱\)−1​𝒇\.\\bm\{u\}\_\{\\infty\}=\\bm\{u\}^\{\\star\}\+\(\\bm\{I\}\-\\bm\{J\}\)^\{\-1\}\\bm\{f\}\.\([6](https://arxiv.org/html/2607.00588#S4.E6)\)Its distance from𝐮⋆\\bm\{u\}^\{\\star\}along the repetition axis is\|a∞\(1\)\|=\|f\(1\)\|/\(1−μ1\)=\|f\(1\)\|/ρ\\lvert a^\{\(1\)\}\_\{\\infty\}\\rvert=\\lvert f^\{\(1\)\}\\rvert/\(1\-\\mu\_\{1\}\)=\\lvert f^\{\(1\)\}\\rvert/\\rhowithf\(1\)=⟨𝐟,𝐯1⟩f^\{\(1\)\}=\\langle\\bm\{f\},\\bm\{v\}\_\{1\}\\rangle\. Since𝐟\\bm\{f\}is near\-orthogonal to𝐯1\\bm\{v\}\_\{1\}\(f\(1\)f^\{\(1\)\}small\),𝐮∞\\bm\{u\}\_\{\\infty\}sits a small but nonzero distance from𝐮⋆\\bm\{u\}^\{\\star\}: a bounded repetition offset, not full collapse\. This\|f\(1\)\|/ρ\\lvert f^\{\(1\)\}\\rvert/\\rhois a residual offset, not an amplification: stronger contraction \(largerρ\\rho, smallerμ1\\mu\_\{1\}\) leaves the drive less room to push the endpoint off𝐮⋆\\bm\{u\}^\{\\star\}, shrinking the distance and deepening repetition, whereas a near\-marginal mode \(ρ→0\\rho\\\!\\to\\\!0\) would let even a smallf\(1\)f^\{\(1\)\}hold𝐮∞\\bm\{u\}\_\{\\infty\}far from𝐮⋆\\bm\{u\}^\{\\star\}\.

###### Proof\.

𝑱\\bm\{J\}is contracting, so its spectral radiusμ1<1\\mu\_\{1\}<1and𝑰−𝑱\\bm\{I\}\-\\bm\{J\}is invertible \(eigenvalues1−μi\>01\-\\mu\_\{i\}\>0\); the Neumann series∑j≥0𝑱j=\(𝑰−𝑱\)−1\\sum\_\{j\\geq 0\}\\bm\{J\}^\{j\}=\(\\bm\{I\}\-\\bm\{J\}\)^\{\-1\}converges\. Unrolling equation[10](https://arxiv.org/html/2607.00588#A1.E10)with𝒇k≡𝒇\\bm\{f\}\_\{k\}\\equiv\\bm\{f\}from𝒂0\\bm\{a\}\_\{0\},

𝒂k=𝑱k​𝒂0\+∑j=0k−1𝑱j​𝒇→k→∞\(𝑰−𝑱\)−1​𝒇,\\bm\{a\}\_\{k\}=\\bm\{J\}^\{k\}\\bm\{a\}\_\{0\}\+\\sum\_\{j=0\}^\{k\-1\}\\bm\{J\}^\{j\}\\bm\{f\}\\;\\xrightarrow\[k\\to\\infty\]\{\}\\;\(\\bm\{I\}\-\\bm\{J\}\)^\{\-1\}\\bm\{f\},\(11\)the transient𝑱k​𝒂0\\bm\{J\}^\{k\}\\bm\{a\}\_\{0\}vanishing at rateμ1k\\mu\_\{1\}^\{k\}\. Equivalently𝒂∞\\bm\{a\}\_\{\\infty\}solves\(𝑰−𝑱\)​𝒂∞=𝒇\(\\bm\{I\}\-\\bm\{J\}\)\\bm\{a\}\_\{\\infty\}=\\bm\{f\}; projecting on the unit eigenvector𝒗1\\bm\{v\}\_\{1\},\(1−μ1\)​a∞\(1\)=f\(1\)\(1\-\\mu\_\{1\}\)a^\{\(1\)\}\_\{\\infty\}=f^\{\(1\)\}, so\|a∞\(1\)\|=\|f\(1\)\|/\(1−μ1\)\\lvert a^\{\(1\)\}\_\{\\infty\}\\rvert=\\lvert f^\{\(1\)\}\\rvert/\(1\-\\mu\_\{1\}\)\. The mapρ↦\|f\(1\)\|/ρ\\rho\\mapsto\\lvert f^\{\(1\)\}\\rvert/\\rhois decreasing on\(0,1\)\(0,1\), so a largerρ\\rhoshrinks the residual\. ∎

#### Nonlinear local existence and uniqueness\.

Lemma[3](https://arxiv.org/html/2607.00588#Thmlemma3)is exact for the linearized loop; the same conclusion survives for the true nonlinear map by the implicit function theorem\. Write the driven fixed\-point condition as𝑮​\(𝒖,𝒇\)=𝒔​\(𝒖\)−𝒖\+𝒇=𝟎\\bm\{G\}\(\\bm\{u\},\\bm\{f\}\)=\\bm\{s\}\(\\bm\{u\}\)\-\\bm\{u\}\+\\bm\{f\}=\\bm\{0\}; it isC1C^\{1\}near\(𝒖⋆,𝟎\)\(\\bm\{u\}^\{\\star\},\\bm\{0\}\)\(Assumption[1](https://arxiv.org/html/2607.00588#Thmassumption1)\) with𝑮​\(𝒖⋆,𝟎\)=𝟎\\bm\{G\}\(\\bm\{u\}^\{\\star\},\\bm\{0\}\)=\\bm\{0\}, and its𝒖\\bm\{u\}\-Jacobian there is∂𝒖𝑮=𝑱−𝑰\\partial\_\{\\bm\{u\}\}\\bm\{G\}=\\bm\{J\}\-\\bm\{I\}, invertible wheneverμ1≠1\\mu\_\{1\}\\neq 1\. The contractionμ1<1\\mu\_\{1\}<1supplies this \(measuredμ1≈0\.15\\mu\_\{1\}\{\\approx\}0\.15, App\.[A\.4](https://arxiv.org/html/2607.00588#A1.SS4)\), so the theorem gives a neighborhood of𝒇=𝟎\\bm\{f\}\{=\}\\bm\{0\}on which the driven fixed point𝒖∞​\(𝒇\)\\bm\{u\}\_\{\\infty\}\(\\bm\{f\}\)exists, is locally unique, and isC1C^\{1\}in𝒇\\bm\{f\}, with first\-order expansion𝒖∞​\(𝒇\)=𝒖⋆\+\(𝑰−𝑱\)−1​𝒇\+o​\(∥𝒇∥\)\\bm\{u\}\_\{\\infty\}\(\\bm\{f\}\)=\\bm\{u\}^\{\\star\}\+\(\\bm\{I\}\-\\bm\{J\}\)^\{\-1\}\\bm\{f\}\+o\(\\lVert\\bm\{f\}\\rVert\): the linear formula equation[6](https://arxiv.org/html/2607.00588#S4.E6)is its leading term\. The theorem upgrades the linear solution to a locally unique branch of the nonlinear loop; it does not by itself produce𝒖⋆\\bm\{u\}^\{\\star\}, whose existence is Assumption[1](https://arxiv.org/html/2607.00588#Thmassumption1)and which appears empirically as a near\-fixed point \(residual≈0\.07\{\\approx\}0\.07; App\.[A\.4](https://arxiv.org/html/2607.00588#A1.SS4)\)\.

#### Frozen coefficients\.

Lemma[3](https://arxiv.org/html/2607.00588#Thmlemma3)freezes𝑱\\bm\{J\}and𝒇\\bm\{f\}\(the standard quasi\-static linearization\); both vary slowly over a finite budget, so the real trajectory*tracks*this instantaneous attractor rather than reaching it\.

### A\.2Consequence 1: a label\-free direction recovers the axis

###### Definition 1\(Difference\-of\-means direction and steering\)\.

Leta\(1\)=⟨𝐚,𝐯1⟩a^\{\(1\)\}=\\langle\\bm\{a\},\\bm\{v\}\_\{1\}\\ranglebe the signed repetition\-axis coordinate, oriented so\+𝐯1\+\\bm\{v\}\_\{1\}points toward the repetition fixed point𝐮⋆\\bm\{u\}^\{\\star\}and the rep score is*monotone increasing*ina\(1\)a^\{\(1\)\}\(deeper into the basin along\+𝐯1\+\\bm\{v\}\_\{1\}is more repetitive\)\. The trapped and free groups𝒯,ℱ\\mathcal\{T\},\\mathcal\{F\}are the top and bottom repetition tertiles, hence separated along signeda\(1\)a^\{\(1\)\}witha¯𝒯\(1\)\>a¯ℱ\(1\)\\bar\{a\}^\{\(1\)\}\_\{\\mathcal\{T\}\}\>\\bar\{a\}^\{\(1\)\}\_\{\\mathcal\{F\}\}, so the difference\-of\-means gapa¯𝒯\(1\)−a¯ℱ\(1\)≠0\\bar\{a\}^\{\(1\)\}\_\{\\mathcal\{T\}\}\-\\bar\{a\}^\{\(1\)\}\_\{\\mathcal\{F\}\}\\neq 0\. The estimator

𝒅=𝒎𝒯−𝒎ℱ∥𝒎𝒯−𝒎ℱ∥,𝒎𝒯=𝔼​\[𝒖∣𝒯\],𝒎ℱ=𝔼​\[𝒖∣ℱ\],\\bm\{d\}\\;=\\;\\frac\{\\bm\{m\}\_\{\\mathcal\{T\}\}\-\\bm\{m\}\_\{\\mathcal\{F\}\}\}\{\\lVert\\bm\{m\}\_\{\\mathcal\{T\}\}\-\\bm\{m\}\_\{\\mathcal\{F\}\}\\rVert\},\\qquad\\bm\{m\}\_\{\\mathcal\{T\}\}=\\mathbb\{E\}\[\\bm\{u\}\\mid\\mathcal\{T\}\],\\quad\\bm\{m\}\_\{\\mathcal\{F\}\}=\\mathbb\{E\}\[\\bm\{u\}\\mid\\mathcal\{F\}\],\(12\)defines the attractor direction, and steered self\-conditioning feeds back𝐱~k=𝐱^k−λ​𝐝\\tilde\{\\bm\{x\}\}\_\{k\}=\\hat\{\\bm\{x\}\}\_\{k\}\-\\lambda\\,\\bm\{d\}with strengthλ≥0\\lambda\\geq 0\.

###### Proposition 1\(Difference of means recovers the attractor direction\)\.

Split the feedback into the repetition axis and the rest,𝐮=𝐮⋆\+a\(1\)​𝐯1\+𝐰\\bm\{u\}=\\bm\{u\}^\{\\star\}\+a^\{\(1\)\}\\bm\{v\}\_\{1\}\+\\bm\{w\}, where the off\-axis content𝐰⟂𝐯1\\bm\{w\}\\perp\\bm\{v\}\_\{1\}collects the drive𝐟k\\bm\{f\}\_\{k\}and remainder𝐫k\\bm\{r\}\_\{k\}of Lemma[2](https://arxiv.org/html/2607.00588#Thmlemma2)\(both⟂𝐯1\\perp\\bm\{v\}\_\{1\}\)\. Assume*separability*: the trapped/free split acts only through the on\-axis coordinatea\(1\)a^\{\(1\)\}, so the off\-axis𝐰\\bm\{w\}has a group\-independent mean,𝔼​\[𝐰∣𝒯\]=𝔼​\[𝐰∣ℱ\]\\mathbb\{E\}\[\\bm\{w\}\\mid\\mathcal\{T\}\]=\\mathbb\{E\}\[\\bm\{w\}\\mid\\mathcal\{F\}\]\. Then the difference of means is parallel to the repetition axis,𝐝∥𝐯1\\bm\{d\}\\parallel\\bm\{v\}\_\{1\}\(empirically partial; see the remark\)\.

###### Proof\.

Take the group\-conditional mean under the model: for𝒢∈\{𝒯,ℱ\}\\mathcal\{G\}\\in\\\{\\mathcal\{T\},\\mathcal\{F\}\\\},

𝒎𝒢=𝔼​\[𝒖∣𝒢\]=𝒖⋆\+a¯𝒢\(1\)​𝒗1\+𝔼​\[𝒘∣𝒢\],a¯𝒢\(1\):=𝔼​\[a\(1\)∣𝒢\],\\bm\{m\}\_\{\\mathcal\{G\}\}=\\mathbb\{E\}\[\\bm\{u\}\\mid\\mathcal\{G\}\]=\\bm\{u\}^\{\\star\}\+\\bar\{a\}^\{\(1\)\}\_\{\\mathcal\{G\}\}\\,\\bm\{v\}\_\{1\}\+\\mathbb\{E\}\[\\bm\{w\}\\mid\\mathcal\{G\}\],\\qquad\\bar\{a\}^\{\(1\)\}\_\{\\mathcal\{G\}\}:=\\mathbb\{E\}\[a^\{\(1\)\}\\mid\\mathcal\{G\}\],the group\-mean on\-axis coordinate\. Subtracting the two groups,

𝒎𝒯−𝒎ℱ=\(𝒖⋆−𝒖⋆\)⏟=0\+\(a¯𝒯\(1\)−a¯ℱ\(1\)\)​𝒗1\+\(𝔼​\[𝒘∣𝒯\]−𝔼​\[𝒘∣ℱ\]\)⏟=0:\\bm\{m\}\_\{\\mathcal\{T\}\}\-\\bm\{m\}\_\{\\mathcal\{F\}\}=\\underbrace\{\(\\bm\{u\}^\{\\star\}\-\\bm\{u\}^\{\\star\}\)\}\_\{=\\,\\bm\{0\}\}\+\\bigl\(\\bar\{a\}^\{\(1\)\}\_\{\\mathcal\{T\}\}\-\\bar\{a\}^\{\(1\)\}\_\{\\mathcal\{F\}\}\\bigr\)\\bm\{v\}\_\{1\}\+\\underbrace\{\\bigl\(\\mathbb\{E\}\[\\bm\{w\}\\mid\\mathcal\{T\}\]\-\\mathbb\{E\}\[\\bm\{w\}\\mid\\mathcal\{F\}\]\\bigr\)\}\_\{=\\,\\bm\{0\}\}:\(13\)the fixed point𝒖⋆\\bm\{u\}^\{\\star\}is shared by both groups and cancels, and the off\-axis term𝔼​\[𝒘∣𝒯\]−𝔼​\[𝒘∣ℱ\]\\mathbb\{E\}\[\\bm\{w\}\\mid\\mathcal\{T\}\]\-\\mathbb\{E\}\[\\bm\{w\}\\mid\\mathcal\{F\}\]vanishes by the modeling assumption\. Only the on\-axis term survives, so𝒎𝒯−𝒎ℱ=\(a¯𝒯\(1\)−a¯ℱ\(1\)\)​𝒗1∥𝒗1\\bm\{m\}\_\{\\mathcal\{T\}\}\-\\bm\{m\}\_\{\\mathcal\{F\}\}=\(\\bar\{a\}^\{\(1\)\}\_\{\\mathcal\{T\}\}\-\\bar\{a\}^\{\(1\)\}\_\{\\mathcal\{F\}\}\)\\bm\{v\}\_\{1\}\\parallel\\bm\{v\}\_\{1\}, with scalara¯𝒯\(1\)−a¯ℱ\(1\)≠0\\bar\{a\}^\{\(1\)\}\_\{\\mathcal\{T\}\}\-\\bar\{a\}^\{\(1\)\}\_\{\\mathcal\{F\}\}\\neq 0since the trapped group sits deeper along𝒗1\\bm\{v\}\_\{1\}\(Definition[1](https://arxiv.org/html/2607.00588#Thmdefinition1)\)\. Normalizing,𝒅∥𝒗1\\bm\{d\}\\parallel\\bm\{v\}\_\{1\}\(up to sign\)\. ∎

#### 𝒅\\bm\{d\}is the empirical counterpart of𝒗1\\bm\{v\}\_\{1\}\.

The measured alignment is\|cos⁡\(𝒗1,𝒅\)\|≤0\.55\\lvert\\cos\(\\bm\{v\}\_\{1\},\\bm\{d\}\)\\rvert\{\\leq\}0\.55, and𝒅\\bm\{d\}steers better than the single\-point𝒗1\\bm\{v\}\_\{1\}itself \(Tab\.[5](https://arxiv.org/html/2607.00588#S5.T5)\) because it averages the basin\-entry drift over the trajectory and many samples\. We take Prop\.[1](https://arxiv.org/html/2607.00588#Thmproposition1)as the idealized motivation and𝒅\\bm\{d\}as the operational direction\.

### A\.3Consequence 2: a closed\-form steering window

###### Proposition 2\(Steering threshold and usable window\)\.

Apply Definition[1](https://arxiv.org/html/2607.00588#Thmdefinition1)with𝐝=𝐯1\\bm\{d\}=\\bm\{v\}\_\{1\}\(Prop\.[1](https://arxiv.org/html/2607.00588#Thmproposition1)\)\. Track the signed axis coordinateck=⟨𝐮k−𝐮⋆,𝐯1⟩c\_\{k\}=\\langle\\bm\{u\}\_\{k\}\-\\bm\{u\}^\{\\star\},\\bm\{v\}\_\{1\}\\rangle\(the signeda\(1\)a^\{\(1\)\}of Definition[1](https://arxiv.org/html/2607.00588#Thmdefinition1)\) and the distance from the repetition fixed pointDk=−ck≥0D\_\{k\}=\-c\_\{k\}\\geq 0\(\+𝐯1\+\\bm\{v\}\_\{1\}points toward𝐮⋆\\bm\{u\}^\{\\star\}, so repetition is proximity to𝐮⋆\\bm\{u\}^\{\\star\}and the rep score decreases inDkD\_\{k\}\)\. Projecting the steered recursion on𝐯1\\bm\{v\}\_\{1\}and*retaining*the drive componentf\(1\)=⟨𝐟,𝐯1⟩f^\{\(1\)\}=\\langle\\bm\{f\},\\bm\{v\}\_\{1\}\\rangle\(proof\) gives

ck\+1=μ1​\(ck−λ\)\+f\(1\),equivalentlyDk\+1=μ1​\(Dk\+λ\)\+\|f\(1\)\|,c\_\{k\+1\}=\\mu\_\{1\}\\bigl\(c\_\{k\}\-\\lambda\\bigr\)\+f^\{\(1\)\},\\qquad\\text\{equivalently\}\\qquad D\_\{k\+1\}=\\mu\_\{1\}\\bigl\(D\_\{k\}\+\\lambda\\bigr\)\+\\lvert f^\{\(1\)\}\\rvert,\(14\)with unique stable steady state

D∞=1−ρρ​λ\+\|f\(1\)\|ρ\.D\_\{\\infty\}\\;=\\;\\frac\{1\-\\rho\}\{\\rho\}\\,\\lambda\\;\+\\;\\frac\{\\lvert f^\{\(1\)\}\\rvert\}\{\\rho\}\.\(15\)The first term is the steering\-induced outward drift; the second is the baseline offset of Lemma[3](https://arxiv.org/html/2607.00588#Thmlemma3)\(D∞=\|f\(1\)\|/ρD\_\{\\infty\}=\\lvert f^\{\(1\)\}\\rvert/\\rhoatλ=0\\lambda\{=\}0\), so steering adds to a nonzero starting distance, not to zero\. Letacrita\_\{\\mathrm\{crit\}\}be the distance above which the orbit is non\-repetitive andRRthe local manifold radius\. In this steady\-state scalar approximation repetition is escaped onceλ≥λ⋆\\lambda\\geq\\lambda^\{\\star\}and the decode stays on\-manifold whileλ≤λmax\\lambda\\leq\\lambda\_\{\\max\}, with

λ⋆=ρ​acrit−\|f\(1\)\|1−ρ,λmax=ρ​R−\|f\(1\)\|1−ρ\.\\lambda^\{\\star\}\\;=\\;\\frac\{\\rho\\,a\_\{\\mathrm\{crit\}\}\-\\lvert f^\{\(1\)\}\\rvert\}\{1\-\\rho\},\\qquad\\lambda\_\{\\max\}\\;=\\;\\frac\{\\rho\\,R\-\\lvert f^\{\(1\)\}\\rvert\}\{1\-\\rho\}\.\(16\)Both thresholds lie below theirf\(1\)=0f^\{\(1\)\}\{=\}0values by\|f\(1\)\|/\(1−ρ\)\\lvert f^\{\(1\)\}\\rvert/\(1\-\\rho\)\(the baseline head start lowers the escape dose\), while the usable window\[λ⋆,λmax\]\[\\lambda^\{\\star\},\\lambda\_\{\\max\}\]keeps width\(R−acrit\)​ρ/\(1−ρ\)\(R\-a\_\{\\mathrm\{crit\}\}\)\\,\\rho/\(1\-\\rho\)\. We do not instantiateacrita\_\{\\mathrm\{crit\}\},RR, orf\(1\)f^\{\(1\)\}on ELF, so the window is qualitative; its numeric range is the empiricalλ\\lambda\-sweep of App\.[C\.2](https://arxiv.org/html/2607.00588#A3.SS2)\.

###### Proof\.

*Steered coordinate\.*Steering feeds back𝒙^k−λ​𝒗1\\hat\{\\bm\{x\}\}\_\{k\}\-\\lambda\\bm\{v\}\_\{1\}, so𝒂k\+1=𝑱​\(𝒂k−λ​𝒗1\)\+𝒇k\\bm\{a\}\_\{k\+1\}=\\bm\{J\}\(\\bm\{a\}\_\{k\}\-\\lambda\\bm\{v\}\_\{1\}\)\+\\bm\{f\}\_\{k\}\. With𝑱\\bm\{J\}symmetric \(Assumption[1](https://arxiv.org/html/2607.00588#Thmassumption1), so⟨𝑱​𝒙,𝒗1⟩=μ1​⟨𝒙,𝒗1⟩\\langle\\bm\{J\}\\bm\{x\},\\bm\{v\}\_\{1\}\\rangle=\\mu\_\{1\}\\langle\\bm\{x\},\\bm\{v\}\_\{1\}\\rangle\), projecting on𝒗1\\bm\{v\}\_\{1\}and keeping the drive componentf\(1\)=⟨𝒇k,𝒗1⟩f^\{\(1\)\}=\\langle\\bm\{f\}\_\{k\},\\bm\{v\}\_\{1\}\\ranglecloses the scalar recursionck\+1=μ1​\(ck−λ\)\+f\(1\)c\_\{k\+1\}=\\mu\_\{1\}\(c\_\{k\}\-\\lambda\)\+f^\{\(1\)\}of equation[14](https://arxiv.org/html/2607.00588#A1.E14), decoupled from the off\-axis content\. \(Lemma[2](https://arxiv.org/html/2607.00588#Thmlemma2)droppedf\(1\)f^\{\(1\)\}as small; we retain it here so thatλ=0\\lambda\{=\}0recovers the baseline offsetc∞=f\(1\)/ρc\_\{\\infty\}=f^\{\(1\)\}/\\rhoof Lemma[3](https://arxiv.org/html/2607.00588#Thmlemma3)rather than a collapse to𝒖⋆\\bm\{u\}^\{\\star\}\.\) Since\+𝒗1\+\\bm\{v\}\_\{1\}points toward𝒖⋆\\bm\{u\}^\{\\star\}while𝒂=𝒖−𝒖⋆\\bm\{a\}=\\bm\{u\}\-\\bm\{u\}^\{\\star\}is the outward displacement,ck≤0c\_\{k\}\\leq 0, and the drive points outward \(f\(1\)≤0f^\{\(1\)\}\\leq 0\); negating gives the distance formDk\+1=μ1​\(Dk\+λ\)\+\|f\(1\)\|D\_\{k\+1\}=\\mu\_\{1\}\(D\_\{k\}\+\\lambda\)\+\\lvert f^\{\(1\)\}\\rvert, an outward drift forced by𝒅\\bm\{d\}, not assumed\.

*Steady state\.*SettingDk\+1=Dk=D∞D\_\{k\+1\}=D\_\{k\}=D\_\{\\infty\}in equation[14](https://arxiv.org/html/2607.00588#A1.E14)gives\(1−μ1\)​D∞=μ1​λ\+\|f\(1\)\|\(1\-\\mu\_\{1\}\)\\,D\_\{\\infty\}=\\mu\_\{1\}\\lambda\+\\lvert f^\{\(1\)\}\\rvert, soD∞=\(1−ρ\)​λ/ρ\+\|f\(1\)\|/ρD\_\{\\infty\}=\(1\-\\rho\)\\lambda/\\rho\+\\lvert f^\{\(1\)\}\\rvert/\\rhoas in equation[15](https://arxiv.org/html/2607.00588#A1.E15)\. Subtracting this relation leavesDk\+1−D∞=μ1​\(Dk−D∞\)D\_\{k\+1\}\-D\_\{\\infty\}=\\mu\_\{1\}\(D\_\{k\}\-D\_\{\\infty\}\), decaying asμ1k\\mu\_\{1\}^\{\\,k\}\(μ1<1\\mu\_\{1\}<1\): the steady state is unique and stable on this axis\.

*Window\.*D∞D\_\{\\infty\}grows linearly inλ\\lambdafrom the baseline\|f\(1\)\|/ρ\\lvert f^\{\(1\)\}\\rvert/\\rho, clearingacrita\_\{\\mathrm\{crit\}\}onceλ≥λ⋆\\lambda\\geq\\lambda^\{\\star\}and staying belowRRwhileλ≤λmax\\lambda\\leq\\lambda\_\{\\max\}:

D∞≥acrit⇔λ≥λ⋆,D∞≤R⇔λ≤λmax,D\_\{\\infty\}\\geq a\_\{\\mathrm\{crit\}\}\\iff\\lambda\\geq\\lambda^\{\\star\},\\qquad D\_\{\\infty\}\\leq R\\iff\\lambda\\leq\\lambda\_\{\\max\},withλ⋆,λmax\\lambda^\{\\star\},\\lambda\_\{\\max\}as in equation[16](https://arxiv.org/html/2607.00588#A1.E16)\(each inequality solved forλ\\lambda\)\. ∎

Equation equation[16](https://arxiv.org/html/2607.00588#A1.E16)is the formal content of our empirical picture: steering removes the pull into the basin without touching the off\-𝒗1\\bm\{v\}\_\{1\}content𝒘\\bm\{w\}that blanket soft\-SC attenuates indiscriminately \(§[5](https://arxiv.org/html/2607.00588#S5)\); a finite window\[λ⋆,λmax\]\[\\lambda^\{\\star\},\\lambda\_\{\\max\}\]exists \(Fig\.[5](https://arxiv.org/html/2607.00588#A3.F5)\), withλ≈2\\lambda\{\\approx\}2the conservative knee \(repetition is monotone\-decreasing inλ\\lambdathroughλ≈4\\lambda\{\\approx\}4, soλ≈2\\lambda\{\\approx\}2is the smallest comfortably\-delivering dose, not the rep\-minimizer\) and over\-steering past the upper edge eventually spiking non\-words \(byλ≈8\\lambda\{\\approx\}8\) as predicted byλmax\\lambda\_\{\\max\}\(Fig\.[5](https://arxiv.org/html/2607.00588#A3.F5)\)\.

### A\.4Empirical validation on the trained loop

#### Existence, contraction, and local uniqueness of the fixed point\.

We test the premises of Assumption[1](https://arxiv.org/html/2607.00588#Thmassumption1)and the implicit\-function note of App\.[A\.1](https://arxiv.org/html/2607.00588#A1.SS1)directly on the trained loop\. Freezing\(𝒛k,tk\)\(\\bm\{z\}\_\{k\},t\_\{k\}\)at the formed basin \(fraction0\.850\.85\) and iterating the full step𝒖←𝒈​\(𝒖\)\\bm\{u\}\\\!\\leftarrow\\\!\\bm\{g\}\(\\bm\{u\}\)from the captured base point, the relative residual∥𝒈​\(𝒖\)−𝒖∥/∥𝒖∥\\lVert\\bm\{g\}\(\\bm\{u\}\)\-\\bm\{u\}\\rVert/\\lVert\\bm\{u\}\\rVertstarts at0\.070\.07and settles near0\.060\.06, the floor set by the step’s sampling noise:𝒈\\bm\{g\}has a near\-fixed point, and the captured base point already sits on it \(pooled𝒖⋆\\bm\{u\}^\{\\star\}captured vs\. iterated,cos=1\.00\\cos\{=\}1\.00\)\. Power\-iterating the Jacobian there gives a leading gainμ1≈0\.15<1\\mu\_\{1\}\{\\approx\}0\.15<1\(captured and strictly\-iterated base points agree,0\.1480\.148vs\.0\.1490\.149\): the loop is contracting, so𝑰−𝑱\\bm\{I\}\-\\bm\{J\}is invertible, the implicit\-function premise \(μ1≠1\\mu\_\{1\}\\neq 1\) holds, and the driven fixed point is locally unique\. The recovered axis is itself stable across the two base points \(\|cos⁡\(𝒗1cap,𝒗1strict\)\|=0\.999\\lvert\\cos\(\\bm\{v\}\_\{1\}^\{\\mathrm\{cap\}\},\\bm\{v\}\_\{1\}^\{\\mathrm\{strict\}\}\)\\rvert\{=\}0\.999, and0\.9990\.999to the saved𝒗1\\bm\{v\}\_\{1\}used in Fig\.[2](https://arxiv.org/html/2607.00588#S4.F2)\), the empirical face of local uniqueness\.

#### A dominant contracting mode emerges as the basin forms\.

Estimating𝑱\\bm\{J\}by the matrix\-free power iteration of Alg\.[2](https://arxiv.org/html/2607.00588#alg2)\(n=100n\{=\}100trajectories,3030iterations\) at base points of increasing convergence \(trajectory fraction0\.5→0\.850\.5\\\!\\to\\\!0\.85\), the predicted structure appears: the spectral gapμ1/μ2\\mu\_\{1\}/\\mu\_\{2\}rises from≈1\{\\approx\}1to2\.72\.7, one mode separating, and the label\-free𝒅\\bm\{d\}aligns with it \(\|cos⁡\(𝒗1,𝒅\)\|\\lvert\\cos\(\\bm\{v\}\_\{1\},\\bm\{d\}\)\\rvertup to0\.550\.55, well above the random floor1/e=0\.0441/\\sqrt\{e\}\{=\}0\.044\), with a perturbation along𝒅\\bm\{d\}amplified2\.2×2\.2\\timesover a random one \(Tab\.[10](https://arxiv.org/html/2607.00588#A2.T10)\)\. So𝒅\\bm\{d\}aligns with the measured dominant mode of the loop, the empirical counterpart of the idealized𝒅∥𝒗1\\bm\{d\}\\\!\\parallel\\\!\\bm\{v\}\_\{1\}\(Prop\.[1](https://arxiv.org/html/2607.00588#Thmproposition1)\), with the alignment only partial; the full estimator anatomy and data efficiency are in App\.[B](https://arxiv.org/html/2607.00588#A2)\.

#### The drive is off the repetition axis \(Lemma[2](https://arxiv.org/html/2607.00588#Thmlemma2)\)\.

The decompositionΔ​𝒖k=βk​𝒗1\+𝒓k\+𝒇k\\Delta\\bm\{u\}\_\{k\}=\\beta\_\{k\}\\bm\{v\}\_\{1\}\+\\bm\{r\}\_\{k\}\+\\bm\{f\}\_\{k\}resolves a seeming contradiction\. The observed drift is dominated by the denoising drive𝒇k\\bm\{f\}\_\{k\}, so its leading direction \(top principal component ofΔ​𝒖\\Delta\\bm\{u\}, noise\-averaged and pooled over the late third of steps,n=200n\{=\}200trajectories\) is near\-orthogonal to both the repetition eigenvector \(\|cos\|=0\.15\\lvert\\cos\\rvert\{=\}0\.15, as is the per\-step drive itself,\|cos⁡\(𝒇k,𝒗1\)\|=0\.15\\lvert\\cos\(\\bm\{f\}\_\{k\},\\bm\{v\}\_\{1\}\)\\rvert\{=\}0\.15\) and the steering direction \(\|cos\|=0\.09\\lvert\\cos\\rvert\{=\}0\.09\); but the Jacobian, a derivative, cancels the perturbation\-independent𝒇k\\bm\{f\}\_\{k\}and exposes the self\-amplified mode, whose leading eigenvector is𝒅\\bm\{d\}\. Drift and Jacobian thus agree rather than conflict, one reading the forcing𝒇k\\bm\{f\}\_\{k\}and the other the self\-amplified modeβk​𝒗1\\beta\_\{k\}\\bm\{v\}\_\{1\}; ACE acts along𝒅⟂𝒇k\\bm\{d\}\\perp\\bm\{f\}\_\{k\}, leaving the drive intact \(§[5](https://arxiv.org/html/2607.00588#S5)\)\.

## Appendix BAnatomy of𝒅\\bm\{d\}\(extended\)

This appendix collects the evidence behind §[5](https://arxiv.org/html/2607.00588#S5)\.

#### One direction across samplers\.

A single base\-config𝒅\\bm\{d\}steers across steps, guidance,γ\\gamma, ODE/SDE, noise scales, seeds and model sizes at near\-native repetition reduction\. Re\-estimating𝒅\\bm\{d\}under each setting recovers nearly the same axis: cosine0\.820\.82–0\.960\.96to𝒅B\\bm\{d\}\_\{B\}over the knob and size re\-estimates of Table[7](https://arxiv.org/html/2607.00588#A2.T7)\(0\.860\.86–0\.940\.94across the step sweep alone; §[6](https://arxiv.org/html/2607.00588#S6)\)\. Pooling the feedback over six samplers \(15001500trajectories\) and re\-estimating barely moves it:cos⁡\(𝒅pooled,𝒅base\)=0\.89\\cos\(\\bm\{d\}\_\{\\mathrm\{pooled\}\},\\bm\{d\}\_\{\\mathrm\{base\}\}\)=0\.89\. The direction is thus a property of the model, not the operating point; the optimal strengthλ\\lambda\(not the direction\) absorbs the configuration dependence\.

Table 7:One direction:d\\bm\{d\}re\-estimated per size or per knob recovers nearly the same axis, and the transferred ELF\-BdB\\bm\{d\}\_\{B\}steers near\-natively\.λ=2\\lambda\{=\}2;6464steps and canonical knobs unless noted;n=500n\{=\}500\(operating\-point rown=1000n\{=\}1000\); seeds row: per\-seed medians; –==too few accepted \(<20<20\) to score\.
#### Estimator robustness\.

Difference\-of\-means is insensitive to its design choices\. Holding the feedback batch fixed and varying only the split, the recovered direction is near\-identical, cosine to the base𝒅\\bm\{d\}in\[0\.97,1\.00\]\[0\.97,1\.00\]across the top\-repetition tertile fraction \(1010–50%50\\%\) and across which trajectory half is pooled, and near zero for the random or label\-permuted controls \(Tab\.[8](https://arxiv.org/html/2607.00588#A2.T8)\)\. The sterner test re\-estimates𝒅\\bm\{d\}from a*fresh*feedback batch at each setting at theγ=1\.0\\gamma\{=\}1\.0operating point against the frozen, deployed base𝒅\\bm\{d\}, folding in sampling noise: even then cosine stays0\.870\.87–0\.930\.93across tertile fractions and0\.880\.88–0\.910\.91across step windows, with the recipe itself re\-estimated this way landing at0\.920\.92, the finite\-sample noise floor every variant sits at\. So𝒅\\bm\{d\}is not an artifact of a tuned split\.

Table 8:The difference\-of\-means directiond\\bm\{d\}is robustly recoverable\.cos\\costo the reference𝒅\\bm\{d\}on a fixed feedback batch, varying only the split \(top\-repetition tertile fraction, trajectory step window\); near zero for the two degenerate controls\. Main\-text direction comparison: Tab\.[5](https://arxiv.org/html/2607.00588#S5.T5)\.
#### Dose form: fixedλ\\lambdavs\. per\-step projection\.

At the fixed direction𝒅\\bm\{d\}, exact per\-step projection \(subtract the instantaneous𝒅\\bm\{d\}\-component\)*under*\-doses, because the loop re\-amplifies the removed component before the next step; a fixedλ=2\\lambda\{=\}2that pre\-compensates that gain reduces repetition more \(Tab\.[9](https://arxiv.org/html/2607.00588#A2.T9)\)\. The gain is the loop’s Jacobian amplification \(Tab\.[10](https://arxiv.org/html/2607.00588#A2.T10)\), and the fullλ\\lambdasweep and usable window are in App\.[C\.2](https://arxiv.org/html/2607.00588#A3.SS2)\.

Table 9:A fixed doseλ=2\\lambda\{=\}2beats exact per\-step projection\.Dose forms at the fixed difference\-of\-means𝒅\\bm\{d\},γ=1\.0\\gamma\{=\}1\.0\(rep is the median\); clean\-PPL is a guardrail\. Main\-text direction comparison: Tab\.[5](https://arxiv.org/html/2607.00588#S5.T5)\.
#### How the estimators are computed \(Tab\.[5](https://arxiv.org/html/2607.00588#S5.T5)\)\.

All steer atλ=2\\lambda\{=\}2, and the supervised ones share one trapped/free label set \(the top/bottom repetition tertiles difference\-of\-means uses\)\.*Difference\-of\-means*\(𝒅\\bm\{d\}\) subtracts the two class means,𝝁T−𝝁F\\bm\{\\mu\}\_\{T\}\{\-\}\\bm\{\\mu\}\_\{F\}\.*Top\-PC*is the leading principal component of the mean\-centered pooled feedback \(unsupervised\)\.*Logistic*is the weight vector of a logistic classifier fit to those labels\.*LDA*is the regularized Fisher discriminant𝚺^w−1​\(𝝁T−𝝁F\)\\widehat\{\\bm\{\\Sigma\}\}\_\{w\}^\{\-1\}\(\\bm\{\\mu\}\_\{T\}\{\-\}\\bm\{\\mu\}\_\{F\}\), with a trace\-shrinkage of the pooled within\-class covariance𝚺^w\\widehat\{\\bm\{\\Sigma\}\}\_\{w\}\. The Jacobian𝒗1\\bm\{v\}\_\{1\}is the power\-iteration mode \(Alg\.[2](https://arxiv.org/html/2607.00588#alg2)\);*random*and*permuted*are the sign\-scrambled and label\-shuffled controls\.

#### Supervision matters\.

𝒅\\bm\{d\}correlates with the feedback’s leading principal component \(\|cos⁡\(𝒅,PC1\)\|=0\.73\\lvert\\cos\(\\bm\{d\},\\text\{PC\}\_\{1\}\)\\rvert=0\.73\), but the difference\-of\-means is what makes it reliable\. The unsupervised PC1\(sign\-ambiguous\) and the logistic discriminant recover the axis less faithfully and steer worse, and the*theory\-optimal*Jacobian eigenvector𝒗1\\bm\{v\}\_\{1\}\(the direction Proposition[1](https://arxiv.org/html/2607.00588#Thmproposition1)says𝒅\\bm\{d\}should equal\) worst of all; only the regularized LDA matches it \(Tab\.[5](https://arxiv.org/html/2607.00588#S5.T5)\)\. Difference\-of\-means wins not by being more aggressive but because it averages the basin\-entry drift over the whole trajectory and many samples, where a single\-point linearization or a logistic classifier overfits\.

#### Placebo controls\.

A random direction at the same dose is the sharpest control: it raises repetition rather than reducing it, since subtracting an uninformed vector is a pure off\-manifold perturbation\. So steering is not “any feedback perturbation helps”: only directions aligned with the attractor axis reduce repetition, and the reduction grows with alignment\. On\-manifold placebos agree: the same pipeline on shuffled labels, or on random halves within one tertile class, also fails \(cos≤0\.19\\cos\\leq 0\.19\)\. The one instructive exception, a within\-trapped split that happens to capture residual depth spread \(cos⁡0\.28\\cos 0\.28\), recovers part of the effect: even among placebos the effect tracks alignment\.

#### The dominant mode of the feedback loop\.

A matrix\-free power iteration on the self\-conditioning Jacobian𝑱=D​𝒔​\(𝒖⋆\)\\bm\{J\}=\\mathrm\{D\}\\bm\{s\}\(\\bm\{u\}^\{\\star\}\)recovers its leading mode𝒗1\\bm\{v\}\_\{1\}\(Alg\.[2](https://arxiv.org/html/2607.00588#alg2)\): each𝑱​𝒗\\bm\{J\}\\bm\{v\}is a central finite difference \(perturb the fed\-back estimate by±ε​𝒗\\pm\\varepsilon\\bm\{v\}uniformly over positions, take one denoising step, pool the change; the drive cancels, leaving the self\-conditioning response\), iterated from base points captured along the trajectory\. The Jacobian is averaged overn=100n\{=\}100trajectories and power\-iterated3030steps, with𝒗1\\bm\{v\}\_\{1\}oriented so the feedback projection rises with repetition\. Since the finite\-difference𝑱\\bm\{J\}is generally non\-symmetric, theμi\\mu\_\{i\}are operator gains \(singular\-value\-like norms\), not necessarily eigenvalues unless𝑱\\bm\{J\}is symmetrized\. The predicted structure*emerges*as the trajectory converges \(Table[10](https://arxiv.org/html/2607.00588#A2.T10), traced in Fig\.[2](https://arxiv.org/html/2607.00588#S4.F2)b\): the spectral gapμ1/μ2\\mu\_\{1\}/\\mu\_\{2\}rises from1\.01\.0to2\.72\.7as subdominant modes collapse,𝒅\\bm\{d\}aligns with the leading eigenvector up to\|cos\|=0\.55\\lvert\\cos\\rvert=0\.55\(12×12\\timesthe random baseline1/e=0\.0441/\\sqrt\{e\}=0\.044\), and a perturbation along𝒅\\bm\{d\}is amplified2\.2×2\.2\\timesover a random one\.𝒅\\bm\{d\}therefore tracks the loop’s measured dominant mode, not merely a statistical correlate of repetition, though recovering the single\-point eigenvector directly steers worse than difference\-of\-means \(Tab\.[5](https://arxiv.org/html/2607.00588#S5.T5)\)\. This is the same gain the loop re\-applies to any component along𝒅\\bm\{d\}each pass, i\.e\. the amplification factor the fixed dose pre\-compensates in theλ⋆\\lambda^\{\\star\}check \(§[5](https://arxiv.org/html/2607.00588#S5)\)\. The alignment is strong but not unit, consistent with𝒅\\bm\{d\}tracking the dominant mode of an*evolving*Jacobian rather than a single static operator\.

Algorithm 2Attractor modes𝒗1,𝒗2\\bm\{v\}\_\{1\},\\bm\{v\}\_\{2\}: matrix\-free deflated power iteration on the feedback Jacobian1:one\-step map

𝒈\\bm\{g\}at trajectory fraction

ff\(the formed basin,

f≈0\.85f\{\\approx\}0\.85\); iterations

NN; probe scale

ε\\varepsilon
2:modes

𝒗1,𝒗2\\bm\{v\}\_\{1\},\\bm\{v\}\_\{2\}and gains

μ1,μ2\\mu\_\{1\},\\mu\_\{2\}\(spectral gap

μ1/μ2\\mu\_\{1\}/\\mu\_\{2\}\)

3:iterate the loop

𝒈\\bm\{g\}forward \(baseline run\) and capture its feedback

𝒙^k\\hat\{\\bm\{x\}\}\_\{k\}at step

k=⌊f​T⌋k\{=\}\\lfloor fT\\rfloor\(near\-converged\)

4:

𝒖⋆←1L​∑l=1L𝒙^k​\[l\]\\bm\{u\}^\{\\star\}\\leftarrow\\tfrac\{1\}\{L\}\\sum\_\{l=1\}^\{L\}\\hat\{\\bm\{x\}\}\_\{k\}\[l\]⊳\\trianglerightcaptured from the run, not solved

5:

ε←0\.01​std⁡\(𝒖⋆\)\\varepsilon\\leftarrow 0\.01\\,\\operatorname\{std\}\(\\bm\{u\}^\{\\star\}\)
6:

𝑱​𝒗:=\[𝒈​\(𝒖⋆\+ε​𝒗\)−𝒈​\(𝒖⋆−ε​𝒗\)\]/\(2​ε\)\\bm\{J\}\\bm\{v\}:=\\big\[\\bm\{g\}\(\\bm\{u\}^\{\\star\}\{\+\}\\varepsilon\\bm\{v\}\)\-\\bm\{g\}\(\\bm\{u\}^\{\\star\}\{\-\}\\varepsilon\\bm\{v\}\)\\big\]/\(2\\varepsilon\)⊳\\trianglerightdrive𝒇k\\bm\{f\}\_\{k\}cancels

7:

𝒗1∼𝒩​\(𝟎,𝑰\)\\bm\{v\}\_\{1\}\\sim\\mathcal\{N\}\(\\bm\{0\},\\bm\{I\}\), normalized

8:for

n=1,…,Nn=1,\\dots,Ndo

9:

𝒘←𝑱​𝒗1\\bm\{w\}\\leftarrow\\bm\{J\}\\bm\{v\}\_\{1\};

μ1←∥𝒘∥\\mu\_\{1\}\\leftarrow\\lVert\\bm\{w\}\\rVert;

𝒗1←𝒘/μ1\\bm\{v\}\_\{1\}\\leftarrow\\bm\{w\}/\\mu\_\{1\}⊳\\triangleright\(Golub and Van Loan,[2013](https://arxiv.org/html/2607.00588#bib.bib10)\)

10:endfor

11:

𝒗2∼𝒩​\(𝟎,𝑰\)\\bm\{v\}\_\{2\}\\sim\\mathcal\{N\}\(\\bm\{0\},\\bm\{I\}\), orthogonalized against

𝒗1\\bm\{v\}\_\{1\}and normalized

12:for

n=1,…,Nn=1,\\dots,Ndo

13:

𝒘←𝑱​𝒗2−⟨𝑱​𝒗2,𝒗1⟩​𝒗1\\bm\{w\}\\leftarrow\\bm\{J\}\\bm\{v\}\_\{2\}\-\\langle\\bm\{J\}\\bm\{v\}\_\{2\},\\bm\{v\}\_\{1\}\\rangle\\bm\{v\}\_\{1\};

μ2←∥𝒘∥\\mu\_\{2\}\\leftarrow\\lVert\\bm\{w\}\\rVert;

𝒗2←𝒘/μ2\\bm\{v\}\_\{2\}\\leftarrow\\bm\{w\}/\\mu\_\{2\}⊳\\trianglerightdeflated

14:endfor

15:return

𝒗1,𝒗2,μ1,μ2\\bm\{v\}\_\{1\},\\bm\{v\}\_\{2\},\\mu\_\{1\},\\mu\_\{2\}

Table 10:As the basin forms, a dominant mode emerges andd\\bm\{d\}aligns with it\.Jacobian power iteration \(γ=1\.5\\gamma\{=\}1\.5runs\); the gap near11at fraction0\.500\.50reflects no clearly dominant mode yet \(power iteration near\-degenerate there,\|cos\|=0\.37\\lvert\\cos\\rvert\{=\}0\.37\)\.
### B\.1Is difference\-of\-means optimal? A directly\-optimized direction

Algorithm 3Direct optimization of the steering direction \(black\-box, deployment objective\)1:model

fθf\_\{\\theta\}, base direction

𝒅\\bm\{d\}, strength

λ=2\\lambda\{=\}2
2:collect

N=300N\{=\}300baseline trajectories’ mean feedback

\{𝒔n\}\\\{\\bm\{s\}\_\{n\}\\\}
3:

PC1\.\.10←\\text\{PC\}\_\{1\.\.10\}\\leftarrowtop right\-singular vectors of the centered

\{𝒔n\}\\\{\\bm\{s\}\_\{n\}\\\}
4:

B←Orthonormalize​\(\[𝒅,PC1,…,PC10\]\)B\\leftarrow\\textsc\{Orthonormalize\}\(\[\\bm\{d\},\\text\{PC\}\_\{1\},\\dots,\\text\{PC\}\_\{10\}\]\)⊳\\trianglerightrow0==𝒅\\bm\{d\};1010orthogonal escape directions

5:functionEval\(

𝒗,n,seed\\bm\{v\},n,\\text\{seed\}\)

6:generate

nnsamples steered by

λ​𝒗\\lambda\\bm\{v\}
7:return\(rep median, Gen\-PPL\)

8:endfunction

9:

𝜽∗←𝒆0\\bm\{\\theta\}^\{\\ast\}\\leftarrow\\bm\{e\}\_\{0\};

\(r0,p0\)←Eval​\(𝒅,120\)\(r\_\{0\},p\_\{0\}\)\\leftarrow\\textsc\{Eval\}\(\\bm\{d\},120\)⊳\\trianglerightstart exactly at𝒅\\bm\{d\}

10:for

t=0,…,25t=0,\\dots,25do

11:

𝜽←𝜽∗\+𝒩​\(0,σt2​I\)\\bm\{\\theta\}\\leftarrow\\bm\{\\theta\}^\{\\ast\}\+\\mathcal\{N\}\(0,\\sigma\_\{t\}^\{2\}I\),

σt=0\.35⋅0\.93t\\sigma\_\{t\}=0\.35\\cdot 0\.93^\{\\,t\}⊳\\trianglerightannealed proposal

12:

\(r,p\)←Eval​\(Norm​\(𝜽​B\),120,seedt\)\(r,p\)\\leftarrow\\textsc\{Eval\}\(\\textsc\{Norm\}\(\\bm\{\\theta\}B\),120,\\text\{seed\}\_\{t\}\)
13:

s←r\+5⋅max⁡\(0,p−1\.1​p0\)s\\leftarrow r\+5\\cdot\\max\(0,\\,p\-1\.1\\,p\_\{0\}\)⊳\\trianglerightconstrained; unconstrained:s←rs\\leftarrow r

14:if

s<s∗s<s^\{\\ast\}then

𝜽∗←𝜽\\bm\{\\theta\}^\{\\ast\}\\leftarrow\\bm\{\\theta\}
15:endfor

16:return

Eval​\(Norm​\(𝜽∗​B\),600,fresh seed\)\\textsc\{Eval\}\(\\textsc\{Norm\}\(\\bm\{\\theta\}^\{\\ast\}B\),600,\\text\{fresh seed\}\)vs

Eval​\(𝒅,600,fresh seed\)\\textsc\{Eval\}\(\\bm\{d\},600,\\text\{fresh seed\}\)⊳\\trianglerightheld\-out comparison

#### Setup\.

The alternatives above are cheap heuristics \(full sweep in Tab\.[5](https://arxiv.org/html/2607.00588#S5.T5)\); we also ask whether*any*direction does better by optimizing the deployment objective itself \(Algorithm[3](https://arxiv.org/html/2607.00588#alg3)\)\. Steering leverage lives where the feedback varies, so the search space is the span of𝒅\\bm\{d\}and the feedback’s top principal components, starting exactly at𝒅\\bm\{d\}\. The*constrained*objective’s hinge penalty is a quality guard that fires only when a candidate degrades Gen\-PPL more than10%10\\%past the𝒅\\bm\{d\}reference; PPL is never minimized, since under our thesis minimizing Gen\-PPL would walk*into*the attractor\.

#### 𝒅\\bm\{d\}is near\-optimal\.

An extensive search \(up to3030seeds across the two objectives\) confirms direct optimization lowers repetition only slightly further than𝒅\\bm\{d\}, along essentially the same axis\. On held\-out seeds disjoint from the search \(n=600n\{=\}600\), the*constrained*objective \(minimize repetition subject to an on\-manifold/Gen\-PPL constraint\) edges𝒅\\bm\{d\}on repetition \(median held\-out1\.85%1\.85\\%vs2\.11%2\.11\\%, winning24/3024/30seeds\) at matched Gen\-PPL \(27\.727\.7vs27\.427\.4\)\. The*unconstrained*objective \(minimize repetition alone,1515seeds\) lowers it a touch more \(median1\.84%1\.84\\%, winning13/1513/15seeds\) at slightly worse Gen\-PPL \(28\.128\.1\), while staying aligned with𝒅\\bm\{d\}\(mediancos=0\.91\\cos\{=\}0\.91\): it trades a little quality for a small gain along essentially the same axis\. The basin is thus approximately one\-dimensional: a single cheap, label\-free difference\-of\-means estimate captures it nearly as well as a direct optimization costing∼8×\{\\sim\}8\\timesthe generations \(31203120vs400400\)\.

Table 11:Direct optimization improves on difference\-of\-means only slightly, along the same axis, at∼8×\{\\sim\}8\\timesthe generation cost\.Black\-box search seeded at𝒅\\bm\{d\}, held\-out seeds \(ELF\-B,λ=2\\lambda\{=\}2\); –==reference row itself\.Table 12:One direction:d\\bm\{d\}is a single low\-dimensional axis of the self\-conditioning feedback\.Operating point \(ELF\-B,6464steps,γ=1\.0\\gamma\{=\}1\.0\),n=1000n\{=\}1000; its alignment with the Jacobian mode is Tab\.[10](https://arxiv.org/html/2607.00588#A2.T10), its irreplaceability Tab\.[5](https://arxiv.org/html/2607.00588#S5.T5)\.
#### Seed\-robustness\.

The estimator comparison \(Tab\.[5](https://arxiv.org/html/2607.00588#S5.T5)\) is reported at theγ=1\.0\\gamma\{=\}1\.0operating point \(§[5](https://arxiv.org/html/2607.00588#S5)\), where difference\-of\-means recovers the most steerable direction: it beats the unsupervised top\-PC, the Jacobian eigenvector \(Tab\.[10](https://arxiv.org/html/2607.00588#A2.T10)\), and the logistic discriminant outright, and matches LDA\. This is seed\-robust: across three estimation seeds difference\-of\-means stays at or near the top \(best in two, within0\.60\.6points of LDA in the third\), while the top\-PC and the Jacobian eigenvector remain far worse \(≥5%\\geq\\\!5\\%steered rep\) in every seed\.

#### Data efficiency\.

𝒅\\bm\{d\}is cheap to estimate: the cosine between𝒅\\bm\{d\}fromnntrajectories and the full\-data direction reaches0\.690\.69atn=50n\{=\}50,0\.940\.94atn=200n\{=\}200, and1\.001\.00atn=800n\{=\}800\. A few hundred unlabeled trajectories suffice, stable well before the repetition\-tertile split matters\.

#### Low\-rank structure, one dominant self\-amplified direction\.

A PCA of the per\-sample mean feedback \(mean\-centered over then=1000n\{=\}1000samples; PC1is its leading singular vector\) concentrates the variance in a few components \(PC1explains18%18\\%, the top five46%46\\%\)\. The contraction makes𝒗1\\bm\{v\}\_\{1\}a strong variance mode \(propagating an isotropic initial spread through𝑱k\\bm\{J\}^\{k\}gives covarianceCov​\(𝒂k\)=σ2​𝑱2​k=∑iσ2​μi2​k​𝒗i​𝒗i⊤\\mathrm\{Cov\}\(\\bm\{a\}\_\{k\}\)=\\sigma^\{2\}\\bm\{J\}^\{2k\}=\\sum\_\{i\}\\sigma^\{2\}\\mu\_\{i\}^\{2k\}\\,\\bm\{v\}\_\{i\}\\bm\{v\}\_\{i\}^\{\\top\}, peaked on𝒗1\\bm\{v\}\_\{1\}\), so𝒅\\bm\{d\}concentrates in the top components \(53%53\\%of∥𝒅∥2\\lVert\\bm\{d\}\\rVert^\{2\}in PC1,82%82\\%in the top five,95%95\\%in the top ten\)\. But it is*subdominant*: the larger, sample\-varying denoising variation pushes the raw top PC off𝒗1\\bm\{v\}\_\{1\}, so\|cos⁡\(𝒅,PC1\)\|=0\.73\\lvert\\cos\(\\bm\{d\},\\text\{PC\}\_\{1\}\)\\rvert\{=\}0\.73rather than11and a plain PCA cannot cleanly isolate the axis\.𝒅\\bm\{d\}is instead the dominant*self\-amplification*mode, the feedback Jacobian’s leading eigenvector \(Tab\.[10](https://arxiv.org/html/2607.00588#A2.T10)\), which is why the unsupervised PC1steers worse \(Tab\.[5](https://arxiv.org/html/2607.00588#S5.T5)\)\.

#### When repetition emerges\.

Repetition*forms during*the trajectory: how far a sample’s feedback drifts along𝒅\\bm\{d\}predicts its final repetition \(Fig\.[2](https://arxiv.org/html/2607.00588#S4.F2)\), the orbit being drawn into the attractor \(Assumption[1](https://arxiv.org/html/2607.00588#Thmassumption1)\)\. A cheap mid\-trajectory check on the feedback can therefore flag doomed samples and reject or reset them early, complementing the steering fix that removes the pull at its source\.

#### What𝒅\\bm\{d\}encodes\.

A linear logit\-lens read of𝒅\\bm\{d\}\(projecting it through the output unembedding\) ranks fragmentary subword pieces \(tri,ction,iding,lect\) among its top\-scoring tokens and whole content words \(emotion,scholarship,stretched\) among its lowest: consistent with𝒅\\bm\{d\}encoding the repetitive, sub\-lexical loop content that steering removes\.

## Appendix CNon\-words and the steering window

### C\.1Non\-words: a second defect, independent of repetition

The*non\-word*rate counts out\-of\-dictionary tokens\(Kukich,[1992](https://arxiv.org/html/2607.00588#bib.bib15)\): length\-≥4\\geq 4alphabetic tokens withzipf\_frequency=0\\,\{=\}\\,0inwordfreq\(Speeret al\.,[2018](https://arxiv.org/html/2607.00588#bib.bib46)\)\(a hunspell\(Németh,[2003](https://arxiv.org/html/2607.00588#bib.bib47)\)cross\-check preserves the ELF\>\>human ordering\)\. ELF generates them far above the human floor, and Gen\-PPL never sees them \(Table[13](https://arxiv.org/html/2607.00588#A3.T13)\)\.

Table 13:Non\-words are a second, decode\-axis defect, invisible to Gen\-PPL\.Out\-of\-dictionary tokens; top block: main\-text generations; bottom: conditional ELF\-B checkpoints;†includes untranslated German; –==not applicable\.This is a*decode\-axis*defect, statistically independent of the trajectory\-axis repetition: a sample can be repetitive, non\-word\-laden, both, or neither\. The decoder reads each latent position by an independentarg⁡max\\arg\\maxatt=1t\{=\}1with no autoregressive tie between neighbours, so individually legal word\-pieces assemble into an illegal whole \(*glued*non\-words, e\.g\.shock\+d→\\toshockd\)\. Because the two axes are independent, the self\-conditioning fix targets repetition and not non\-words; conversely, over\-steering pushes the latent off the embedding manifold and surfaces as a non\-word spike \(theλmax\\lambda\_\{\\max\}bound, Prop\.[2](https://arxiv.org/html/2607.00588#Thmproposition2)\), the signal that bounds the steering window from above\.

### C\.2The steeringλ\\lambda\-window

Two surface readouts bound the usable dose from opposite sides \(Fig\.[5](https://arxiv.org/html/2607.00588#A3.F5)\)\.*Lower edge*: sweepingλ\\lambdafrom0, repetition falls steeply out of the basin from the closed\-form thresholdλ⋆≈1\.5\\lambda^\{\\star\}\{\\approx\}1\.5\(ELF\-B crosses the human bar byλ≈2\.5\\lambda\{\\approx\}2\.5, the smaller models earlier\)\.*Upper edge*: past it the latent leaves the embedding manifold and the non\-word rate spikes \(0\.83%0\.83\\%atλ=6\\lambda\{=\}6to2\.42%2\.42\\%atλ=8\\lambda\{=\}8\), and repetition itself rebounds \(back to5\.03%5\.03\\%atλ=8\\lambda\{=\}8from the1\.11%1\.11\\%floor atλ=4\\lambda\{=\}4\): over\-steering fails on both counts\. The usable window isλ∈\[1\.5,5\]\\lambda\\in\[1\.5,5\]and we operate atλ=2\\lambda\{=\}2\.

![Refer to caption](https://arxiv.org/html/2607.00588v1/x5.png)Figure 5:Steering has a finite usable window, as the theory predicts\.λ\\lambda\-sweep on ELF\-B: repetition \(blue\), non\-words \(red, marking the latent leaving the token\-embedding manifold\); shaded: repetition below the human bar, non\-words at baseline\.A single ELF\-B direction𝒅B\\bm\{d\}\_\{B\}at this one dose carries across sizes \(Table[14](https://arxiv.org/html/2607.00588#A3.T14)\): it tracks each size’s natively re\-estimated direction throughout\.λ=2\\lambda\{=\}2is the knee of every size’s curve and the conservative operating point: it brings ELF\-M and ELF\-L well under the human bar and the hardest model ELF\-B to just above it \(2\.11%2\.11\\%vs the1\.92%1\.92\\%bar;λ=3\\lambda\{=\}3takes ELF\-B fully under, at1\.37%1\.37\\%\), whileλ=4\\lambda\{=\}4over\-steers the larger two\. One direction and one mild dose thus suffice across sizes\.

Table 14:Dose response across sizes:λ=2\\lambda\{=\}2\(bold\) is the conservative operating point\.Shared ELF\-B𝒅B\\bm\{d\}\_\{B\}at doseλ\\lambdaacross sizes \(transfer\); quality512512\-word matched\. Upper edge \(λ≳5\\lambda\{\\gtrsim\}5\): Fig\.[5](https://arxiv.org/html/2607.00588#A3.F5)\.

## Appendix DSteering Pareto\-dominates every soft self\-conditioning alternative

#### The benchmarked alternatives \(Tab\.[15](https://arxiv.org/html/2607.00588#A4.T15)\)\.

Each soft self\-conditioning variant*softens*, rather than removes, the fed\-back estimate𝒙^\\hat\{\\bm\{x\}\}before it re\-enters the loop, with no retraining\.*mag*rescales it,𝒙^←α​𝒙^\\hat\{\\bm\{x\}\}\\\!\\leftarrow\\\!\\alpha\\hat\{\\bm\{x\}\}, soα=1\\alpha\{=\}1is the unmodified full\-SC sampler andα=0\\alpha\{=\}0is SC\-reset \(feedback disabled; theα\\alphasweep is Tab\.[2](https://arxiv.org/html/2607.00588#S3.T2)\)\.*noise*adds Gaussian noise,𝒙^←𝒙^\+σ​ϵ\\hat\{\\bm\{x\}\}\\\!\\leftarrow\\\!\\hat\{\\bm\{x\}\}\+\\sigma\\bm\{\\epsilon\}, to break the fixed point while keeping the signal\.*dist*periodically decodes𝒙^\\hat\{\\bm\{x\}\}to token logits and feeds back the temperature\-weighted embeddingsoftmax​\(logits/T\)​𝑬\\mathrm\{softmax\}\(\\text\{logits\}/T\)\\,\\bm\{E\}\(near\-committed at smallTT, mean\-like at largeTT\)\.*cutoff*turns self\-conditioning off entirely past a mid\-trajectory step \(feeding back𝟎\\bm\{0\}thereafter\);*decay*linearly anneals the feedback magnitude from full to a floorα\\alphaover the trajectory;*early\-restart*decodes𝒙^\\hat\{\\bm\{x\}\}once mid\-trajectory, flags the samples already repeating \(tokenseq\-rep\-4above a threshold\), and resets their feedback to𝟎\\bm\{0\}for the remaining steps\. Full\-SC \(post\-hoc reject\) runs the unmodified sampler followed by the same reject\-to\-10001000loop\. ACE instead subtracts one direction,𝒙^←𝒙^−λ​𝒅\\hat\{\\bm\{x\}\}\\\!\\leftarrow\\\!\\hat\{\\bm\{x\}\}\-\\lambda\\bm\{d\}\(§[5](https://arxiv.org/html/2607.00588#S5)\), leaving the rest of the feedback intact\. At the shared reject\-to\-10001000caliber, steering Pareto\-dominates this frontier \(§[5](https://arxiv.org/html/2607.00588#S5); the full grid over every variant and both doses is Tab\.[15](https://arxiv.org/html/2607.00588#A4.T15)\); Fig\.[6](https://arxiv.org/html/2607.00588#A4.F6)gives the per\-size cost of the win\.

Table 15:Full compute\-to\-clean grid: every soft method at its lower\- and higher\-rep dose\.Reject\-to\-10001000atγ=1\.0\\gamma\{=\}1\.0,6464steps \(passseq\-rep\-4≤1\.92%\\,\\leq 1\.92\\%\); rep % is the direct\-generation median, NFE in10310^\{3\}forward passes, clean\-PPL on the accepted set\. The compact main\-text view is Tab\.[4](https://arxiv.org/html/2607.00588#S5.T4)\.![Refer to caption](https://arxiv.org/html/2607.00588v1/x6.png)Figure 6:Steering makes human\-clean text1\.51\.5–5×5\\timescheaper\.Generations to10001000samples under the human bar, baseline vs steered, by model size \(log scale\); the per\-method head\-to\-head at one size is Tab\.[4](https://arxiv.org/html/2607.00588#S5.T4)\.

## Appendix EBuilding the fix into training \(extended\)

#### Setup\.

We continue\-train ELF\-B from the released checkpoint on OpenWebText \(fresh AdamW, weight decay0,5050\-step warmup, effective batch512512, learning rate10−410^\{\-4\},128128optimizer steps\), then evaluate the EMA weights by plain unsteered sampling onn=1000n\{=\}1000samples \(rep: medians; clean\-PPL: reject\-to\-10001000\), on the same footing as the released baseline\. EMA decay is lowered from the release’s0\.99990\.9999to0\.9990\.999so the average tracks this short run\. The continue\-train control \(w=0w\{=\}0\) and the anti\-attractor regularizer \(w=7w\{=\}7, best\) of Table[3](https://arxiv.org/html/2607.00588#S5.T3)share this recipe\.

#### Anti\-attractor regularization\.

Rather than clean the feedback, teach the model not to*produce*attractor\-direction feedback\. Let𝒙^θ​\(𝒛k,tk\)∈ℝL×e\\hat\{\\bm\{x\}\}\_\{\\theta\}\(\\bm\{z\}\_\{k\},t\_\{k\}\)\\in\\mathbb\{R\}^\{L\\times e\}be the model’s clean\-latent prediction at noise leveltkt\_\{k\}and𝒖θ=1L​∑l=1L𝒙^θ​\[l\]∈ℝe\\bm\{u\}\_\{\\theta\}=\\tfrac\{1\}\{L\}\\sum\_\{l=1\}^\{L\}\\hat\{\\bm\{x\}\}\_\{\\theta\}\[l\]\\in\\mathbb\{R\}^\{e\}its position\-pooled estimate \(the feedback𝒖\\bm\{u\}of §[4](https://arxiv.org/html/2607.00588#S4), equation[4](https://arxiv.org/html/2607.00588#S4.E4)\)\. With the frozen difference\-of\-means direction𝒅\\bm\{d\}\(equation[12](https://arxiv.org/html/2607.00588#A1.E12)\), the*anti\-attractor*regularizer is

ℒattr\(θ\)=𝔼𝒛0,k\[ReLU\(⟨𝒖θ,𝒅⟩\)2\],\\mathcal\{L\}\_\{\\mathrm\{attr\}\}\(\\theta\)\\;=\\;\\mathbb\{E\}\_\{\\bm\{z\}\_\{0\},\\,k\}\\\!\\left\[\\,\\operatorname\{ReLU\}\\\!\\big\(\\langle\\bm\{u\}\_\{\\theta\},\\,\\bm\{d\}\\rangle\\big\)^\{2\}\\,\\right\],\(17\)and we continue\-train against

ℒ​\(θ\)=ℒbase​\(θ\)\+w​ℒattr​\(θ\),w≥0,\\mathcal\{L\}\(\\theta\)\\;=\\;\\mathcal\{L\}\_\{\\mathrm\{base\}\}\(\\theta\)\\;\+\\;w\\,\\mathcal\{L\}\_\{\\mathrm\{attr\}\}\(\\theta\),\\qquad w\\geq 0,\(18\)withℒbase\\mathcal\{L\}\_\{\\mathrm\{base\}\}the standard flow\-matching loss\. The one\-sidedReLU\\operatorname\{ReLU\}is deliberate: only the\+𝒅\+\\bm\{d\}half\-line is the repetitive basin \(Prop\.[2](https://arxiv.org/html/2607.00588#Thmproposition2)\), so we penalize the model for pushing its*own*clean estimate toward the attractor while leaving the−𝒅\-\\bm\{d\}side and all off\-𝒅\\bm\{d\}content𝒘⟂𝒅\\bm\{w\}\\perp\\bm\{d\}untouched\. Because the penalty acts only on𝒙^θ\\hat\{\\bm\{x\}\}\_\{\\theta\}, the trained model needs no test\-time intervention\.

#### Result\.

A short fine\-tune \(best atw=7w\{=\}7\) lowers plain repetition with coherent text at a small clean\-PPL cost \(Table[3](https://arxiv.org/html/2607.00588#S5.T3)\); the same budget without the penalty \(w=0w\{=\}0\) barely moves it, so the reduction is the regularizer’s, not the extra training\. This confirms𝒅\\bm\{d\}is a trainable, causal property, not just an inference\-time handle\.

## Appendix FControls and example outputs

This appendix supports the causal test of §[3](https://arxiv.org/html/2607.00588#S3)with the repetition–PPL association \(§[F\.1](https://arxiv.org/html/2607.00588#A6.SS1)\), and collects example outputs: non\-words \(§[F\.2](https://arxiv.org/html/2607.00588#A6.SS2)\) and baseline\-vs\-steered repetition samples \(§[G](https://arxiv.org/html/2607.00588#A7)\)\.

### F\.1Conditional PPL collapse \(6464\-step operating point\)

#### Association\.

Table[16](https://arxiv.org/html/2607.00588#A6.T16)stratifies the1,0001\{,\}000samples of the6464\-step \(γ=1\.0\\gamma\{=\}1\.0\) run by repetition; median GPT\-2 PPL falls monotonically as repetition rises \(Pearsonr​\(rep,ppl\)=−0\.63r\(\{\\textsc\{rep\}\},\{\\textsc\{ppl\}\}\)\{=\}\{\-\}0\.63, Spearmanρ=−0\.68\\rho\{=\}\{\-\}0\.68\)\.

Table 16:Perplexity falls monotonically as repetition rises \(association\)\.6464\-step run\.

### F\.2Example non\-word outputs

All examples are from the ELF\-B6464\-step \(γ=1\.0\\gamma\{=\}1\.0\) run\.

#### Non\-words \(with context\)\.

The defect spans coinages, glued\-together word pairs, and misspellings:

- •psychloginism: “*The goal ofpsychloginismis to learn the…*”
- •intersignal: “*…short of a bipartisanintersignalvote that would happen…*”
- •evenfigured: “*…may occur and we’veevenfiguredout a way to…*” \(even \+ figured\)
- •slamed: “*…his wife when heslameda woman in the…*” \(slammed\)
- •encryptted: “*…contains the firstencrypttedelement…*” \(encrypted\)

## Appendix GQualitative samples: baseline vs\. steered

#### Repetition emerging along the trajectory\.

Decoding*one*sample’s current estimate at successive trajectory fractions shows the basin forming, the text counterpart of Figure[3](https://arxiv.org/html/2607.00588#S5.F3)\. Table[17](https://arxiv.org/html/2607.00588#A7.T17)shows one draw under baseline and steering at three trajectory stages: the baseline stays locked in the basin at3838–45%45\\%repeated44\-grams, while the steered run stays near the human bar \(1\.01\.0–3\.1%3\.1\\%\)\.

Table 17:The defect and the fix in one sample, along the trajectory\.Same seed as Figure[3](https://arxiv.org/html/2607.00588#S5.F3), baseline vs steered \(λ=2\\lambda\{=\}2\);highlighted==the cell’s dominant looped44\-gram;\[…\]==elided\.

## Appendix HRelated work

#### Diffusion\-LM failure modes: metric and decoder\.

Prior work on diffusion\-LM failure modes mainly diagnoses evaluation artifacts or decoding bottlenecks, not text\-visible repetition directly\. The*metric*line shows generative perplexity is distorted by low\-entropy or reduced\-temperature sampling\(Zhenget al\.,[2025](https://arxiv.org/html/2607.00588#bib.bib21); Pynadathet al\.,[2026](https://arxiv.org/html/2607.00588#bib.bib17)\), echoing broader evidence that model\-based perplexity is unreliable for generation\(Wanget al\.,[2022](https://arxiv.org/html/2607.00588#bib.bib22); Heet al\.,[2023](https://arxiv.org/html/2607.00588#bib.bib29)\)and gameable end\-to\-end by naive samplers\(Franca and Tong,[2026](https://arxiv.org/html/2607.00588#bib.bib48)\); this line is mostly on discrete or masked models\. The*decoder*line targets per\-position rounding and independence with stronger decoders\(Liet al\.,[2022](https://arxiv.org/html/2607.00588#bib.bib24); Dielemanet al\.,[2022](https://arxiv.org/html/2607.00588#bib.bib25); Liet al\.,[2026](https://arxiv.org/html/2607.00588#bib.bib19); Shenet al\.,[2026](https://arxiv.org/html/2607.00588#bib.bib18)\): it addresses the discretization bottleneck, not the self\-conditioning feedback direction behind text\-visible repetition\. An entropy view is in any case blind to whether the tokens are real words\.

#### Reading the text, not the metric or the decoder\.

We instead measure what the models generate, counting the defects a reader sees against human references\. The closest prior model, the self\-conditioned embedding diffusion ofStrudelet al\.\([2022](https://arxiv.org/html/2607.00588#bib.bib26)\), reports likelihood/entropy trade\-offs under guidance; relative to it and to the AR\-repetition line\(Xuet al\.,[2022](https://arxiv.org/html/2607.00588#bib.bib30)\), we study the generated text itself rather than a clean\-embedding probe or the metric\.

#### Relation to autoregressive degeneration\.

Repetition is the*denoising\-axis*analogue of classical AR\-decoding degeneration\(Holtzmanet al\.,[2020](https://arxiv.org/html/2607.00588#bib.bib14)\): the self\-conditioning feedback plays the role of AR’s repeated\-context prior and the low\-entropy committed regime that of greedy maximization, but the errors compound along the denoising\-step axis rather than the sequence\-order axis\. The correspondence is prescriptive: because repetition is set in the continuous latent, upstream of token selection, decode\-time fixes such as contrastive search\(Suet al\.,[2022](https://arxiv.org/html/2607.00588#bib.bib31)\)are poorly placed to reach it, whereas the orthogonal non\-word defect is amenable to AR\-style contextual decoding \(App\.[C](https://arxiv.org/html/2607.00588#A3)\)\.

#### Relation to masked diffusion\.

Two concurrent threads on masked diffusion bound our claims\.Cardeiet al\.\([2026](https://arxiv.org/html/2607.00588#bib.bib20)\)add self\-conditioning to masked diffusion and report a large drop in generative perplexity without an explicit repetition analysis; whether masked, non\-committed feedback enters the attractor we describe is left open\.Shnaidmanet al\.\([2025](https://arxiv.org/html/2607.00588#bib.bib39)\)steer a behavioral knob of masked DLMs \(safety refusal\) along an approximately one\-dimensional subspace\. A single\-direction account of a DLM is thus not unique to us; what is specific here is a contractive attractor of the*continuous*self\-conditioning loop, recovered label\-free and targeting the generative\-perplexity defect, captured by one frozen direction that generalizes across inference settings and model scales\.

Similar Articles

LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling

Hugging Face Daily Papers

LangFlow presents the first continuous diffusion language model that rivals discrete diffusion approaches, challenging the long-held belief that continuous diffusion is inferior for language modeling. The work introduces key ingredients like optimal Gumbel-based noise scheduling and demonstrates competitive perplexity and transfer learning performance compared to discrete diffusion baselines.

Drifting Objectives for Refining Discrete Diffusion Language Models

arXiv cs.CL

This paper introduces TokenDrift, a drifting objective that refines discrete diffusion language models by lifting categorical predictions to a continuous semantic space for anti-symmetric drifting, significantly improving generation quality under a fixed number of denoising steps.

TextLDM: Language Modeling with Continuous Latent Diffusion

Hugging Face Daily Papers

This paper introduces TextLDM, a method that adapts visual latent diffusion transformers for language modeling by mapping discrete tokens to continuous latents. It demonstrates that this approach, enhanced by representation alignment, matches GPT-2 performance and unifies visual and text generation architectures.