Self-Training Doesn't Flatten Language -- It Restructures It: Surface Markers Amplify While Deep Syntax Dies

arXiv cs.CL Papers

Summary

This paper presents evidence that self-training on language model outputs does not uniformly flatten language but restructures it, with surface markers (discourse connectives, hedges, em-dashes) increasing while deep syntactic structures (passives, subjunctives, parentheticals) collapse, formalized as the Structural Depth Hypothesis.

arXiv:2605.20602v1 Announce Type: new Abstract: Successive self-training on a language model's own outputs is widely characterized as a process of flattening: diversity drops, distributions narrow, and the text becomes "more like itself." We provide evidence that this characterization is incomplete. Across eleven generations of self-training on five models (GPT-2 124M, Pythia-410M, Pythia-1.4B, OPT-1.3B, Pythia-2.8B), language is not flattened uniformly -- it is restructured. Surface markers (discourse connectives, hedges, em-dashes) rise, while mid- and deep-syntactic structures (questions, parentheticals, passives, subjunctives) collapse. We formalize this asymmetric collapse as the Structural Depth Hypothesis (SDH): the per-generation decay rate of a linguistic feature is predicted primarily by its structural depth -- the number of nested syntactic dependencies it requires -- and only secondarily by its generation-zero output frequency. Pooling 17-feature panels from five models spanning three architecture families (N=85), the pooled Spearman correlation is rho=0.540 (p < 10^{-6}; cluster-bootstrap 95% CI [0.434, 0.634]), while frequency is a substantially weaker predictor (rho=0.225). A matched human-text fine-tuning control yields rho=0.039 (p=0.88), confirming the gradient is self-training-specific. We further document a Superficial Complexity Paradox: aggregate complexity proxies (dep-tree depth, TTR, word length) all rise as the underlying clause structure dies, with direct implications for training-data curation and LLM-text detection.
Original Article
View Cached Full Text

Cached at: 05/21/26, 06:34 AM

# Self-Training Doesn’t Flatten Language — It Restructures It: Surface Markers Amplify While Deep Syntax Dies
Source: [https://arxiv.org/html/2605.20602](https://arxiv.org/html/2605.20602)
###### Abstract

Successive self\-training on a language model’s own outputs is widely characterized as a process of*flattening*: diversity drops, distributions narrow, and the text becomes “more like itself\.” We provide evidence that this characterization is incomplete\. Across eleven generations of self\-training on five models \(GPT\-2 124M, Pythia\-410M, Pythia\-1\.4B, OPT\-1\.3B, Pythia\-2\.8B\), language is not flattened uniformly — it is*restructured*\. Surface markers \(discourse connectives, hedges, em\-dashes\) and aggregate “complexity” proxies \(dep\-tree depth, type\-token ratio, average word length\) all*rise*, while mid\- and deep\-syntactic structures \(questions, parentheticals, passives, subjunctives\) collapse\. We formalize this asymmetric collapse as the Structural Depth Hypothesis \(SDH\): the per\-generation decay rate of a linguistic featureϕ\\phiis predicted primarily by its*structural depth*d​\(ϕ\)d\(\\phi\)— the number of nested syntactic dependencies it requires — and only secondarily by its generation\-zero output frequency\. Pooling 17\-feature panels from five models spanning three architecture families \(N=85N\{=\}85\), a mixed\-effects model accounting for the nested structure yields a highly significant depth coefficient \(p<0\.001p<0\.001\); the pooled Spearman correlation isρ=0\.540\\rho\{=\}0\.540\(p<10−6p<10^\{\-6\}; cluster\-bootstrap 95% CI\[0\.434,0\.634\]\[0\.434,0\.634\]\), while frequency is a substantially weaker predictor \(ρ=0\.225\\rho\{=\}0\.225\)\. Four of five models are individually significant \(Pythia\-410M:ρ=0\.609\\rho\{=\}0\.609,p=0\.010p\{=\}0\.010; OPT\-1\.3B:ρ=0\.563\\rho\{=\}0\.563,p=0\.019p\{=\}0\.019; Pythia\-1\.4B:ρ=0\.498\\rho\{=\}0\.498,p=0\.042p\{=\}0\.042; Pythia\-2\.8B:ρ=0\.705\\rho\{=\}0\.705,p=0\.002p\{=\}0\.002\)\. A matched human\-text fine\-tuning control yieldsρ=0\.039\\rho\{=\}0\.039\(p=0\.88p\{=\}0\.88\), confirming the gradient is self\-training\-specific\. We further document a*Superficial Complexity Paradox*: surface measures of complexity*rise*as the underlying clause structure dies\. Reporting only aggregate fingerprints — as is now standard in the LLM\-stylometry literature — systematically masks this bifurcation, with direct implications for training\-data curation and detection\.

Self\-Training Doesn’t Flatten Language — It Restructures It: Surface Markers Amplify While Deep Syntax Dies

Ming LiuAmazonmlliuz@amazon\.com

## 1Introduction

A model trained on its own outputs is supposed to converge\. Variance shrinks, perplexity falls, the tail of the distribution thins, and — in the dominant “model collapse” framing — the text drifts toward a low\-entropy attractor\(Shumailov et al\.,[2024](https://arxiv.org/html/2605.20602#bib.bib23); Dohmatob et al\.,[2024](https://arxiv.org/html/2605.20602#bib.bib4); Alemohammad et al\.,[2023](https://arxiv.org/html/2605.20602#bib.bib1)\)\. A separate literature on*LLM linguistic fingerprints*\(Zanotto and Aroyehun,[2024](https://arxiv.org/html/2605.20602#bib.bib29); Sourati et al\.,[2025](https://arxiv.org/html/2605.20602#bib.bib24); Tercon and Dobrovoljc,[2025](https://arxiv.org/html/2605.20602#bib.bib25); Kobak et al\.,[2025](https://arxiv.org/html/2605.20602#bib.bib15); Juzek and Ward,[2025](https://arxiv.org/html/2605.20602#bib.bib14)\)reports a parallel observation from the static side: machine\-generated text is unusually rich in discourse markers, em\-dashes, and hedges relative to human baselines\. Both literatures agree that something is happening to the distribution\. Neither, we argue, has correctly described*what*\.

We run eleven generations of self\-training on GPT\-2 124M and track seventeen linguistic features chosen*a priori*for their position on a syntactic\-depth scale, from purely lexical surface markers \(d=0d\{=\}0\) to cross\-clausal phenomena like the subjunctive \(d=3d\{=\}3\)\. The picture that emerges is not flattening\. It is bifurcation\.

#### The divergence\.

Discourse markers \(however,moreover,therefore\) more than*double*in relative frequency, and so do hedges, em\-dashes, and sentence\-initial conjunctions\. At the same time, question marks fall by92%92\\%, parentheticals by57%57\\%, passive voice by56%56\\%, irregular past\-tense verbs by52%52\\%, and subjunctive constructions by53%53\\%\. A reader sampling from generation 10 sees text that looks*more*essayistic,*more*hedged, and*more*formally connective than the original GPT\-2 distribution — and yet has lost the syntactic machinery \(questions, embedded clauses, passives\) that makes prose actually flexible\. The aggregate fingerprint metrics preferred in the recent stylometry literature\(Zanotto and Aroyehun,[2024](https://arxiv.org/html/2605.20602#bib.bib29); Kobak et al\.,[2025](https://arxiv.org/html/2605.20602#bib.bib15)\)all*rise*: dependency\-tree depth increases by45%45\\%, clause embedding by33%33\\%, type\-token ratio by10%10\\%\. By every standard “complexity” proxy, the text is getting richer\. By every clause\-structural measure, it is dying\.

#### The hypothesis\.

We argue that this is not a collection of unrelated failures but a single phenomenon predicted by structural depth\. Define the*structural depth*d​\(ϕ\)d\(\\phi\)of a featureϕ\\phias the minimum number of nested syntactic dependencies required to licenseϕ\\phiin a sentence\. We propose theStructural Depth Hypothesis:

> Under iterated self\-training, the per\-generation drift rate of a linguistic featureϕ\\phiis approximatelyd​ϕd​t∝\(−α​d​\(ϕ\)\+β​σ​\(ϕ\)\)​ϕ\\frac\{d\\phi\}\{dt\}\\propto\(\-\\alpha\\,d\(\\phi\)\+\\beta\\,\\sigma\(\\phi\)\)\\,\\phi, whereσ​\(ϕ\)\\sigma\(\\phi\)is the feature’s dependence on sampling stochasticity \(highσ\\sigma= produced only under diverse sampling, absent from greedy mode\)\.

The first term predicts that mid\- and deep\-syntactic features decay roughly in proportion to their depth\. The second term predicts that shallow, sampling\-dependent features ride the rich\-get\-richer dynamics of stochastic generation\. Together they predict bifurcation, not flattening\.

#### Why depth, not frequency\.

The dominant theoretical account of model collapse\(Shumailov et al\.,[2024](https://arxiv.org/html/2605.20602#bib.bib23); Dohmatob et al\.,[2024](https://arxiv.org/html/2605.20602#bib.bib4)\)attributes distributional drift to a*frequency*mechanism: rare events get sampled less, so they die\. Our data are inconsistent with this account as the primary driver\. Pooling across five models \(N=85N\{=\}85\), depth predicts per\-feature decay rate \(ρ=0\.540\\rho\{=\}0\.540,p<10−6p<10^\{\-6\}\) while frequency is a substantially weaker predictor \(ρ=0\.225\\rho\{=\}0\.225,p=0\.039p\{=\}0\.039\)\. The most\-decayed features in our data are not the rarest; they are the structurally deepest\.

#### Contributions\.

- •We formalize the Structural Depth Hypothesis and derive three testable predictions: surface amplification, deep death, and group\-mean monotonicity in depth \(§[3](https://arxiv.org/html/2605.20602#S3)\)\.
- •We provide controlled self\-training studies on five models spanning three architecture families — GPT\-2 124M, Pythia\-410M, Pythia\-1\.4B, OPT\-1\.3B, and Pythia\-2\.8B — each run for 11 generations, with seventeen features selected*a priori*stratified by depth and a per\-feature trajectory analysis \(§[5](https://arxiv.org/html/2605.20602#S5)\)\. The GPT\-2 group means\{\+24\.9%,−10\.0%,−47\.2%,−52\.7%\}\\\{\+24\.9\\%,\-10\.0\\%,\-47\.2\\%,\-52\.7\\%\\\}ford∈\{0,1,2,3\}d\\in\\\{0,1,2,3\\\}are monotone in depth\.
- •We document the*Superficial Complexity Paradox*— aggregate fingerprint metrics rise while clause\-structural features die — and argue that this systematically biases the existing LLM\-stylometry literature \(§[5\.5](https://arxiv.org/html/2605.20602#S5.SS5)\)\.

The result reframes both literatures\. To the model\-collapse community, we offer a structural rather than purely statistical account of which features die\. To the fingerprint community, we provide a mechanism for*why*the canonical machine\-text markers are exactly the ones they are: they are the survivors — and amplifiers — of a depth\-graded collapse\.

## 2Related Work

#### Model collapse\.

Shumailov et al\. \([2024](https://arxiv.org/html/2605.20602#bib.bib23)\)formalized model collapse in*Nature*, showing that recursive training on synthetic data causes language models to lose the distributional tails\.Dohmatob et al\. \([2024](https://arxiv.org/html/2605.20602#bib.bib4)\)provide a scaling\-law analysis predicting that low\-frequency events decay first — a frequency\-rank mechanism we directly test and find insufficient \(§[5\.3](https://arxiv.org/html/2605.20602#S5.SS3)\)\.Seddik et al\. \([2024](https://arxiv.org/html/2605.20602#bib.bib22)\)offer a complementary statistical analysis of collapse dynamics\.Gerstgrasser et al\. \([2024](https://arxiv.org/html/2605.20602#bib.bib7)\)show that accumulating real data alongside synthetic data can mitigate collapse\.Alemohammad et al\. \([2023](https://arxiv.org/html/2605.20602#bib.bib1)\)characterize self\-consuming generative loops in image and text domains;Briesch et al\. \([2023](https://arxiv.org/html/2605.20602#bib.bib3)\)demonstrate that LLMs suffer from training on their own outputs;Guo et al\. \([2024](https://arxiv.org/html/2605.20602#bib.bib10)\)document declining lexical diversity under iterative generation; andHerel and Mikolov \([2024](https://arxiv.org/html/2605.20602#bib.bib12)\)show analogous collapse in language modeling specifically\. Related to collapse dynamics,Holtzman et al\. \([2020](https://arxiv.org/html/2605.20602#bib.bib13)\)characterized neural text degeneration and motivated nucleus sampling as a mitigation, andWelleck et al\. \([2020](https://arxiv.org/html/2605.20602#bib.bib27)\)proposed unlikelihood training to address repetitive generation — both phenomena adjacent to the template\-amplification mechanism in our SDH\. All of these works study collapse at the level of distributions over tokens, embeddings, or perplexity\. They do not ask*which*linguistic structures collapse, nor relate collapse rate to syntactic properties of the features themselves\. Our contribution is orthogonal: we hold the collapse phenomenon fixed and ask what governs the per\-feature decay rate, finding that structural depth is a significant predictor while frequency is not\.

#### Processing depth in psycholinguistics\.

The notion that syntactic complexity scales with embedding depth has a long history in psycholinguistics\.Gibson \([2000](https://arxiv.org/html/2605.20602#bib.bib8)\)formalized Dependency Locality Theory, predicting greater processing cost for structures with longer dependency chains\.Hale \([2001](https://arxiv.org/html/2605.20602#bib.bib11)\)andLevy \([2008](https://arxiv.org/html/2605.20602#bib.bib16)\)showed that surprisal — a correlate of contextual improbability — tracks processing difficulty\. Our structural\-depth scaled​\(ϕ\)d\(\\phi\)can be seen as a coarse proxy for these processing\-cost metrics applied to generation rather than comprehension: features requiring more sequential commitments under an autoregressive model face a multiplicative probability penalty analogous to the integration\-cost penalty in human processing\.

#### LLM linguistic fingerprints\.

A growing body of work characterizes the static distributional signature of LLM\-generated text\.Zanotto and Aroyehun \([2024](https://arxiv.org/html/2605.20602#bib.bib29)\)catalog distinctive lexical and discourse markers\.Kobak et al\. \([2025](https://arxiv.org/html/2605.20602#bib.bib15)\)document the “*delve*effect” — specific words and phrases that appear at elevated rates in machine\-generated prose\.Sourati et al\. \([2025](https://arxiv.org/html/2605.20602#bib.bib24)\)report shrinking lexical diversity in successive LLM generations\.Tercon and Dobrovoljc \([2025](https://arxiv.org/html/2605.20602#bib.bib25)\)provide stylometric profiles,Juzek and Ward \([2025](https://arxiv.org/html/2605.20602#bib.bib14)\)trace the sources of lexical overrepresentation, andWu et al\. \([2024](https://arxiv.org/html/2605.20602#bib.bib28)\)show fingerprint stability across prompts\.Padmakumar and He \([2024](https://arxiv.org/html/2605.20602#bib.bib19)\)andLiang et al\. \([2024](https://arxiv.org/html/2605.20602#bib.bib17)\)document the downstream effects of LLM\-generated text on content diversity and peer review\.Mitchell et al\. \([2023](https://arxiv.org/html/2605.20602#bib.bib18)\)propose curvature\-based detection of machine text, a method whose effectiveness may be affected by the structural changes we document\. These works study the static fingerprint of a single model generation, not its dynamics under self\-training\. Our contribution connects the two: the canonical fingerprint markers \(discourse connectives, hedges, em\-dashes\) are precisely the features that the SDH predicts will amplify under iteration\.

#### Concurrent work\.

Grigoreva et al\. \([2025](https://arxiv.org/html/2605.20602#bib.bib9)\)\(FLLM 2025\) study lexical drift in iterated generation but treat features as a flat bag without a depth scale\.Vanmassenhove \([2025](https://arxiv.org/html/2605.20602#bib.bib26)\)frame synthetic\-data contamination as “unnatural selection” reducing multilingual diversity, a concern complementary to our monolingual structural analysis\.Peterson and Christiano \([2025](https://arxiv.org/html/2605.20602#bib.bib20)\)document “knowledge collapse” in factual content\. None of these works tests a structural\-depth account or identifies the bifurcation between rising aggregate complexity proxies and falling clause\-level structures\. Our SDH provides a unifying mechanism that subsumes both their lexical\-drift findings and the model\-collapse literature’s distributional findings as predictions of a single depth\-graded process\.

## 3The Structural Depth Hypothesis

### 3\.1Structural depth

We define the*structural depth*d​\(ϕ\)∈\{0,1,2,3,…\}d\(\\phi\)\\in\\\{0,1,2,3,\\ldots\\\}of a linguistic featureϕ\\phias the minimum number of nested syntactic dependencies a sentence must instantiate in order to licenseϕ\\phi\. We use “structural” rather than “syntactic” to distinguishd​\(ϕ\)d\(\\phi\)— a property of the feature type — from the measured mean dependency\-tree depth of a sentence, which is the aggregate metric that exhibits the Superficial Complexity Paradox \(§[5\.5](https://arxiv.org/html/2605.20602#S5.SS5)\)\.

- •d=0d\{=\}0:*lexical / surface markers\.*Tokens or shortnn\-grams whose presence is independent of the surrounding parse: discourse markers \(however,moreover\), hedging particles \(perhaps,maybe\), em\-dashes, exclamation marks\.
- •d=1d\{=\}1:*local syntax\.*Phenomena that depend on a single syntactic relation: regular past\-tense morphology \(V\+𝑒𝑑V\{\+\}\{\\sl ed\}\), sentence\-initial conjunctions, simple coordination, quotation marks introducing direct speech, colons and semicolons introducing local elaboration\.
- •d=2d\{=\}2:*clause structure\.*Phenomena requiring an embedded clause or non\-trivial argument structure: question formation \(subject\-aux inversion or*wh*\-extraction\), parentheticals, passive voice, irregular past\-tense verbs \(which cluster in clause\-final and embedded contexts in our annotated sample\), relative clauses, adverbial clauses\.
- •d=3d\{=\}3:*cross\-clause / mood\.*Phenomena requiring coordination across clause boundaries or non\-indicative mood: subjunctive constructions \(counterfactuals, complement\-clause subjunctives\)\.

We define a stratified panel of seventeen features spanningd∈\{0,1,2,3\}d\\in\\\{0,1,2,3\\\}\(Table[1](https://arxiv.org/html/2605.20602#S4.T1)\)\.

#### Justification for depth assignments\.

The depth assignments follow from standard syntactic theory: a feature at depthddrequiresddnested dependency relations in the parse tree\. For example, passive voice \(d=2d\{=\}2\) requires both an argument\-structure alternation and an auxiliary BE form, hence two syntactic commitments \(we detect this via a surface BE\-aux\+\+past\-participle pattern\)\. Discourse markers \(d=0d\{=\}0\) are freely insertable adverbs with no syntactic dependencies\. The assignments were fixed before any trajectory data were computed; the only post\-hoc change would be to add or remove features from the panel entirely, not to re\-annotate depths\. Leave\-one\-out analysis \(§[5\.2](https://arxiv.org/html/2605.20602#S5.SS2)\) shows that the depth–decay correlation is robust to removing any single feature from the panel, and a sensitivity check reassigning irregular past fromd=2d\{=\}2tod=1d\{=\}1confirms that the pooled correlation remains significant \(§[8](https://arxiv.org/html/2605.20602#S8)\)\.

### 3\.2The organizing principle

Letϕt\\phi\_\{t\}denote the relative frequency of featureϕ\\phiat generationttof self\-training, normalized to its generation\-0 value\. We summarize the SDH as a heuristic organizing schema \(not a fitted dynamical model\):

d​ϕd​t≈−α​d​\(ϕ\)​ϕt\+β​σ​\(ϕ\)​ϕt,\\frac\{d\\phi\}\{dt\}\\;\\approx\\;\-\\alpha\\,d\(\\phi\)\\,\\phi\_\{t\}\\;\+\\;\\beta\\,\\sigma\(\\phi\)\\,\\phi\_\{t\},\(1\)whered​\(ϕ\)d\(\\phi\)is structural depth,σ​\(ϕ\)∈\[0,1\]\\sigma\(\\phi\)\\in\[0,1\]is the*sampling dependence*ofϕ\\phi— the degree to which a feature’s production relies on stochastic sampling diversity rather than the model’s deterministic mode — andα,β\>0\\alpha,\\beta\>0are model\-specific constants\. Operationally,σ​\(ϕ\)=1−τ​\(ϕ\)\\sigma\(\\phi\)=1\-\\tau\(\\phi\)whereτ=fgreedy/fnucleus\\tau=f\_\{\\text\{greedy\}\}/f\_\{\\text\{nucleus\}\}\(see §[7](https://arxiv.org/html/2605.20602#S7)\)\. The two terms capture two competing forces:

Depth penalty \(−α​d\-\\alpha\\,d\)A feature at structural depthddrequires the model to makedd*sequentially correct*syntactic commitments\. A question requires subject\-auxiliary inversion; a passive requires both argument re\-ordering and an auxiliary; a subjunctive requires mood marking conditioned on the matrix\-clause verb\. At each choice point, the model’s autoregressive sampling must allocate probability mass to the correct continuation\. The probability of successfully traversing allddchoice points is \(to first order\) multiplicative, so the effective probability of generating a depth\-ddstructure scales aspdp^\{d\}for some per\-step success ratep<1p<1\. When the model is fine\-tuned on its own outputs, structures that were under\-sampled become even rarer in the next training corpus, creating a depth\-graded positive feedback loop\.

Sampling\-dependence bonus \(\+β​σ\+\\beta\\,\\sigma\)Conversely, features whose production relies on the diversity of nucleus sampling — those*absent*from greedy templates — ride the rich\-get\-richer dynamics of stochastic generation\. Discourse connectives, hedges, and em\-dashes haveσ≈1\\sigma\\approx 1: they are produced almost exclusively under stochastic sampling, not greedy decoding \(e\.g\., “*However, X\. Moreover, Y\.*” appears in sampled but not greedy text\)\. Once present in the training corpus, they bias the model’s distribution upward, and each generation of self\-training amplifies this over\-representation\.

### 3\.3Predictions

The hypothesis makes three falsifiable predictions:

1. 1\.Surface amplification\.For shallow, sampling\-dependent features \(d=0d\{=\}0,σ\\sigmalarge\),ϕ˙\>0\\dot\{\\phi\}\>0: their relative frequency*grows*across generations\.
2. 2\.Deep death\.For deep features \(d≥2d\\geq 2\), regardless ofσ\\sigma,ϕ˙≪0\\dot\{\\phi\}\\ll 0: their relative frequency collapses\.
3. 3\.Monotone group means\.The mean per\-generation changeΔ​ϕ¯\|d\\overline\{\\Delta\\phi\}\\,\|\\,dis monotone decreasing indd\.

A fourth prediction — the*exception that proves the rule*— follows from a boundary condition: ad=0d\{=\}0feature can have highσ\\sigma\(sampling\-dependent\) yet still fail to amplify if its baseline rate is too low for the rich\-get\-richer loop to engage\. Such a feature should*not*amplify and may even die\. We will see \(§[6\.1](https://arxiv.org/html/2605.20602#S6.SS1)\) that exclamation marks are exactly such a feature in GPT\-2\.

## 4Experimental Setup

#### Models\.

Our primary experiment uses GPT\-2 124M\(Radford et al\.,[2019](https://arxiv.org/html/2605.20602#bib.bib21)\)\. Cross\-model replication uses Pythia\-410M, Pythia\-1\.4B, and Pythia\-2\.8B\(Biderman et al\.,[2023](https://arxiv.org/html/2605.20602#bib.bib2)\), plus OPT\-1\.3B\(Zhang et al\.,[2022](https://arxiv.org/html/2605.20602#bib.bib30)\), spanning three architecture families and a7×7\\timesparameter range \(410M–2\.8B\)\.

#### Self\-training protocol\.

Starting from generation0\(the released checkpoint\), we iterate eleven generations \(t=0,…,10t=0,\\ldots,10\)\. At each generation we \(i\) sampleN=3,000N\{=\}3\{,\}000texts of length256256tokens \(\>3×\{\>\}3\{\\times\}the corpus used byShumailov et al\.[2024](https://arxiv.org/html/2605.20602#bib.bib23); stable under subsampling to 1500, see §[8](https://arxiv.org/html/2605.20602#S8)\) with nucleus sampling \(top\-​p=0\.95\\text\{top\-\}p\{=\}0\.95,T=0\.9T\{=\}0\.9,top\-​k=50\\text\{top\-\}k\{=\}50, repetition penalty1\.11\.1\); \(ii\) fine\-tune the previous generation’s checkpoint on its own samples for one epoch with the default Hugging Face training configuration \(AdamW, lr==5e\-5, weight\_decay==0\.01, linear schedule, max\_grad\_norm==1\.0, bf16\); \(iii\) freeze the resulting checkpoint as generationt\+1t\{\+\}1\. We hold the prompt distribution and the decoding hyperparameters fixed across generations to isolate the effect of iterated training\. We verify that the depth gradient is robust to alternative decoding strategies \(greedy, ancestral, tight nucleus\) in §[8](https://arxiv.org/html/2605.20602#S8)\. We do not mix in human data: this is the pure self\-training regime studied byShumailov et al\. \([2024](https://arxiv.org/html/2605.20602#bib.bib23)\)andDohmatob et al\. \([2024](https://arxiv.org/html/2605.20602#bib.bib4)\)\.

#### Features and annotation\.

The seventeen features were selected before any trajectory was computed\. Each feature is annotated with structural depthd​\(ϕ\)∈\{0,1,2,3\}d\(\\phi\)\\in\\\{0,1,2,3\\\}following the operational definitions in §[3](https://arxiv.org/html/2605.20602#S3)and Table[1](https://arxiv.org/html/2605.20602#S4.T1)\. Two additional features with depth annotations—long words \(≥10\{\\geq\}10characters,d=0d\{=\}0\) and ellipsis markers \(d=2d\{=\}2\)—are computed by the extraction code but excluded from the primary panel because they are frequency aggregates rather than discrete syntactic events; including them does not change the pooled correlation appreciably \(ρ=0\.52\\rho\{=\}0\.52vs\.0\.540\.54\)\. Features are extracted from each generation’s3,0003\{,\}000\-text corpus usingspaCyen\_core\_web\_smfor parsing and custom regular\-expression and lexical matchers for the feature categories\. All counts are normalized to per\-token rates and reported relative to generation0\.

#### Aggregate metrics\.

For comparison with the static fingerprint literature we additionally compute six aggregate metrics: mean dependency\-tree depth, mean clause embedding \(number ofccomp/xcomp/advcl/relclper sentence\), average word length in characters, type\-token ratio over 100\-word windows \(TTR\-100\), hapax\-legomenon ratio, and mean dependency\-link length\.

#### Statistical tests\.

For the SDH test we compute the per\-feature decay rateλ​\(ϕ\)\\lambda\(\\phi\)as the slope of the OLS regression oflog⁡ϕt\\log\\phi\_\{t\}ontt, then evaluate Spearman rank correlations betweenλ​\(ϕ\)\\lambda\(\\phi\)and \(a\) structural depthd​\(ϕ\)d\(\\phi\)and \(b\) baseline corpus frequencyf​\(ϕ\)f\(\\phi\)measured at generation0\.

Featured​\(ϕ\)d\(\\phi\)OperationalizationΔ\\Delta\(%\)discourse\_markers0lexical:*however, moreover, furthermore, nevertheless, …*\(20 connectives; cf\.Fraser[1999](https://arxiv.org/html/2605.20602#bib.bib5)\)\+126\.2\+126\.2hedging0lexical:*perhaps, maybe, possibly, somewhat, …*\+44\.2\+44\.2em\_dashes0punctuation: — \(Unicode/ASCII variants\)\+28\.6\+28\.6exclamation0punctuation:\!−99\.3\-99\.3regular\_past\_ed1morphology:VV\+edregulars\+79\.7\+79\.7sent\_initial\_conj1sentence\-initial*And, But, So, Yet, …*\+19\.0\+19\.0coordination1lexical:*and, but, or, nor, yet*−14\.4\-14\.4quotes1punctuation: paired"/’−14\.9\-14\.9colons1punctuation::−64\.8\-64\.8semicolons1punctuation:;−64\.4\-64\.4question\_marks2punctuation:?−91\.7\-91\.7parentheses2punctuation:\(…\)−56\.8\-56\.8passive\_voice2regex: BE\-aux\+\+VV\+ed\(regular passives\)−55\.5\-55\.5irregular\_past2morphology: irregular past\-tense verbs−52\.3\-52\.3relative\_clauses2dep:relcl−28\.2\-28\.2adverbial\_clauses2dep:advcl\+1\.6\+1\.6subjunctive3lexico\-syntactic: counterfactual / complement subjunctive−52\.7\-52\.7Table 1:The seventeen features selected*a priori*, their structural depthd​\(ϕ\)d\(\\phi\), operationalization, and total relative change between generations 0 and 10 of self\-training on GPT\-2 124M\.

## 5Results

### 5\.1Bifurcation under self\-training

![Refer to caption](https://arxiv.org/html/2605.20602v1/x1.png)Figure 1:Core result for GPT\-2 124M across 11 self\-training generations\.\(a\)Feature trajectories normalized to generation 0:d=0d\{=\}0features \(red\) amplify whiled≥2d\{\\geq\}2features \(green, blue\) collapse\.\(b\)Group\-mean percent change is monotone in structural depth\. Error bars: standard error\.\(c\)Head\-to\-head: depth outperforms frequency as a decay predictor\.Figure[1](https://arxiv.org/html/2605.20602#S5.F1)plots the trajectories of all seventeen features across the eleven generations, normalized to generation0\. Two regimes are immediately visible\. Surface markers \(d=0d\{=\}0, except exclamation\) climb monotonically:discourse markers more than double\(\+126\.2%\+126\.2\\%\), hedging grows by\+44\.2%\+44\.2\\%, em\-dashes by\+28\.6%\+28\.6\\%\. In contrast,d=2d\{=\}2features collapse: question marks lose91\.7%91\.7\\%of their generation\-0 mass, parentheticals56\.8%56\.8\\%, passive voice55\.5%55\.5\\%, irregular past tense52\.3%52\.3\\%\. Thed=3d\{=\}3subjunctive collapses by52\.7%52\.7\\%\. Thed=1d\{=\}1band lies in between, with mixed signs\.

#### Group means are monotone in depth\.

Averaging within depth strata:

Δ¯\|d=0=\+24\.9%,Δ¯\|d=1=−10\.0%,\\overline\{\\Delta\}\\,\|\\,d\{=\}0=\+24\.9\\%,\\quad\\overline\{\\Delta\}\\,\|\\,d\{=\}1=\-10\.0\\%,Δ¯\|d=2=−47\.2%,Δ¯\|d=3=−52\.7%\.\\overline\{\\Delta\}\\,\|\\,d\{=\}2=\-47\.2\\%,\\quad\\overline\{\\Delta\}\\,\|\\,d\{=\}3=\-52\.7\\%\.Mean per\-generation change is monotone decreasing indd, exactly as the SDH predicts\.

#### Perplexity drops monotonically\.

Across the eleven generations, the model’s perplexity on its own generated corpus drops41\.2→44\.5→40\.1→32\.2→25\.2→20\.9→18\.8→16\.6→14\.8→13\.7→13\.141\.2\\to 44\.5\\to 40\.1\\to 32\.2\\to 25\.2\\to 20\.9\\to 18\.8\\to 16\.6\\to 14\.8\\to 13\.7\\to 13\.1— a68%68\\%reduction\. This self\-perplexity collapse reflects the model becoming increasingly self\-consistent as it trains on its own outputs\. Standard collapse diagnostics would call this convergence; our feature decomposition shows it is the signature of structural impoverishment\.

### 5\.2The SDH test: depth predicts decay rate

We compute the per\-feature decay rateλ​\(ϕ\)\\lambda\(\\phi\)as the OLS slope oflog⁡ϕt\\log\\phi\_\{t\}onttacrosst=0,…,10t=0,\\ldots,10, and rank\-correlateλ​\(ϕ\)\\lambda\(\\phi\)against structural depth \(Figure[2](https://arxiv.org/html/2605.20602#S5.F2)\)\.

![Refer to caption](https://arxiv.org/html/2605.20602v1/x2.png)Figure 2:Per\-feature decay rateλ​\(ϕ\)\\lambda\(\\phi\)vs\. structural depthd​\(ϕ\)d\(\\phi\)for GPT\-2\. Group means \(horizontal bars\) decrease monotonically with depth\. Positiveλ\\lambdaindicates amplification; negative indicates collapse\.The Spearman correlation isρ=0\.373\\rho=0\.373\(p=0\.140p=0\.140,N=17N\{=\}17\)\. Because each individual model contributes onlyN=17N\{=\}17features, the per\-model test has only∼50%\{\\sim\}50\\%power to detectρ=0\.43\\rho\{=\}0\.43atα=0\.05\\alpha\{=\}0\.05; we therefore treat the pooled mixed\-effects test \(§[5\.6](https://arxiv.org/html/2605.20602#S5.SS6)\) as the primary inferential benchmark and report per\-model results for transparency\. The per\-model SDH test is statistically conservative\. A permutation test \(100,000 shuffles of depth labels\) givespperm=0\.070p\_\{\\text\{perm\}\}=0\.070for obtaining a Spearmanρ\\rhoas large as the observed value, andpmono=0\.081p\_\{\\text\{mono\}\}=0\.081for obtaining group means as monotone as observed\. The effect size betweend=0d\{=\}0andd=2d\{=\}2groups is large \(Cohen’sd=1\.16d=1\.16, computed from group means of per\-feature decay rates and pooled within\-group standard deviations\)\.

#### Leave\-one\-out robustness\.

To verify that no single feature drives the correlation, we recomputeρ\\rhodropping each feature in turn\. The correlation remains in the same direction for all 17 leave\-one\-out subsets \(\|ρ\|∈\[0\.273,0\.597\]\|\\rho\|\\in\[0\.273,0\.597\]\)\. Removing exclamation — the predicted outlier \(§[6\.1](https://arxiv.org/html/2605.20602#S6.SS1)\) —*strengthens*the correlation toρ=0\.597\\rho\{=\}0\.597\(p=0\.015p\{=\}0\.015\), confirming that the signal is carried by the depth gradient, not by any single feature\. The cross\-model replication \(§[5\.6](https://arxiv.org/html/2605.20602#S5.SS6)\) provides the pooled test\.

### 5\.3Head\-to\-head: depth vs\. frequency

The dominant theoretical account of model collapse\(Shumailov et al\.,[2024](https://arxiv.org/html/2605.20602#bib.bib23); Dohmatob et al\.,[2024](https://arxiv.org/html/2605.20602#bib.bib4)\)attributes per\-feature decay to baseline frequency: rare features under\-sample and die\. We test this by computing the Spearman correlation betweenλ​\(ϕ\)\\lambda\(\\phi\)and the generation\-0 corpus frequencyf​\(ϕ\)f\(\\phi\)\. We findρfreq=−0\.194\\rho\_\{\\text\{freq\}\}=\-0\.194\(p=0\.456p=0\.456\)\. The sign is*opposite*to the depth correlation, and the magnitude is strictly smaller\.In a head\-to\-head comparison on this model, depth outperforms frequency as a decay predictor, though neither reaches significance individually atN=17N\{=\}17; the pooled test \(§[5\.6](https://arxiv.org/html/2605.20602#S5.SS6)\) provides the statistical power\.

To formally test whether depth retains predictive power after controlling for frequency, we compute the partial Spearman correlationρ​\(depth,decay\|log⁡freq\)\\rho\(\\text\{depth\},\\text\{decay\}\\,\|\\,\\log\\text\{freq\}\)across the pooledN=85N\{=\}85panel\. The partial correlation isρpartial=0\.490\\rho\_\{\\text\{partial\}\}\{=\}0\.490\(p<10−6p<10^\{\-6\}\), barely attenuated from the rawρ=0\.509\\rho\{=\}0\.509\. In a mixed\-effects model including standardizedlog\\log\-frequency as a covariate alongside depth, the depth coefficient remains highly significant \(βdepth=0\.039\\beta\_\{\\text\{depth\}\}\{=\}0\.039,p=0\.002p\{=\}0\.002\) while frequency is not significant \(βfreq=0\.016\\beta\_\{\\text\{freq\}\}\{=\}0\.016,p=0\.24p\{=\}0\.24\)\. Adding frequency does not improve model fit \(Δ​AIC=−0\.9\\Delta\\text\{AIC\}\{=\}\{\-\}0\.9, favoring the simpler depth\-only model\)\. A SteigerZZ\-test for dependent overlapping correlations givesZ=1\.89Z\{=\}1\.89\(p=0\.059p\{=\}0\.059\), indicating depth is a marginally stronger predictor than frequency when both share the same outcome variable\.

The frequency mechanism is not nothing — low\-frequency events do under\-sample — but frequency alone cannot explain the depth gradient we observe\. Depth is a complementary predictor that captures structural variance frequency leaves on the table\.

![Refer to caption](https://arxiv.org/html/2605.20602v1/x3.png)Figure 3:Depth vs\. frequency as decay predictors \(GPT\-2\)\. Each point is one of the 17 features\. Depth \(Spearmanρ=0\.373\\rho\{=\}0\.373\) outperforms frequency \(ρ=−0\.194\\rho\{=\}\{\-\}0\.194\), which has the wrong sign\.
### 5\.4Critical pairs

The clearest evidence for SDH comes from*critical pairs*: pairs of features that are matched on lexical surface form or frequency but differ in depth\.

Past tense, regular vs\. irregular\.Both are equally frequent in the gen\-0 corpus and use a closed grammatical category\. Regular past tense \(V\+𝑒𝑑V\{\+\}\{\\sl ed\},d=1d\{=\}1\)*grows*by\+79\.7%\+79\.7\\%across self\-training\. Irregular past tense \(d=2d\{=\}2, because irregular verbs cluster in clause\-final and embedded positions in our annotated sample\)*collapses*by−52\.3%\-52\.3\\%\. Frequency cannot explain this divergence; depth can\.

Local vs\. deep punctuation\.Em\-dashes \(d=0d\{=\}0, surface elaboration\) grow\+28\.6%\+28\.6\\%\. Parentheticals \(d=2d\{=\}2, requiring displaced material\) collapse−56\.8%\-56\.8\\%\. Both are punctuation; both mark elaboration; only one survives\.

Sentence\-initial vs\. subordinate conjunction\.Sentence\-initial*And/But/So*\(d=1d\{=\}1, paratactic\) grows\+19\.0%\+19\.0\\%\. Within\-sentence coordination \(d=1d\{=\}1but more constrained,cc/conjdependencies\) shrinks−14\.4%\-14\.4\\%\. The paratactic surface form survives; the syntactically integrated form attenuates\.

These pairs were selected to maximize depth contrast within frequency\-matched sets, and are therefore*illustrative*rather than confirmatory; the statistical tests in §[5\.6](https://arxiv.org/html/2605.20602#S5.SS6)–[7](https://arxiv.org/html/2605.20602#S7)use all 17 features without selection\.

![Refer to caption](https://arxiv.org/html/2605.20602v1/x4.png)Figure 4:Critical pairs: features matched on frequency but differing in depth show opposite fates under self\-training\.
### 5\.5The Superficial Complexity Paradox

If we set aside the seventeen depth\-stratified features and look only at the aggregate fingerprint metrics standard in the LLM\-stylometry literature, we see the opposite story:

By every standard “complexity” proxy, the text appears to be getting*richer*\. Type\-token ratio rises\. Words are longer\. Dependency trees are deeper\. A fingerprint analysis at generation 10 would report a more lexically diverse, more syntactically elaborate model than at generation0\.

This is theSuperficial Complexity Paradox\. Aggregate metrics that index “how complex does the text look” are*decoupled*from the per\-feature structural inventory of the language\. The model generates longer, more lexically varied, more nominally embedded sentences — whose embedding is parataxis \(chained discourse connectives, sentence\-initial conjunctions, coordinated noun phrases\) rather than subordination \(relative clauses, passives, embedded questions\)\. Dependency\-tree depth grows because trees are wider and chained, not because clauses are nested\.

#### Qualitative illustration\.

To make the paradox concrete, we provide actual model outputs\. At generation 0, GPT\-2 produces text with varied syntactic forms: “*one who has not benefited from this is the American Heritage Foundation, whose chairman \[…\] recently told members of Congress that they ‘need to change our attitude\.’*” At generation 10, the same prompt yields: “*However, some argue that while some experts are taking steps towards addressing the problems posed by climate change rapidly over the next decade, others remain hesitant about how quickly humans can adapt rapidly enough so it can become more dangerous\.*” The latter sentence is longer, uses more formal vocabulary, and contains a discourse marker \(“however”\) and a hedging expression \(“some argue”\) — but its clause structure is shallower: a single long paratactic chain with no embedded questions, no passives, and no subordination beyond a single “while” clause\.

#### Why aggregate metrics miss the bifurcation\.

Dependency depth counts edges, not clause types\. A sentence such as “*However, the model, perhaps, generates — and continues to generate — and, moreover, refines — texts\.*” has high dep\-tree depth, high TTR, and long words\. It also has zero embedded clauses, zero passives, zero questions, and zero subjunctives\. The fingerprint metrics literature\(Zanotto and Aroyehun,[2024](https://arxiv.org/html/2605.20602#bib.bib29); Kobak et al\.,[2025](https://arxiv.org/html/2605.20602#bib.bib15); Tercon and Dobrovoljc,[2025](https://arxiv.org/html/2605.20602#bib.bib25)\)implicitly assumes that aggregate complexity tracks structural complexity\. Under self\-training, this assumption fails\. To quantify the divergence formally, we compare the percent\-change distributions of surface features \(d=0d\{=\}0: discourse markers, hedges, em\-dashes, sentence\-initial conjunctions\) versus clause\-structural features \(d≥2d\{\\geq\}2\)\. A Mann\-WhitneyUUtest confirms that the two groups differ significantly \(surface mean\+162%\+162\\%vs\. clause\-level mean−10%\-10\\%;p<10−5p<10^\{\-5\}, per\-feature averaged;p<10−6p<10^\{\-6\}pooled\)\. The paradox is not merely descriptive: it is a statistically robust divergence\.

### 5\.6Cross\-model replication

We replicate the self\-training protocol on four additional models: Pythia\-410M, Pythia\-1\.4B, and Pythia\-2\.8B\(Biderman et al\.,[2023](https://arxiv.org/html/2605.20602#bib.bib2)\), plus OPT\-1\.3B\(Zhang et al\.,[2022](https://arxiv.org/html/2605.20602#bib.bib30)\)\(a different architecture family trained on different data\)\. All replication models run the full 11\-generation protocol\.

#### Pythia\-1\.4B\.

The depth gradient is clearly visible \(Figure[6](https://arxiv.org/html/2605.20602#S5.F6)\):d=0d\{=\}0features grow by\+200%\+200\\%on average,d=1d\{=\}1by\+50%\+50\\%, andd=3d\{=\}3\(subjunctive\) collapses by−44%\-44\\%\. The single\-model SDH correlation isρ=0\.498\\rho\{=\}0\.498\(p=0\.042p\{=\}0\.042\) — significant on its own\. Passive voice \(d=2d\{=\}2\) collapses by−50%\-50\\%, and relative clauses \(d=2d\{=\}2\) decline by−11%\-11\\%, while discourse markers \(d=0d\{=\}0,\+109%\+109\\%\) and hedging \(d=0d\{=\}0,\+293%\+293\\%\) amplify strongly\. Thed=2d\{=\}2group is heterogeneous in Pythia\-1\.4B: its mean is\+131%\+131\\%including all six features, but this is driven entirely by a single outlier \(parentheses,\+723%\+723\\%\)\. Excluding parentheses, thed=2d\{=\}2mean is\+12%\+12\\%, substantially below thed=0d\{=\}0amplification\. We attribute the parentheses spike to a degenerate generation mode \(mode\-collapse into forum\-style text with heavy bracket use\)\. Excluding parentheses*strengthens*the depth correlation \(ρ=0\.680\\rho\{=\}0\.680,p=0\.004p\{=\}0\.004\)\.

#### Pythia\-2\.8B\.

The largest model shows the*strongest*depth gradient:ρ=0\.705\\rho\{=\}0\.705\(p=0\.002p\{=\}0\.002\)\. Group means are\{\+345%,\+51%,−3%,\+4%\}\\\{\+345\\%,\+51\\%,\-3\\%,\+4\\%\\\}ford∈\{0,1,2,3\}d\\in\\\{0,1,2,3\\\}\. The relative ordering is preserved:d=0d\{=\}0features amplify two orders of magnitude more thand=2d\{=\}2features, andd=2d\{=\}2features now show net decay \(−3%\-3\\%\) with the full 11\-generation data\.

#### OPT\-1\.3B and Pythia\-410M\.

OPT\-1\.3B — trained on a different corpus than the Pythia family and using a different tokenizer — confirms the depth gradient:ρ=0\.563\\rho\{=\}0\.563\(p=0\.019p\{=\}0\.019\)\. Pythia\-410M, the smallest replication model, yieldsρ=0\.609\\rho\{=\}0\.609\(p=0\.010p\{=\}0\.010\) with perfectly monotone group means\{\+252%,\+79%,\+25%,−71%\}\\\{\+252\\%,\+79\\%,\+25\\%,\-71\\%\\\}\. The consistency across architecture families and training corpora rules out dataset\-specific artifacts\.

![Refer to caption](https://arxiv.org/html/2605.20602v1/x5.png)Figure 5:Cross\-model comparison\.\(a\)Group means by depth across five models\. The depth gradient is consistent despite different absolute magnitudes\.\(b\)Pooled decay rates \(N=85N\{=\}85\) show a highly significant depth correlation \(ρ=0\.540\\rho\{=\}0\.540,p<10−6p<10^\{\-6\}\)\.
#### Pooled cross\-model test\.

Combining all five models’ 17\-feature panels \(N=85N\{=\}85; we address pseudoreplication in §[8](https://arxiv.org/html/2605.20602#S8)\) yields:

ρdepth=0\.540​\(p<10−6\)\.\\rho\_\{\\text\{depth\}\}=0\.540\\;\(p<10^\{\-6\}\)\.A permutation test \(10510^\{5\}shuffles\) confirms:pperm<10−5p\_\{\\text\{perm\}\}<10^\{\-5\}\. A cluster\-aware bootstrap \(10410^\{4\}resamples, resampling at the model level to respect the nested structure\) yields a95%95\\%CI of\[0\.434,0\.634\]\[0\.434,0\.634\], excluding zero\. Per\-feature bootstrap CIs \(resampling over models,10410^\{4\}iterations\) show that 12 of 17 features have CIs excluding zero; the five with wider intervals \(exclamation, colons, semicolons, question\_marks, parentheses\) are high\-variance punctuation features that do not drive the correlation\. \(We note that withG=5G\{=\}5clusters the bootstrap may be anti\-conservative; the mixed\-effects model provides the principled inferential benchmark\.\) Structural depth is a highly significant predictor of per\-feature decay rate; baseline corpus frequency is a substantially weaker predictor \(ρ=0\.225\\rho\{=\}0\.225,p=0\.039p\{=\}0\.039\)\. Depth yields a substantially stronger correlation than frequency \(ρdepth=0\.540\\rho\_\{\\text\{depth\}\}\{=\}0\.540,p<10−6p<10^\{\-6\}vs\.ρfreq=0\.225\\rho\_\{\\text\{freq\}\}\{=\}0\.225,p=0\.039p\{=\}0\.039\)\. While frequency reaches marginal significance in the pooled test \(likely inflated by the repeated\-features structure\), depth explains more than twice as much rank\-variance; the per\-feature averaged analysis \(N=17N\{=\}17\) yieldsρdepth=0\.661\\rho\_\{\\text\{depth\}\}\{=\}0\.661\(p=0\.004p\{=\}0\.004\) vs\.ρfreq=0\.083\\rho\_\{\\text\{freq\}\}\{=\}0\.083\(n\.s\.,p=0\.75p\{=\}0\.75\), confirming that depth is the primary predictor\. The correlation*strengthens*with model scale \(GPT\-2:ρ=0\.373\\rho\{=\}0\.373; Pythia\-410M:ρ=0\.609\\rho\{=\}0\.609; Pythia\-1\.4B:ρ=0\.498\\rho\{=\}0\.498; OPT\-1\.3B:ρ=0\.563\\rho\{=\}0\.563; Pythia\-2\.8B:ρ=0\.705\\rho\{=\}0\.705\), suggesting that greater model capacity sharpens rather than weakens the depth gradient\.

![Refer to caption](https://arxiv.org/html/2605.20602v1/x6.png)\(a\)GPT\-2 124M
![Refer to caption](https://arxiv.org/html/2605.20602v1/x7.png)\(b\)Pythia\-1\.4B

Figure 6:Feature\-level heatmaps across generations\. Each row is a feature \(sorted by depth\), each column a generation\. Color indicates relative change from generation 0\. The depth gradient is visible as a warm\-to\-cool transition from top \(d=0d\{=\}0\) to bottom \(d=3d\{=\}3\)\.

## 6Discussion

### 6\.1Why exclamation is the exception that proves the rule

Exclamation marks haved​\(ϕ\)=0d\(\\phi\)\{=\}0— they are pure surface punctuation\. SDH’s first prediction \(*surface amplification*\) naively applies\. Yet exclamation marks die:−99\.3%\-99\.3\\%, the single sharpest collapse in our panel\. Why?

The full law isϕ˙∝−α​d\+β​σ\\dot\{\\phi\}\\propto\-\\alpha\\,d\+\\beta\\,\\sigma\. The sampling\-dependence termσ​\(ϕ\)\\sigma\(\\phi\)is decisive\. Discourse markers, hedges, and em\-dashes have highσ\\sigma\(≈1\\approx 1\): they are produced almost exclusively under stochastic nucleus sampling, absent from greedy decoding\. Once present in the training corpus, their over\-representation compounds\. Exclamation marks also have highσ\\sigma\(≈1\\approx 1;τ≈0\\tau\\approx 0\), but at a crucial difference: their nucleus\-sampled rate \(fnuc=1\.03f\_\{\\text\{nuc\}\}\\\!=\\\!1\.03per 1000 tokens\) is marginal—barely above zero—because GPT\-2’s essayistic generation mode rarely produces exclamatory text even stochastically\. So whileσ\\sigmais high, the feature’s baseline rate is too low to trigger the rich\-get\-richer amplification loop\. Instead, each generation slightly under\-samples exclamation, and the under\-sampling compounds\.

This is not a failure of SDH; it is a confirmation\. A pure\-frequency account would predict that the amplification of discourse markers and the death of exclamation should look similar \(bothd=0d\{=\}0, both shallow\)\. They do not\. A pure\-depth account would predict no amplification at all\. Only the joint depth\+\\,\+\\,template law captures both signs\.

### 6\.2Why aggregate metrics mislead

The Superficial Complexity Paradox \(§[5\.5](https://arxiv.org/html/2605.20602#S5.SS5)\) has direct methodological implications\. The current standard of practice in LLM fingerprint detection\(Zanotto and Aroyehun,[2024](https://arxiv.org/html/2605.20602#bib.bib29); Kobak et al\.,[2025](https://arxiv.org/html/2605.20602#bib.bib15); Tercon and Dobrovoljc,[2025](https://arxiv.org/html/2605.20602#bib.bib25); Wu et al\.,[2024](https://arxiv.org/html/2605.20602#bib.bib28)\)is to report aggregate distributional summaries: lexical diversity, dependency\-tree depth, word\-length distributions\. Our results show that under self\-training these summaries*rise*while the underlying clause inventory collapses\. A detector calibrated on aggregate complexity will*lose accuracy*as the generation depth of the upstream model increases — exactly the regime that detectors most need to cover\. The remedy is to report depth\-stratified feature panels, of which the panel in Table[1](https://arxiv.org/html/2605.20602#S4.T1)is one concrete instance\.

### 6\.3Implications for training\-data curation

The depth\-graded collapse predicts which kinds of human\-written data are most valuable for re\-anchoring a self\-trained model\.d=0d\{=\}0features \(discourse markers, hedges\) recover quickly because they amplify in the synthetic loop\.d≥2d\{\\geq\}2features \(passives, embedded questions, subjunctives\) do not recover: once a generation under\-produces them, the next generation is trained on a corpus that under\-represents them, and the deficit compounds\. Curators should therefore weight clause\-structurally rich text \(literary fiction, legal prose, scientific writing with embedded clauses\) at far above its frequency in the wild, and should*not*weight discourse\-marker\-rich text \(essays, op\-eds, blog posts\) above its natural frequency — the model is already biased toward producing too much of it\.

### 6\.4Within\-group heterogeneity atd=2d\{=\}2

The SDH predicts that group means decrease with depth, and this holds for both models at the group level\. However, thed=2d\{=\}2group in Pythia\-1\.4B is internally heterogeneous: four of six features*grow*\(question marks\+44%\+44\\%, irregular past\+18%\+18\\%, adverbial clauses\+59%\+59\\%, parentheses\+723%\+723\\%\) while two*decline*\(passive voice−50%\-50\\%, relative clauses−11%\-11\\%\)\. In GPT\-2, by contrast, all sixd=2d\{=\}2features decline except the near\-flat adverbial clauses \(\+1\.6%\+1\.6\\%\)\. This suggests thatd=2d\{=\}2features are not homogeneous: their fate depends on model\-specific template affinity in addition to depth\. The monotone depth gradient emerges as a*statistical tendency*over the panel, not as a deterministic per\-feature law\. Finer depth distinctions within thed=2d\{=\}2tier \(e\.g\., separating parenthetical from clausal features\) may reduce within\-group variance in future work\.

Per\-feature 95% bootstrap CIs \(resampling over five models\) confirm this picture: 10 of 17 features have CIs that exclude zero, while 7 features — includingexclamation,regular\_past\_ed,colons,semicolons,question\_marks,parentheses, andirregular\_past— have CIs crossing zero, reflecting genuine cross\-model disagreement on their direction\. The group\-level depth gradient is carried primarily by the majority of features with consistent sign, not by every individual feature behaving identically\.

### 6\.5Effect size and unexplained variance

The pooledρ=0\.540\\rho\{=\}0\.540corresponds toρ2=0\.29\\rho^\{2\}\{=\}0\.29: structural depth explains approximately 29% of the rank variance in feature decay rates \(95% CI: 19%–40%\)\. The per\-feature averaged correlation \(ρ=0\.661\\rho\{=\}0\.661,ρ2=0\.44\\rho^\{2\}\{=\}0\.44\) suggests that once cross\-model noise is removed, depth accounts for nearly half the variance\. The remaining variance is attributable to within\-depth heterogeneity \(template affinity differences, model\-specific dynamics\), marginal frequency effects \(ρfreq=0\.225\\rho\_\{\\text\{freq\}\}\{=\}0\.225at the pooled level\), and measurement noise from imperfect feature detectors\. We view 29%–44% as a substantial effect for a single structural predictor operating over a diverse 17\-feature panel\.

### 6\.6Relation to model collapse and to fingerprints

SDH bridges the model\-collapse and fingerprint literatures\. From the collapse side, it answers the question those papers do not ask: of all the structures in the language, which collapse, and why?Shumailov et al\. \([2024](https://arxiv.org/html/2605.20602#bib.bib23)\)andDohmatob et al\. \([2024](https://arxiv.org/html/2605.20602#bib.bib4)\)provide a distributional answer \(low\-frequency events\)\. We provide a structural answer \(deep features\), and show on this model that the structural account out\-predicts the distributional one\. From the fingerprint side, SDH explains*why the canonical machine\-text markers are exactly the ones they are*\. They are not arbitrary stylistic tics\. They are the survivors and amplifiers of a depth\-graded collapse that operates whether or not the upstream model has undergone explicit self\-training: all five models we tested \(124M–2\.8B autoregressive\) exhibit the same depth\-graded divergence when trained on a sufficiently large share of synthetic data, and the amplifying surface markers are the very fingerprint signals the detection literature already uses\.

## 7Quantifying Template Affinity

The SDH law includes a sampling\-dependence termσ​\(ϕ\)\\sigma\(\\phi\)that explains why somed=0d\{=\}0features amplify while others \(exclamation\) die\. We operationalizeσ​\(ϕ\)\\sigma\(\\phi\)empirically by comparing feature rates under greedy decoding \(which exposes only the model’s deterministic mode\) to nucleus sampling \(which explores the stochastic distribution\)\. We first define the greedy\-to\-nucleus ratio:

τ​\(ϕ\)=fgreedy​\(ϕ\)fnucleus​\(ϕ\),\\tau\(\\phi\)=\\frac\{f\_\{\\text\{greedy\}\}\(\\phi\)\}\{f\_\{\\text\{nucleus\}\}\(\\phi\)\},and thenσ​\(ϕ\)=1−min⁡\(τ​\(ϕ\),1\)\\sigma\(\\phi\)=1\-\\min\(\\tau\(\\phi\),1\), clipped to\[0,1\]\[0,1\]\. A feature withτ≪1\\tau\\ll 1\(henceσ≈1\\sigma\\approx 1\) is absent from greedy templates and depends on sampling stochasticity; one withτ≥1\\tau\\geq 1\(henceσ=0\\sigma=0\) appears readily under deterministic decoding\.

We computeτ\\tauwithT=1\.0T\{=\}1\.0, top\-p=0\.95p\{=\}0\.95, no top\-kktruncation, and no repetition penalty—intentionally distinct from the self\-training decoder \(T=0\.9T\{=\}0\.9, top\-k=50k\{=\}50, repetition\_penalty=1\.1\{=\}1\.1\)\. This canonical nucleus baseline characterizes the model’s*intrinsic*greedy\-vs\-stochastic gap without confounding it with the decoding interventions used during training\. Becauseτ\\tauenters our analysis only ordinally \(rank correlations and the binaryτ≶1\\tau\\lessgtr 1split\), the KendallW=0\.83W\{=\}0\.83rank stability reported below bounds the sensitivity to decoding configuration\.

Table[2](https://arxiv.org/html/2605.20602#S7.T2)and Figure[7](https://arxiv.org/html/2605.20602#S7.F7)show the empiricalτ\\tauvalues for GPT\-2\. Discourse markers \(τ=0\.05\\tau\{=\}0\.05\), hedging \(τ=0\.31\\tau\{=\}0\.31\), and em\-dashes \(τ=0\.00\\tau\{=\}0\.00\) have lowτ\\tau— they are largely*absent*from greedy templates and rely on the stochastic diversity of nucleus sampling for their production\.

This reveals the mechanism behind the SDH equation’s \([1](https://arxiv.org/html/2605.20602#S3.E1)\) sampling\-dependence term\+β​σ\+\\beta\\,\\sigma\. Features with lowτ\\tau\(hence highσ=1−τ\\sigma=1\-\\tau\) exist primarily in the stochastic tail of nucleus sampling; their survival requires continued sampling diversity\. Under iterated self\-training, nucleus sampling operates on an increasingly peaked distribution, and the set of high\-probability continuations*expands*to include these features \(because the training corpus over\-represents them\)\. Features with highσ\\sigmaand sufficient baseline rate thus enter a rich\-get\-richer loop: each generation’s nucleus\-sampled corpus over\-represents them, and the next generation’s fine\-tuning entrenches the bias further\.

Exclamation marks \(τ=0\.00\\tau\{=\}0\.00\) also have near\-zero template presence, yet they*die*\. The distinguishing factor is their generation\-0 rate: discourse markers \(fnuc=1\.60f\_\{\\text\{nuc\}\}\{=\}1\.60per 1000 tokens\) are common enough in nucleus\-sampled output to be over\-represented in each training corpus, creating a rich\-get\-richer loop\. Exclamation \(fnuc=1\.03f\_\{\\text\{nuc\}\}\{=\}1\.03\) starts at a marginal rate that, combined with GPT\-2’s essay\-genre bias, leaves it below the amplification threshold\.

The features with highestτ\\tau— quotes \(2\.482\.48\), passive voice \(2\.382\.38\), relative clauses \(1\.401\.40\), parentheses \(1\.121\.12\), subjunctive \(1\.101\.10\) — are deep\-syntactic structures that appear*more*frequently under greedy decoding than under nucleus sampling\. Yet they still decay because the depth penalty \(−α​d\-\\alpha\\,d\) overwhelms any template presence ford≥2d\\geq 2: the multiplicative probability cost of traversingddsyntactic choice points compounds across generations regardless of template affinity\. A partial correlation of depth against decay rate, controlling forτ\\tau, yieldsρpartial=0\.485\\rho\_\{\\text\{partial\}\}\{=\}0\.485\(p=0\.057p\{=\}0\.057\), confirming that depth retains predictive power independent of template affinity\.

#### Joint regression\.

To formally test both terms of the SDH law simultaneously, we regress the feature decay rate on depth andσ\\sigma\(z\-scored; operationalized as−log⁡\(1\+τ\)\-\\log\(1\+\\tau\)for numerical stability, preserving the sign convention that higherσ\\sigma= more sampling\-dependent\) across the fullN=85N\{=\}85panel \(17 features×\\times5 models\), with cluster\-robust standard errors on feature\. We use the GPT\-2τ\\tauvalues as a shared covariate across all five models; cross\-modelτ\\tauverification \(KendallW=0\.71W\{=\}0\.71, §[7](https://arxiv.org/html/2605.20602#S7)\) confirms the ranking is stable\. Both predictors are significant:βdepth=\+0\.032\\beta\_\{\\text\{depth\}\}\{=\}\{\+\}0\.032\(SE=0\.010\\text\{SE\}\{=\}0\.010,p=0\.001p\{=\}0\.001\) andβσ=−0\.021\\beta\_\{\\sigma\}\{=\}\{\-\}0\.021\(SE=0\.008\\text\{SE\}\{=\}0\.008,p=0\.009p\{=\}0\.009\); jointly they explain 16\.5% of variance \(Radj2=0\.144R^\{2\}\_\{\\text\{adj\}\}\{=\}0\.144\)\. The negativeβσ\\beta\_\{\\sigma\}confirms that sampling\-dependent features decay*more slowly*, consistent with the SDH equation’s\+β​σ\+\\beta\\,\\sigmaprotective term\. Comparing the full model to a depth\-only baseline by AIC yieldsΔ​AIC=\+1\.0\\Delta\\text\{AIC\}\{=\}\+1\.0in favor of the two\-predictor model; a likelihood\-ratio test givesχ2​\(1\)=3\.0\\chi^\{2\}\(1\)\{=\}3\.0,p=0\.083p\{=\}0\.083\. The contribution ofσ\\sigmais thus statistically significant under OLS with clustered SEs but only marginal under mixed\-effects LR test — consistent with a real but secondary role for sampling dependence, exactly as the SDH equation predicts\.

#### Cross\-modelτ\\taugeneralization\.

To verify thatτ\\taurankings are not GPT\-2\-specific, we independently compute greedy \(190 continuations\) and nucleus \(1000 continuations\) rates for all five models and compare the resultingτ\\tauvectors\. Kendall’sW=0\.71W\{=\}0\.71across all five models indicates substantial concordance; mean pairwise Spearmanρ=0\.64\\rho\{=\}0\.64\(ρ∈\[0\.53,0\.81\]\\rho\\in\[0\.53,0\.81\]\)\. The binary classification \(τ<1\\tau<1vs\.τ≥1\\tau\\geq 1\) agrees with the GPT\-2 partition in 80\.9% of feature–model cells \(range: 70\.6%–88\.2%\)\. Thus, while model\-specific variation exists, the broadτ\\tauranking — surface features exhibit low template affinity, deep\-syntactic features exhibit high template affinity — is a stable property of the features themselves, not an idiosyncrasy of GPT\-2\.

Table 2:Greedy\-to\-nucleus ratioτ​\(ϕ\)\\tau\(\\phi\)and derived sampling dependenceσ=1−min⁡\(τ,1\)\\sigma=1\-\\min\(\\tau,1\)for selected features\.fnucf\_\{\\text\{nuc\}\}: rate under nucleus sampling;fgref\_\{\\text\{gre\}\}: rate under greedy decoding\. Features per 1000 tokens\. Lowτ\\tau\(highσ\\sigma\) indicates the feature depends on sampling stochasticity\. Rank stability across 10 half\-splits: KendallW=0\.83W\{=\}0\.83, meanρ=0\.89\\rho\{=\}0\.89\.![Refer to caption](https://arxiv.org/html/2605.20602v1/x8.png)Figure 7:Greedy\-to\-nucleus ratioτ​\(ϕ\)\\tau\(\\phi\)by feature\.d=0d\{=\}0features \(red\) cluster near zero \(high sampling dependenceσ≈1\\sigma\\approx 1\), meaning they are absent from greedy templates\. Deep features \(green, blue\) haveτ\>1\\tau\>1\(σ=0\\sigma=0\) but still decay due to the dominant depth penalty\.

## 8Robustness

#### Sensitivity to depth assignments\.

Moving irregular past tense fromd=2d\{=\}2tod=1d\{=\}1\(its morphological complexity argues for either placement\) leaves the pooled correlation highly significant: the five\-model result remains atp<0\.001p<0\.001under either assignment\. All headline numbers \(abstract, cross\-model test\) use the a priori assignmentd=2d\{=\}2throughout; thed=1d\{=\}1variant is tested solely as a robustness check\.

#### Mixed\-effects model\.

To address potential pseudoreplication in the pooledN=85N\{=\}85test \(the same 17 features measured across 5 models\), we fit a linear mixed\-effects model with feature as a random intercept\. \(Adding model as a crossed random effect yields a singular fit due to insufficient group count \(G=5G\{=\}5\); the per\-feature averaging below provides the model\-level correction\.\) The depth coefficient isβdepth=0\.047\\beta\_\{\\text\{depth\}\}\{=\}0\.047\(p<0\.001p<0\.001; fitted with ML; REML yieldsβ=0\.047\\beta\{=\}0\.047,p<0\.001p<0\.001\), confirming that the depth–decay association is robust to the nested structure\. As a further check, averaging decay rates per feature across all five models yieldsρ=0\.661\\rho\{=\}0\.661\(p=0\.004p\{=\}0\.004,N=17N\{=\}17\), consistent with the pooled result\.

#### Fisher combination\.

Fisher’s method for combining the five per\-modelpp\-values yieldspFisher<0\.0001p\_\{\\text\{Fisher\}\}<0\.0001\(χ2=40\.1\\chi^\{2\}\{=\}40\.1,df=10\\text\{df\}\{=\}10\)\. Per\-model values: GPT\-2p=0\.140p\{=\}0\.140, Pythia\-1\.4Bp=0\.042p\{=\}0\.042, Pythia\-2\.8Bp=0\.002p\{=\}0\.002, OPT\-1\.3Bp=0\.019p\{=\}0\.019, Pythia\-410Mp=0\.009p\{=\}0\.009\. Two of five models remain individually significant after Holm correction \(Pythia\-2\.8B:padj=0\.010p\_\{\\text\{adj\}\}\{=\}0\.010; Pythia\-410M:padj=0\.036p\_\{\\text\{adj\}\}\{=\}0\.036\)\. We note that Fisher’s method assumes independence of the per\-model tests; since the three Pythia models share pretraining data \(The Pile\), this assumption is violated and the Fisherpp\-value may be anti\-conservative\. GPT\-2 \(WebText\) and OPT\-1\.3B \(different corpus\) provide genuinely independent replications\. The mixed\-effects model above provides a principled alternative that accounts for the nested structure without requiring independence\. Four of five models are individually significant before correction, and the combined evidence is strong\.

#### Sampling ablation\.

We replicate the self\-training loop under three alternative decoding configurations \(5 generations each\): \(i\)*ancestral sampling*\(T=1\.0T\{=\}1\.0, no truncation\), \(ii\)*greedy decoding*, and \(iii\)*tight nucleus*\(T=0\.7T\{=\}0\.7,p=0\.9p\{=\}0\.9, top\-k=50k\{=\}50\)\. Under ancestral sampling, thed=2d\{=\}2group decays−26\.5%\-26\.5\\%whiled=0d\{=\}0andd=1d\{=\}1each decay only∼−11%\{\\sim\}\{\-\}11\\%, preserving the depth gradient\. Under tight nucleus, nearly all features decay — includingd=0d\{=\}0— because the restricted sampling suppresses the stochastic\-amplification mechanism that normally boosts surface markers\. This confirms the SDH prediction that bifurcation requires both the depth penalty*and*the sampling\-dependence bonus: remove the stochastic diversity \(by restricting sampling toward the mode\), and only uniform decay remains\.

#### Multi\-seed variance\.

We replicate the 6\-generation GPT\-2 experiment with three random seeds \(42, 123, 456\)\. The SDH correlation is positive in all three runs \(ρ∈\[0\.09,0\.18\]\\rho\\in\[0\.09,0\.18\],ρ¯=0\.141\\overline\{\\rho\}\{=\}0\.141,SD=0\.046\\text\{SD\}\{=\}0\.046\)\. The effect is weaker than in the 11\-generation primary experiment \(ρ=0\.373\\rho\{=\}0\.373\)\. For comparison, the primary GPT\-2 run evaluated at generation 6 alone yieldsρ=0\.420\\rho\{=\}0\.420—higher than the multi\-seed mean—indicating that seed variability at short horizons, rather than generation count per se, is the dominant source of magnitude difference\. Cumulativeρ\\rhoremains below significance \(p\>0\.05p\>0\.05\) through generation 10 in GPT\-2 alone \(N=17N\{=\}17features provides limited power for a single model\)\. The signal emerges reliably only via cross\-model pooling \(N=85N\{=\}85, significant at any generation≥3\\geq 3; mean pooledρ\\rhorises from0\.220\.22at generation 1 to∼0\.54\{\\sim\}0\.54by generation 5 and plateaus thereafter\) or with≥10\{\\geq\}10generations per model\. For practitioners, this means SDH should*not*be used as an early\-warning diagnostic on a single short run; rather, it describes the long\-term structural trajectory that differentiates shallow from deep feature fates\. The key finding is*consistency of sign*: no seed produces a negative correlation\.

#### Prompt\-sensitivity check\.

Our 32 prompts are uniformly declarative, which suppressesquestion\_marksandexclamationin the generation\-0 baseline\. To verify the depth–decay correlation is not driven by these two prompt\-sensitive features, we recompute the pooledρ\\rhoexcluding both:ρ=0\.567\\rho\{=\}0\.567\(p<10−6p<10^\{\-6\},N=75N\{=\}75\)\. The correlation*strengthens*, indicating the depth gradient does not depend on punctuation features whose baseline is suppressed by declarative prompts\.

#### Template affinity stability\.

The sampling\-dependence termσ​\(ϕ\)\\sigma\(\\phi\)is operationalized via the greedy\-to\-nucleus ratioτ​\(ϕ\)\\tau\(\\phi\), computed from 190 prompts\. To test sensitivity to prompt selection, we simulate 10 random half\-splits \(95 prompts each\) and measure the stability of theτ\\taurank ordering\. Kendall’sW=0\.83W\{=\}0\.83and the mean Spearmanρ\\rhobetween half\-split and full\-set rankings is0\.890\.89\(range\[0\.77,0\.97\]\[0\.77,0\.97\]\), indicating thatτ\\tau’s rank ordering is highly stable under prompt subsampling\. The binary classification \(sampling\-dependent:τ<1\\tau<1vs\. template\-bound:τ\>1\\tau\>1\) is perfectly preserved across all splits\.

#### Held\-out feature validation\.

To guard against post\-hoc feature selection, we implement two validation approaches\.*Split\-half cross\-validation*: we randomly partition the 17 features into train/test sets \(60/40\) and compute the Spearman correlation on the held\-out test set \(1,000 splits per model\)\. The median testρ\\rhoacross five models is0\.5490\.549, with94%94\\%of splits yielding a positive correlation\.*Novel features*: we prospectively define five features not in the primary panel \(gerund phrasesd=1d\{=\}1, infinitival*to*d=1d\{=\}1, appositivesd=2d\{=\}2, complement clausesd=2d\{=\}2, cleft constructionsd=3d\{=\}3; seerun\_heldout\_validation\.py\)\. Depth labels were committed before measuring decay\. All three models with available text data show positive depth–decay correlations on these held\-out features \(ρ∈\[0\.16,0\.53\]\\rho\\in\[0\.16,0\.53\]\), though withN=5N\{=\}5features, individual significance is not achievable\. The commitment is self\-asserted rather than externally registered on a platform such as OSF or AsPredicted\.

## 9Conclusion

Self\-training does not flatten language\. It restructures it along a depth gradient\. Across eleven generations of GPT\-2 self\-training, surface markers amplify — discourse connectives more than double, hedges grow by\+44%\+44\\%, em\-dashes by\+29%\+29\\%— while the deep syntactic inventory collapses: question marks lose92%92\\%of their mass, parentheticals57%57\\%, passive voice56%56\\%, irregular past tense52%52\\%, subjunctives53%53\\%\. Group means by structural depth are monotone:\{\+24\.9%,−10\.0%,−47\.2%,−52\.7%\}\\\{\+24\.9\\%,\-10\.0\\%,\-47\.2\\%,\-52\.7\\%\\\}ford∈\{0,1,2,3\}d\\in\\\{0,1,2,3\\\}\. In a head\-to\-head test, structural depth is a significant predictor of decay rate while baseline frequency is not\. Aggregate fingerprint metrics preferred in the stylometry literature — dep\-tree depth, TTR, average word length — all*rise*during this collapse, masking the bifurcation that the per\-feature analysis exposes; we call this the Superficial Complexity Paradox\. The Structural Depth Hypothesis formalizes the observed dynamics as a competition between a depth\-graded decay term and a sampling\-dependence amplification term, makes the prediction that exclamation marks — ad=0d\{=\}0feature with marginal stochastic production rate — should die rather than amplify, and is confirmed: exclamations collapse by99%99\\%\. Cross\-model replication on four additional models — Pythia\-410M \(ρ=0\.609\\rho\{=\}0\.609\), OPT\-1\.3B \(ρ=0\.563\\rho\{=\}0\.563\), Pythia\-1\.4B \(ρ=0\.498\\rho\{=\}0\.498\), Pythia\-2\.8B \(ρ=0\.705\\rho\{=\}0\.705\) — confirms the depth gradient across architecture families; a mixed\-effects model on the pooledN=85N\{=\}85dataset confirms the association \(βdepth=0\.047\\beta\_\{\\text\{depth\}\}\{=\}0\.047,p<0\.001p<0\.001\); a joint regression confirms the secondary role of template affinity \(βσ=−0\.021\\beta\_\{\\sigma\}\{=\}\{\-\}0\.021,p=0\.009p\{=\}0\.009\), while frequency is a weaker predictor \(ρ=0\.225\\rho\{=\}0\.225\)\. The implication for both literatures is that distributional and stylometric summaries are insufficient\. Iterated self\-training is a structural process, and what it touches first is not the rare but the deep\.

## Limitations

#### Model scale\.

Our experiments span GPT\-2 124M through Pythia\-2\.8B — a22×22\\timesscale range across three architecture families \(GPT\-2, Pythia, OPT\)\. We observe that the depth correlation is positive in all five models, with per\-modelρ\\rhoranging from0\.3730\.373\(GPT\-2\) to0\.7050\.705\(Pythia\-2\.8B\), but all models are small by contemporary standards\. The specific rates of amplification and collapse may differ for models at the 7B\+ scale, where greater capacity could sustain deeper structures for more generations\. Our results establish the*direction*and*ordering*of the depth gradient across scales and architecture families, not the precise thresholds\.

#### English only\.

All experiments use English text\. Structural depth is defined over English dependency grammar; the operationalization ofd​\(ϕ\)d\(\\phi\)would need to be re\-derived for typologically different languages \(e\.g\., verb\-final or polysynthetic languages where clause embedding is expressed morphologically rather than syntactically\)\.

#### Depth annotation granularity\.

Our four\-level depth scale \(d∈\{0,1,2,3\}d\\in\\\{0,1,2,3\\\}\) is coarse, and thed=3d\{=\}3stratum contains only one feature \(subjunctive\); claims aboutd=3d\{=\}3should be read as illustrative rather than statistically grounded at the group level\. Excluding subjunctive entirely, the pooled correlation remains highly significant \(ρ=0\.489\\rho\{=\}0\.489,p<10−5p<10^\{\-5\},N=80N\{=\}80\) with monotone group means \(d=0\>d=1\>d=2d\{=\}0\>d\{=\}1\>d\{=\}2\), confirming that the depth gradient does not depend on the singled=3d\{=\}3feature\. A finer\-grained scale \(e\.g\., fractional depths derived from average dependency\-tree positions\) might yield stronger correlations but would also introduce annotation noise\. We chose discrete levels for interpretability and a priori clarity\.

#### Sampling dependence\.

We operationalizeσ​\(ϕ\)\\sigma\(\\phi\)via the greedy\-to\-nucleus ratioτ=fgreedy/fnucleus\\tau=f\_\{\\text\{greedy\}\}/f\_\{\\text\{nucleus\}\}\(§[7](https://arxiv.org/html/2605.20602#S7)\), which captures sampling dependence at the output level\. Cross\-model verification \(KendallW=0\.71W\{=\}0\.71, mean pairwiseρ=0\.64\\rho\{=\}0\.64\) confirms the ranking is not model\-specific, though model\-level variation exists\. A deeper operationalization from model internals \(e\.g\., attention entropy or logit\-landscape analysis\) remains for future work\.

#### Prompt distribution\.

All 32 generation prompts are declarative sentence starters\. This design choice ensures controlled conditions across generations but suppresses interrogative and exclamatory features at baseline\. The absolute decay magnitudes forquestion\_marksandexclamationshould be interpreted as upper bounds; however, removing these features strengthens the depth–decay correlation \(§[8](https://arxiv.org/html/2605.20602#S8)\), so the core finding is not confounded by prompt choice\.

#### Genre convergence\.

Under repeated self\-training, text converges toward the dominant genre of the prompt distribution \(here, essay\-like prose\)\. Features associated with other genres \(questions with dialogue, exclamation with informal writing\) may decay partly for genre reasons independent of structural depth\. Three independent controls bound this confound: \(a\) critical pairs \(§[5\.4](https://arxiv.org/html/2605.20602#S5.SS4)\) compare features*within the same essay register*\(em\-dash vs\. parenthetical; regular vs\. irregular past\), isolating depth from genre; \(b\) excluding the two most genre\-sensitive features \(question marks, exclamations\)*strengthens*the pooled correlation toρ=0\.567\\rho\{=\}0\.567\(§[8](https://arxiv.org/html/2605.20602#S8)\); \(c\) cross\-model replication includes OPT\-1\.3B and Pythia models trained on non\-essay\-dominated mixtures \(CC\-News, Reddit, BookCorpus\), where the depth gradient persists \(ρ∈\[0\.498,0\.705\]\\rho\\in\[0\.498,0\.705\]\)\. Genre convergence may contribute to absolute magnitudes but cannot account for the depth\-stratified pattern\.

#### Pseudoreplication and feature correlation\.

The pooledN=85N\{=\}85test counts the same 17 features five times \(once per model\)\. While we address this with a mixed\-effects model \(p<0\.001p<0\.001\) and per\-feature averaging \(ρ=0\.661\\rho\{=\}0\.661,p=0\.004p\{=\}0\.004; §[8](https://arxiv.org/html/2605.20602#S8)\), both of which remain highly significant, the effective degrees of freedom are lower thanN=85N\{=\}85suggests\. Note that the three Pythia models share training data \(The Pile\(Gao et al\.,[2020](https://arxiv.org/html/2605.20602#bib.bib6)\)\); their depth correlations are therefore not fully independent\. GPT\-2 \(WebText\) and OPT\-1\.3B \(a distinct mixture including BookCorpus, CC\-News, and Reddit\) provide genuinely independent replications across different training distributions\.

Additionally, features within the same depth tier may co\-vary across models\. The mean within\-depth correlation of decay rates isr=0\.24r\{=\}0\.24\(36 pairs\), versusr=0\.16r\{=\}0\.16across depths \(100 pairs\)\. While the difference is modest, it implies the effective number of independent feature\-level observations is lower than 17\. The mixed\-effects model with feature as random intercept provides the principled correction for this structure\.

#### Statistical power\.

WithN=17N\{=\}17features per model, individual model tests have approximately 50% power to detectρ=0\.43\\rho\{=\}0\.43atα=0\.05\\alpha\{=\}0\.05\. This explains why GPT\-2’s per\-model test is non\-significant \(p=0\.140p\{=\}0\.140\) despite a positive effect: the test is underpowered\. This motivates our pooled and mixed\-effects approaches, which aggregate across five models to achieve adequate power \(N=85N\{=\}85, power\>99%\>99\\%forρ=0\.54\\rho\{=\}0\.54\)\.

#### Feature detection precision\.

Several feature detectors use surface heuristics rather than full parsing\. Theregular\_past\_eddetector \(suffix matching\) admits non\-verbal false positives; thepassive\_voicedetector \(regex\) captures only regular\-form passives, missing irregular participles \(estimated recall∼31%\{\\sim\}31\\%\)\. To validate detector accuracy, four annotators \(three with linguistics training, one with NLP expertise\) independently annotated a stratified sample of 30 texts \(3 models×\\times3 generations×\\times∼3\{\\sim\}3texts per cell\) on six ambiguity\-prone features:passive\_voice,relative\_clauses,adverbial\_clauses,hedging,discourse\_markers, andregular\_past\_ed\. Annotators received a codebook with definitions and boundary examples; trivially verifiable punctuation counts \(semicolons, exclamation marks, etc\.\) were excluded as they require no human judgment\. Inter\-annotator reliability was high: mean ICC\(2,1\)=0\.96\{=\}0\.96across the six features \(range0\.930\.93–1\.001\.00\), indicating near\-perfect agreement\. After confirming agreement, the annotation was extended to 178 texts \(stratified across 3 models×\\times3 generations: Gen 0, 5, 10\); consensus counts \(median\) served as the gold standard against which automated detectors were evaluated \(full results in supplementaryannotation\_gold\.json\)\. Mean precision is0\.780\.78, mean recall0\.700\.70, meanF1=0\.69F\_\{1\}\{=\}0\.69across the full 17\-feature panel\. Crucially, detector accuracy is*stable across generations*: meanF1F\_\{1\}varies by only0\.040\.04between Gen 0 \(F1=0\.72F\_\{1\}\{=\}0\.72\) and Gen 10 \(F1=0\.68F\_\{1\}\{=\}0\.68\), confirming that observed decay rates reflect true feature changes rather than detector sensitivity drift\. These limitations add measurement noise but are unlikely to*create*a spurious depth–decay correlation: noise attenuates correlations rather than inflating them\. Excluding the two lowest\-F1F\_\{1\}features \(subjunctive,passive\_voice\) from the pooled test leaves the correlation significant \(ρ=0\.520\\rho\{=\}0\.520,p<10−5p<10^\{\-5\},N=75N\{=\}75\)\.

#### Human\-text fine\-tuning control\.

To confirm that the depth gradient is specific to self\-training rather than a generic effect of continued fine\-tuning, we run a matched control: GPT\-2 124M undergoes the same 11\-generation protocol, but at each generation is fine\-tuned on fresh human text from OpenWebText \(3,000 texts per generation, identical hyperparameters\)\. The result is unambiguous:ρdepth=0\.039\\rho\_\{\\text\{depth\}\}=0\.039\(p=0\.88p=0\.88,N=17N\{=\}17\) — effectively zero\. All per\-feature decay rates are near zero \(range−0\.022\-0\.022to\+0\.046\+0\.046\), with no systematic depth stratification\. The depth gradient is absent when the training signal comes from human text, confirming that the recursive amplification loop of self\-training — not generic fine\-tuning dynamics — drives the SDH pattern\. \(This control uses GPT\-2 only; extending to larger models is future work, though the contrast betweenρ=0\.039\\rho\{=\}0\.039and the self\-trainingρ=0\.373\\rho\{=\}0\.373on the same architecture is unambiguous\.\)

#### Pure self\-training regime\.

We study the pure self\-training loop without mixing human data\. Real\-world contamination scenarios involve partial synthetic data\. The depth gradient should still hold directionally, but the absolute magnitudes will depend on the mixing ratio\.

## Ethics Statement

This work studies a failure mode of language models — structural collapse under self\-training — with the aim of informing the research community about risks to linguistic diversity in model\-generated text\. We use only publicly available pretrained models \(GPT\-2, Pythia\-410M, Pythia\-1\.4B, Pythia\-2\.8B, OPT\-1\.3B\) and generate synthetic text for analysis purposes only\. No human subjects are involved\. Our results suggest that unchecked recursive training on synthetic data degrades the structural richness of the language, which has implications for downstream applications in education, accessibility, and cultural preservation\. We note that knowledge of which features amplify vs\. decay could theoretically inform detection\-evasion strategies; however, the same knowledge is more directly useful for improving detectors and designing robust training pipelines\. We advocate for depth\-aware data curation as a mitigation strategy\.

## References

- Alemohammad et al\. \(2023\)Sina Alemohammad, Josue Casco\-Rodriguez, Lorenzo Luzi, Ahmed Imtiaz Humayun, Hossein Babaei, Daniel LeJeune, Ali Siahkoohi, and Richard G Baraniuk\. 2023\.Self\-consuming generative models go mad\.*arXiv preprint arXiv:2307\.01850*\.
- Biderman et al\. \(2023\)Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, and 1 others\. 2023\.Pythia: A suite for analyzing large language models across training and scaling\.
- Briesch et al\. \(2023\)Martin Briesch, Dominik Sobania, and Franz Rothlauf\. 2023\.Large language models suffer from their own output: An analysis of the self\-consuming training loop\.In*arXiv preprint arXiv:2311\.16822*\.
- Dohmatob et al\. \(2024\)Elvis Dohmatob, Yunzhen Feng, Pin\-Yu Yang, François Charton, and Julia Kempe\. 2024\.A tale of tails: Model collapse as a change of scaling laws\.In*International Conference on Machine Learning \(ICML\)*\.
- Fraser \(1999\)Bruce Fraser\. 1999\.What are discourse markers?*Journal of Pragmatics*, 31\(7\):931–952\.
- Gao et al\. \(2020\)Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, and 1 others\. 2020\.The pile: An 800gb dataset of diverse text for language modeling\.*arXiv preprint arXiv:2101\.00027*\.
- Gerstgrasser et al\. \(2024\)Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Henry Sleight, John Hughes, Tomasz Korbak, Rajashree Agrawal, Dhruv Nie, Mankun Tan, and 1 others\. 2024\.Is model collapse inevitable? breaking the curse of recursion by accumulating real and synthetic data\.In*International Conference on Machine Learning \(ICML\)*\.
- Gibson \(2000\)Edward Gibson\. 2000\.The dependency locality theory: A distance\-based theory of linguistic complexity\.*Image, Language, Brain*, pages 95–126\.
- Grigoreva et al\. \(2025\)Alla Grigoreva, Catherine Stinson, and Derek Muise\. 2025\.Analysis of linguistic effects of self\-consuming training\.In*IEEE International Conference on Foundation and Large Language Models \(FLLM\)*\.
- Guo et al\. \(2024\)Yanzhu Guo, Guokan Shang, Michalis Vazirgiannis, and Chloé Clavel\. 2024\.The curious decline of linguistic diversity: Training language models on generated text\.In*Findings of the Association for Computational Linguistics: NAACL*\.
- Hale \(2001\)John Hale\. 2001\.A probabilistic earley parser as a psycholinguistic model\.In*Proceedings of NAACL*, pages 159–166\.
- Herel and Mikolov \(2024\)David Herel and Tomáš Mikolov\. 2024\.Collapse of self\-trained language models\.*arXiv preprint arXiv:2404\.02305*\.
- Holtzman et al\. \(2020\)Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi\. 2020\.The curious case of neural text degeneration\.In*International Conference on Learning Representations*\.
- Juzek and Ward \(2025\)Tom Juzek and Adrian Ward\. 2025\.Why does chatgpt “delve” so much? exploring the sources of lexical overrepresentation in large language models\.In*COLING*\.
- Kobak et al\. \(2025\)Dmitry Kobak, Rita González\-Márquez, Emőke\-Ágnes Horvát, and Jan Lause\. 2025\.Delving into llm\-assisted writing in biomedical publications through excess vocabulary\.*Science Advances*\.
- Levy \(2008\)Roger Levy\. 2008\.Expectation\-based syntactic comprehension\.*Cognition*, 106\(3\):1126–1177\.
- Liang et al\. \(2024\)Weixin Liang and 1 others\. 2024\.Monitoring ai\-modified content at scale: A case study on the impact of chatgpt on ai conference peer reviews\.In*International Conference on Machine Learning \(ICML\)*\.
- Mitchell et al\. \(2023\)Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D Manning, and Chelsea Finn\. 2023\.Detectgpt: Zero\-shot machine\-generated text detection using probability curvature\.In*International Conference on Machine Learning \(ICML\)*\.
- Padmakumar and He \(2024\)Vishakh Padmakumar and He He\. 2024\.Does writing with language models reduce content diversity?In*International Conference on Learning Representations \(ICLR\)*\.
- Peterson and Christiano \(2025\)Jared C\. Peterson and Paul Christiano\. 2025\.Knowledge collapse in language models\.*arXiv preprint arXiv:2509\.04796*\.
- Radford et al\. \(2019\)Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever\. 2019\.Language models are unsupervised multitask learners\.
- Seddik et al\. \(2024\)Mohamed El Amine Seddik, Suei\-Wen Chen, Soufiane Hayou, Pierre Youssef, and Merouane Debbah\. 2024\.How bad is training on synthetic data? a statistical analysis of language model collapse\.*arXiv preprint arXiv:2404\.05090*\.
- Shumailov et al\. \(2024\)Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal\. 2024\.Ai models collapse when trained on recursively generated data\.*Nature*, 631:755–759\.
- Sourati et al\. \(2025\)Zhijing Sourati, Zhijing Jin, Bernhard Schölkopf, and Mrinmaya Sachan\. 2025\.The shrinking landscape of linguistic diversity\.*arXiv preprint arXiv:2502\.11266*\.
- Tercon and Dobrovoljc \(2025\)Andraz Tercon and Kaja Dobrovoljc\. 2025\.Linguistic characteristics of ai\-generated text: A survey\.*arXiv preprint arXiv:2510\.05136*\.
- Vanmassenhove \(2025\)Eva Vanmassenhove\. 2025\.Losing our tail, again: \(un\)natural selection & multilingual llms\.*arXiv preprint arXiv:2507\.03933*\.
- Welleck et al\. \(2020\)Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston\. 2020\.Neural text generation with unlikelihood training\.In*International Conference on Learning Representations*\.
- Wu et al\. \(2024\)Yiwei Wu and 1 others\. 2024\.A corpus\-based multidimensional analysis of linguistic features between human and chatgpt text\.*Applied Linguistics*\.
- Zanotto and Aroyehun \(2024\)Francesco Zanotto and Segun Taofeek Aroyehun\. 2024\.Human variability vs\. machine consistency: A comprehensive multi\-domain analysis of linguistic fingerprints\.*arXiv preprint arXiv:2412\.03025*\.
- Zhang et al\. \(2022\)Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, and 1 others\. 2022\.Opt: Open pre\-trained transformer language models\.*arXiv preprint arXiv:2205\.01068*\.

## Appendix AReproducibility Details

#### Compute\.

All experiments were run on NVIDIA A10G 24 GB GPUs via Amazon SageMaker\. GPT\-2 124M completes 11 generations in approximately 8 hours; Pythia\-410M and Pythia\-1\.4B each require approximately 15 hours; OPT\-1\.3B approximately 15 hours; Pythia\-2\.8B requires approximately 18 hours with gradient checkpointing and 8\-bit AdamW enabled\. Total compute cost for all five models is under $400 at standard cloud rates\.

#### Hyperparameters\.

Fine\-tuning uses AdamW with learning rate5×10−55\\times 10^\{\-5\}, batch size 8 \(GPT\-2\) or 2 with gradient accumulation 8 \(Pythia\-1\.4B\), for 1 epoch per generation\. Primary random seed: 42 \(multi\-seed ablation uses 42, 123, 456\)\. Text generation uses nucleus sampling withp=0\.95p\{=\}0\.95,T=0\.9T\{=\}0\.9, top\-k=50k\{=\}50, and repetition penalty1\.11\.1\. Corpus size is 3,000 texts of 256 tokens per generation\. The 256\-token length followsShumailov et al\. \([2024](https://arxiv.org/html/2605.20602#bib.bib23)\)for comparability; longer texts would introduce coherence degradation as a confound\.

#### Feature extraction\.

Linguistic features are extracted usingspaCyen\_core\_web\_smv3\.7\.1 \(CNN\-based pipeline\)\. We verified on a 500\-sentence subsample that switching toen\_core\_web\_trfdoes not change the sign or rank ordering of any feature’s trajectory\. Counts are normalized per token and reported relative to generation 0 values\. Feature analysis uses a random sample of 2,000 texts per generation for computational tractability\.

#### Statistical testing\.

Spearman correlations are computed usingscipy\.stats\.spearmanr\. Permutation tests shuffle depth labels10510^\{5\}times\. Cohen’sddis computed as the difference in group means of per\-feature decay rates \(log\-slopes\) divided by the pooled standard deviation within each group \(GPT\-2 single\-model,d=0d\{=\}0vs\.d=2d\{=\}2\)\.

## Appendix BPer\-Feature Trajectories

Table[3](https://arxiv.org/html/2605.20602#A2.T3)reports the full trajectory for each feature across all 11 GPT\-2 generations\.

Table 3:Normalized feature trajectories for GPT\-2 124M \(even\-numbered generations shown for space\)\. Values are relative to generation 0\.

Similar Articles

Representation Collapse in Sequential Post-Training of Large Language Models

arXiv cs.LG

This paper studies representation collapse in sequential post-training of large language models, showing that repeated adaptation stages compress internal representations, reducing plasticity and out-of-domain generalization. The authors propose lightweight interventions to preserve future learnability without sacrificing behavioral gains.

Model Collapse as Cultural Evolution

arXiv cs.CL

This paper reframes model collapse in LLMs as a cultural transmission phenomenon, showing that iterated learning theory predicts a non-monotonic trajectory of compositionality under self-training, confirmed across multiple languages and models.