When Is Next-Token Prediction Useful? Marginalization, Ergodicity, Mixture Identifiability, Local Sufficiency, RAG, Tools, and Programming
Summary
This paper distinguishes three probabilistic objects often conflated in language modeling—the full conditional language process, the marginal text-only law, and the model-induced distribution—and analyzes the conditions under which next-token prediction is useful, with RAG and tools interpreted as conditional sufficiency devices.
View Cached Full Text
Cached at: 05/25/26, 09:00 AM
# When Is Next-Token Prediction Useful? Marginalization, Ergodicity, Mixture Identifiability, Local Sufficiency, RAG, Tools, and Programming
Source: [https://arxiv.org/html/2605.23278](https://arxiv.org/html/2605.23278)
###### Abstract
Language models trained on observed sequences are often described as learning the conditional distribution of the next token given previous tokens\. This description is only conditionally correct\. A model trained on realized token trajectories does not observe full conditional laws; it receives sampled continuations\. Moreover, real language generation is conditioned not only on previous words but also on non\-textual circumstances: facts, events, intentions, goals, beliefs, social context, and task\-specific constraints\. This paper distinguishes three objects that are often conflated: the full conditional language process conditioned on latent circumstances, the marginal text\-only process obtained by integrating those circumstances out, and the model\-induced distribution learned from finite observed corpora\.
The paper argues that interpreting model training as estimating the marginal text\-only law requires strong assumptions of stationarity, representativeness, and ergodicity, assumptions that are standard in statistical estimation but problematic when applied to heterogeneous language corpora\[[4](https://arxiv.org/html/2605.23278#bib.bib4),[8](https://arxiv.org/html/2605.23278#bib.bib8),[22](https://arxiv.org/html/2605.23278#bib.bib22)\]\. Even if these assumptions hold, the marginal text\-only law is useful only when the observed prefix is an approximately sufficient statistic for the latent circumstances relevant to continuation\. In information\-theoretic terms, usefulness requires that the residual conditional mutual information between the next token and the omitted circumstances, given the observed text, be small\.
The paper then extends this argument to heterogeneous training corpora\. Language models are trained on an indistinct mixture of regimes: programming, mathematics, fiction, journalism, law, social media, dialogue, manuals, and many others\. The model can learn a correct text\-only mixture conditional only if the heterogeneous process is stable, representative, ergodic, and sufficiently identifiable from textual prefixes\. This introduces a distinction between learnability and usefulness: a model may learn the correct mixture conditional without that conditional being epistemically useful in a particular situation\.
Finally, the paper interprets Retrieval Augmented Generation \(RAG\) and tool use as conditional sufficiency devices, following the retrieval\-augmented language\-modeling and tool\-augmented language\-modeling literatures\[[20](https://arxiv.org/html/2605.23278#bib.bib20),[6](https://arxiv.org/html/2605.23278#bib.bib6),[28](https://arxiv.org/html/2605.23278#bib.bib28),[36](https://arxiv.org/html/2605.23278#bib.bib36),[18](https://arxiv.org/html/2605.23278#bib.bib18),[10](https://arxiv.org/html/2605.23278#bib.bib10)\]\. RAG and tools are not automatically proxies for missing context\. They are useful only insofar as conditioning on retrieved material or tool outputs makes the remaining latent circumstances irrelevant, or at least substantially less informative for the continuation\. Programming is analyzed as a favorable case because specifications, previous code, tests, error messages, and documentation often textualize much of the relevant latent state\.
###### Contents
1. [1Introduction](https://arxiv.org/html/2605.23278#S1)
2. [2Three Different Objects Often Confused](https://arxiv.org/html/2605.23278#S2)1. [2\.1The full conditional language process](https://arxiv.org/html/2605.23278#S2.SS1) 2. [2\.2The marginal text\-only conditional law](https://arxiv.org/html/2605.23278#S2.SS2) 3. [2\.3The model\-induced predictive distribution](https://arxiv.org/html/2605.23278#S2.SS3)
3. [3What Training Actually Observes](https://arxiv.org/html/2605.23278#S3)
4. [4The Ergodicity Requirement](https://arxiv.org/html/2605.23278#S4)
5. [5Marginalization Is Not Enough](https://arxiv.org/html/2605.23278#S5)
6. [6A Useful Information\-Theoretic Criterion](https://arxiv.org/html/2605.23278#S6)
7. [7Mixed Training Regimes and Local Islands of Sufficiency](https://arxiv.org/html/2605.23278#S7)
8. [8Mixture Identifiability and Local Conditional Laws](https://arxiv.org/html/2605.23278#S8)1. [8\.1Prompt sensitivity](https://arxiv.org/html/2605.23278#S8.SS1) 2. [8\.2Style\-content confusion](https://arxiv.org/html/2605.23278#S8.SS2) 3. [8\.3False authority](https://arxiv.org/html/2605.23278#S8.SS3) 4. [8\.4Local competence](https://arxiv.org/html/2605.23278#S8.SS4) 5. [8\.5Cross\-regime contamination](https://arxiv.org/html/2605.23278#S8.SS5)
9. [9Why Ordinary Factual Language Often Fails the Criterion](https://arxiv.org/html/2605.23278#S9)
10. [10Programming as a Favorable Case](https://arxiv.org/html/2605.23278#S10)
11. [11RAG as Textualization of Latent Circumstances](https://arxiv.org/html/2605.23278#S11)
12. [12Tool Use as External Access to Non\-Textual State](https://arxiv.org/html/2605.23278#S12)
13. [13Consequences for Temperature](https://arxiv.org/html/2605.23278#S13)
14. [14Synthetic Contamination and Recursive Training](https://arxiv.org/html/2605.23278#S14)
15. [15Educational Implications](https://arxiv.org/html/2605.23278#S15)
16. [16Prompts, Non\-Observed Conditionals, and the Limits of Context Injection](https://arxiv.org/html/2605.23278#S16)
17. [17Conclusion](https://arxiv.org/html/2605.23278#S17)
18. [References](https://arxiv.org/html/2605.23278#bib)
## 1Introduction
A common statement about language models is that they learn the probability distribution of the next word or token given previous words or tokens, a view descending from the statistical language\-modeling tradition initiated by Shannon and developed in modern NLP\[[29](https://arxiv.org/html/2605.23278#bib.bib29),[30](https://arxiv.org/html/2605.23278#bib.bib30),[22](https://arxiv.org/html/2605.23278#bib.bib22),[27](https://arxiv.org/html/2605.23278#bib.bib27)\]\. This statement captures part of the statistical structure of language modeling, but it conceals several distinct assumptions\.
First, language models trained on sequences do not directly observe conditional probability distributions\. In maximum\-likelihood language modeling, from classical neural language models to contemporary transformers, the training signal is a realized next token and the loss is cross\-entropy against that token\[[2](https://arxiv.org/html/2605.23278#bib.bib2),[23](https://arxiv.org/html/2605.23278#bib.bib23),[34](https://arxiv.org/html/2605.23278#bib.bib34),[26](https://arxiv.org/html/2605.23278#bib.bib26),[7](https://arxiv.org/html/2605.23278#bib.bib7)\]\. They observe realized trajectories\. For each context, they are rewarded for increasing the probability assigned to the observed next token\. The conditional law is therefore not an object present in the data; it is an inferred statistical object\.
Second, real language generation is not conditioned only on previous words\. This is closely related to the distinction between linguistic form and meaning or communicative grounding emphasized by Bender and Koller\[[3](https://arxiv.org/html/2605.23278#bib.bib3)\]\. Human utterances depend on latent and external circumstances: facts, events, communicative intentions, institutional settings, social relations, speaker beliefs, physical reality, and task constraints\. A text\-only language model has access to these variables only insofar as they are represented, implied, or recoverable from the observed prefix\.
Third, even if one defines a marginal text\-only conditional distribution by integrating over the missing circumstances, a finite corpus informs that marginal distribution only under strong assumptions\. The corpus must be a sufficiently representative realization of a stable process\. In stochastic\-process language, this requires assumptions analogous to stationarity and ergodicity\.
Fourth, even if the model correctly estimates the marginal text\-only distribution, this does not guarantee usefulness\. The marginal law is useful only when the text prefix is an approximately sufficient statistic for the omitted circumstances relevant to the continuation\.
Fifth, real training corpora are heterogeneous mixtures\. They contain local regimes in which textual sufficiency holds approximately, and many regimes in which it does not\. A model trained on such a mixture may therefore be locally reliable in some domains and unreliable in others, while producing equally fluent outputs everywhere\.
Sixth, heterogeneous training introduces an additional learnability problem\. For a model to learn a correct mixture conditional, the textual prefix must provide enough information to infer, at least probabilistically, which component regime generated the prefix\. If the regime is identifiable, local conditional laws may be learned\. If it is not, the model learns only a blended distribution over heterogeneous continuations\.
This paper formalizes these distinctions and derives a criterion for when next\-token prediction is useful\. It then discusses programming as a favorable case and clarifies the role of Retrieval Augmented Generation \(RAG\) and tool use as attempts to improve the sufficiency of the conditioning context\[[20](https://arxiv.org/html/2605.23278#bib.bib20),[6](https://arxiv.org/html/2605.23278#bib.bib6),[28](https://arxiv.org/html/2605.23278#bib.bib28),[36](https://arxiv.org/html/2605.23278#bib.bib36),[18](https://arxiv.org/html/2605.23278#bib.bib18),[10](https://arxiv.org/html/2605.23278#bib.bib10)\]\. The point is deliberately architecture\-neutral: the argument applies to any language model trained on observed sequences, from classical statistical language models to contemporary neural models and LLMs\.
The paper deliberately does not address a second, even more important, problem of language models\. Even if the text\-only language distribution were estimated successfully, nothing in that fact guarantees that the generated sequences correspond to true statements\. This is a distinct issue\. A model may learn a distribution of linguistic continuations and still produce false factual claims, invalid arguments, or wrong arithmetic operations\. Standard language models do not contain an intrinsic truth checker or validity checker; their objective is to assign probability to sequences, not to verify correspondence with facts, proofs, executions, or calculations\. Consequently, truthfulness depends on two properties that must not be conflated: first, whether the relevant language distribution has been learned correctly; second, whether high\-probability continuations in that distribution are true or valid in the target domain\. Hallucination and factuality studies make this distinction central, and even in domains such as arithmetic or symbolic reasoning, external execution or verification is often needed to separate plausible generated text from correct results\[[16](https://arxiv.org/html/2605.23278#bib.bib16),[15](https://arxiv.org/html/2605.23278#bib.bib15),[17](https://arxiv.org/html/2605.23278#bib.bib17),[10](https://arxiv.org/html/2605.23278#bib.bib10),[28](https://arxiv.org/html/2605.23278#bib.bib28)\]\.
## 2Three Different Objects Often Confused
LetXtX\_\{t\}denote the token at positiontt, and let
X≤t=\(X1,…,Xt\)X\_\{\\leq t\}=\(X\_\{1\},\\ldots,X\_\{t\}\)be the observed textual history\. LetZtZ\_\{t\}denote the non\-textual circumstances relevant to language production at timett\. These may include
Zt=\(world state,facts,events,speaker beliefs,goals,intentions,audience,task constraints,…\)\.Z\_\{t\}=\(\\text\{world state\},\\text\{facts\},\\text\{events\},\\text\{speaker beliefs\},\\text\{goals\},\\text\{intentions\},\\text\{audience\},\\text\{task constraints\},\\ldots\)\.It is useful to distinguish three objects\.
### 2\.1The full conditional language process
The full conditional process is
pfull\(xt\+1∣x≤t,zt\)\.p\_\{\\mathrm\{full\}\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\},z\_\{t\}\)\.This object describes the probability of a continuation given both the textual prefix and the relevant non\-textual circumstances\. It is the object closest to real language production\. A person does not speak only because previous words statistically suggest a continuation\. A person speaks in relation to intentions, facts, perceptions, social pressures, tasks, and goals\.
### 2\.2The marginal text\-only conditional law
IfZtZ\_\{t\}is unobserved, one can define the marginal text\-only conditional distribution:
pmarg\(xt\+1∣x≤t\)=∫pfull\(xt\+1∣x≤t,z\)p\(z∣x≤t\)𝑑z\.p\_\{\\mathrm\{marg\}\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\}\)=\\int p\_\{\\mathrm\{full\}\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\},z\)p\(z\\mid x\_\{\\leq t\}\)dz\.This is a theoretical object\. It is the conditional distribution of future tokens given past tokens after integrating over latent circumstances\. Importantly, this is not the full human language process\. It is the full process after relevant circumstances have been averaged out\.
### 2\.3The model\-induced predictive distribution
A language model with parametersθ\\thetadefines
pθ\(xt\+1∣x≤t\)\.p\_\{\\theta\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\}\)\.At decoding temperatureTT, logitsℓi\\ell\_\{i\}are transformed into
pθ,T\(i∣x≤t\)=exp\(ℓi/T\)∑jexp\(ℓj/T\)\.p\_\{\\theta,T\}\(i\\mid x\_\{\\leq t\}\)=\\frac\{\\exp\(\\ell\_\{i\}/T\)\}\{\\sum\_\{j\}\\exp\(\\ell\_\{j\}/T\)\}\.The model samples frompθ,Tp\_\{\\theta,T\}, not frompfullp\_\{\\mathrm\{full\}\}\. Nor canpθ,Tp\_\{\\theta,T\}automatically be identified withpmargp\_\{\\mathrm\{marg\}\}\. The relation between these objects depends on assumptions\.
Full processpfull\(xt\+1∣x≤t,zt\)p\_\{\\mathrm\{full\}\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\},z\_\{t\}\)Marginal text\-only lawpmarg\(xt\+1∣x≤t\)p\_\{\\mathrm\{marg\}\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\}\)Model\-induced lawpθ\(xt\+1∣x≤t\)p\_\{\\theta\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\}\)Decoding distributionpθ,T\(xt\+1∣x≤t\)p\_\{\\theta,T\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\}\)marginalizelatent contextergodicity \+representative corpus \+optimizationtemperature transformFigure 1:Four distinct distributions\. The model samples from the last object\. Identifying it with the preceding objects requires assumptions\.
## 3What Training Actually Observes
During standard next\-token training, the model sees realized token sequences
\(x1\(m\),…,xTm\(m\)\),m=1,…,M\.\(x^\{\(m\)\}\_\{1\},\\ldots,x^\{\(m\)\}\_\{T\_\{m\}\}\),\\qquad m=1,\\ldots,M\.For each positiontt, the training signal is the realized pair
\(x≤t\(m\),xt\+1\(m\)\)\.\(x^\{\(m\)\}\_\{\\leq t\},x^\{\(m\)\}\_\{t\+1\}\)\.The standard maximum\-likelihood objective is the cross\-entropy loss\[[4](https://arxiv.org/html/2605.23278#bib.bib4),[8](https://arxiv.org/html/2605.23278#bib.bib8)\]:
ℒ\(θ\)=−∑m=1M∑t=1Tmlogpθ\(xt\+1\(m\)∣x≤t\(m\)\)\.\\mathcal\{L\}\(\\theta\)=\-\\sum\_\{m=1\}^\{M\}\\sum\_\{t=1\}^\{T\_\{m\}\}\\log p\_\{\\theta\}\(x^\{\(m\)\}\_\{t\+1\}\\mid x^\{\(m\)\}\_\{\\leq t\}\)\.For a single training instance, the loss is minimized by assigning probability one to the observed next token:
pθ\(xt\+1\(m\)∣x≤t\(m\)\)=1\.p\_\{\\theta\}\(x^\{\(m\)\}\_\{t\+1\}\\mid x^\{\(m\)\}\_\{\\leq t\}\)=1\.Of course, with shared parameters and many related or conflicting contexts, the global optimum cannot generally assign probability one to every observed continuation\. The model must compress, interpolate, and generalize\.
Still, the immediate training signal is not a full conditional distribution\. This point is also related to the distinction between teacher\-forced training on observed prefixes and free\-running generation from model outputs, discussed in the literature on exposure bias\[[13](https://arxiv.org/html/2605.23278#bib.bib13)\]\. It is a realized continuation\. Thus, the claim that the model learns the next\-token conditional distribution must be understood as an asymptotic statistical interpretation of maximum likelihood, not as a literal description of the observed training data\.
## 4The Ergodicity Requirement
For training on realized trajectories to inform the marginal conditional distributionpmargp\_\{\\mathrm\{marg\}\}, the observed corpus must behave as a sufficiently representative sample from a stable marginalized language process\. A simplified requirement is
p^N\(xt\+1∣x≤t\)⟶pmarg\(xt\+1∣x≤t\)\\widehat\{p\}\_\{N\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\}\)\\longrightarrow p\_\{\\mathrm\{marg\}\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\}\)as corpus sizeN→∞N\\to\\infty\. Such convergence requires assumptions analogous to those used when interpreting empirical frequencies or maximum\-likelihood estimates as estimates of an underlying stochastic process\[[8](https://arxiv.org/html/2605.23278#bib.bib8),[4](https://arxiv.org/html/2605.23278#bib.bib4)\]:
- •stationarity,
- •ergodicity,
- •representativeness,
- •adequate support coverage,
- •and stability of the data\-generating process\.
Natural language violates these assumptions in many ways\. Language changes across time, domain, genre, author, institution, platform, ideology, and communicative purpose\. Moreover, for long contexts, exact repetition ofx≤tx\_\{\\leq t\}is rare or absent\. The empirical conditional law for a long exact context is usually not observable\.
Therefore, what the model learns in practice is not a table of conditional probabilities\. It learns parameterized regularities that compress observed trajectories across related contexts\. This means thatpθp\_\{\\theta\}can be interpreted as an approximation topmargp\_\{\\mathrm\{marg\}\}only if the corpus is sufficiently ergodic and representative with respect to the relevant conditional structure\. Without this assumption, the model fits a historical archive of textual trajectories; it does not estimate a stable law of language\.
## 5Marginalization Is Not Enough
Suppose, for the sake of argument, that the ergodicity problem is solved and that
pθ\(xt\+1∣x≤t\)≈pmarg\(xt\+1∣x≤t\)\.p\_\{\\theta\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\}\)\\approx p\_\{\\mathrm\{marg\}\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\}\)\.Is this enough to make next\-token prediction useful? No\. The marginal distribution may be a poor guide if omitted circumstancesZtZ\_\{t\}remain important after conditioning on the text\.
The full conditional law is
pfull\(xt\+1∣x≤t,zt\)\.p\_\{\\mathrm\{full\}\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\},z\_\{t\}\)\.The marginal law is useful only if
pfull\(xt\+1∣x≤t,zt\)≈pmarg\(xt\+1∣x≤t\)p\_\{\\mathrm\{full\}\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\},z\_\{t\}\)\\approx p\_\{\\mathrm\{marg\}\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\}\)for the relevant range ofztz\_\{t\}\. Equivalently, one needs approximate conditional independence:
Xt\+1⟂Zt∣X≤t\.X\_\{t\+1\}\\perp Z\_\{t\}\\mid X\_\{\\leq t\}\.This is a sufficiency condition\. The observed textual prefixX≤tX\_\{\\leq t\}must be an approximately sufficient statistic for the latent circumstancesZtZ\_\{t\}with respect to predictingXt\+1X\_\{t\+1\}\.
If this condition fails, then even a perfect estimate ofpmargp\_\{\\mathrm\{marg\}\}may be of limited use\. The model will generate plausible continuations averaged over unobserved situations, while the correct continuation in the actual situation depends on variables not contained in the text\.
## 6A Useful Information\-Theoretic Criterion
The preceding condition can be expressed using conditional mutual information\. The textual prefix is sufficient for continuation if
I\(Xt\+1;Zt∣X≤t\)≈0\.I\(X\_\{t\+1\};Z\_\{t\}\\mid X\_\{\\leq t\}\)\\approx 0\.When this quantity is small, the hidden circumstances add little predictive information once the text is known\. When this quantity is large, the omitted circumstances remain important\.
This gives a principled criterion: next\-token prediction is useful when the past text captures most of the task\-relevant latent state\. It is unreliable when the next correct continuation depends strongly on unobserved facts, intentions, or world states\. This criterion separates domains where language models can perform well from domains where fluent continuation is epistemically weak\.
## 7Mixed Training Regimes and Local Islands of Sufficiency
The preceding sections treated the language process as if it were homogeneous\. Real training corpora are not homogeneous\. A language model is trained on an indistinct mass of language containing many different regimes, genres, tasks, and communicative practices; this heterogeneity is already visible in large\-scale web\-trained models such as GPT\-style systems\[[26](https://arxiv.org/html/2605.23278#bib.bib26),[7](https://arxiv.org/html/2605.23278#bib.bib7)\]\.
A useful abstraction is to write the training distribution as a mixture:
𝒟=∑k=1Kπk𝒟k,\\mathcal\{D\}=\\sum\_\{k=1\}^\{K\}\\pi\_\{k\}\\mathcal\{D\}\_\{k\},where each component𝒟k\\mathcal\{D\}\_\{k\}corresponds to a different linguistic or task regime, for example
𝒟k∈\{\\displaystyle\\mathcal\{D\}\_\{k\}\\in\\\{programming,mathematics,fiction,journalism,law,\\displaystyle\\text\{programming\},\\text\{mathematics\},\\text\{fiction\},\\text\{journalism\},\\text\{law\},dialogue,social media,textbooks,…\}\.\\displaystyle\\text\{dialogue\},\\text\{social media\},\\text\{textbooks\},\\ldots\\\}\.Each regime has its own latent circumstancesZt\(k\)Z\_\{t\}^\{\(k\)\}and its own full conditional process
pk\(xt\+1∣x≤t,zt\(k\)\)\.p\_\{k\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\},z\_\{t\}^\{\(k\)\}\)\.The sufficiency condition is therefore not global\. It is local to a regime:
Ik\(Xt\+1;Zt\(k\)∣X≤t\)≈0\.I\_\{k\}\(X\_\{t\+1\};Z\_\{t\}^\{\(k\)\}\\mid X\_\{\\leq t\}\)\\approx 0\.This condition may hold approximately in local islands of language, such as programming, formal mathematics, standardized bureaucratic forms, boilerplate legal text, API documentation, textbook exercises, or highly conventional technical genres\.
It will fail in many other regions, such as open\-world factual claims, historical interpretation, medical diagnosis, political judgment, personal advice, strategic decision\-making, moral or legal evaluation, and scientific discovery outside well\-textualized evidence\.
The model is therefore trained on a mixed corpus containing both sufficient\-context and insufficient\-context regimes\. This has several implications\. First, the training objective does not explicitly tell the model whether a given prompt belongs to a regime where the textual prefix is sufficient\. The model learns continuation statistics over the mixture\. Second, success in sufficiency islands can create a misleading impression of general competence\. Third, mixed training encourages transfer of stylistic confidence across regimes\. The user experiences a homogeneous interface, while the epistemic status of the output changes radically across domains\.
## 8Mixture Identifiability and Local Conditional Laws
The preceding section shows that heterogeneous training creates a mixture of linguistic regimes\. We now consider a more basic question: how can a model trained on such heterogeneous data learn even the correct conditional distribution of future text given past text, whether or not that distribution is useful?
The answer is that this is possible only under a mixture\-identification condition\. The past sequence must contain enough information to identify, or at least probabilistically infer, which regime generated it\.
Suppose the corpus is generated by a mixture
𝒟=∑k=1Kπk𝒟k,\\mathcal\{D\}=\\sum\_\{k=1\}^\{K\}\\pi\_\{k\}\\mathcal\{D\}\_\{k\},where each regimekkhas a text\-only conditional law
pk\(xt\+1∣x≤t\)\.p\_\{k\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\}\)\.The global marginal text\-only conditional is then not a single homogeneous law but a mixture conditional:
pmix\(xt\+1∣x≤t\)=∑k=1Kp\(k∣x≤t\)pk\(xt\+1∣x≤t\)\.p\_\{\\mathrm\{mix\}\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\}\)=\\sum\_\{k=1\}^\{K\}p\(k\\mid x\_\{\\leq t\}\)p\_\{k\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\}\)\.Thus, in principle, a model can learn the correct mixture conditional only if it learns bothp\(k∣x≤t\)p\(k\\mid x\_\{\\leq t\}\), the posterior probability that the prefix belongs to regimekk, andpk\(xt\+1∣x≤t\)p\_\{k\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\}\), the local continuation law inside that regime\.
This is possible when the prefix carries enough information to classify the regime\. For example, prefixes such as
or
or
strongly identify programming, mathematical proof, and SQL\-like regimes, respectively\. In such cases,
p\(k∗∣x≤t\)≈1,p\(k^\{\\ast\}\\mid x\_\{\\leq t\}\)\\approx 1,and therefore
pmix\(xt\+1∣x≤t\)≈pk∗\(xt\+1∣x≤t\)\.p\_\{\\mathrm\{mix\}\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\}\)\\approx p\_\{k^\{\\ast\}\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\}\)\.This is the favorable case\. The model behaves locally as if it had selected a specialized conditional law\.
By contrast, prefixes such as “The answer is”, “The cause was”, or “The correct interpretation is” may not identify a sufficiently precise regime or latent situation\. Thenp\(k∣x≤t\)p\(k\\mid x\_\{\\leq t\}\)remains diffuse, and the resulting conditional distribution is a blended average over heterogeneous possibilities\. Such a mixture average may be statistically correct for the training archive but epistemically useless for the actual situation\.
This distinction explains several familiar behaviors of language models\.
### 8\.1Prompt sensitivity
Small changes in the prompt can move probability mass across mixture components:
p\(k∣x≤t\)→p\(k∣x≤t′\)\.p\(k\\mid x\_\{\\leq t\}\)\\to p\(k\\mid x^\{\\prime\}\_\{\\leq t\}\)\.A small stylistic or contextual modification may cause the model to shift from one local regime to another\.
### 8\.2Style\-content confusion
The model may identify the genre correctly while failing to identify the epistemic situation\. For example, it may produce the style of a legal opinion without access to the evidence required for a legal opinion\.
### 8\.3False authority
The model may enter a high\-confidence explanatory regime even when the latent factual context is missing\. The output then has the surface form of explanation without the informational basis required for explanation\.
### 8\.4Local competence
The model may appear genuinely competent in domains where prefixes strongly identify the regime and contain most of the relevant state\.
### 8\.5Cross\-regime contamination
The model may import patterns from nearby textual regimes that are statistically similar but epistemically inappropriate\. This is especially likely whenp\(k∣x≤t\)p\(k\\mid x\_\{\\leq t\}\)is diffuse\.
Thus, the learnability of a correct text\-only conditional in heterogeneous training requires at least three assumptions:
1. 1\.the mixture weights and component distributions are sufficiently stable;
2. 2\.the corpus is representative and ergodic with respect to the mixture process;
3. 3\.the prefix permits reliable regime inference\.
Under these assumptions, a model may approximate
pθ\(xt\+1∣x≤t\)≈pmix\(xt\+1∣x≤t\)\.p\_\{\\theta\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\}\)\\approx p\_\{\\mathrm\{mix\}\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\}\)\.But this is still only the text\-only mixture conditional\. It is not the full conditional process of language production\.
Moreover, usefulness requires an additional local condition\. For the inferred regimekk, the omitted circumstances must be conditionally irrelevant:
Ik\(Xt\+1;Zt\(k\)∣X≤t\)≈0\.I\_\{k\}\(X\_\{t\+1\};Z\_\{t\}^\{\(k\)\}\\mid X\_\{\\leq t\}\)\\approx 0\.Thus, we obtain a two\-stage criterion\.
The two questions must not be confused\. A model may learn the correct text\-only mixture conditional and still generate continuations that are useless or misleading for the actual world situation\.
## 9Why Ordinary Factual Language Often Fails the Criterion
Consider prompts such as “The correct diagnosis is”, “The cause of the accident was”, or “The best policy response is”\. In each case, the correct continuation may depend on facts not present in the textual prefix\. These facts may include medical measurements, physical events, institutional constraints, legal evidence, scientific data, or strategic objectives\.
If these variables are not included inX≤tX\_\{\\leq t\}, thenI\(Xt\+1;Zt∣X≤t\)I\(X\_\{t\+1\};Z\_\{t\}\\mid X\_\{\\leq t\}\)is large\. The model may still generate plausible text, but plausible does not mean correct\. It samples from a distribution over continuations that are statistically compatible with the prefix, not from the actual situation\-specific conditional law\. This is one structural source of hallucination\.
## 10Programming as a Favorable Case
Programming is a domain where both mixture identifiability and local sufficiency often hold approximately\. More generally, modern language models display strong few\-shot behavior in structured textual regimes, although such behavior should not be confused with guaranteed symbolic correctness\[[7](https://arxiv.org/html/2605.23278#bib.bib7),[37](https://arxiv.org/html/2605.23278#bib.bib37)\]\. First, programming prefixes frequently identify the regime with high probability\. Code syntax, file structure, comments, imports, error messages, and programming\-language\-specific conventions makep\(k∣x≤t\)p\(k\\mid x\_\{\\leq t\}\)sharply concentrated\.
Second, much of the relevant latent state can be textualized\. Suppose the context contains
X≤t=\(\\displaystyle X\_\{\\leq t\}=\(natural language specification,design notes,imports,\\displaystyle\\text\{natural language specification\},\\text\{design notes\},\\text\{imports\},previous code,tests,error messages,documentation snippets\)\.\\displaystyle\\text\{previous code\},\\text\{tests\},\\text\{error messages\},\\text\{documentation snippets\}\)\.In many programming tasks, the relevant latent state includes
Zt=\(\\displaystyle Z\_\{t\}=\(intended function,API constraints,syntax rules,\\displaystyle\\text\{intended function\},\\text\{API constraints\},\\text\{syntax rules\},type constraints,test behavior,runtime errors\)\.\\displaystyle\\text\{type constraints\},\\text\{test behavior\},\\text\{runtime errors\}\)\.Much of this latent state can be inserted into the prefix\. Thus, approximately,
Zt≈g\(X≤t\),Z\_\{t\}\\approx g\(X\_\{\\leq t\}\),and therefore
pfull\(xt\+1∣x≤t,zt\)≈pfull\(xt\+1∣x≤t\)\.p\_\{\\mathrm\{full\}\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\},z\_\{t\}\)\\approx p\_\{\\mathrm\{full\}\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\}\)\.Programming is favorable for several reasons: syntax is explicit; many constraints are local and textual; prior code strongly constrains future code; specifications can often be written in text; error messages and tests externalize hidden state; correctness can often be checked by execution; and training corpora contain many repeated structural patterns\.
Thus, programming often satisfies
I\(Xt\+1;Zt∣X≤t\)≪I\(Xt\+1;Zt\)I\(X\_\{t\+1\};Z\_\{t\}\\mid X\_\{\\leq t\}\)\\ll I\(X\_\{t\+1\};Z\_\{t\}\)more strongly than ordinary factual discourse\. This does not mean that language models understand programs in a human sense\. It means that programming is a domain where the textual prefix often identifies the regime and contains much of the relevant state needed for continuation\.
SpecificationExisting codeTests / errorsTextual prefixX≤tX\_\{\\leq t\}Relevant latent state mostly textualizedNext code token / lineFigure 2:Programming as a favorable regime: specifications, previous code, tests, and errors often textualize the latent task state\.
## 11RAG as Textualization of Latent Circumstances
Retrieval Augmented Generation can be understood in this framework as an attempt to improve sufficiency\. This interpretation is compatible with retrieval\-augmented language modeling approaches such as RAG and RETRO\[[20](https://arxiv.org/html/2605.23278#bib.bib20),[6](https://arxiv.org/html/2605.23278#bib.bib6)\]\. LetRtR\_\{t\}denote retrieved material:
Rt=\(r1,…,rk\)\.R\_\{t\}=\(r\_\{1\},\\ldots,r\_\{k\}\)\.The model now conditions on
pθ\(xt\+1∣x≤t,Rt\)\.p\_\{\\theta\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\},R\_\{t\}\)\.RAG is useful only if retrieved material reduces the remaining dependence on hidden circumstances:
I\(Xt\+1;Zt∣X≤t,Rt\)<I\(Xt\+1;Zt∣X≤t\)\.I\(X\_\{t\+1\};Z\_\{t\}\\mid X\_\{\\leq t\},R\_\{t\}\)<I\(X\_\{t\+1\};Z\_\{t\}\\mid X\_\{\\leq t\}\)\.The strongest useful case is
I\(Xt\+1;Zt∣X≤t,Rt\)≈0\.I\(X\_\{t\+1\};Z\_\{t\}\\mid X\_\{\\leq t\},R\_\{t\}\)\\approx 0\.Equivalently,
pfull\(xt\+1∣x≤t,zt,Rt\)≈pfull\(xt\+1∣x≤t,Rt\)\.p\_\{\\mathrm\{full\}\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\},z\_\{t\},R\_\{t\}\)\\approx p\_\{\\mathrm\{full\}\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\},R\_\{t\}\)\.Thus, RAG does not add knowledge symbolically to the model\. It attempts to turn missing circumstances into additional text\.
If the retrieved material does not capture the relevant omitted state, RAG merely adds more text\. It may improve topicality, citation appearance, or stylistic confidence without solving the underlying epistemic problem\. In a heterogeneous training setting, RAG must satisfy a regime\-specific condition:
Ik\(Xt\+1;Zt\(k\)∣X≤t,Rt\)≈0\.I\_\{k\}\(X\_\{t\+1\};Z\_\{t\}^\{\(k\)\}\\mid X\_\{\\leq t\},R\_\{t\}\)\\approx 0\.This is especially important in educational RAG systems\. A course RAG is useful only when the retrieved course material is sufficient for the student’s question\. If the missing context is the student’s misconception, the intended level of the course, the exact theorem assumptions, or the teacher’s notation, and the RAG material does not capture these variables, the model may still generate a plausible but misleading answer\.
## 12Tool Use as External Access to Non\-Textual State
Tool use is stronger than RAG, but it is not automatically sufficient either\. A calculator, theorem prover, compiler, database, search engine, or code interpreter can access or generate information not contained in the language model’s predictive distribution\. Tool use, skill\-like modules, and instruction\-following interfaces should therefore be treated as system\-level orchestration rather than pure language modeling\[[25](https://arxiv.org/html/2605.23278#bib.bib25),[18](https://arxiv.org/html/2605.23278#bib.bib18),[28](https://arxiv.org/html/2605.23278#bib.bib28),[36](https://arxiv.org/html/2605.23278#bib.bib36),[10](https://arxiv.org/html/2605.23278#bib.bib10)\]\.
LetAtA\_\{t\}be the output of an external tool:
At=τ\(X≤t\)\.A\_\{t\}=\\tau\(X\_\{\\leq t\}\)\.The model conditions on
pθ\(xt\+1∣x≤t,Rt,At\)\.p\_\{\\theta\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\},R\_\{t\},A\_\{t\}\)\.Tool use is useful only if the tool output reduces the remaining dependence on hidden circumstances:
I\(Xt\+1;Zt∣X≤t,Rt,At\)<I\(Xt\+1;Zt∣X≤t,Rt\)\.I\(X\_\{t\+1\};Z\_\{t\}\\mid X\_\{\\leq t\},R\_\{t\},A\_\{t\}\)<I\(X\_\{t\+1\};Z\_\{t\}\\mid X\_\{\\leq t\},R\_\{t\}\)\.The strongest case is
I\(Xt\+1;Zt∣X≤t,Rt,At\)≈0\.I\(X\_\{t\+1\};Z\_\{t\}\\mid X\_\{\\leq t\},R\_\{t\},A\_\{t\}\)\\approx 0\.Equivalently,
pfull\(xt\+1∣x≤t,zt,Rt,At\)≈pfull\(xt\+1∣x≤t,Rt,At\)\.p\_\{\\mathrm\{full\}\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\},z\_\{t\},R\_\{t\},A\_\{t\}\)\\approx p\_\{\\mathrm\{full\}\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\},R\_\{t\},A\_\{t\}\)\.Thus, RAG and tools are not automatically proxies for missing context\. They are proxies only insofar as conditioning on them makes the remaining latent circumstances irrelevant, or at least substantially less informative for the continuation\.
This explains why tool\-augmented systems can be more reliable than plain language models\. They do not make the language model internally more truthful\. They change the conditioning information available to generation and delegate parts of the task to systems with different objectives and guarantees\.
## 13Consequences for Temperature
Temperature modifies the model\-induced distribution\. Decoding choices are known to strongly affect generated text quality, diversity, and degeneration\[[14](https://arxiv.org/html/2605.23278#bib.bib14)\]:
pθ,T\(i∣x≤t\)=exp\(ℓi/T\)∑jexp\(ℓj/T\)\.p\_\{\\theta,T\}\(i\\mid x\_\{\\leq t\}\)=\\frac\{\\exp\(\\ell\_\{i\}/T\)\}\{\\sum\_\{j\}\\exp\(\\ell\_\{j\}/T\)\}\.But temperature does not solve the marginalization problem\. It does not restore omitted circumstances\. It does not makepθ,Tp\_\{\\theta,T\}a sample frompfullp\_\{\\mathrm\{full\}\}\. Nor does it ensure thatpθ,Tp\_\{\\theta,T\}approximatespmargp\_\{\\mathrm\{marg\}\}, unless the training process already produced such an approximation\.
Temperature only changes how broadly the model samples from its own learned distribution\. When the prefix is insufficient, higher temperature often increases diversity at the cost of epistemic reliability\. When the prefix is highly sufficient, moderate stochasticity may be useful for exploring multiple valid continuations\. This explains why sampling can be useful for creative writing or alternative code completions but dangerous for factual claims\.
## 14Synthetic Contamination and Recursive Training
The preceding analysis also clarifies a further problem that becomes central once language\-model outputs are injected into future training data\. The problem is not merely that synthetic data may be low quality\. It is not just a ”regression to the mean” kind of problem we would have each time we use model fitted data added in a dataset\. The deeper issue is that model\-generated sequences are not samples from the full human language process, and need not even be samples from the marginalized text\-only law discussed above\. They are samples from a model\-induced distribution, possibly after decoding transformations such as temperature, nucleus sampling, filtering, alignment, or tool\-mediated post\-processing\.
LetPPdenote the target marginalized human text process and letQθ,TQ\_\{\\theta,T\}denote the distribution of sequences produced by a model with parametersθ\\thetaand decoding ruleTT\. If generated data are mixed into the next training corpus, the effective training distribution becomes schematically
Pα=\(1−α\)P\+αQθ,T,0≤α≤1\.P\_\{\\alpha\}=\(1\-\\alpha\)P\+\\alpha Q\_\{\\theta,T\},\\qquad 0\\leq\\alpha\\leq 1\.The next model trained on this corpus is then pushed towardPαP\_\{\\alpha\}, not towardPP\. IfQθ,T=PQ\_\{\\theta,T\}=P, no distortion is introduced\. But this equality is precisely what the previous sections show cannot be assumed\. It would require the original model to have learned the correct marginalized distribution under strong ergodicity, representativeness, mixture\-identifiability, and local sufficiency conditions\. If these requirements fail, generated data introduce a biased component into the empirical training process\.
The point can be made recursively\. Suppose models are trained through generations according to
Pn\+1=\(1−α\)P\+αQθn,Tn,P\_\{n\+1\}=\(1\-\\alpha\)P\+\\alpha Q\_\{\\theta\_\{n\},T\_\{n\}\},whereQθn,TnQ\_\{\\theta\_\{n\},T\_\{n\}\}is the distribution generated by the previous model\. UnlessQθn,TnQ\_\{\\theta\_\{n\},T\_\{n\}\}is already a faithful sample from the target process, approximation errors, missing tails, regime confusions, and hallucinated structures can be reintroduced as training evidence\. The training procedure then begins to fit not only human language trajectories but also distortions created by previous models\.
This connects the present argument with the literature on model collapse and self\-consuming generative models\. Shumailov et al\. describe model collapse as arising when models recursively train on generated data and write that “model collapse is universal among generative models that recursively train on data generated by previous generations”\[[32](https://arxiv.org/html/2605.23278#bib.bib32)\]\. Their earlier formulation emphasizes that models can “forget the true underlying data distribution” and that “tails of the original content distribution disappear”\[[31](https://arxiv.org/html/2605.23278#bib.bib31)\]\. Alemohammad et al\. analyze related autophagous loops and conclude that, without enough fresh real data, future models see quality or diversity progressively decrease\[[1](https://arxiv.org/html/2605.23278#bib.bib1)\]\.
The framework developed in this paper gives a complementary interpretation of these results\. If a model\-generated sequence comes from a regime in which the prefix was not sufficient for the omitted context, then the sequence may encode a plausible continuation rather than a sample from the true marginalized conditional process\. If the model failed to identify the correct local regime, the generated sample may mix incompatible regimes\. If RAG or tool outputs were used but did not satisfy the conditional sufficiency requirement, then the generated sequence may contain the surface form of grounded language without the relevant grounding state\. If no truth\-checking mechanism is applied, false statements may be injected as ordinary language evidence\.
Consequently, AI\-generated data are not neutral additions to the corpus\. They are observations from a derived distribution whose relation to the target language process depends on precisely the assumptions analyzed above\. If the generating models do not satisfy those assumptions, adding their outputs to future training data makes the possibility of estimating the human marginalized conditional distribution even weaker\. In mixture notation, synthetic contamination adds components
𝒟AI,j\\mathcal\{D\}\_\{\\mathrm\{AI\},j\}whose conditional laws are not human regime conditionalspkp\_\{k\}, but model\-induced approximationsqjq\_\{j\}\. The effective mixture becomes
𝒟′=\(1−α\)∑kπk𝒟k\+α∑jρj𝒟AI,j\.\\mathcal\{D\}^\{\\prime\}=\(1\-\\alpha\)\\sum\_\{k\}\\pi\_\{k\}\\mathcal\{D\}\_\{k\}\+\\alpha\\sum\_\{j\}\\rho\_\{j\}\\mathcal\{D\}\_\{\\mathrm\{AI\},j\}\.A model trained on𝒟′\\mathcal\{D\}^\{\\prime\}learns the conditional law of this contaminated mixture, not the original human\-language mixture\. If the synthetic components underrepresent rare regimes, omit distributional tails, overproduce high\-probability stylistic patterns, or contain hallucinated facts, these distortions become part of the next training target\.
This also interacts with the sufficiency criterion\. For a generated sequence to be a safe training sample, it is not enough that it be fluent\. One would need, at minimum, that the model which generated it operated in a regime satisfying
Ik\(Xt\+1;Zt\(k\)∣X≤t,Rt,At\)≈0,I\_\{k\}\(X\_\{t\+1\};Z\_\{t\}^\{\(k\)\}\\mid X\_\{\\leq t\},R\_\{t\},A\_\{t\}\)\\approx 0,and that the resulting generated statement be valid or truth\-checked when the domain requires truth rather than mere continuation\. Otherwise, synthetic data may amplify exactly the regimes in which next\-token prediction is least epistemically reliable\.
Thus, the present paper implies a stricter view of recursive training than the usual warning that models should not be trained on “bad” synthetic data\. The issue is structural\. Model\-generated text is trustworthy as training evidence only when the generating process itself satisfied the learnability, identifiability, sufficiency, and verification requirements relevant to the domain\. Without these conditions, synthetic contamination makes the original ergodic\-estimation problem harder rather than easier\.
## 15Educational Implications
The educational relevance of the previous analysis is not that language models are, by themselves, teachers\. It is rather that recent work has suggested LLM\-based and retrieval\-augmented systems as scalable components of personal or adaptive tutors, continuing a long tradition of research on one\-to\-one tutoring and intelligent tutoring systems\[[5](https://arxiv.org/html/2605.23278#bib.bib5),[11](https://arxiv.org/html/2605.23278#bib.bib11),[33](https://arxiv.org/html/2605.23278#bib.bib33),[21](https://arxiv.org/html/2605.23278#bib.bib21)\]\. More recent educational discussions of LLMs emphasize their possible use for feedback, explanation, exercise generation, and tutoring, while also stressing risks of hallucination, bias, privacy loss, and over\-reliance\[[19](https://arxiv.org/html/2605.23278#bib.bib19)\]\. In this context, RAG has been proposed as a way to build course\-specific tutors by grounding responses in curated instructional material rather than relying only on the parametric distribution of a general language model\[[20](https://arxiv.org/html/2605.23278#bib.bib20),[6](https://arxiv.org/html/2605.23278#bib.bib6),[9](https://arxiv.org/html/2605.23278#bib.bib9)\]\.
The analysis developed above can therefore be read as a set of conditions under which such educational uses may actually be useful\. A RAG\-based tutor is not reliable merely because it retrieves course notes, nor because it answers in a pedagogically fluent style\. It is useful when the educational task belongs to a regime that the prompt and retrieved material identify with sufficient precision, and when the augmented context is close to sufficient for the latent pedagogical state relevant to the student’s question\.
In the notation of the paper, the educational usefulness condition is local to the task regime:
Ik\(Xt\+1;Zt\(k\)∣X≤t,Rt,At\)≈0\.I\_\{k\}\(X\_\{t\+1\};Z\_\{t\}^\{\(k\)\}\\mid X\_\{\\leq t\},R\_\{t\},A\_\{t\}\)\\approx 0\.HereZt\(k\)Z\_\{t\}^\{\(k\)\}includes not only the objective subject matter, but also educationally relevant latent variables such as the student’s misconception, the intended level of the course, the notation adopted by the teacher, the exact assumptions of the theorem or exercise, and the learning goal of the interaction\. The retrieved materialRtR\_\{t\}and any tool outputAtA\_\{t\}are useful only insofar as they reduce the residual dependence on these variables\.
Thus, educational language\-model systems are most justified when:
- •the relevant task state can be textualized;
- •the course materials retrieved by the RAG system are actually sufficient for the student’s question;
- •the local regime is well identified, as in a specific course, exercise type, programming assignment, or mathematical procedure;
- •correctness can be externally checked, for example through tests, calculations, symbolic systems, or teacher review;
- •the student is asked to inspect, criticize, and compare the output rather than merely accept it\.
They are least reliable when the correct answer depends on hidden facts, the prompt under\-specifies the situation, the retrieval system fails to capture the relevant course context, the student’s misconception is not represented in the prompt, or no external verification is available\.
Programming tutors, mathematical exercise generators with symbolic checking, and RAG systems based on carefully curated course materials therefore have a stronger theoretical justification than open\-ended factual tutors without retrieval or verification\. This does not mean that their outputs are guaranteed to be true or pedagogically appropriate\. The present section only states conditions under which the language\-model distribution may become educationally useful\. It does not solve the separate truth\-checking and validity\-checking problem discussed in the introduction: even when the conditioning context is sufficient for plausible continuation, correctness still requires either domain conditions under which likely continuations are usually true or external mechanisms of verification\.
## 16Prompts, Non\-Observed Conditionals, and the Limits of Context Injection
The preceding sections considered the conditions under which a language model trained on observed token trajectories may approximate a marginal text\-only conditional distribution\. We now consider a further limitation: even when the prompt contains full information about the relevant context, the model cannot reliably follow a conditional distribution that does not correspond to trajectories observed, directly or indirectly, in training\.
Let the full language process be
pfull\(xt\+1∣x≤t,zt\),p\_\{\\mathrm\{full\}\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\},z\_\{t\}\),
whereztz\_\{t\}denotes the relevant non\-textual circumstances\. Suppose that, at inference time, the prompt is enriched with a textual representationctc\_\{t\}of those circumstances\. The model is then queried on an extended prefix
x~≤t=\(x≤t,ct\)\.\\tilde\{x\}\_\{\\leq t\}=\(x\_\{\\leq t\},c\_\{t\}\)\.
A common intuition is that, ifctc\_\{t\}contains all relevant information, then the model should be able to condition correctly on it\. This intuition is too strong\. The model can exploitctc\_\{t\}only if training has exposed it to sufficiently many trajectories in which similar contextual material played the relevant conditioning role\. In other words, the model must have learned a conditional law of the form
p\(xt\+1∣x≤t,ct\)\.p\(x\_\{t\+1\}\\mid x\_\{\\leq t\},c\_\{t\}\)\.
If the relevant conditional pattern was not observed in the training corpus, then the prompt does not create it\. The prompt changes the input trajectory, but it does not supply missing conditional data to the training process\.
This can be interpreted as a support or ergodicity failure\. For the model to learn the conditional distribution associated with the extended context, the training process must contain enough representative occurrences of the relevant kind of conditioning event\. Formally, one would need convergence of the form
p^N\(xt\+1∣x≤t,ct\)⟶pmarg\(xt\+1∣x≤t,ct\)\.\\widehat\{p\}\_\{N\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\},c\_\{t\}\)\\longrightarrow p\_\{\\mathrm\{marg\}\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\},c\_\{t\}\)\.
If the event class\(x≤t,ct\)\(x\_\{\\leq t\},c\_\{t\}\)is absent, extremely rare, or not part of an ergodic component of the training distribution, this convergence cannot occur\. The model then has no learned conditional distribution corresponding to the requested situation\. It may still generate a fluent continuation by analogy with nearby trajectories, but such a continuation is not an estimate of the desired conditional law\.
This makes precise the sense in which a prompt is not “new information for the model\.” It is new text placed in the conditioning sequence\. It may redirect generation toward regions of the learned distribution, but it cannot substitute for missing training experience\. A prompt can select among learned conditional behaviors; it cannot by itself install a conditional behavior that was not learned\.
The same point applies to RAG and tool outputs\. Retrieved materialRtR\_\{t\}and tool outputAtA\_\{t\}enlarge the conditioning sequence:
pθ\(xt\+1∣x≤t,Rt,At\)\.p\_\{\\theta\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\},R\_\{t\},A\_\{t\}\)\.
However, the model can exploit this enlarged context only if its training has included sufficiently similar situations in which such added material correctly constrained continuation\. Otherwise, RAG or tool output may be present in the prompt but not function as an effective conditioning variable\. It may improve topical relevance or surface coherence without inducing the desired conditional behavior\.
Thus, two conditions must be distinguished\. The first is informational sufficiency:
I\(Xt\+1;Zt∣X≤t,Rt,At\)≈0\.I\(X\_\{t\+1\};Z\_\{t\}\\mid X\_\{\\leq t\},R\_\{t\},A\_\{t\}\)\\approx 0\.
This says that the added material contains enough information to make the hidden context irrelevant\. The second is learned conditional availability: the model must have been trained on enough trajectories for which similar augmented prefixes were associated with the appropriate continuations\. Without the second condition, even a fully informative prompt may fail\.
This is especially clear in formal domains\. Suppose a prompt fully specifies an arithmetic operation or a new formal rule\. If the model has not learned the relevant conditional structure from training, it is not guaranteed to follow the rule\. The correctness of the output depends not on the presence of the rule in the prompt alone, but on whether the model has learned how such rules constrain continuations\. This is why the success of language modeling and the truth or validity of generated statements remain distinct properties\. A model may fit language trajectories well and still lack a mechanism for checking whether a generated continuation is true, valid, or rule\-compliant\.
This observation also clarifies the role of in\-context learning\. In\-context examples can guide a model when they activate conditional patterns already learned during training\[[7](https://arxiv.org/html/2605.23278#bib.bib7),[24](https://arxiv.org/html/2605.23278#bib.bib24),[35](https://arxiv.org/html/2605.23278#bib.bib35)\]\. They do not guarantee arbitrary rule acquisition from the prompt\. The prompt may induce local adaptation in generation, but this adaptation remains constrained by the model\-induced distribution\. In this sense, prompting is better understood as conditional selection within learned trajectories than as the direct insertion of new knowledge or new rules\.
The general implication is the following\. Even if the prompt contains full information about the relevant context, the model can only use that information reliably when the corresponding conditional relation belongs to the learned support of the training distribution\. Otherwise the problem is, in the terminology of the present paper, another failure of ergodicity or support coverage\. The relevant conditional distribution was not observed, and therefore cannot be recovered merely by writing the conditioning variables into the prompt\.
## 17Conclusion
The usefulness of next\-token prediction depends on two separate conditions, plus a mixture\-identification condition in heterogeneous corpora\.
First, the model\-induced distribution can be interpreted as an approximation to a marginal text\-only conditional law only under strong assumptions of representativeness, stationarity, and ergodicity of the observed corpus\.
Second, in a heterogeneous corpus, the prefix must allow the model to infer the relevant local regime:
p\(k∗∣X≤t\)≈1p\(k^\{\\ast\}\\mid X\_\{\\leq t\}\)\\approx 1or at least to estimate the appropriate mixture conditional:
pmix\(xt\+1∣x≤t\)=∑kp\(k∣x≤t\)pk\(xt\+1∣x≤t\)\.p\_\{\\mathrm\{mix\}\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\}\)=\\sum\_\{k\}p\(k\\mid x\_\{\\leq t\}\)p\_\{k\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\}\)\.Third, even if this approximation succeeds, the resulting text\-only law is useful only when the observed prefix is an approximately sufficient statistic for the latent circumstances relevant to continuation\.
The central criterion is
I\(Xt\+1;Zt∣X≤t\)≈0\.I\(X\_\{t\+1\};Z\_\{t\}\\mid X\_\{\\leq t\}\)\\approx 0\.In heterogeneous corpora, this criterion becomes regime\-specific:
Ik\(Xt\+1;Zt\(k\)∣X≤t\)≈0\.I\_\{k\}\(X\_\{t\+1\};Z\_\{t\}^\{\(k\)\}\\mid X\_\{\\leq t\}\)\\approx 0\.Some parts of language form local islands where the condition approximately holds\. Programming is a favorable example because specifications, code, tests, error messages, and documentation often textualize much of the relevant latent state\.
Many other parts of language do not satisfy the condition\. In such regions, the model produces plausible continuation rather than situation\-specific truth\.
RAG and tool use can be understood as attempts to improve sufficiency by adding retrieved evidence or externally computed information\. But they are useful as proxies for missing context only when
Ik\(Xt\+1;Zt\(k\)∣X≤t,Rt,At\)≈0\.I\_\{k\}\(X\_\{t\+1\};Z\_\{t\}^\{\(k\)\}\\mid X\_\{\\leq t\},R\_\{t\},A\_\{t\}\)\\approx 0\.
Finally, even a prompt containing the full relevant context cannot make the model follow a conditional law that was not represented in the training trajectories\. Prompting, RAG, and tool outputs can only condition generation through structures already made available by the learned distribution; otherwise the failure is again one of support, representativeness, and ergodicity\.
Thus, the theory does not imply that language models are useless\. It specifies the conditions under which they are useful\. Their success depends not only on model size, architecture, decoding strategy, or corpus scale, but on whether the relevant world has been adequately turned into text and whether the local regime has been adequately identified\. Even then, distributional success is not the same as truth\. The additional question \- whether generated statements are factual, valid, executable, or mathematically correct \- requires either domain conditions under which likely continuations are usually true or external checking mechanisms such as retrieval, execution, formal verification, or other tools\[[16](https://arxiv.org/html/2605.23278#bib.bib16),[15](https://arxiv.org/html/2605.23278#bib.bib15),[10](https://arxiv.org/html/2605.23278#bib.bib10),[28](https://arxiv.org/html/2605.23278#bib.bib28)\]\.
## References
- \[1\]Alemohammad, S\., Casco\-Rodriguez, J\., Luzi, L\., Humayun, A\. I\., Babaei, H\., LeJeune, D\., Siahkoohi, A\., & Baraniuk, R\. G\. \(2024\)\. Self\-consuming generative models go MAD\.*International Conference on Learning Representations*\.
- \[2\]Bengio, Y\., Ducharme, R\., Vincent, P\., & Jauvin, C\. \(2003\)\. A neural probabilistic language model\.*Journal of Machine Learning Research*, 3, 1137–1155\.
- \[3\]Bender, E\. M\., & Koller, A\. \(2020\)\. Climbing towards NLU: On meaning, form, and understanding in the age of data\.*Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics*, 5185–5198\.
- \[4\]Bishop, C\. M\. \(2006\)\.*Pattern Recognition and Machine Learning*\. Springer\.
- \[5\]Bloom, B\. S\. \(1984\)\. The 2 sigma problem: The search for methods of group instruction as effective as one\-to\-one tutoring\.*Educational Researcher*, 13\(6\), 4–16\.
- \[6\]Borgeaud, S\., Mensch, A\., Hoffmann, J\., Cai, T\., Rutherford, E\., Millican, K\., van den Driessche, G\. B\., Lespiau, J\.\-B\., Damoc, B\., Clark, A\., et al\. \(2022\)\. Improving language models by retrieving from trillions of tokens\.*Proceedings of the 39th International Conference on Machine Learning*\.
- \[7\]Brown, T\. B\., Mann, B\., Ryder, N\., Subbiah, M\., Kaplan, J\., Dhariwal, P\., Neelakantan, A\., Shyam, P\., Sastry, G\., Askell, A\., et al\. \(2020\)\. Language models are few\-shot learners\.*Advances in Neural Information Processing Systems*, 33, 1877–1901\.
- \[8\]Cover, T\. M\., & Thomas, J\. A\. \(2006\)\.*Elements of Information Theory*\(2nd ed\.\)\. Wiley\.
- \[9\]Dong, C\., Yuan, Y\., Chen, K\., Cheng, S\., & Wen, C\. \(2023\)\. How to build an adaptive AI tutor for any course using knowledge graph\-enhanced retrieval\-augmented generation \(KG\-RAG\)\.*arXiv:2311\.17696*\.
- \[10\]Gao, L\., Madaan, A\., Zhou, S\., Alon, U\., Liu, P\., Yang, Y\., Callan, J\., & Neubig, G\. \(2023\)\. Program\-aided language models\. In*Proceedings of the 40th International Conference on Machine Learning*\.
- \[11\]Graesser, A\. C\., Chipman, P\., Haynes, B\. C\., & Olney, A\. \(2005\)\. AutoTutor: An intelligent tutoring system with mixed\-initiative dialogue\.*IEEE Transactions on Education*, 48\(4\), 612–618\.
- \[12\]Graves, A\. \(2012\)\. Sequence transduction with recurrent neural networks\.*ICML Workshop on Representation Learning*\.
- \[13\]He, T\., Zhang, J\., Zhou, Z\., & Glass, J\. \(2021\)\. Quantifying exposure bias for neural language generation\.*Transactions of the Association for Computational Linguistics*, 9, 971–986\.
- \[14\]Holtzman, A\., Buys, J\., Du, L\., Forbes, M\., & Choi, Y\. \(2020\)\. The curious case of neural text degeneration\.*International Conference on Learning Representations*\.
- \[15\]Huang, L\., Yu, W\., Ma, W\., Zhong, W\., Feng, Z\., Wang, H\., Chen, Q\., Peng, W\., Feng, X\., Qin, B\., & Liu, T\. \(2023\)\. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions\.*arXiv:2311\.05232*\.
- \[16\]Ji, Z\., Lee, N\., Frieske, R\., Yu, T\., Su, D\., Xu, Y\., Ishii, E\., Bang, Y\. J\., Madotto, A\., & Fung, P\. \(2023\)\. Survey of hallucination in natural language generation\.*ACM Computing Surveys*, 55\(12\), Article 248\.
- \[17\]Kadavath, S\., Conerly, T\., Askell, A\., Henighan, T\., Drain, D\., Perez, E\., Schiefer, N\., Hatfield\-Dodds, Z\., DasSarma, N\., Tran\-Johnson, E\., et al\. \(2022\)\. Language models \(mostly\) know what they know\.*arXiv:2207\.05221*\.
- \[18\]Karpas, E\., Abend, O\., Belinkov, Y\., Lenz, B\., Lieber, O\., Ratner, N\., Shoham, Y\., Bata, H\., Levine, Y\., Leyton\-Brown, K\., Muhlgay, D\., Rozen, N\., Schwartz, E\., Shachaf, G\., Shalev\-Shwartz, S\., Shashua, A\., & Tenenholtz, M\. \(2022\)\. MRKL systems: A modular, neuro\-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning\.*arXiv:2205\.00445*\.
- \[19\]Kasneci, E\., Sessler, K\., Küchemann, S\., Bannert, M\., Dementieva, D\., Fischer, F\., Gasser, U\., Groh, G\., Günnemann, S\., H”ullermeier, E\., et al\. \(2023\)\. ChatGPT for good? On opportunities and challenges of large language models for education\.*Learning and Individual Differences*, 103, 102274\.
- \[20\]Lewis, P\., Perez, E\., Piktus, A\., Petroni, F\., Karpukhin, V\., Goyal, N\., Kuttler, H\., Lewis, M\., Yih, W\.\-t\., Rocktäschel, T\., et al\. \(2020\)\. Retrieval\-augmented generation for knowledge\-intensive NLP tasks\.*Advances in Neural Information Processing Systems*, 33, 9459–9474\.
- \[21\]Ma, W\., Adesope, O\. O\., Nesbit, J\. C\., & Liu, Q\. \(2014\)\. Intelligent tutoring systems and learning outcomes: A meta\-analysis\.*Journal of Educational Psychology*, 106\(4\), 901–918\.
- \[22\]Manning, C\. D\., & Schütze, H\. \(1999\)\.*Foundations of Statistical Natural Language Processing*\. MIT Press\.
- \[23\]Mikolov, T\., Karafiát, M\., Burget, L\., Černocký, J\., & Khudanpur, S\. \(2010\)\. Recurrent neural network based language model\.*INTERSPEECH*\.
- \[24\]Min, S\., Lyu, X\., Holtzman, A\., Artetxe, M\., Lewis, M\., Hajishirzi, H\., and Zettlemoyer, L\. \(2022\)\. Rethinking the role of demonstrations: What makes in\-context learning work?Proceedings of EMNLP\.
- \[25\]Ouyang, L\., Wu, J\., Jiang, X\., Almeida, D\., Wainwright, C\., Mishkin, P\., Zhang, C\., Agarwal, S\., Slama, K\., Ray, A\., et al\. \(2022\)\. Training language models to follow instructions with human feedback\.*Advances in Neural Information Processing Systems*, 35, 27730–27744\.
- \[26\]Radford, A\., Wu, J\., Child, R\., Luan, D\., Amodei, D\., & Sutskever, I\. \(2019\)\. Language models are unsupervised multitask learners\. OpenAI technical report\.
- \[27\]Rosenfeld, R\. \(2000\)\. Two decades of statistical language modeling: Where do we go from here?*Proceedings of the IEEE*, 88\(8\), 1270–1278\.
- \[28\]Schick, T\., Dwivedi\-Yu, J\., Dessi, R\., Raileanu, R\., Lomeli, M\., Zettlemoyer, L\., Cancedda, N\., & Scialom, T\. \(2023\)\. Toolformer: Language models can teach themselves to use tools\. In*Advances in Neural Information Processing Systems*\.
- \[29\]Shannon, C\. E\. \(1948\)\. A mathematical theory of communication\.*Bell System Technical Journal*, 27\(3\), 379–423\.
- \[30\]Shannon, C\. E\. \(1951\)\. Prediction and entropy of printed English\.*Bell System Technical Journal*, 30\(1\), 50–64\.
- \[31\]Shumailov, I\., Shumaylov, Z\., Zhao, Y\., Gal, Y\., Papernot, N\., & Anderson, R\. \(2023\)\. The curse of recursion: Training on generated data makes models forget\.*arXiv:2305\.17493*\.
- \[32\]Shumailov, I\., Shumaylov, Z\., Zhao, Y\., Papernot, N\., Anderson, R\., & Gal, Y\. \(2024\)\. AI models collapse when trained on recursively generated data\.*Nature*, 631, 755–759\.
- \[33\]VanLehn, K\. \(2011\)\. The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems\.*Educational Psychologist*, 46\(4\), 197–221\.
- \[34\]Vaswani, A\., Shazeer, N\., Parmar, N\., Uszkoreit, J\., Jones, L\., Gomez, A\. N\., Kaiser, Ł\., & Polosukhin, I\. \(2017\)\. Attention is all you need\.*Advances in Neural Information Processing Systems*, 30\.
- \[35\]Xie, S\. M\., Raghunathan, A\., Liang, P\., and Ma, T\. \(2022\)\. An explanation of in\-context learning as implicit Bayesian inference\.International Conference on Learning Representations\.
- \[36\]Yao, S\., Zhao, J\., Yu, D\., Du, N\., Shafran, I\., Narasimhan, K\., & Cao, Y\. \(2023\)\. ReAct: Synergizing reasoning and acting in language models\. In*International Conference on Learning Representations*\.
- \[37\]Wei, J\., Wang, X\., Schuurmans, D\., Bosma, M\., Xia, F\., Chi, E\., Le, Q\. V\., & Zhou, D\. \(2022\)\. Chain\-of\-thought prompting elicits reasoning in large language models\.*Advances in Neural Information Processing Systems*, 35, 24824–24837\.Similar Articles
Probabilistic Attribution For Large Language Models
This paper proposes a model-agnostic probabilistic token attribution measure for LLMs using Bayes' rule to invert next-token log probabilities, capturing the model's internal representation of token sequences and improving interpretability through entropy analysis.
The Need for an External Observer Formalizing the Sufficiency Gap: A Mathematical Extension of Mixture Identifiability and Contextual Grounding in Sequence Models
This paper formalizes the sufficiency gap in next-token prediction, demonstrating that even ideal sequence models can become overconfident when textual prefixes are not sufficient statistics for latent circumstances. It proposes an external observer mechanism to reduce but not eliminate this gap.
Rant: Stop saying LLMs are just “next token predictors.”
A critique of the oversimplified claim that LLMs are 'just next token predictors,' arguing that prediction at scale induces useful representations and capabilities, and that such dismissals confuse objective with learned system.
Text-to-Distribution Prediction with Quantile Tokens and Neighbor Context
Amazon and Stanford researchers propose Quantile Token Regression, inserting dedicated quantile tokens into LLM inputs to predict full probability distributions, achieving ~4 point MAPE reduction and 2× narrower intervals on Airbnb and Stack Overflow benchmarks.
TPA: Next Token Probability Attribution for Detecting Hallucinations in RAG
TPA proposes a novel method for detecting hallucinations in RAG systems by attributing next-token probabilities to seven distinct sources (Query, RAG Context, Past Token, Self Token, FFN, Final LayerNorm, Initial Embedding) and aggregating by Part-of-Speech tags. The approach achieves state-of-the-art performance across five LLMs including Llama2, Llama3, Mistral, and Qwen.