The Need for an External Observer Formalizing the Sufficiency Gap: A Mathematical Extension of Mixture Identifiability and Contextual Grounding in Sequence Models
Summary
This paper formalizes the sufficiency gap in next-token prediction, demonstrating that even ideal sequence models can become overconfident when textual prefixes are not sufficient statistics for latent circumstances. It proposes an external observer mechanism to reduce but not eliminate this gap.
View Cached Full Text
Cached at: 05/27/26, 09:09 AM
# The Need for an External Observer Formalizing the Sufficiency Gap: A Mathematical Extension of Mixture Identifiability and Contextual Grounding in Sequence Models
Source: [https://arxiv.org/html/2605.26711](https://arxiv.org/html/2605.26711)
###### Abstract
Corielli \([2026](https://arxiv.org/html/2605.26711#bib.bib10)\)argues that next\-token prediction is epistemically useful only when the observed textual prefix is an approximately sufficient statistic for the latent circumstances relevant to continuation\. This paper develops a model\-agnostic toy formalization of that claim\. We construct a binary mixed\-regime process with one deterministic textual regime and one random regime governed by an unobserved latent state\. Even an ideal infinite\-capacity sequence predictor that exactly recovers the text\-only marginal law can become overconfident when the observed prefix is compatible with the wrong latent regime\. The resulting entropy difference is not an ordinary optimization error; it is a sufficiency gap caused by marginalization over an unobserved state\. We then formalize retrieval, tool use, and external grounding through an auxiliary binary signal with fidelityγ∈\[1/2,1\]\\gamma\\in\[1/2,1\]\. The resulting Bayesian update yields a contextual dominance threshold: a corrective signal reverses the posterior odds induced by the textual history exactly when its fidelity exceeds the text\-only posterior weight assigned to the misleading regime\. This threshold reduces, but does not generally eliminate, the sufficiency gap; complete closure requires perfect revelation of the relevant latent state or an equivalent verification mechanism\. The analysis clarifies why temperature scaling cannot restore missing context, why grounding mechanisms must be both informative and learnably usable by the model, and why autonomous sequence models require structurally decoupled observers or verifiers in high\-stakes domains\.
## 1Introduction
A common description of language modeling is that models estimate or approximate a conditional distribution over the next token given the observed textual prefix, a view rooted in statistical language modeling and information\-theoretic accounts of language prediction\(Shannon,[1948](https://arxiv.org/html/2605.26711#bib.bib24),[1951](https://arxiv.org/html/2605.26711#bib.bib25); Manning and Schuetze,[1999](https://arxiv.org/html/2605.26711#bib.bib18); Rosenfeld,[2000](https://arxiv.org/html/2605.26711#bib.bib22)\)\. This description is statistically meaningful only under additional assumptions\. In neural language modeling, including early neural probabilistic models and later transformer\-based systems, a training corpus consists of realized token trajectories, not directly observed conditional laws\(Bengio et al\.,[2003](https://arxiv.org/html/2605.26711#bib.bib3); Vaswani et al\.,[2017](https://arxiv.org/html/2605.26711#bib.bib28); Radford et al\.,[2019](https://arxiv.org/html/2605.26711#bib.bib21); Brown et al\.,[2020](https://arxiv.org/html/2605.26711#bib.bib8)\)\. Moreover, human language production is conditioned not only on previous words, but also on facts, intentions, goals, institutions, beliefs, physical circumstances, and task\-specific constraints; this point is closely related to the distinction between form and grounded meaning emphasized byBender and Koller \([2020](https://arxiv.org/html/2605.26711#bib.bib2)\)\.Corielli \([2026](https://arxiv.org/html/2605.26711#bib.bib10)\)distinguishes three objects that are often conflated: the full conditional language process, the marginal text\-only law obtained by integrating out latent circumstances, and the model\-induced distribution learned from finite corpora\.
This paper translates that distinction into a compact analytical example\. We construct a binary data\-generating process composed of two regimes\. In one regime, the observable text sequence is deterministic and strictly alternating\. In the other, it is an independent fair coin\. The latent regime is not directly observed\. An ideal text\-only predictor therefore infers a posterior probability over regimes from the observed prefix and predicts from the resulting mixture conditional\. This model can be statistically optimal relative to the text\-only marginal distribution and still be epistemically misleading in the actual latent state\.
The central object of analysis is a*sufficiency gap*: the gap between the entropy of the true continuation conditional on the relevant latent state and the entropy of the text\-only marginal prediction\. In the broader language ofCorielli \([2026](https://arxiv.org/html/2605.26711#bib.bib10)\), such gaps become operationally dangerous in heterogeneous, historically sampled, and only partially ergodic archives\. Nevertheless, the mathematical point in the present paper is sharper: the failure can arise even after perfect recovery of the marginal text\-only law\. The problem is not merely insufficient model capacity or poor optimization\. It is the loss of latent\-state information under marginalization\.
The paper then introduces an auxiliary grounding signal\. This signal represents retrieval\-augmented generation, tool use, programmatic checks, human review, or any external observer that supplies evidence about the omitted latent state\. Its fidelity is parameterized byγ∈\[1/2,1\]\\gamma\\in\[1/2,1\]\. The resulting posterior update gives a closed\-form contextual dominance threshold\. If the grounding signal is too noisy relative to the text\-only posterior induced by the misleading prefix, the archive\-induced structural bias remains dominant\. If the signal is sufficiently precise, it reverses posterior odds and reduces the sufficiency gap\. Full gap closure, however, requires perfect or effectively perfect access to the latent state\.
The contribution is deliberately architecture\-neutral\. The analysis does not depend on transformers, parameter counts, reinforcement learning, or any particular decoding system\. It applies to any predictor trained on realized sequences and evaluated through a text\-only or augmented conditional distribution\.
## 2Background: Three Distinct Distributions
LetXtX\_\{t\}denote the token at positiontt, and let
X≤t=\(X1,…,Xt\)X\_\{\\leq t\}=\(X\_\{1\},\\ldots,X\_\{t\}\)be the observed textual history\. LetZtZ\_\{t\}denote the non\-textual circumstances relevant to producing or evaluating the next token\. These may include the world state, facts, speaker beliefs, goals, intentions, task constraints, institutional context, and other latent variables\.
FollowingCorielli \([2026](https://arxiv.org/html/2605.26711#bib.bib10)\), it is useful to distinguish three distributions\.
1. \(i\)Thefull conditional process pfull\(xt\+1∣x≤t,zt\),p\_\{\\mathrm\{full\}\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\},z\_\{t\}\),which conditions on both text and latent circumstances\.
2. \(ii\)Themarginal text\-only law pmarg\(xt\+1∣x≤t\)=∫pfull\(xt\+1∣x≤t,z\)p\(z∣x≤t\)𝑑z,p\_\{\\mathrm\{marg\}\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\}\)=\\int p\_\{\\mathrm\{full\}\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\},z\)\\,p\(z\\mid x\_\{\\leq t\}\)\\,dz,which averages over latent circumstances conditional on text\.
3. \(iii\)Themodel\-induced predictive distribution pθ\(xt\+1∣x≤t\),p\_\{\\theta\}\(x\_\{t\+1\}\\mid x\_\{\\leq t\}\),which is the distribution actually represented by a trained model\.
The present paper assumes an idealized limit in which the model\-induced distribution exactly recovers the relevant text\-only marginal law, abstracting from finite\-sample estimation and optimization issues familiar from statistical learning theory\(Bishop,[2006](https://arxiv.org/html/2605.26711#bib.bib5)\)\. This removes finite\-data, architecture, and optimization failures from the analysis and isolates the conditional\-independence issue rather than ordinary estimation error\. If a gap remains under this idealization, it is structural rather than merely empirical\.
The usefulness condition is local sufficiency\. The textual prefix is sufficient for continuation when
Xt\+1⟂Zt∣X≤t,X\_\{t\+1\}\\perp Z\_\{t\}\\mid X\_\{\\leq t\},\(1\)or approximately when
I\(Xt\+1;Zt∣X≤t\)≈0\.I\(X\_\{t\+1\};Z\_\{t\}\\mid X\_\{\\leq t\}\)\\approx 0\.\(2\)If this condition fails, a correct text\-only marginal conditional may remain a poor guide to the actual situation\-specific conditional law\.
## 3A Binary Mixed\-Regime Process
We now define a minimal process that makes the sufficiency failure explicit\. Let the latent regime process\{Zt\}t≥1\\\{Z\_\{t\}\\\}\_\{t\\geq 1\}take values in\{0,1\}\\\{0,1\\\}\. We index the latent variable asZt\+1Z\_\{t\+1\}when it governs the emission ofXt\+1X\_\{t\+1\}\. This is notationally equivalent to usingZtZ\_\{t\}as the continuation\-relevant latent state after a one\-step reindexing\.
Let𝒱=\{0,1\}\\mathcal\{V\}=\\\{0,1\\\}be the vocabulary and letℋt=σ\(X1,…,Xt\)\\mathcal\{H\}\_\{t\}=\\sigma\(X\_\{1\},\\ldots,X\_\{t\}\)denote the textual filtration\. Assume the latent process is a homogeneous Markov chain with symmetric retention probabilityρ∈\(1/2,1\)\\rho\\in\(1/2,1\), using standard stochastic\-process notation\(Doob,[1953](https://arxiv.org/html/2605.26711#bib.bib11)\):
P\(Zt\+1=i∣Zt=i\)=ρ,i∈\{0,1\}\.\\operatorname\{P\}\(Z\_\{t\+1\}=i\\mid Z\_\{t\}=i\)=\\rho,\\qquad i\\in\\\{0,1\\\}\.\(3\)The exact value ofρ\\rhois not central to the entropy calculations below; it provides a simple source of regime persistence\. The broader interpretation of empirical corpus frequencies as stable process information relies on ergodic\-type assumptions of the kind classically formalized byBirkhoff \([1931](https://arxiv.org/html/2605.26711#bib.bib4)\)\.
The full conditional law is defined as follows\.
Regime 0: deterministic textual regime\.The next token alternates with certainty:
P\(Xt\+1=x∣ℋt,Zt\+1=0\)=𝟏\{x=1−Xt\}\.\\operatorname\{P\}\(X\_\{t\+1\}=x\\mid\\mathcal\{H\}\_\{t\},Z\_\{t\+1\}=0\)=\\mathbf\{1\}\_\{\\\{x=1\-X\_\{t\}\\\}\}\.\(4\)
Regime 1: random latent regime\.The next token is an independent fair coin:
P\(Xt\+1=x∣ℋt,Zt\+1=1\)=12,x∈\{0,1\}\.\\operatorname\{P\}\(X\_\{t\+1\}=x\\mid\\mathcal\{H\}\_\{t\},Z\_\{t\+1\}=1\)=\\frac\{1\}\{2\},\\qquad x\\in\\\{0,1\\\}\.\(5\)
The ideal text\-only predictor recovers the marginal conditional law by integrating over the posterior distribution of the latent regime:
pmarg\(Xt\+1=x∣ℋt\)\\displaystyle p\_\{\\mathrm\{marg\}\}\(X\_\{t\+1\}=x\\mid\\mathcal\{H\}\_\{t\}\)=∑k∈\{0,1\}P\(Zt\+1=k∣ℋt\)P\(Xt\+1=x∣ℋt,Zt\+1=k\)\.\\displaystyle=\\sum\_\{k\\in\\\{0,1\\\}\}\\operatorname\{P\}\(Z\_\{t\+1\}=k\\mid\\mathcal\{H\}\_\{t\}\)\\operatorname\{P\}\(X\_\{t\+1\}=x\\mid\\mathcal\{H\}\_\{t\},Z\_\{t\+1\}=k\)\.\(6\)Define
πt,0:=P\(Zt\+1=0∣ℋt\)\.\\pi\_\{t,0\}:=\\operatorname\{P\}\(Z\_\{t\+1\}=0\\mid\\mathcal\{H\}\_\{t\}\)\.\(7\)For the alternating continuationx=1−Xtx=1\-X\_\{t\}, the marginal law becomes
pmarg\(Xt\+1=1−Xt∣ℋt\)=πt,0\+12\(1−πt,0\)=12\(1\+πt,0\)\.p\_\{\\mathrm\{marg\}\}\(X\_\{t\+1\}=1\-X\_\{t\}\\mid\\mathcal\{H\}\_\{t\}\)=\\pi\_\{t,0\}\+\\frac\{1\}\{2\}\(1\-\\pi\_\{t,0\}\)=\\frac\{1\}\{2\}\(1\+\\pi\_\{t,0\}\)\.\(8\)
###### Assumption 1\(Full support of the misleading regime\)\.
The prior over regimes has full support and the observed alternating history has positive likelihood under Regime 0\. Henceπt,0\>0\\pi\_\{t,0\}\>0whenever the observed history is compatible with Regime 0\.
This assumption avoids treating the deterministic regime as impossible after observing an alternating prefix\. It is the minimal support condition needed for the text\-only posterior to place positive weight on the misleading structural explanation\.
## 4Entropy of the Text\-Only Mixture
LetH2:\[0,1\]→\[0,1\]H\_\{2\}:\[0,1\]\\to\[0,1\]denote binary entropy in bits\(Cover and Thomas,[2006](https://arxiv.org/html/2605.26711#bib.bib9)\):
H2\(p\)=−plog2p−\(1−p\)log2\(1−p\)\.H\_\{2\}\(p\)=\-p\\log\_\{2\}p\-\(1\-p\)\\log\_\{2\}\(1\-p\)\.\(9\)
###### Lemma 1\(Entropy of the text\-only mixture\)\.
Letα=\(1\+π\)/2\\alpha=\(1\+\\pi\)/2forπ∈\[0,1\]\\pi\\in\[0,1\]\. ThenH2\(α\)=1H\_\{2\}\(\\alpha\)=1if and only ifπ=0\\pi=0\. Moreover,H2\(α\)H\_\{2\}\(\\alpha\)is strictly decreasing inπ\\pion\(0,1\)\(0,1\)\.
###### Proof\.
By the chain rule,
ddπH2\(1\+π2\)=12log2\(1−αα\)=12log2\(1−π1\+π\)\.\\frac\{d\}\{d\\pi\}H\_\{2\}\\\!\\left\(\\frac\{1\+\\pi\}\{2\}\\right\)=\\frac\{1\}\{2\}\\log\_\{2\}\\left\(\\frac\{1\-\\alpha\}\{\\alpha\}\\right\)=\\frac\{1\}\{2\}\\log\_\{2\}\\left\(\\frac\{1\-\\pi\}\{1\+\\pi\}\\right\)\.\(10\)Forπ∈\(0,1\)\\pi\\in\(0,1\), the ratio\(1−π\)/\(1\+π\)\(1\-\\pi\)/\(1\+\\pi\)lies in\(0,1\)\(0,1\), so the derivative is strictly negative\. Atπ=0\\pi=0,α=1/2\\alpha=1/2andH2\(1/2\)=1H\_\{2\}\(1/2\)=1\. Forπ\>0\\pi\>0, strict monotonicity givesH2\(α\)<1H\_\{2\}\(\\alpha\)<1\. ∎
## 5Theorem 1: The Sufficiency Gap and False Authority
###### Theorem 1\(Sufficiency gap under latent\-regime mismatch\)\.
Suppose the true continuation regime is Regime 1, so thatZt\+1=1Z\_\{t\+1\}=1, but the realized textual historyℋt\\mathcal\{H\}\_\{t\}is compatible with a deterministic alternating sequence and satisfies Assumption[1](https://arxiv.org/html/2605.26711#Thmassumption1)\. Then the predictive entropy of the ideal text\-only marginal model is strictly lower than the true conditional entropy:
ΔH:=H\(pfull∣ℋt,Zt\+1=1\)−H\(pmarg∣ℋt\)\>0\.\\Delta H:=H\(p\_\{\\mathrm\{full\}\}\\mid\\mathcal\{H\}\_\{t\},Z\_\{t\+1\}=1\)\-H\(p\_\{\\mathrm\{marg\}\}\\mid\\mathcal\{H\}\_\{t\}\)\>0\.\(11\)
###### Proof\.
Under Regime 1, the true conditional distribution is uniform over the binary vocabulary\. Hence
H\(pfull∣ℋt,Zt\+1=1\)=1bit\.H\(p\_\{\\mathrm\{full\}\}\\mid\\mathcal\{H\}\_\{t\},Z\_\{t\+1\}=1\)=1\\quad\\text\{bit\}\.\(12\)By Eq\. \([8](https://arxiv.org/html/2605.26711#S3.E8)\), the text\-only marginal predictor assigns probability
α=12\(1\+πt,0\)\\alpha=\\frac\{1\}\{2\}\(1\+\\pi\_\{t,0\}\)\(13\)to the alternating continuation and probability1−α1\-\\alphato the non\-alternating continuation\. By Assumption[1](https://arxiv.org/html/2605.26711#Thmassumption1),πt,0\>0\\pi\_\{t,0\}\>0\. Lemma[1](https://arxiv.org/html/2605.26711#Thmlemma1)therefore implies
H\(pmarg∣ℋt\)=H2\(α\)<1\.H\(p\_\{\\mathrm\{marg\}\}\\mid\\mathcal\{H\}\_\{t\}\)=H\_\{2\}\(\\alpha\)<1\.\(14\)Consequently,
ΔH=1−H2\(α\)\>0\.\\Delta H=1\-H\_\{2\}\(\\alpha\)\>0\.\(15\)∎
The theorem proves that false authority can arise even for a statistically ideal text\-only predictor\. The model is not making an optimization mistake; it is applying the correct marginal law given the information available in text\. The error is epistemic: the current latent state is random, but the observed prefix is compatible with a deterministic structural regime\. The model cannot distinguish a genuine invariant from a misleading local accident without additional information\.
For this reason, the term*sufficiency gap*is mathematically more precise than*ergodicity gap*in the theorem itself\. In the broader interpretation ofCorielli \([2026](https://arxiv.org/html/2605.26711#bib.bib10)\), heterogeneous non\-ergodic archives amplify this problem because they contain many local regimes whose textual signatures can be confused, transferred, or overgeneralized\.
## 6Temperature Scaling Cannot Restore Latent Sufficiency
A common response to overconfident generations is to increase sampling temperature, one of several decoding choices known to affect diversity and degeneration in neural text generation\(Holtzman et al\.,[2020](https://arxiv.org/html/2605.26711#bib.bib13)\)\. Temperature, however, reshapes the model\-induced distribution; it does not restore omitted latent variables\. In the binary case, if the unscaled text\-only marginal probability of the alternating token isα\>1/2\\alpha\>1/2, temperatureT\>0T\>0produces
pT\(Xt\+1=1−Xt∣ℋt\)=α1/Tα1/T\+\(1−α\)1/T=11\+\(1−αα\)1/T\.p\_\{T\}\(X\_\{t\+1\}=1\-X\_\{t\}\\mid\\mathcal\{H\}\_\{t\}\)=\\frac\{\\alpha^\{1/T\}\}\{\\alpha^\{1/T\}\+\(1\-\\alpha\)^\{1/T\}\}=\\frac\{1\}\{1\+\\left\(\\frac\{1\-\\alpha\}\{\\alpha\}\\right\)^\{1/T\}\}\.\(16\)
###### Theorem 2\(Temperature increases structural error in the deterministic regime\)\.
Assume the true continuation regime is Regime 0 andα∈\(1/2,1\)\\alpha\\in\(1/2,1\)\. For everyT\>1T\>1, temperature scaling strictly increases the probability of sampling the structurally invalid token relative toT=1T=1\. Moreover, this change does not reduce the residual dependence ofXt\+1X\_\{t\+1\}on the latent state\.
###### Proof\.
In Regime 0, the valid continuation is1−Xt1\-X\_\{t\}with probability one\. The probability of sampling the invalid token at temperatureTTis
ϵ\(T\)=1−pT\(Xt\+1=1−Xt∣ℋt\)=r1/T1\+r1/T,r:=1−αα∈\(0,1\)\.\\epsilon\(T\)=1\-p\_\{T\}\(X\_\{t\+1\}=1\-X\_\{t\}\\mid\\mathcal\{H\}\_\{t\}\)=\\frac\{r^\{1/T\}\}\{1\+r^\{1/T\}\},\\qquad r:=\\frac\{1\-\\alpha\}\{\\alpha\}\\in\(0,1\)\.\(17\)Forr∈\(0,1\)r\\in\(0,1\), the functionT↦r1/TT\\mapsto r^\{1/T\}is strictly increasing forT\>0T\>0\. Sinceu↦u/\(1\+u\)u\\mapsto u/\(1\+u\)is strictly increasing foru\>0u\>0,ϵ\(T\)\\epsilon\(T\)is strictly increasing inTT\. ThusT\>1T\>1strictly increases the probability of structural error relative toT=1T=1\.
Temperature acts only on the predictive probabilities in Eq\. \([16](https://arxiv.org/html/2605.26711#S6.E16)\)\. It does not condition onZt\+1Z\_\{t\+1\}and therefore cannot makeXt\+1X\_\{t\+1\}independent ofZt\+1Z\_\{t\+1\}conditional on the text\. It reshapes the marginal distribution but does not restore latent sufficiency\. ∎
Lowering temperature has the opposite local effect: it sharpens the marginal distribution and can reduce errors in deterministic regimes\. But this comes at the cost of intensifying false authority when the true regime is random\. Temperature therefore moves the system along a confidence\-diversity frontier\. It does not convert a text\-only marginal law into the full situation\-specific conditional process\.
## 7Contextual Augmentation by aγ\\gamma\-Oracle
We now introduce an auxiliary signalRt∈\{0,1\}R\_\{t\}\\in\\\{0,1\\\}representing retrieved evidence, a tool output, a human annotation, a programmatic check, or another external observer; this abstraction covers retrieval\-augmented and tool\-augmented language\-modeling systems\(Lewis et al\.,[2020](https://arxiv.org/html/2605.26711#bib.bib17); Borgeaud et al\.,[2022](https://arxiv.org/html/2605.26711#bib.bib7); Karpas et al\.,[2022](https://arxiv.org/html/2605.26711#bib.bib16); Schick et al\.,[2023](https://arxiv.org/html/2605.26711#bib.bib23); Gao et al\.,[2023](https://arxiv.org/html/2605.26711#bib.bib12); Yao et al\.,[2023](https://arxiv.org/html/2605.26711#bib.bib32)\)\. The signal has fidelityγ∈\[1/2,1\]\\gamma\\in\[1/2,1\]:
P\(Rt=k∣Zt\+1=k\)=γ,k∈\{0,1\}\.\\operatorname\{P\}\(R\_\{t\}=k\\mid Z\_\{t\+1\}=k\)=\\gamma,\\qquad k\\in\\\{0,1\\\}\.\(18\)Whenγ=1/2\\gamma=1/2, the signal is uninformative noise\. Whenγ=1\\gamma=1, it perfectly reveals the latent regime\.
###### Assumption 2\(Learned conditional availability\)\.
The augmented predictor has learned the conditional semantics of the auxiliary signalRtR\_\{t\}\. That is, it can treatRtR\_\{t\}as evidence aboutZt\+1Z\_\{t\+1\}according to the emission model above\.
This assumption is essential\. An external token may be informationally sufficient from the standpoint of an external statistician but useless for a model that has not learned how such tokens constrain continuation\. In the terminology ofCorielli \([2026](https://arxiv.org/html/2605.26711#bib.bib10)\), informational sufficiency and learned conditional availability are distinct requirements\. RAG passages, tool outputs, and prompt\-injected facts are useful only when they both contain relevant information and belong to a learned conditioning pattern\.
###### Theorem 3\(Bayesian updating under imperfect grounding\)\.
Let the true state be Regime 1 and suppose the observed textual history induces the text\-only posteriorπt,0=P\(Zt\+1=0∣ℋt\)\\pi\_\{t,0\}=\\operatorname\{P\}\(Z\_\{t\+1\}=0\\mid\\mathcal\{H\}\_\{t\}\)\. If the auxiliary signal emits the corrective valueRt=1R\_\{t\}=1, then the augmented posterior probability assigned to the deterministic regime is
qt,γ:=P\(Zt\+1=0∣ℋt,Rt=1\)=\(1−γ\)πt,0\(1−γ\)πt,0\+γ\(1−πt,0\)\.q\_\{t,\\gamma\}:=\\operatorname\{P\}\(Z\_\{t\+1\}=0\\mid\\mathcal\{H\}\_\{t\},R\_\{t\}=1\)=\\frac\{\(1\-\\gamma\)\\pi\_\{t,0\}\}\{\(1\-\\gamma\)\\pi\_\{t,0\}\+\\gamma\(1\-\\pi\_\{t,0\}\)\}\.\(19\)Consequently, the augmented probability assigned to the alternating token is
α\(γ\)=12\(1\+qt,γ\)=12\(1\+\(1−γ\)πt,0\(1−γ\)πt,0\+γ\(1−πt,0\)\)\.\\alpha\(\\gamma\)=\\frac\{1\}\{2\}\\left\(1\+q\_\{t,\\gamma\}\\right\)=\\frac\{1\}\{2\}\\left\(1\+\\frac\{\(1\-\\gamma\)\\pi\_\{t,0\}\}\{\(1\-\\gamma\)\\pi\_\{t,0\}\+\\gamma\(1\-\\pi\_\{t,0\}\)\}\\right\)\.\(20\)
###### Proof\.
By the law of total probability,
P\(Xt\+1=1−Xt∣ℋt,Rt=1\)\\displaystyle\\operatorname\{P\}\(X\_\{t\+1\}=1\-X\_\{t\}\\mid\\mathcal\{H\}\_\{t\},R\_\{t\}=1\)\(21\)=qt,γ⋅1\+\(1−qt,γ\)⋅12=12\(1\+qt,γ\)\.\\displaystyle\\qquad=q\_\{t,\\gamma\}\\cdot 1\+\(1\-q\_\{t,\\gamma\}\)\\cdot\\frac\{1\}\{2\}=\\frac\{1\}\{2\}\(1\+q\_\{t,\\gamma\}\)\.\(22\)It remains to computeqt,γq\_\{t,\\gamma\}\. By Bayes’ theorem, using the usual posterior\-updating logic of graphical and Bayesian models\(Wainwright and Jordan,[2008](https://arxiv.org/html/2605.26711#bib.bib29)\),
qt,γ\\displaystyle q\_\{t,\\gamma\}=P\(Rt=1∣ℋt,Zt\+1=0\)P\(Zt\+1=0∣ℋt\)∑k∈\{0,1\}P\(Rt=1∣ℋt,Zt\+1=k\)P\(Zt\+1=k∣ℋt\)\.\\displaystyle=\\frac\{\\operatorname\{P\}\(R\_\{t\}=1\\mid\\mathcal\{H\}\_\{t\},Z\_\{t\+1\}=0\)\\operatorname\{P\}\(Z\_\{t\+1\}=0\\mid\\mathcal\{H\}\_\{t\}\)\}\{\\sum\_\{k\\in\\\{0,1\\\}\}\\operatorname\{P\}\(R\_\{t\}=1\\mid\\mathcal\{H\}\_\{t\},Z\_\{t\+1\}=k\)\\operatorname\{P\}\(Z\_\{t\+1\}=k\\mid\\mathcal\{H\}\_\{t\}\)\}\.\(23\)Conditional onZt\+1Z\_\{t\+1\}, the signal emission is independent ofℋt\\mathcal\{H\}\_\{t\}\. Thus
P\(Rt=1∣Zt\+1=0\)=1−γ,P\(Rt=1∣Zt\+1=1\)=γ\.\\operatorname\{P\}\(R\_\{t\}=1\\mid Z\_\{t\+1\}=0\)=1\-\\gamma,\\qquad\\operatorname\{P\}\(R\_\{t\}=1\\mid Z\_\{t\+1\}=1\)=\\gamma\.\(24\)Substitution gives Eq\. \([19](https://arxiv.org/html/2605.26711#S7.E19)\), and Eq\. \([20](https://arxiv.org/html/2605.26711#S7.E20)\) follows immediately\. ∎
## 8The Contextual Dominance Threshold
Assume now that the alternating textual history induces a dominant misleading prior:
πt,0\>12\.\\pi\_\{t,0\}\>\\frac\{1\}\{2\}\.\(25\)The external signal reverses posterior odds when the posterior assigned to Regime 0 falls below1/21/2:
qt,γ<12\.q\_\{t,\\gamma\}<\\frac\{1\}\{2\}\.\(26\)Solvingqt,γ=1/2q\_\{t,\\gamma\}=1/2using Eq\. \([19](https://arxiv.org/html/2605.26711#S7.E19)\) gives
γcrit=πt,0\.\\gamma\_\{\\mathrm\{crit\}\}=\\pi\_\{t,0\}\.\(27\)
###### Corollary 1\(Contextual dominance threshold\)\.
Under Eq\. \([25](https://arxiv.org/html/2605.26711#S8.E25)\), the corrective signalRt=1R\_\{t\}=1reverses the posterior odds in favor of the random regime if and only if
γ\>πt,0\.\\gamma\>\\pi\_\{t,0\}\.\(28\)Ifγ<πt,0\\gamma<\\pi\_\{t,0\}, the deterministic textual prior remains posteriorly dominant\.
###### Proof\.
From Eq\. \([19](https://arxiv.org/html/2605.26711#S7.E19)\), the inequalityqt,γ<1/2q\_\{t,\\gamma\}<1/2is equivalent to
\(1−γ\)πt,0<γ\(1−πt,0\),\(1\-\\gamma\)\\pi\_\{t,0\}<\\gamma\(1\-\\pi\_\{t,0\}\),\(29\)which simplifies toγ\>πt,0\\gamma\>\\pi\_\{t,0\}\. Equality holds atγ=πt,0\\gamma=\\pi\_\{t,0\}\. ∎
The threshold in Corollary[1](https://arxiv.org/html/2605.26711#Thmcorollary1)is not a thermodynamic phase transition\. The posteriorqt,γq\_\{t,\\gamma\}, the predictive probabilityα\(γ\)\\alpha\(\\gamma\), and the entropyH2\(α\(γ\)\)H\_\{2\}\(\\alpha\(\\gamma\)\)vary continuously inγ\\gammaon\(1/2,1\]\(1/2,1\]except at degenerate boundary cases of zero denominator\. The threshold is instead a posterior\-odds crossing\. It marks the point at which the external signal becomes stronger than the archive\-induced structural prior; in repeated\-information settings, posterior convergence and merging of beliefs require stronger conditions on the information stream\(Blackwell and Dubins,[1962](https://arxiv.org/html/2605.26711#bib.bib6)\)\.
###### Corollary 2\(Gap reduction versus gap closure\)\.
Forγ\>πt,0\\gamma\>\\pi\_\{t,0\}, the corrective signal reduces the sufficiency gap by shifting posterior mass toward the true random regime\. However, the gap is fully closed only whenγ=1\\gamma=1or, more generally, whenqt,γ=0q\_\{t,\\gamma\}=0\.
###### Proof\.
Whenγ\>πt,0\\gamma\>\\pi\_\{t,0\}, Corollary[1](https://arxiv.org/html/2605.26711#Thmcorollary1)givesqt,γ<1/2q\_\{t,\\gamma\}<1/2, so
α\(γ\)=12\(1\+qt,γ\)<34\.\\alpha\(\\gamma\)=\\frac\{1\}\{2\}\(1\+q\_\{t,\\gamma\}\)<\\frac\{3\}\{4\}\.\(30\)This is closer to the true random\-regime probability1/21/2than a posterior\-dominant deterministic prediction\. However, ifqt,γ\>0q\_\{t,\\gamma\}\>0, thenα\(γ\)\>1/2\\alpha\(\\gamma\)\>1/2andH2\(α\(γ\)\)<1H\_\{2\}\(\\alpha\(\\gamma\)\)<1, so the entropy gap remains positive\. Eq\. \([19](https://arxiv.org/html/2605.26711#S7.E19)\) yieldsqt,γ=0q\_\{t,\\gamma\}=0whenγ=1\\gamma=1andπt,0<1\\pi\_\{t,0\}<1\. In that caseα\(1\)=1/2\\alpha\(1\)=1/2and the entropy equals one bit, matching the true random\-regime conditional entropy\. ∎
This result gives a precise interpretation of retrieval and tool failures and is also consistent with Bayesian interpretations of in\-context learning\(Xie et al\.,[2022](https://arxiv.org/html/2605.26711#bib.bib31)\)\. Adding material to a context window is not enough\. The material must be sufficiently reliable relative to the model’s text\-induced prior, and the model must have learned to treat that material as evidence\. Otherwise the learned archive can absorb, discount, or reinterpret the corrective signal\.
## 9External Observers, Verification, and Self\-Blindness
The preceding results clarify why an autoregressive sequence model is structurally limited as a self\-evaluator\. Its internal confidence is computed from the distribution available to it\. Under the false\-authority event of Theorem[1](https://arxiv.org/html/2605.26711#Thmtheorem1), the model’s internal entropy is
Hinternal=H\(pmarg∣ℋt\)=H2\(12\(1\+πt,0\)\)\.H\_\{\\mathrm\{internal\}\}=H\(p\_\{\\mathrm\{marg\}\}\\mid\\mathcal\{H\}\_\{t\}\)=H\_\{2\}\\\!\\left\(\\frac\{1\}\{2\}\(1\+\\pi\_\{t,0\}\)\\right\)\.\(31\)This value is a function of the text\-only posterior\. It is not a function of the realized latent state except through whatever information about that state is already encoded in the text\.
Thus, the model cannot compute the sufficiency gap from its own text\-only distribution alone\. The quantity
ΔH=H\(pfull∣ℋt,Zt\+1\)−H\(pmarg∣ℋt\)\\Delta H=H\(p\_\{\\mathrm\{full\}\}\\mid\\mathcal\{H\}\_\{t\},Z\_\{t\+1\}\)\-H\(p\_\{\\mathrm\{marg\}\}\\mid\\mathcal\{H\}\_\{t\}\)\(32\)requires access to the latent\-state\-conditioned law\. That information has been integrated out\. Softmax confidence, predictive entropy, or logit margins can therefore be calibrated with respect to historical textual patterns while remaining miscalibrated with respect to world\-state correspondence, a distinction that connects to the hallucination and uncertainty\-estimation literatures\(Ji et al\.,[2023](https://arxiv.org/html/2605.26711#bib.bib14); Kadavath et al\.,[2022](https://arxiv.org/html/2605.26711#bib.bib15)\)\.
This is the formal role of an external observer\. The observer may be a human expert, a retrieval system, a database query, a compiler, a theorem prover, a calculator, a laboratory measurement, or another model with access to different evidence; instruction\-following and orchestration mechanisms can help route such evidence into the interaction, but they do not by themselves remove the need for verification\(Ouyang et al\.,[2022](https://arxiv.org/html/2605.26711#bib.bib20)\)\. The crucial property is not that the observer is non\-machine, but that it is structurally decoupled from the same text\-only marginalization failure\.
If an auxiliary architecture can reliably access or verify the relevant latent state, then the sequence model should not be treated as an autonomous epistemic authority\. The robust component should constrain, verify, reject, or govern generation\. Whether it should replace the sequence model entirely depends on the task\. In many applications, language modeling remains useful as an interface, summarizer, or proposal generator, while the observer supplies grounding and verification\. In high\-stakes settings, however, the decision rule should be organized around the verifier or external state\-access mechanism rather than around the unaudited next\-token distribution\.
## 10Relation to Programming, RAG, and Tool Use
The binary toy model is intentionally minimal, but it captures the two\-stage criterion developed inCorielli \([2026](https://arxiv.org/html/2605.26711#bib.bib10)\)\. First, the model must identify the relevant local regime from the prefix\. Second, within that regime, the prefix and any augmentation must make the latent circumstances approximately irrelevant:
Ik\(Xt\+1;Zt\(k\)∣X≤t,Rt,At\)≈0,I\_\{k\}\(X\_\{t\+1\};Z\_\{t\}^\{\(k\)\}\\mid X\_\{\\leq t\},R\_\{t\},A\_\{t\}\)\\approx 0,\(33\)whereRtR\_\{t\}denotes retrieved material andAtA\_\{t\}denotes tool output\.
Programming is a favorable domain because syntax, previous code, tests, error messages, documentation, and execution traces often textualize much of the relevant latent state\. This does not make generated code automatically correct, but it makes the sufficiency condition more plausible than in open\-world factual discourse\. Moreover, compilers, unit tests, type checkers, and runtime execution can act as external observers whose objectives are not merely next\-token likelihood\.
RAG is useful under the same condition, as retrieval\-augmented systems are best understood here as attempts to textualize missing state rather than as automatic truth guarantees\(Lewis et al\.,[2020](https://arxiv.org/html/2605.26711#bib.bib17); Borgeaud et al\.,[2022](https://arxiv.org/html/2605.26711#bib.bib7)\)\. Retrieved material helps when it reduces residual dependence on omitted circumstances\. It fails when it merely adds topical text, citation\-like surface form, or misleadingly similar context without capturing the latent facts needed for the continuation\. Tool use is stronger than retrieval when the tool has direct access to non\-textual state or executes a formal procedure, as in program\-aided or action\-augmented language\-modeling systems, but it is still useful only if the tool is appropriate for the latent variable at issue and the model has learned how to condition on the result\(Gao et al\.,[2023](https://arxiv.org/html/2605.26711#bib.bib12); Schick et al\.,[2023](https://arxiv.org/html/2605.26711#bib.bib23); Yao et al\.,[2023](https://arxiv.org/html/2605.26711#bib.bib32)\)\.
This distinction is especially important for prompt injection of facts or rules\. A prompt can select among learned conditional behaviors; it does not by itself install a new conditional law, which is consistent with evidence that in\-context examples often work by activating or selecting learned patterns rather than by unconstrained rule acquisition\(Brown et al\.,[2020](https://arxiv.org/html/2605.26711#bib.bib8); Min et al\.,[2022](https://arxiv.org/html/2605.26711#bib.bib19); Wei et al\.,[2022](https://arxiv.org/html/2605.26711#bib.bib30); Xie et al\.,[2022](https://arxiv.org/html/2605.26711#bib.bib31)\)\. Even when an added context token is informationally sufficient in principle, the model must have learned how that kind of token constrains continuation\. This is exactly the role of Assumption[2](https://arxiv.org/html/2605.26711#Thmassumption2)in the toy model\.
The same sufficiency distinction also matters for recursive training and synthetic contamination\. If generated text is later treated as ordinary training evidence, the next model is no longer trained only on samples from the human text process but also on samples from a model\-induced distribution\. When the generating model operated under an insufficient prefix, the synthetic sample may preserve a plausible marginal continuation rather than the full situation\-specific conditional\. This mechanism complements recent analyses of self\-consuming generative loops and model collapse, where recursively generated data can reduce diversity, erase distributional tails, or move the training target away from the original data\-generating process\(Shumailov et al\.,[2023](https://arxiv.org/html/2605.26711#bib.bib26),[2024](https://arxiv.org/html/2605.26711#bib.bib27); Alemohammad et al\.,[2024](https://arxiv.org/html/2605.26711#bib.bib1)\)\.
## 11Decision\-Theoretic Implications
In practice, inferential errors should be evaluated through a loss function\. A wrong continuation in a poem, a wrong code suggestion, a wrong medical recommendation, and a wrong legal interpretation do not have the same cost\. The problem formalized here is that a text\-only sequence model cannot generally estimate the relevant situation\-specific probability of error from its internal distribution alone\.
LetL\(a,z\)L\(a,z\)denote the loss of taking action or emitting answeraain latent statezz\. A rational decision rule requires an estimate of
E\[L\(a,Zt\+1\)∣ℋt,available evidence\]\.\\operatorname\{E\}\[L\(a,Z\_\{t\+1\}\)\\mid\\mathcal\{H\}\_\{t\},\\text\{available evidence\}\]\.\(34\)If the available evidence is onlyℋt\\mathcal\{H\}\_\{t\}, and ifℋt\\mathcal\{H\}\_\{t\}is not sufficient forZt\+1Z\_\{t\+1\}, then the model’s token probabilities can be poor proxies for expected loss\. External observers are therefore not optional embellishments in high\-stakes environments\. They are the mechanisms through which the missing state variables, verification procedures, and domain\-specific loss assessments enter the decision system\.
## 12Conclusion
This paper provides a formal companion toCorielli \([2026](https://arxiv.org/html/2605.26711#bib.bib10)\)by showing that false authority can emerge even under an ideal text\-only predictor\. The binary mixed\-regime process separates statistical optimality relative to a marginal archive from epistemic usefulness in a realized latent state\. The model’s overconfidence is not caused by insufficient capacity or bad optimization\. It follows from conditioning on an insufficient statistic\.
Temperature scaling cannot solve this problem because it modifies the decoding distribution without restoring the omitted latent circumstances\. External grounding can help, but only under two conditions\. First, the grounding signal must be informative enough to overcome the misleading posterior induced by the textual history\. In the binary model, this posterior\-odds threshold isγ\>πt,0\\gamma\>\\pi\_\{t,0\}\. Second, the model must possess learned conditional availability: it must know how to use the grounding signal as evidence\. Otherwise, the signal may be present in the prompt but fail to function as a conditioning variable\.
The strongest conclusion is not that sequence models are useless\. It is that their usefulness is conditional\. They are most reliable in local islands where the textual prefix identifies the regime and textualizes the relevant latent state\. They are least reliable when the correct continuation depends on omitted facts, unobserved circumstances, or verification procedures not represented in the prompt\. In such cases, safe deployment requires structurally decoupled external observers, retrieval systems, tools, formal checks, or human review\. Without them, the system does not take a calculated risk; it acts under an unmeasured sufficiency failure\.
## References
- Alemohammad et al\. \[2024\]Alemohammad, S\., Casco\-Rodriguez, J\., Luzi, L\., Humayun, A\. I\., Babaei, H\., LeJeune, D\., Siahkoohi, A\., and Baraniuk, R\. G\. \(2024\)\.Self\-consuming generative models go MAD\.In*International Conference on Learning Representations*\.
- Bender and Koller \[2020\]Bender, E\. M\. and Koller, A\. \(2020\)\.Climbing towards NLU: On meaning, form, and understanding in the age of data\.In*Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics*, pages 5185–5198\.
- Bengio et al\. \[2003\]Bengio, Y\., Ducharme, R\., Vincent, P\., and Jauvin, C\. \(2003\)\.A neural probabilistic language model\.*Journal of Machine Learning Research*, 3:1137–1155\.
- Birkhoff \[1931\]Birkhoff, G\. D\. \(1931\)\.Proof of the ergodic theorem\.*Proceedings of the National Academy of Sciences*, 17\(12\):656–660\.
- Bishop \[2006\]Bishop, C\. M\. \(2006\)\.*Pattern Recognition and Machine Learning*\.Springer\.
- Blackwell and Dubins \[1962\]Blackwell, D\. and Dubins, L\. \(1962\)\.Merging of opinions with increasing information\.*The Annals of Mathematical Statistics*, 33\(3\):882–886\.
- Borgeaud et al\. \[2022\]Borgeaud, S\., Mensch, A\., Hoffmann, J\., Cai, T\., Rutherford, E\., Millican, K\., van den Driessche, G\., Lespiau, J\.\-B\., Damoc, B\., Clark, A\., et al\. \(2022\)\.Improving language models by retrieving from trillions of tokens\.In*Proceedings of the 39th International Conference on Machine Learning*\.
- Brown et al\. \[2020\]Brown, T\. B\., Mann, B\., Ryder, N\., Subbiah, M\., Kaplan, J\., Dhariwal, P\., Neelakantan, A\., Shyam, P\., Sastry, G\., Askell, A\., et al\. \(2020\)\.Language models are few\-shot learners\.*Advances in Neural Information Processing Systems*, 33:1877–1901\.
- Cover and Thomas \[2006\]Cover, T\. M\. and Thomas, J\. A\. \(2006\)\.*Elements of Information Theory*\.Wiley\-Interscience, 2nd edition\.
- Corielli \[2026\]Corielli, F\. \(2026\)\.When is next\-token prediction useful? Marginalization, ergodicity, mixture identifiability, local sufficiency, RAG, tools, and programming\.Working paper, May 22, 2026\.[ArXiv Link](https://arxiv.org/abs/2605.23278)
- Doob \[1953\]Doob, J\. L\. \(1953\)\.*Stochastic Processes*\.Wiley\.
- Gao et al\. \[2023\]Gao, L\., Madaan, A\., Zhou, S\., Alon, U\., Liu, P\., Yang, Y\., Callan, J\., and Neubig, G\. \(2023\)\.Program\-aided language models\.In*Proceedings of the 40th International Conference on Machine Learning*\.
- Holtzman et al\. \[2020\]Holtzman, A\., Buys, J\., Du, L\., Forbes, M\., and Choi, Y\. \(2020\)\.The curious case of neural text degeneration\.In*International Conference on Learning Representations*\.
- Ji et al\. \[2023\]Ji, Z\., Lee, N\., Frieske, R\., Yu, T\., Su, D\., Xu, Y\., Ishii, E\., Bang, Y\. J\., Madotto, A\., and Fung, P\. \(2023\)\.Survey of hallucination in natural language generation\.*ACM Computing Surveys*, 55\(12\):Article 248\.
- Kadavath et al\. \[2022\]Kadavath, S\., Conerly, T\., Askell, A\., Henighan, T\., Drain, D\., Perez, E\., Schiefer, N\., Hatfield\-Dodds, Z\., DasSarma, N\., Tran\-Johnson, E\., et al\. \(2022\)\.Language models \(mostly\) know what they know\.*arXiv:2207\.05221*\.
- Karpas et al\. \[2022\]Karpas, E\., Abend, O\., Belinkov, Y\., Lenz, B\., Lieber, O\., Ratner, N\., Shoham, Y\., Bata, H\., Levine, Y\., Leyton\-Brown, K\., Muhlgay, D\., Rozen, N\., Schwartz, E\., Shachaf, G\., Shalev\-Shwartz, S\., Shashua, A\., and Tenenholtz, M\. \(2022\)\.MRKL systems: A modular, neuro\-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning\.*arXiv:2205\.00445*\.
- Lewis et al\. \[2020\]Lewis, P\., Perez, E\., Piktus, A\., Petroni, F\., Karpukhin, V\., Goyal, N\., Kuttler, H\., Lewis, M\., Yih, W\.\-t\., Rocktäschel, T\., et al\. \(2020\)\.Retrieval\-augmented generation for knowledge\-intensive NLP tasks\.*Advances in Neural Information Processing Systems*, 33:9459–9474\.
- Manning and Schuetze \[1999\]Manning, C\. D\. and Schuetze, H\. \(1999\)\.*Foundations of Statistical Natural Language Processing*\.MIT Press\.
- Min et al\. \[2022\]Min, S\., Lyu, X\., Holtzman, A\., Artetxe, M\., Lewis, M\., Hajishirzi, H\., and Zettlemoyer, L\. \(2022\)\.Rethinking the role of demonstrations: What makes in\-context learning work?In*Proceedings of EMNLP*\.
- Ouyang et al\. \[2022\]Ouyang, L\., Wu, J\., Jiang, X\., Almeida, D\., Wainwright, C\., Mishkin, P\., Zhang, C\., Agarwal, S\., Slama, K\., Ray, A\., et al\. \(2022\)\.Training language models to follow instructions with human feedback\.*Advances in Neural Information Processing Systems*, 35:27730–27744\.
- Radford et al\. \[2019\]Radford, A\., Wu, J\., Child, R\., Luan, D\., Amodei, D\., and Sutskever, I\. \(2019\)\.Language models are unsupervised multitask learners\.OpenAI technical report\.
- Rosenfeld \[2000\]Rosenfeld, R\. \(2000\)\.Two decades of statistical language modeling: Where do we go from here?*Proceedings of the IEEE*, 88\(8\):1270–1278\.
- Schick et al\. \[2023\]Schick, T\., Dwivedi\-Yu, J\., Dessi, R\., Raileanu, R\., Lomeli, M\., Zettlemoyer, L\., Cancedda, N\., and Scialom, T\. \(2023\)\.Toolformer: Language models can teach themselves to use tools\.In*Advances in Neural Information Processing Systems*\.
- Shannon \[1948\]Shannon, C\. E\. \(1948\)\.A mathematical theory of communication\.*Bell System Technical Journal*, 27\(3\):379–423\.
- Shannon \[1951\]Shannon, C\. E\. \(1951\)\.Prediction and entropy of printed English\.*Bell System Technical Journal*, 30\(1\):50–64\.
- Shumailov et al\. \[2023\]Shumailov, I\., Shumaylov, Z\., Zhao, Y\., Gal, Y\., Papernot, N\., and Anderson, R\. \(2023\)\.The curse of recursion: Training on generated data makes models forget\.*arXiv:2305\.17493*\.
- Shumailov et al\. \[2024\]Shumailov, I\., Shumaylov, Z\., Zhao, Y\., Papernot, N\., Anderson, R\., and Gal, Y\. \(2024\)\.AI models collapse when trained on recursively generated data\.*Nature*, 631:755–759\.
- Vaswani et al\. \[2017\]Vaswani, A\., Shazeer, N\., Parmar, N\., Uszkoreit, J\., Jones, L\., Gomez, A\. N\., Kaiser, L\., and Polosukhin, I\. \(2017\)\.Attention is all you need\.In*Advances in Neural Information Processing Systems*, volume 30\.
- Wainwright and Jordan \[2008\]Wainwright, M\. J\. and Jordan, M\. I\. \(2008\)\.Graphical models, exponential families, and variational inference\.*Foundations and Trends in Machine Learning*, 1\(1–2\):1–305\.
- Wei et al\. \[2022\]Wei, J\., Wang, X\., Schuurmans, D\., Bosma, M\., Xia, F\., Chi, E\., Le, Q\. V\., and Zhou, D\. \(2022\)\.Chain\-of\-thought prompting elicits reasoning in large language models\.*Advances in Neural Information Processing Systems*, 35:24824–24837\.
- Xie et al\. \[2022\]Xie, S\. M\., Raghunathan, A\., Liang, P\., and Ma, T\. \(2022\)\.An explanation of in\-context learning as implicit Bayesian inference\.In*International Conference on Learning Representations*\.
- Yao et al\. \[2023\]Yao, S\., Zhao, J\., Yu, D\., Du, N\., Shafran, I\., Narasimhan, K\., and Cao, Y\. \(2023\)\.ReAct: Synergizing reasoning and acting in language models\.In*International Conference on Learning Representations*\.Similar Articles
When Is Next-Token Prediction Useful? Marginalization, Ergodicity, Mixture Identifiability, Local Sufficiency, RAG, Tools, and Programming
This paper distinguishes three probabilistic objects often conflated in language modeling—the full conditional language process, the marginal text-only law, and the model-induced distribution—and analyzes the conditions under which next-token prediction is useful, with RAG and tools interpreted as conditional sufficiency devices.
The Confidence Shortcut: A Reasoning Failure Mode of Masked Diffusion Models
This paper identifies a failure mode in masked diffusion language models where confidence-based decoding leads to high-confidence errors on complex reasoning tasks, and shows that confidence-aligned training exacerbates this issue while random masking preserves reasoning performance.
Finite Certificates for In-Context Determinacy and a Threshold Theory of Emergence in Language Models
This paper introduces finite certificates for verifying determinacy and emergence in language model in-context behavior, providing theoretical criteria and experimental validation on contemporary models.
Probabilistic Attribution For Large Language Models
This paper proposes a model-agnostic probabilistic token attribution measure for LLMs using Bayes' rule to invert next-token log probabilities, capturing the model's internal representation of token sequences and improving interpretability through entropy analysis.
How Language Models Fail: Token-Level Signatures of Committed and Persistent Reasoning Failures
This paper characterizes two distinct processes by which language models fail in reasoning—committed failure and persistent uncertainty—using token-level uncertainty signals, and demonstrates implications for self-consistency and failure detection strategies.