Discrete Stochastic Localization for Non-autoregressive Generation
Summary
Introduces Discrete Stochastic Localization (DSL), a continuous-state diffusion framework for non-autoregressive text generation that uses unit-sphere token embeddings and a timestep-invariant denoiser, achieving better distributional faithfulness than masked discrete diffusion models on OpenWebText.
View Cached Full Text
Cached at: 05/14/26, 06:19 AM
# Discrete Stochastic Localization for Non-autoregressive Generation
Source: [https://arxiv.org/html/2605.12836](https://arxiv.org/html/2605.12836)
Yunshu Wu1Jiayi Cheng2Longxuan Yu1Partha Thakuria1 Rob BrekelmansEvangelos E\. Papalexakis1Greg Ver Steeg1 1University of California Riverside2New York University \{ywu380, ylong030, pthakuria, epapalex, greg\.versteeg\}@ucr\.edu jiayi\.cheng@nyu\.edu rob\.brekelmans@gmail\.com
###### Abstract
Continuous diffusion is a natural framework for non\-autoregressive generation but has generally lagged behind masked discrete diffusion models \(MDMs\) on discrete sequence generation\. We argue that the bottleneck is not continuity itself, but a representation in which denoising depends on timestep\-indexed noise regimes\. We introduce*Discrete Stochastic Localization*\(DSL\), a continuous\-state framework with unit\-sphere token embeddings whose Bayes\-optimal denoiser is invariant to the nominal signal\-to\-noise ratio \(SNR\) under the localization channel\. One trained network then supports an entire family of per\-token SNR paths, with endpoint masked\-diffusion paths as a special case\. Fine\-tuning a pretrained MDLM checkpoint with DSL substantially improves distributional faithfulness \(MAUVE\) on OpenWebText across all step budgets fromT=128T\{=\}128toT=1024T\{=\}1024, and the same checkpoint supports random\-order autoregressive sampling, as well as a hybrid continuous\-then\-discrete sampler using as few as T=48 total steps—without distillation or retraining\.
## 1Introduction
Continuous diffusion has become a standard framework for high\-dimensional generation, with strong results in image, video, and audio synthesis\(Hoet al\.,[2020](https://arxiv.org/html/2605.12836#bib.bib97); Songet al\.,[2020](https://arxiv.org/html/2605.12836#bib.bib105); Hoet al\.,[2022](https://arxiv.org/html/2605.12836#bib.bib688); Huanget al\.,[2025](https://arxiv.org/html/2605.12836#bib.bib689); Konget al\.,[2020](https://arxiv.org/html/2605.12836#bib.bib690)\)\. For language, however, continuous diffusion over token embeddings has not delivered the same advantage: existing continuous diffusion language models have consistently lagged behind masked discrete diffusion models \(MDMs\)\(Austinet al\.,[2021](https://arxiv.org/html/2605.12836#bib.bib65); Sahooet al\.,[2024](https://arxiv.org/html/2605.12836#bib.bib663); Shiet al\.,[2024](https://arxiv.org/html/2605.12836#bib.bib664)\)\. This gap is puzzling\. Continuous diffusion exposes the model to a continuum of finite\-SNR states between the fully uninformative and nearly clean limits, which should in principle provide richer supervision than endpoint\-like masked corruptions\. Why, then, has this extra continuity not translated into stronger language generation?
We argue that the missing ingredient is not continuity alone, but also representation choice\. Standard embedding\-space diffusion typically adopts Variance\-Preserving \(VP\) or Variance\-Exploding \(VE\) Gaussian noising\. These parameterizations are probabilistically valid, but they represent the zero\-information limit as isotropic Gaussian noise rather than as a mask\-like token state\. Moreover, the raw noisy embedding is generally not sufficient for clean\-token prediction: the denoiser must also receive a timestep or noise\-level label to calibrate how much token evidence the embedding contains\. As a result, denoisers in this representation are highly sensitive to time conditioning to distinguish between masked versus uncertain states\.
This suggests a simple design goal: parameterize the corruption so that the noisy token state alone determines the clean\-token posterior\. We introduce*Discrete Stochastic Localization*\(DSL\), a continuous diffusion language model based on stochastic localization over unit\-norm token embeddings\. In DSL, the zero\-SNR state is the zero vector, which can be identified with a\[MASK\]token, while the high\-SNR limit recovers the clean token embedding\. More importantly, we show that unit\-norm stochastic\-localization parameterization makes the Bayes\-optimal clean\-token posterior depend only on the noisy token state, not on the nominal SNR\. DSL therefore uses a time\-invariant denoiser: masked, uncertain finite\-SNR, and near\-clean token states are all denoised to clean\-token posteriors without explicit time conditioning\.
This time\-invariant denoiser gives DSL a simple inference\-level consequence\. Because the model is not tied to a global diffusion time, different positions can occupy different SNR levels and follow different update schedules\. Masked refinement, random\-order autoregressive decoding, and hybrid continuous\-then\-discrete sampling are therefore shown to be different ways of querying the same posterior estimator, rather than separate sampler\-specific models\. Thus DSL keeps the endpoint behavior and checkpoint compatibility of masked diffusion while adding continuous finite\-SNR states for richer refinement\.
We instantiate DSL by fine\-tuning from a pretrained MDLM checkpoint and evaluate multiple inference paths through the same model on OpenWebText\. On a masked\-refinement path with a ReMDM\-family decoder, DSL substantially improves few\-step MAUVE across sampling budgets fromT=128T=128toT=1024T=1024\. The same checkpoint also supports random\-order autoregressive sampling without retraining and a hybrid continuous\-then\-discrete sampler that produces coherent generations in as few asT=48T=48steps\. Together, these results support the central claim: endpoint revealing, masked refinement, and continuous/hybrid denoising can be served by one posterior estimator rather than by separate sampler\-specific models\.
\(a\)DSL dynamics: each𝒛i\(t\)\\bm\{z\}\_\{i\}\(t\)localizes toward a symbol anchor over time\.
\(b\)State summary: tokens span SNR1\.01\.0–24\.324\.3, from masked to hard decoding\.
Figure 1:Discrete Stochastic Localization \(DSL\) dynamically “localizes” to sample discrete tokens\.*\(Left\)*DSL dynamics on a cyclic toy sequence \(ABCDE, BCDEA, …\); details in Appendix[D](https://arxiv.org/html/2605.12836#A4)\.*\(Right\)*State summary at one snapshot: each token occupies a different SNR regime, transitioning from masked to soft to hard decoding as SNR increases\.Contributions\.
- •A continuous\-state framework for discrete generation\.We introduce DSL, a stochastic\-localization channel over unit\-sphere token embeddings whose Bayes\-optimal denoiser is, by construction, a function of the noisy state alone rather than the nominal SNR\. A single time\-agnostic network can therefore support an entire family of per\-token SNR paths\.
- •A practical recipe\.Mixed\-support SNR sampling trains the same denoiser on endpoint and finite\-SNR states, while a mixture\-of\-tokens converter turns the framework into a trainable objective over a standard Transformer/DiT backbone\.
- •Evidence across paths\.A single DSL checkpoint improves few\-step OpenWebText generation under masked refinement, supports random\-order AR and hybrid sampling without retraining, and is competitive on Text8 likelihood\.
## 2Discrete Stochastic Localization
This section develops DSL in three steps\. We first set up the simplest version of the framework—a single Gaussian observation channel applied to the whole sentence at one global signal\-to\-noise ratio \(§[2\.1](https://arxiv.org/html/2605.12836#S2.SS1)\)\. We then introduce DSL’s key design choice, placing clean token embeddings on the unit sphere, and show that under this geometry the Bayes\-optimal denoiser depends only on the noisy token and not on the nominal SNR \(§[2\.2](https://arxiv.org/html/2605.12836#S2.SS2)\)\. Because a single denoiser then suffices across all SNR regimes, we can let each token position evolve under its own SNR; this generalizes the channel to a continuous family of per\-token SNR paths \(§[2\.3](https://arxiv.org/html/2605.12836#S2.SS3)\)\. Finally, two exact NLL views justify the state support used by the training objective \(§[2\.4](https://arxiv.org/html/2605.12836#S2.SS4)\)\.
### 2\.1A sentence\-level continuous starting point
We start from the simplest version of DSL: a Gaussian observation channel applied to the whole sentence at a single global signal\-to\-noise ratio\. This setup is the natural continuous analogue of standard diffusion and lets us derive the MMSE drift form that the rest of the section will build on\.
Let𝒔=\(s1,…,sL\)∈𝒱L\{\\bm\{s\}\}=\(s\_\{1\},\\dots,s\_\{L\}\)\\in\{\\mathcal\{V\}\}^\{L\}be a sentence of lengthLL, and let each token be embedded as𝒙i=enc\(si\)∈ℝd\{\\bm\{x\}\}\_\{i\}=\\operatorname\{enc\}\(s\_\{i\}\)\\in\\mathbb\{R\}^\{d\}\. We write the full sentence embedding as𝒙=\[𝒙1;…;𝒙L\]∈ℝLd\{\\bm\{x\}\}=\[\{\\bm\{x\}\}\_\{1\};\\dots;\{\\bm\{x\}\}\_\{L\}\]\\in\\mathbb\{R\}^\{Ld\}\. The sentence\-level localization channel is
𝒛t=t𝒙\+tϵ,𝒙∼P\(𝒙\),ϵ∼𝒩\(𝟎,𝑰\)\.\\bm\{z\}\_\{t\}=t\\,\{\\bm\{x\}\}\+\\sqrt\{t\}\\,\{\\bm\{\\epsilon\}\},\\qquad\{\\bm\{x\}\}\\sim P\(\{\\bm\{x\}\}\),\\;\{\\bm\{\\epsilon\}\}\\sim\\mathcal\{N\}\(\{\\bm\{0\}\},\{\\bm\{I\}\}\)\.Dividing byttgives𝒛t/t=𝒙\+ϵ/t\\bm\{z\}\_\{t\}/t=\{\\bm\{x\}\}\+\{\\bm\{\\epsilon\}\}/\\sqrt\{t\}, so𝒛t\\bm\{z\}\_\{t\}is equivalent in information to observing𝒙\{\\bm\{x\}\}through additive Gaussian noise of variance1/t1/t\. The signal\-to\-noise ratio is therefore exactlytt: att=0t=0the observation carries no information, and asttgrows𝒛t\\bm\{z\}\_\{t\}becomes progressively more informative about𝒙\{\\bm\{x\}\}\. This channel admits a simple SDE realization,
d𝒛t=𝒙dt\+d𝐰,𝒙∼P\(𝒙\),d\\bm\{z\}\_\{t\}=\{\\bm\{x\}\}\\,dt\+d\{\\mathbf\{w\}\},\\qquad\{\\bm\{x\}\}\\sim P\(\{\\bm\{x\}\}\),\(1\)which conditions on the unknown clean sentence and so cannot be used directly for generation\. We seek an equivalent unconditional SDE—equivalent in the sense that both dynamics induce the same marginal on𝒛t\\bm\{z\}\_\{t\}—whose drift depends only on𝒛t\\bm\{z\}\_\{t\}:
d𝒛t=𝒙^\(𝒛t,t\)dt\+d𝐰,𝒙^\(𝒛t,t\):=𝔼Pt\(𝒙∣𝒛t\)\[𝒙\]\.d\\bm\{z\}\_\{t\}=\\hat\{\\bm\{x\}\}\(\\bm\{z\}\_\{t\},t\)\\,dt\+d\{\\mathbf\{w\}\},\\qquad\\hat\{\\bm\{x\}\}\(\\bm\{z\}\_\{t\},t\):=\\mathbb\{E\}\_\{P\_\{t\}\(\{\\bm\{x\}\}\\mid\\bm\{z\}\_\{t\}\)\}\[\{\\bm\{x\}\}\]\.\(2\)The drift that achieves this equivalence is the MMSE denoiser, the posterior mean of𝒙\{\\bm\{x\}\}given𝒛t\\bm\{z\}\_\{t\}\(proof in Appendix[A\.4](https://arxiv.org/html/2605.12836#A1.SS4)\)\. At this point DSL still resembles a standard continuous\-denoising construction: one global SNR, one denoiser that nominally depends ontt\.
### 2\.2Unit\-norm token geometry gives a time\-invariant denoiser
The key design choice in DSL is to constrain every clean token embedding to lie on the unit sphere,‖𝒙i‖2=1\{\\\|\{\\bm\{x\}\}\_\{i\}\\\|\}\_\{2\}=1for allii\. This is a structural constraint built into the embedding geometry, not a regularizer on the loss\. As we now show, the benefit of this construction is to remove the nominal SNRttfrom the Bayes\-optimal denoiser\.
Under the localization channel Eq\.[1](https://arxiv.org/html/2605.12836#S2.E1),Pt\(𝒙∣𝒛t\)∝P\(𝒙\)exp\(𝒙⋅𝒛t−t2‖𝒙‖22\)P\_\{t\}\(\{\\bm\{x\}\}\\mid\\bm\{z\}\_\{t\}\)\\;\\propto\\;P\(\{\\bm\{x\}\}\)\\exp\\\!\\Big\(\{\\bm\{x\}\}\\cdot\\bm\{z\}\_\{t\}\\;\-\\;\\tfrac\{t\}\{2\}\{\\\|\{\\bm\{x\}\}\\\|\}\_\{2\}^\{2\}\\Big\)is the Bayes posterior factorization\. The termt2‖𝒙‖22\\tfrac\{t\}\{2\}\{\\\|\{\\bm\{x\}\}\\\|\}\_\{2\}^\{2\}is the only place the nominal SNR enters\. Under unit\-norm token embeddings,‖𝒙‖22=L\{\\\|\{\\bm\{x\}\}\\\|\}\_\{2\}^\{2\}=Lfor every length\-LLsentence, so this term is sentence\-independent and cancels after normalization\. The posterior, and therefore the MMSE denoiser, depends only on𝒛t\\bm\{z\}\_\{t\}:
𝒙^\(𝒛,t\)=𝔼Pt\(𝒙∣𝒛\)\[𝒙\]=𝔼P\(𝒙\)\[𝒙e𝒙⋅𝒛\]/𝔼P\(𝒙\)\[e𝒙⋅𝒛\]=𝒙^\(𝒛\)\.\\hat\{\{\\bm\{x\}\}\}\(\\bm\{z\},t\)=\\mathbb\{E\}\_\{P\_\{t\}\(\{\\bm\{x\}\}\\mid\\bm\{z\}\)\}\[\{\\bm\{x\}\}\]=\\mathbb\{E\}\_\{P\(\{\\bm\{x\}\}\)\}\\\!\\left\[\{\\bm\{x\}\}e^\{\{\\bm\{x\}\}\\cdot\\bm\{z\}\}\\right\]\\Big/\\mathbb\{E\}\_\{P\(\{\\bm\{x\}\}\)\}\\\!\\left\[e^\{\{\\bm\{x\}\}\\cdot\\bm\{z\}\}\\right\]=\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}\\hat\{\\bm\{x\}\}\(\\bm\{z\}\)\}\.\(3\)The SNR\-invariance in Eq\.[3](https://arxiv.org/html/2605.12836#S2.E3)requires both the localization parameterization, which makeszza posterior natural coordinate withaτ/στ2=1a\_\{\\tau\}/\\sigma\_\{\\tau\}^\{2\}=1, and unit\-sphere geometry, which makes‖xi‖2\\\|x\_\{i\}\\\|^\{2\}vocabulary\-constant\. Either condition alone is insufficient; their composition produces the structural cancellation\.
This SNR\-invariance is the structural property that drives the rest of the paper\. The Bayes denoiser does not need to know whichttgenerated𝒛t\\bm\{z\}\_\{t\}to denoise it optimally; it only needs the noisy token\. A single network can therefore in principle serve every SNR regime at once\. We give the full derivation in Appendix[A\.2](https://arxiv.org/html/2605.12836#A1.SS2); geometric consequences—an entropy envelope on the induced token posteriorp\(𝒙i∣𝒛t\)p\(\{\\bm\{x\}\}\_\{i\}\\mid\\bm\{z\}\_\{t\}\)and Lipschitz smoothness—are deferred to Section[4](https://arxiv.org/html/2605.12836#S4)\.
### 2\.3Arbitrary per\-token SNR paths
The SNR\-invariance from §[2\.2](https://arxiv.org/html/2605.12836#S2.SS2)lets us drop the assumption that all tokens share a single SNR\. We can let each positioniievolve under its own SNR:
𝒛i=γi𝒙i\+γiϵi,ϵi∼𝒩\(𝟎,𝑰\),\\bm\{z\}\_\{i\}=\\gamma\_\{i\}\\,\{\\bm\{x\}\}\_\{i\}\+\\sqrt\{\\gamma\_\{i\}\}\\,\{\\bm\{\\epsilon\}\}\_\{i\},\\qquad\{\\bm\{\\epsilon\}\}\_\{i\}\\sim\\mathcal\{N\}\(\{\\bm\{0\}\},\{\\bm\{I\}\}\),and we allow eachγi\(t\)\\gamma\_\{i\}\(t\)to follow an arbitrary admissible path—continuous, non\-decreasing, withγi\(0\)=0\\gamma\_\{i\}\(0\)=0andγi\(t\)→∞\\gamma\_\{i\}\(t\)\\to\\inftyast→∞t\\to\\infty\.
For these per\-token paths, the data\-conditioned and unconditional dynamics are
d𝒛i\\displaystyle d\\bm\{z\}\_\{i\}=𝒙iγ˙idt\+γ˙id𝐰i,𝒙∼P\(𝒙\),\\displaystyle=\{\\bm\{x\}\}\_\{i\}\\,\\dot\{\\gamma\}\_\{i\}\\,dt\+\\sqrt\{\\dot\{\\gamma\}\_\{i\}\}\\,d\{\\mathbf\{w\}\}\_\{i\},\\qquad\{\\bm\{x\}\}\\sim P\(\{\\bm\{x\}\}\),\(4\)d𝒛i\\displaystyle d\\bm\{z\}\_\{i\}=𝒙^i\(𝒛\)γ˙idt\+γ˙id𝐰i\.\\displaystyle=\\hat\{\\bm\{x\}\}\_\{i\}\(\\bm\{z\}\)\\,\\dot\{\\gamma\}\_\{i\}\\,dt\+\\sqrt\{\\dot\{\\gamma\}\_\{i\}\}\\,d\{\\mathbf\{w\}\}\_\{i\}\.\(5\)We show in Appendix[A\.4](https://arxiv.org/html/2605.12836#A1.SS4)that these two SDEs induce the same marginal on𝒛t\\bm\{z\}\_\{t\}along any admissible per\-token path\. Combined with the SNR\-invariance of𝒙^i\\hat\{\\bm\{x\}\}\_\{i\}, this means a*single*trained denoiser can be sampled along arbitrary per\-token SNR schedules without changing the target distribution\.
The path interpretation is then immediate\. Driving one token fromγ=0\\gamma\{=\}0to∞\\inftywhile holding others fixed corresponds to revealing that token in random\-order autoregressive \(ROAR\) style; sending selected positions back toγ=0\\gamma\{=\}0before re\-denoising corresponds to masked diffusion with remasking; jointly increasing allγi\\gamma\_\{i\}corresponds to standard continuous diffusion\.Random\-order AR generation, masked refinement, and continuous diffusion sampling are different paths through one shared family of per\-token SNR configurations—and, by the argument above, can all be served by one trained DSL model\.
### 2\.4Exact NLL views that motivate the training support
The DSL training objective is not chosen as a sampler heuristic\. It is derived from two exact likelihood views of the same per\-token SNR formalism: a continuous path\-integral NLL over finite\-SNR denoising states, and an endpoint ROAR NLL over mask/reveal states\. These identities are the likelihood\-level reason that the mixed\-support loss in Section[3\.1](https://arxiv.org/html/2605.12836#S3.SS1)places training mass on both state families\.
Continuous path\-integral view\.For any admissible per\-token SNR pathCC, DSL admits the path\-integral identity
−logP\(𝒙\)=12∫C𝐄\(𝒙,γ\)⋅𝑑γ,Ei\(𝒙,γ\):=𝔼pγ\(𝒛∣𝒙\)\[‖𝒙i−𝒙^i\(𝒛\)‖22\]\-\\log P\(\{\\bm\{x\}\}\)\\;=\\;\\tfrac\{1\}\{2\}\\int\_\{C\}\\mathbf\{E\}\(\{\\bm\{x\}\},\\gamma\)\\cdot d\\gamma,\\qquad E\_\{i\}\(\{\\bm\{x\}\},\\gamma\):=\\mathbb\{E\}\_\{p\_\{\\gamma\}\(\\bm\{z\}\\mid\{\\bm\{x\}\}\)\}\\\!\\big\[\{\\\|\{\\bm\{x\}\}\_\{i\}\-\\hat\{\\bm\{x\}\}\_\{i\}\(\\bm\{z\}\)\\\|\}\_\{2\}^\{2\}\\big\]\(6\)\(derivation in Appendix[A\.3](https://arxiv.org/html/2605.12836#A1.SS3)\)\. Under the Bayes\-optimal denoiser, this error field is conservative, so the integral is path\-independent: every admissible path gives the same−logP\(𝒙\)\-\\log P\(\{\\bm\{x\}\}\)\. Intermediate continuous\-SNR states are therefore not heuristic interpolants—they are the integration variables of an exact likelihood\.
Endpoint \(ROAR\) view\.A second exact view comes from configurations in which some positions are already fully revealed \(γ=∞\\gamma\{=\}\\infty\) and the rest remain fully masked \(γ=0\\gamma\{=\}0\)\. In this regime the sentence likelihood reduces to an autoregressive\-style decomposition, and averaging over the size of the revealed subset, the choice of revealed positions, and the next index to predict yields the random\-order autoregressive estimator
−1LlogP\(𝒔\)=−𝔼\(k,A,i\)\[logP\(si∣sA\)\]\-\\tfrac\{1\}\{L\}\\log P\(\{\\bm\{s\}\}\)\\;=\\;\-\\mathbb\{E\}\_\{\(k,A,i\)\}\\\!\\big\[\\log P\(s\_\{i\}\\mid s\_\{A\}\)\\big\]\(7\)\(derivation in Appendix[A\.6](https://arxiv.org/html/2605.12836#A1.SS6)\)\. This is the subset\-conditioned objective used by masked\-diffusion language models\(Sahooet al\.,[2024](https://arxiv.org/html/2605.12836#bib.bib663); Nieet al\.,[2025](https://arxiv.org/html/2605.12836#bib.bib10)\); here, it arises directly from the same DSL path formalism\.
Bridge to mixed\-support training\.Eq\.[6](https://arxiv.org/html/2605.12836#S2.E6)and Eq\.[7](https://arxiv.org/html/2605.12836#S2.E7)are the two likelihood sources of the practical objective in Section[3\.1](https://arxiv.org/html/2605.12836#S3.SS1)\. The former contributes intermediate finite\-SNR states; the latter contributes endpoint mask/reveal states\. Section[3](https://arxiv.org/html/2605.12836#S3)then implements this NLL\-derived support with a single mixed cross\-entropy objective\.
## 3DSL Training and Architecture
Section[2](https://arxiv.org/html/2605.12836#S2)identifies two exact likelihood views and their corresponding SNR paths\. Turning them into a trainable recipe involves two design decisions: a mixed\-support SNR sampler that covers both families under a single loss \(§[3\.1](https://arxiv.org/html/2605.12836#S3.SS1)\), and a posterior\-view converter that presents the corrupted state to the backbone neural network architecture in a form that matches the underlying geometry \(§[3\.2](https://arxiv.org/html/2605.12836#S3.SS2)\)\.
Algorithm 1DSL training1:Sequence length
LL; mixing weight
λ∈\[0,1\]\\lambda\\in\[0,1\]; ROAR endpoint values
γmin,γmax\\gamma\_\{\\min\},\\gamma\_\{\\max\}; log\-normal parameters
\(μ,σ\)\(\\mu,\\sigma\); backbone
fθf\_\{\\theta\}with converter; learning rate
η\\eta\.
2:foreach training stepdo
3:Sample sentence
𝒔=\(s1,…,sL\)∼Pdata\{\\bm\{s\}\}=\(s\_\{1\},\\dots,s\_\{L\}\)\\sim P\_\{\\mathrm\{data\}\}; embed
𝒙i=enc\(si\)\{\\bm\{x\}\}\_\{i\}=\\operatorname\{enc\}\(s\_\{i\}\)\.
4:Draw
u∼Unif\[0,1\]u\\sim\\mathrm\{Unif\}\[0,1\]\.
5:if
u<1−λu<1\-\\lambdathen⊳\\trianglerightROAR branch
6:Sample reveal\-set size
k∼Unif\{0,…,L−1\}k\\sim\\mathrm\{Unif\}\\\{0,\\dots,L\{\-\}1\\\}, subset
A⊆\[L\]A\\subseteq\[L\]with
\|A\|=k\|A\|\{=\}k\.
7:Set
γi←γmax\\gamma\_\{i\}\\leftarrow\\gamma\_\{\\max\}for
i∈Ai\\in A, and
γi←γmin\\gamma\_\{i\}\\leftarrow\\gamma\_\{\\min\}otherwise\.
8:else⊳\\trianglerightContinuous path branch
9:Sample
γi∼LogNormal\(μ,σ2\)\\gamma\_\{i\}\\sim\\mathrm\{LogNormal\}\(\\mu,\\sigma^\{2\}\)independently for each
i∈\[L\]i\\in\[L\]\.
10:endif
11:Sample
ϵi∼𝒩\(𝟎,𝑰\)\{\\bm\{\\epsilon\}\}\_\{i\}\\sim\\mathcal\{N\}\(\{\\bm\{0\}\},\{\\bm\{I\}\}\); form noisy tokens
𝒛i←γi𝒙i\+γiϵi\\bm\{z\}\_\{i\}\\leftarrow\\gamma\_\{i\}\{\\bm\{x\}\}\_\{i\}\+\\sqrt\{\\gamma\_\{i\}\}\\,\{\\bm\{\\epsilon\}\}\_\{i\}\.
12:Compute converter outputs
𝒎iconv\{\\bm\{m\}\}\_\{i\}^\{\\mathrm\{conv\}\}from each
𝒛i\\bm\{z\}\_\{i\}via Eq\.[11](https://arxiv.org/html/2605.12836#S3.E11)\.
13:Run backbone:
pθ\(⋅∣𝒛\)←fθ\(𝒎1conv,…,𝒎Lconv\)p\_\{\\theta\}\(\\cdot\\mid\\bm\{z\}\)\\leftarrow f\_\{\\theta\}\(\{\\bm\{m\}\}\_\{1\}^\{\\mathrm\{conv\}\},\\dots,\{\\bm\{m\}\}\_\{L\}^\{\\mathrm\{conv\}\}\)\.
14:Update
θ←θ−η∇θ∑i=1L−logpθ\(si∣𝒛\)\\theta\\leftarrow\\theta\-\\eta\\,\\nabla\_\{\\theta\}\\,\\sum\_\{i=1\}^\{L\}\-\\log p\_\{\\theta\}\(s\_\{i\}\\mid\\bm\{z\}\)\.
15:endfor
### 3\.1Mixed\-SNR training
To cover both state families in a single model, the DSL objective should place training mass on both, written as a convex combination
ℒtheory=λℒcont\-NLL\+\(1−λ\)ℒROAR\-NLL\.\\mathcal\{L\}\_\{\\mathrm\{theory\}\}\\;=\\;\\lambda\\,\\mathcal\{L\}\_\{\\mathrm\{cont\\mbox\{\-\}NLL\}\}\\;\+\\;\(1\-\\lambda\)\\,\\mathcal\{L\}\_\{\\mathrm\{ROAR\\mbox\{\-\}NLL\}\}\.\(8\)
One CE loss for both branches\.ℒROAR\-NLL\\mathcal\{L\}\_\{\\mathrm\{ROAR\\mbox\{\-\}NLL\}\}is already a token\-level categorical log\-likelihood, so cross\-entropy onPθ\(si∣𝒛\)P\_\{\\theta\}\(s\_\{i\}\\mid\\bm\{z\}\)is its exact form\.ℒcont\-NLL\\mathcal\{L\}\_\{\\mathrm\{cont\\mbox\{\-\}NLL\}\}comes from the path\-integral identity Eq\.[6](https://arxiv.org/html/2605.12836#S2.E6)and is naturally an embedding\-space MSE on the Bayes denoiser𝒙^i\(𝒛\)=𝔼\[𝒙i∣𝒛\]\\hat\{\\bm\{x\}\}\_\{i\}\(\\bm\{z\}\)=\\mathbb\{E\}\[\{\\bm\{x\}\}\_\{i\}\\mid\\bm\{z\}\]\. We parameterize this denoiser through a categorical token posterior,
𝒙^θ,i\(𝒛\)=∑v∈𝒱Pθ\(si=v∣𝒛\)𝒙v\.\\hat\{\\bm\{x\}\}\_\{\\theta,i\}\(\\bm\{z\}\)=\\sum\_\{v\\in\{\\mathcal\{V\}\}\}P\_\{\\theta\}\(s\_\{i\}=v\\mid\\bm\{z\}\)\\,\{\\bm\{x\}\}\_\{v\}\.\(9\)Minimizing token\-level cross\-entropy makesPθ\(⋅∣𝒛\)P\_\{\\theta\}\(\\cdot\\mid\\bm\{z\}\)match the true posteriorP\(⋅∣𝒛\)P\(\\cdot\\mid\\bm\{z\}\); at this optimum, the posterior mean above recovers the same Bayes denoiser that minimizes the path\-integral MSE\. Thus, for the continuous branch, CE is a posterior\-matching surrogate for the exact MSE NLL rather than a separate heuristic\. It also avoids embedding\-space optimization pathologies known from prior continuous diffusion language models\(Gulrajani and Hashimoto,[2023](https://arxiv.org/html/2605.12836#bib.bib656)\)\. Full details are in Appendix[A\.7](https://arxiv.org/html/2605.12836#A1.SS7)\.
Implementation as mixed SNR sampling\.Since both branches share the same CE loss, theλ\\lambda\-weighted mixture of losses equals, in expectation, a single CE loss under aλ\\lambda\-mixture of SNR distributions\. We implement the latter directly: with probabilityλ\\lambdawe draw token\-wise SNRs from a continuous\-path log\-normal distribution, and with probability1−λ1\-\\lambdafrom a ROAR\-style endpoint distribution concentrated near the low\- and high\-SNR extremes\. Both branches are trained under
ℒCE=−𝔼\(𝒔,𝒛\)∼qtrain∑i=1Llogpθ\(si∣𝒛\),\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\\;=\\;\-\\,\\mathbb\{E\}\_\{\(\{\\bm\{s\}\},\\bm\{z\}\)\\sim q\_\{\\mathrm\{train\}\}\}\\sum\_\{i=1\}^\{L\}\\log p\_\{\\theta\}\(s\_\{i\}\\mid\\bm\{z\}\),\(10\)realizing Eq\.[8](https://arxiv.org/html/2605.12836#S3.E8)as a single CE objective over aλ\\lambda\-mixed SNR support\. We summarize the full training procedure in Algorithm[1](https://arxiv.org/html/2605.12836#alg1); hyperparameters are listed in Appendix[C\.2](https://arxiv.org/html/2605.12836#A3.SS2)\.
### 3\.2Posterior\-view converter and backbone
The DSL posterior lives in token\-embedding geometry, while a Transformer backbone expects token\-like inputs that interact stably under self\-attention\. We introduce a converter that maps each noisy token𝒛i\\bm\{z\}\_\{i\}to a mixture\-of\-tokens representation,
qiconv\(v∣𝒛i\)=softmaxτ\(⟨𝒛i,𝒙v⟩\+bv\),𝒎iconv=∑v∈𝒱∪\{\[MASK\]\}qiconv\(v∣𝒛i\)𝒙v,q\_\{i\}^\{\\mathrm\{conv\}\}\(v\\mid\\bm\{z\}\_\{i\}\)\\;=\\;\\mathrm\{softmax\}\_\{\\tau\}\\\!\\big\(\\langle\\bm\{z\}\_\{i\},\{\\bm\{x\}\}\_\{v\}\\rangle\+b\_\{v\}\\big\),\\qquad\{\\bm\{m\}\}\_\{i\}^\{\\mathrm\{conv\}\}\\;=\\;\\sum\_\{v\\in\{\\mathcal\{V\}\}\\cup\\\{\[\\text\{MASK\}\]\\\}\}q\_\{i\}^\{\\mathrm\{conv\}\}\(v\\mid\\bm\{z\}\_\{i\}\)\\,\{\\bm\{x\}\}\_\{v\},\(11\)Herevvindexes candidate token values,𝒙v=enc\(v\)\{\\bm\{x\}\}\_\{v\}=\\operatorname\{enc\}\(v\), andτ\>0\\tau\>0is a learnable converter temperature hyperparameter\. We feed the sequence\(𝒎1conv,…,𝒎Lconv\)\(\{\\bm\{m\}\}\_\{1\}^\{\\mathrm\{conv\}\},\\dots,\{\\bm\{m\}\}\_\{L\}^\{\\mathrm\{conv\}\}\)to a standard Transformer encoder / DiT\-style denoiser\. Intuitively,𝒎iconv\{\\bm\{m\}\}\_\{i\}^\{\\mathrm\{conv\}\}is a vocabulary\-weighted combination of unit\-norm token embeddings: when𝒛i\\bm\{z\}\_\{i\}is very noisy the mixture is broad, and as the SNR grows it concentrates on a single token\.
The converter has two properties required for the analysis in Section[4](https://arxiv.org/html/2605.12836#S4):‖𝒎iconv‖2≤1\{\\\|\{\\bm\{m\}\}\_\{i\}^\{\\mathrm\{conv\}\}\\\|\}\_\{2\}\\leq 1for every𝒛i\\bm\{z\}\_\{i\}regardless of SNR, and𝒎iconv\{\\bm\{m\}\}\_\{i\}^\{\\mathrm\{conv\}\}concentrates smoothly as SNR grows\. Together these make the backbone see a stationary input geometry in which semantic preference is carried by direction and confidence grows by controlled radial motion\. This mirrors the norm–direction posterior decomposition made explicit later in[Equation˜12](https://arxiv.org/html/2605.12836#S4.E12)\.
Finally, DSL uses no informative timestep input: for compatibility with pretrained masked\-diffusion DiTs\(Sahooet al\.,[2024](https://arxiv.org/html/2605.12836#bib.bib663)\), we keep the timestep embedding module but feedt=0t=0throughout training and sampling, consistent with Eq\.[3](https://arxiv.org/html/2605.12836#S2.E3)since𝒛i\\bm\{z\}\_\{i\}and𝒎iconv\{\\bm\{m\}\}\_\{i\}^\{\\mathrm\{conv\}\}already encode the corruption regime\.
## 4Posterior Geometry and Pathwise Error Control
Sections[2](https://arxiv.org/html/2605.12836#S2)–[3](https://arxiv.org/html/2605.12836#S3)define DSL as a posterior\-state framework: exact NLL views identify the continuous\-SNR and endpoint state families, and mixed\-SNR training places support on both\. This section explains what DSL geometry buys us for finite models\. In the unit\-sphere localization channel, the Bayes target is not a family of time\-indexed denoisers\{𝒙^\(⋅,t\)\}t\\\{\\hat\{\\bm\{x\}\}\(\\cdot,t\)\\\}\_\{t\}, but a single posterior\-mean map𝒙^\(𝒛\)=𝔼\[𝒙∣𝒛\]\\hat\{\\bm\{x\}\}\(\\bm\{z\}\)=\\mathbb\{E\}\[\{\\bm\{x\}\}\\mid\\bm\{z\}\]\. Different inference paths therefore differ only in which realized states𝒛\\bm\{z\}they visit, and all query the same denoising map\. We analyze this shared map below: its geometry separates token identity from confidence and yields bounded, smooth posterior motion, which helps control path\-dependent approximation error\.
Norm–direction posterior geometry\.Fix a token position\. Written as a function of the noisy token𝒛\\bm\{z\}, the induced token posterior is
p\(v∣𝒛\)∝πvexp\(⟨𝒛,𝒙v⟩\)=πvexp\(‖𝒛‖2cosθv\),p\(v\\mid\\bm\{z\}\)\\;\\propto\\;\\pi\_\{v\}\\exp\\\!\\big\(\\langle\\bm\{z\},\{\\bm\{x\}\}\_\{v\}\\rangle\\big\)\\;=\\;\\pi\_\{v\}\\exp\\\!\\big\(\{\\\|\\bm\{z\}\\\|\}\_\{2\}\\cos\\theta\_\{v\}\\big\),\(12\)where𝒙v:=enc\(v\)\{\\bm\{x\}\}\_\{v\}:=\\operatorname\{enc\}\(v\)is the unit\-norm embedding of tokenvv,πv\\pi\_\{v\}is its prior, andθv\\theta\_\{v\}is the angle between𝒛\\bm\{z\}and𝒙v\{\\bm\{x\}\}\_\{v\}\. Thus direction controls token preference while norm controls posterior concentration; this is a property of the unit\-sphere channel, not a calibration objective\. Figure[2](https://arxiv.org/html/2605.12836#S4.F2)verifies that the trained converter follows this decomposition: trajectories organize by direction, and increasing SNR primarily increases concentration\.
\(a\)
\(b\)
Figure 2:The DSL posterior decomposes into direction and norm axes\.\(a\)t\-SNE projection of converter outputs for 25 probe tokens spanning five semantic classes\. Trajectories sweep𝒛i=γ𝒆v\\bm\{z\}\_\{i\}=\\gamma\\,\{\\bm\{e\}\}\_\{v\}from\[MASK\]acrossγ∈\[10−2,80\]\\gamma\\in\[10^\{\-2\},80\]and are colored by SNR\.\(b\)Mean top\-1 token recovery and target\-token probability asγ\\gammaincreases\.Bounded and Lipschitz posterior motion\.The converter maps each noisy state to a bounded mixture\-of\-tokens input, and clean token embeddings are unit\-norm\. At fixed converter temperature, this bounds the achievable logit range and gives an entropy envelope for the token posterior\. Moreover, the Bayes posterior\-mean map𝒎\(𝒛\):=𝔼\[𝒙∣𝒛\]\{\\bm\{m\}\}\(\\bm\{z\}\):=\\mathbb\{E\}\[\{\\bm\{x\}\}\\mid\\bm\{z\}\]is 1\-Lipschitz:
‖𝒎\(𝒛1\)−𝒎\(𝒛2\)‖2≤‖𝒛1−𝒛2‖2\.\{\\\|\{\\bm\{m\}\}\(\\bm\{z\}\_\{1\}\)\-\{\\bm\{m\}\}\(\\bm\{z\}\_\{2\}\)\\\|\}\_\{2\}\\;\\leq\\;\{\\\|\\bm\{z\}\_\{1\}\-\\bm\{z\}\_\{2\}\\\|\}\_\{2\}\.\(13\)Thus local perturbations in noisy state cannot be amplified by the Bayes denoising field\. Full entropy bounds are in Appendix[A\.9](https://arxiv.org/html/2605.12836#A1.SS9), and the proof of Eq\.[13](https://arxiv.org/html/2605.12836#S4.E13)is in Appendix[A\.10](https://arxiv.org/html/2605.12836#A1.SS10)\.
Table 1:OWT \(L=1024L=1024\) unconditional generation\. We report MAUVE↑\\uparrowas the primary metric; the full table with MAUVE / GenPPL / SentEnt is deferred to Appendix[F](https://arxiv.org/html/2605.12836#A6)\. Bold marks the best MAUVE in each column among non\-reference rows\. Rows marked‡\\ddaggerare taken from\(Wanget al\.,[2025](https://arxiv.org/html/2605.12836#bib.bib678)\); rows marked§\\Sfrom\(Routet al\.,[2025](https://arxiv.org/html/2605.12836#bib.bib682)\)\. All DSL\-FT rows use the same DSL\-finetuned checkpoint\. DSL\-FT \+ ROAR \(naive\) and DSL\-FT \+ MDLM are non\-refinement decoder ablations: tokens are committed once and never remasked\. The ReMDM\-family rows add iterative remasking refinement\.
## 5Experiments
We evaluate DSL on unconditional OpenWebText generation and Text8 likelihood\. For OpenWebText, we separate the evaluation into three stages\. First, we use simple non\-refinement MDM decoders to isolate the effect of DSL training itself\. Second, we add ReMDM\-family masked refinement\(Wanget al\.,[2025](https://arxiv.org/html/2605.12836#bib.bib678)\)to test whether stronger iterative samplers can further exploit the DSL\-trained posterior\. Third, we evaluate a hybrid continuous\-then\-MDM sampler to show that the same checkpoint can compose continuous and discrete inference stages\. Unless otherwise stated, all DSL results use the same DSL\-finetuned checkpoint\.
### 5\.1Setup
We evaluate unconditional OpenWebText \(OWT\) generation under the MDLM/ReMDM protocol: GPT\-2 BPE tokenization, sequence lengthL=1024L=1024, 5k samples, nucleus sampling withtop\-p=0\.9\\text\{top\-\}p=0\.9where applicable, and MAUVE computed with GPT\-2 Large embeddings andK=500K=500clusters\(Sahooet al\.,[2024](https://arxiv.org/html/2605.12836#bib.bib663); Wanget al\.,[2025](https://arxiv.org/html/2605.12836#bib.bib678)\)\. OWT is our checkpoint\-compatibility setting: we fine\-tune DSL from the official MDLM checkpoint, whereas Text8 usesfrom\-scratchtraining\. Full sampling protocols and diagnostic metrics are in Appendix[F](https://arxiv.org/html/2605.12836#A6)\.
Primary metric\.Following ReMDM, we use MAUVE as the primary OWT generation metric and report GenPPL/SentEnt only as diagnostics\(Wanget al\.,[2025](https://arxiv.org/html/2605.12836#bib.bib678)\)\. GenPPL can be reduced by narrowing the generation distribution and sacrificing entropy/diversity; MAUVE instead compares model and human text distributions via divergence frontiers\(Pillutlaet al\.,[2021](https://arxiv.org/html/2605.12836#bib.bib692)\)\. Full MAUVE / GenPPL / SentEnt results are in Appendix[F](https://arxiv.org/html/2605.12836#A6)\.
### 5\.2Training\-side ablation with simple MDM decoders
Before evaluating iterative refinement, we first isolate the effect of DSL training itself\. We decode the same DSL\-finetuned checkpoint with two simple MDM\-style samplers that do not remask or refine committed tokens\. The first is the vanilla MDLM sampler from the official implementation of\(Sahooet al\.,[2024](https://arxiv.org/html/2605.12836#bib.bib663)\): it monotonically converts masked positions into sampled tokens and never revisits them\. The second is ROAR, a random\-order autoregressive sampler that reveals one token at a time using the DSL denoiser as an MDM posterior estimator\. Full ROAR details are given in Appendix[C\.5](https://arxiv.org/html/2605.12836#A3.SS5), and the shared OWT sampling protocol is specified in Appendix[C\.4](https://arxiv.org/html/2605.12836#A3.SS4)\.
Table[1](https://arxiv.org/html/2605.12836#S4.T1)shows that DSL already performs strongly under these non\-refinement decoders\. Replacing the MDLM checkpoint with the DSL\-finetuned checkpoint under the same vanilla MDLM sampler raises MAUVE from0\.015/0\.023/0\.031/0\.0420\.015/0\.023/0\.031/0\.042to0\.402/0\.481/0\.506/0\.4950\.402/0\.481/0\.506/0\.495acrossT=128,256,512,1024T=128,256,512,1024\. This is the cleanest training\-side comparison in the table: the sampler is the standard monotone MDM decoder, so the improvement comes from the DSL\-trained posterior rather than from remasking or self\-correction\.
ROAR provides a complementary endpoint test\. It performs a single random\-order reveal pass withT=L=1024T=L=1024, commits each token exactly once, and performs no confidence\-based ordering, remasking, or iterative refinement\. Even under this deliberately naive decoding path, DSL\-FT \+ ROAR reaches MAUVE0\.5510\.551, exceeding DSL\-FT \+ MDLM at the same budget and also surpassing several prior refinement baselines\. Together, the MDLM and ROAR ablations show that DSL improves the underlying clean\-token posterior before any ReMDM\-family refinement is added\.
### 5\.3Masked refinement with ReMDM\-family decoders
Algorithm[2](https://arxiv.org/html/2605.12836#alg2)summarizes the primary masked\-refinement sampler used for the main OWT results\. The sampler starts from the zero\-SNR mask state, repeatedly predicts clean\-token posteriors, reveals a subset of masked positions, and optionally remasks committed tokens inside a ReMDM\-loop window\. The ReMDM\-loop row in Table[1](https://arxiv.org/html/2605.12836#S4.T1)instantiates this template with the step\-budgeted capped scheduleηcap\(T\)\\eta\_\{\\mathrm\{cap\}\}\(T\); the confidence\-based row changes only the remasking policy\. Vanilla MDLM, ROAR, and hybrid sampling details are deferred to Appendix[C\.4](https://arxiv.org/html/2605.12836#A3.SS4), Appendix[C\.5](https://arxiv.org/html/2605.12836#A3.SS5), and Appendix[C\.6](https://arxiv.org/html/2605.12836#A3.SS6)\.
Algorithm 2DSL sampling with ReMDM sampler\. Adapted from\(Wanget al\.,[2025](https://arxiv.org/html/2605.12836#bib.bib678), Algorithm 1\); in DSL, the ReMDM mask state is represented by zero SNR\.Input:DSL denoiser
𝒙θ\{\\bm\{x\}\}\_\{\\theta\}, sampling steps
TT, noise schedule
αt\\alpha\_\{t\}, clean endpoint
SNRmax\\mathrm\{SNR\}\_\{\\max\},remask scheduleσt\\sigma\_\{t\}\.
Initialize:set every token embedding to the mask endpoint, i\.e\.,
SNR\(𝒛T\)=0\\operatorname\{SNR\}\(\\bm\{z\}\_\{T\}\)=0\.
for
i=T,T−1,…,1i=T,T\-1,\\ldots,1do
t=i/T,s=\(i−1\)/Tt=i/T,\\quad s=\(i\-1\)/T; set
αt,αs\\alpha\_\{t\},\\alpha\_\{s\}\.
Setσt∈\[0,σtmax\]\\sigma\_\{t\}\\in\[0,\\sigma\_\{t\}^\{\\max\}\], whereσtmax=min\{1,\(1−αs\)/αt\}\\sigma\_\{t\}^\{\\max\}=\\min\\\{1,\(1\-\\alpha\_\{s\}\)/\\alpha\_\{t\}\\\}\.
Predict clean\-token posteriors
𝒙^=𝒙θ\(𝒛t\)\\widehat\{\{\\bm\{x\}\}\}=\{\\bm\{x\}\}\_\{\\theta\}\(\\bm\{z\}\_\{t\}\)\.
Form the ReMDM endpoint posterior
pθ\(𝒛s∣𝒛t\)=qσDSL\(𝒛s∣𝒛t,𝒙=𝒙^\),p\_\{\\theta\}\(\\bm\{z\}\_\{s\}\\mid\\bm\{z\}\_\{t\}\)=q\_\{\{\\color\[rgb\]\{0\.70703125,0\.375,0\.09375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.70703125,0\.375,0\.09375\}\\sigma\}\}^\{\\mathrm\{DSL\}\}\\bigl\(\\bm\{z\}\_\{s\}\\mid\\bm\{z\}\_\{t\},\{\\bm\{x\}\}=\\widehat\{\{\\bm\{x\}\}\}\\bigr\),where ReMDM’s mask atom is interpreted as
SNR=0\\operatorname\{SNR\}=0and its clean\-token atom as
SNR=SNRmax\\operatorname\{SNR\}=\\mathrm\{SNR\}\_\{\\max\}\.
Sample
𝒛s∼pθ\(𝒛s∣𝒛t\)\\bm\{z\}\_\{s\}\\sim p\_\{\\theta\}\(\\bm\{z\}\_\{s\}\\mid\\bm\{z\}\_\{t\}\)\.
Realize sampled clean tokens at
SNRmax\\mathrm\{SNR\}\_\{\\max\};realize remasked tokens by resetting their SNR to0\.
Set
𝒛t←𝒛s\\bm\{z\}\_\{t\}\\leftarrow\\bm\{z\}\_\{s\}\.
endfor
Output:decoded tokens from the final clean\-endpoint latents\.
Having isolated the training\-side gain with simple decoders, we next evaluate whether iterative masked refinement can further exploit the same DSL checkpoint\. Table[1](https://arxiv.org/html/2605.12836#S4.T1)compares DSL with ReMDM\-family decoders against published diffusion baselines under the controlled OWT protocol\. Both DSL refinement rows substantially improve MAUVE across step budgets fromT=128T=128toT=1024T=1024\. The confidence\-based schedule performs best atT=512T=512, reaching MAUVE0\.7070\.707, while the step\-budgeted ReMDM\-loop schedule with principledηcap\(T\)\\eta\_\{\\mathrm\{cap\}\}\(T\)performs best atT=128T=128,T=256T=256, andT=1024T=1024, reaching0\.6390\.639,0\.6610\.661, and0\.7220\.722respectively\.
Table 2:Few\-step OWT generation with the DSL hybrid continuous\-then\-MDM sampler\. All rows use the same DSL\-finetuned checkpoint\. For each\(Tcont,TMDM\)\(T\_\{\\rm cont\},T\_\{\\rm MDM\}\), we report a balancedσswitch\\sigma\_\{\\rm switch\}from a small handoff sweep\. The full sweep with MAUVE / GenPPL / SentEnt is in Appendix[F](https://arxiv.org/html/2605.12836#A6)\.The comparison with the preceding ablation rows separates training and sampling effects\. DSL\-FT \+ MDLM and DSL\-FT \+ ROAR show that DSL training already improves non\-refinement decoding\. The ReMDM\-family rows then show that iterative remasking further unlocks the same checkpoint, improving from DSL\-FT \+ MDLM’s0\.4950\.495and ROAR’s0\.5510\.551atT=1024T=1024to0\.7220\.722with ReMDM\-loop\. Thus DSL improves the denoiser first, and refinement policies further exploit the improved posterior\.
### 5\.4Hybrid continuous\-then\-MDM sampling
We also evaluate a hybrid sampler that composes two inference regimes within the same DSL checkpoint\. The sampler first runs a short continuous denoising stage, where each position carries soft finite\-SNR token evidence, and then switches to an MDM\-style endpoint stage that commits the remaining uncertainty into discrete tokens\. This experiment is designed as a path\-compatibility test: can one DSL checkpoint support both continuous intermediate states and discrete endpoint decoding in a single generation run?
Table[2](https://arxiv.org/html/2605.12836#S5.T2)reports few\-step OWT generation with this hybrid path\. At 48 total steps, the hybrid sampler reaches MAUVE0\.6620\.662, already exceeding the non\-refinement DSL\-FT \+ MDLM result atT=1024T=1024\. With 80 total steps, it reaches0\.7240\.724, comparable to the best ReMDM\-family result in Table[1](https://arxiv.org/html/2605.12836#S4.T1)\. The allocation between continuous and discrete stages matters: at the same nominal 64\-step budget,\(Tcont,TMDM\)=\(16,48\)\(T\_\{\\mathrm\{cont\}\},T\_\{\\mathrm\{MDM\}\}\)=\(16,48\)reaches0\.7020\.702, while\(32,32\)\(32,32\)reaches0\.5890\.589\. We therefore view hybrid sampling as a promising DSL\-specific inference path, but not yet as an exhaustively optimized decoding policy\.
Table 3:Bits Per Character \(BPC\) on Text8 test set\. Continuous diffusion entries \(Plaid, DSL\) are ELBO upper bounds; discrete diffusion and AR entries are exact NLL — direct numerical comparison across these categories is not meaningful
### 5\.5Likelihood\-based evaluation on Text8
The DSL framework admits two likelihood estimators following Section[2\.4](https://arxiv.org/html/2605.12836#S2.SS4): the path\-integral identity Eq\.[6](https://arxiv.org/html/2605.12836#S2.E6)and the ROAR estimator Eq\.[7](https://arxiv.org/html/2605.12836#S2.E7)\. Table[3](https://arxiv.org/html/2605.12836#S5.T3)reports BPC on Text8\. DSL achieves≤1\.45\\leq 1\.45BPC, the first continuous\-state NLL estimate to approach the masked\-discrete\-diffusion range \(MDLM≤1\.40\\leq 1\.40, MD4≤1\.37\\leq 1\.37\)\. Continuous\-state estimates are upper bounds via ELBO and so cannot match the exact NLL accessible to masked and AR models; the relevant comparison is to other continuous\-state baselines, where DSL improves on Plaid by≥0\.03\\geq 0\.03BPC\. Setup details in Appendix[F](https://arxiv.org/html/2605.12836#A6)\.
## 6Related Work
DSL connects two lines of work\. Continuous diffusion LMs operate over noisy token embeddings and expose finite\-SNR states, but typically rely on timestep/noise conditioning and have trailed masked discrete diffusion on language benchmarks\(Liet al\.,[2022](https://arxiv.org/html/2605.12836#bib.bib660); Strudelet al\.,[2022](https://arxiv.org/html/2605.12836#bib.bib659); Dielemanet al\.,[2022](https://arxiv.org/html/2605.12836#bib.bib657); Gulrajani and Hashimoto,[2023](https://arxiv.org/html/2605.12836#bib.bib656)\)\. Masked diffusion and ReMDM decode through endpoint mask/reveal states, with remasking enabling revision of visible tokens\(Austinet al\.,[2021](https://arxiv.org/html/2605.12836#bib.bib65); Sahooet al\.,[2024](https://arxiv.org/html/2605.12836#bib.bib663); Shiet al\.,[2024](https://arxiv.org/html/2605.12836#bib.bib664); Wanget al\.,[2025](https://arxiv.org/html/2605.12836#bib.bib678)\)\. Related hybrid methods connect discrete and Gaussian processes through loss\-level mixtures or alignments\([Sahooet al\.,](https://arxiv.org/html/2605.12836#bib.bib7); Pynadathet al\.,[2025](https://arxiv.org/html/2605.12836#bib.bib691)\)\. DSL differs in that its stochastic\-localization SDE constrains the entire intermediate path: posterior\-mean drift plus Brownian noise gives finite\-SNR states a path\-level identity from masked to clean tokens\. This SDE structure underlies the path\-integral identity under the Bayes\-optimal denoiser, whereas loss\-level hybrids do not impose an analogous conservative field or path\-independence guarantee\. DSL is therefore complementary: unit\-sphere posterior coordinates make the Bayes denoiser SNR\-invariant, so continuous denoising, ROAR revealing, and masked remasking become different per\-token SNR paths through one time\-agnostic model\.
## 7Conclusion
We introduced Discrete Stochastic Localization \(DSL\), a continuous\-state framework for discrete sequence generation with unit\-sphere token embeddings\. By making the clean\-token posterior only depend on the noisy state rather than a nominal timestep, DSL yields a time\-agnostic denoiser and a mixture\-of\-tokens interface for standard Transformer/DiT backbones\. A single DSL supports masked refinement, random\-order AR, and hybrid continuous\-then\-MDM sampling, improving OpenWebText generation across paths and approaching the MDM likelihood range on Text8\. These results suggest parameterizing continuous diffusion for language as posterior\-state denoising over noisy tokens\.
Limitations\.Our evaluation is limited to OpenWebText generation and Text8 likelihood\. The reported ReMDM\-family and hybrid samplers are not the result of an exhaustive decoding\-policy search; they are intended to test whether DSL training improves the underlying posterior and whether one checkpoint can be queried along multiple SNR paths\. More optimized remasking schedules, hybrid handoff rules, larger models, longer contexts, and conditional generation remain important directions for future work\.
## References
- \[1\]\(2021\)Structured denoising diffusion models in discrete state\-spaces\.Advances in Neural Information Processing Systems34,pp\. 17981–17993\.Cited by:[§F\.2](https://arxiv.org/html/2605.12836#A6.SS2.p2.1),[§F\.2](https://arxiv.org/html/2605.12836#A6.SS2.p3.1),[§1](https://arxiv.org/html/2605.12836#S1.p1.1),[§6](https://arxiv.org/html/2605.12836#S6.p1.1)\.
- \[2\]S\. Dieleman, L\. Sartran, A\. Roshannai, N\. Savinov, Y\. Ganin, P\. H\. Richemond, A\. Doucet, R\. Strudel, C\. Dyer, C\. Durkan,et al\.\(2022\)Continuous diffusion for categorical data\.arXiv preprint arXiv:2211\.15089\.Cited by:[§A\.7](https://arxiv.org/html/2605.12836#A1.SS7.SSS0.Px2.p1.6),[Appendix B](https://arxiv.org/html/2605.12836#A2.p1.1),[§F\.2](https://arxiv.org/html/2605.12836#A6.SS2.p3.1),[§6](https://arxiv.org/html/2605.12836#S6.p1.1)\.
- \[3\]I\. Gulrajani and T\. B\. Hashimoto\(2023\)Likelihood\-based diffusion language models\.Advances in Neural Information Processing Systems36,pp\. 16693–16715\.Cited by:[§A\.7](https://arxiv.org/html/2605.12836#A1.SS7.SSS0.Px2.p1.6),[§A\.7](https://arxiv.org/html/2605.12836#A1.SS7.SSS0.Px3.p1.2),[§A\.7](https://arxiv.org/html/2605.12836#A1.SS7.SSS0.Px4.p2.1),[Appendix B](https://arxiv.org/html/2605.12836#A2.p1.1),[§F\.2](https://arxiv.org/html/2605.12836#A6.SS2.p2.1),[§F\.2](https://arxiv.org/html/2605.12836#A6.SS2.p3.1),[§3\.1](https://arxiv.org/html/2605.12836#S3.SS1.p2.6),[§6](https://arxiv.org/html/2605.12836#S6.p1.1)\.
- \[4\]D\. Guo, S\. Shamai, and S\. Verdú\(2005\)Mutual information and minimum mean\-square error in gaussian channels\.IEEE transactions on information theory51\(4\),pp\. 1261–1282\.Cited by:[Table 4](https://arxiv.org/html/2605.12836#A1.T4.21.21.2.1.1)\.
- \[5\]J\. Ho, A\. Jain, and P\. Abbeel\(2020\)Denoising diffusion probabilistic models\.Advances in Neural Information Processing Systems33,pp\. 6840–6851\.Cited by:[§1](https://arxiv.org/html/2605.12836#S1.p1.1)\.
- \[6\]J\. Ho, T\. Salimans, A\. Gritsenko, W\. Chan, M\. Norouzi, and D\. J\. Fleet\(2022\)Video diffusion models\.Advances in neural information processing systems35,pp\. 8633–8646\.Cited by:[§1](https://arxiv.org/html/2605.12836#S1.p1.1)\.
- \[7\]E\. Hoogeboom, A\. A\. Gritsenko, J\. Bastings, B\. Poole, R\. v\. d\. Berg, and T\. Salimans\(2021\)Autoregressive diffusion models\.arXiv preprint arXiv:2110\.02037\.Cited by:[§F\.2](https://arxiv.org/html/2605.12836#A6.SS2.p3.1)\.
- \[8\]E\. Hoogeboom, D\. Nielsen, P\. Jaini, P\. Forré, and M\. Welling\(2021\)Argmax flows and multinomial diffusion: learning categorical distributions\.Advances in neural information processing systems34,pp\. 12454–12465\.Cited by:[§F\.2](https://arxiv.org/html/2605.12836#A6.SS2.p3.1)\.
- \[9\]X\. Huang, Z\. Li, G\. He, M\. Zhou, and E\. Shechtman\(2025\)Self forcing: bridging the train\-test gap in autoregressive video diffusion\.arXiv preprint arXiv:2506\.08009\.Cited by:[§1](https://arxiv.org/html/2605.12836#S1.p1.1)\.
- \[10\]T\. Karras, M\. Aittala, T\. Aila, and S\. Laine\(2022\)Elucidating the design space of diffusion\-based generative models\.arXiv preprint arXiv:2206\.00364\.Cited by:[§C\.6](https://arxiv.org/html/2605.12836#A3.SS6.SSS0.Px1.p1.10),[§C\.6](https://arxiv.org/html/2605.12836#A3.SS6.SSS0.Px1.p1.11)\.
- \[11\]X\. Kong, R\. Brekelmans, and G\. Ver Steeg\(2023\)Information\-theoretic diffusion\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2302.03792)Cited by:[Table 4](https://arxiv.org/html/2605.12836#A1.T4.21.21.2.1.1)\.
- \[12\]Z\. Kong, W\. Ping, J\. Huang, K\. Zhao, and B\. Catanzaro\(2020\)Diffwave: a versatile diffusion model for audio synthesis\.arXiv preprint arXiv:2009\.09761\.Cited by:[§1](https://arxiv.org/html/2605.12836#S1.p1.1)\.
- \[13\]X\. Li, J\. Thickstun, I\. Gulrajani, P\. S\. Liang, and T\. B\. Hashimoto\(2022\)Diffusion\-lm improves controllable text generation\.Advances in neural information processing systems35,pp\. 4328–4343\.Cited by:[§A\.7](https://arxiv.org/html/2605.12836#A1.SS7.SSS0.Px2.p1.6),[Appendix B](https://arxiv.org/html/2605.12836#A2.p1.1),[§6](https://arxiv.org/html/2605.12836#S6.p1.1)\.
- \[14\]A\. Lou, C\. Meng, and S\. Ermon\(2024\)Discrete diffusion modeling by estimating the ratios of the data distribution\.InForty\-first International Conference on Machine Learning,Cited by:[§F\.2](https://arxiv.org/html/2605.12836#A6.SS2.p2.1),[§F\.2](https://arxiv.org/html/2605.12836#A6.SS2.p3.1)\.
- \[15\]M\. Mahoney\(2006\)Large Text Compression Benchmark\.Note:[https://www\.mattmahoney\.net/dc/text\.html](https://www.mattmahoney.net/dc/text.html)Accessed: 2025\-05\-11Cited by:[§F\.2](https://arxiv.org/html/2605.12836#A6.SS2.p1.1)\.
- \[16\]S\. Nie, F\. Zhu, Z\. You, X\. Zhang, J\. Ou, J\. Hu, J\. Zhou, Y\. Lin, J\. Wen, and C\. Li\(2025\)Large language diffusion models\.arXiv preprint arXiv:2502\.09992\.Cited by:[§A\.6](https://arxiv.org/html/2605.12836#A1.SS6.p2.1),[§2\.4](https://arxiv.org/html/2605.12836#S2.SS4.p3.3)\.
- \[17\]D\. P\. Palomar and S\. Verdú\(2005\)Gradient of mutual information in linear vector gaussian channels\.IEEE Transactions on Information Theory52\(1\),pp\. 141–154\.Cited by:[§A\.3](https://arxiv.org/html/2605.12836#A1.SS3.p1.1),[Theorem A\.1](https://arxiv.org/html/2605.12836#A1.Thmtheorem1.2.2)\.
- \[18\]K\. Pillutla, S\. Swayamdipta, R\. Zellers, J\. Thickstun, S\. Welleck, Y\. Choi, and Z\. Harchaoui\(2021\)Mauve: measuring the gap between neural text and human text using divergence frontiers\.Advances in Neural Information Processing Systems34,pp\. 4816–4828\.Cited by:[§5\.1](https://arxiv.org/html/2605.12836#S5.SS1.p2.1)\.
- \[19\]P\. Pynadath, J\. Shi, and R\. Zhang\(2025\)Candi: hybrid discrete\-continuous diffusion models\.arXiv preprint arXiv:2510\.22510\.Cited by:[§6](https://arxiv.org/html/2605.12836#S6.p1.1)\.
- \[20\]H\. Ramsauer, B\. Schäfl, J\. Lehner, P\. Seidl, M\. Widrich, T\. Adler, L\. Gruber, M\. Holzleitner, M\. Pavlović, G\. K\. Sandve,et al\.\(2020\)Hopfield networks is all you need\.arXiv preprint arXiv:2008\.02217\.Cited by:[Table 4](https://arxiv.org/html/2605.12836#A1.T4.20.20.2.1.1)\.
- \[21\]L\. Rout, C\. Caramanis, and S\. Shakkottai\(2025\)Anchored diffusion language model\.arXiv preprint arXiv:2505\.18456\.Cited by:[Table 9](https://arxiv.org/html/2605.12836#A6.T9),[Table 9](https://arxiv.org/html/2605.12836#A6.T9.12.6),[Table 1](https://arxiv.org/html/2605.12836#S4.T1),[Table 1](https://arxiv.org/html/2605.12836#S4.T1.8.4)\.
- \[22\]S\. Sahoo, M\. Arriola, Y\. Schiff, A\. Gokaslan, E\. Marroquin, J\. Chiu, A\. Rush, and V\. Kuleshov\(2024\)Simple and effective masked diffusion language models\.Advances in Neural Information Processing Systems37,pp\. 130136–130184\.Cited by:[§C\.1](https://arxiv.org/html/2605.12836#A3.SS1.SSS0.Px2.p1.1),[§C\.1](https://arxiv.org/html/2605.12836#A3.SS1.p1.1),[§F\.2](https://arxiv.org/html/2605.12836#A6.SS2.p2.1),[§F\.2](https://arxiv.org/html/2605.12836#A6.SS2.p3.1),[§1](https://arxiv.org/html/2605.12836#S1.p1.1),[§2\.4](https://arxiv.org/html/2605.12836#S2.SS4.p3.3),[§3\.2](https://arxiv.org/html/2605.12836#S3.SS2.p3.3),[§5\.1](https://arxiv.org/html/2605.12836#S5.SS1.p1.3),[§5\.2](https://arxiv.org/html/2605.12836#S5.SS2.p1.1),[§6](https://arxiv.org/html/2605.12836#S6.p1.1)\.
- \[23\]S\. S\. Sahoo, J\. Deschenaux, A\. Gokaslan, G\. Wang, J\. T\. Chiu, and V\. KuleshovThe diffusion duality\.InICLR 2025 Workshop on Deep Generative Model in Machine Learning: Theory, Principle and Efficacy,Cited by:[§6](https://arxiv.org/html/2605.12836#S6.p1.1)\.
- \[24\]J\. Shi, K\. Han, Z\. Wang, A\. Doucet, and M\. Titsias\(2024\)Simplified and generalized masked diffusion for discrete data\.Advances in neural information processing systems37,pp\. 103131–103167\.Cited by:[§F\.2](https://arxiv.org/html/2605.12836#A6.SS2.p2.1),[§F\.2](https://arxiv.org/html/2605.12836#A6.SS2.p3.1),[§1](https://arxiv.org/html/2605.12836#S1.p1.1),[§6](https://arxiv.org/html/2605.12836#S6.p1.1)\.
- \[25\]A\. Shih, D\. Sadigh, and S\. Ermon\(2022\)Training and inference on any\-order autoregressive models the right way\.Advances in Neural Information Processing Systems35,pp\. 2762–2775\.Cited by:[§F\.2](https://arxiv.org/html/2605.12836#A6.SS2.p3.1)\.
- \[26\]J\. Song, C\. Meng, and S\. Ermon\(2020\)Denoising diffusion implicit models\.arXiv preprint arXiv:2010\.02502\.Cited by:[§1](https://arxiv.org/html/2605.12836#S1.p1.1)\.
- \[27\]R\. Strudel, C\. Tallec, F\. Altché, Y\. Du, Y\. Ganin, A\. Mensch, W\. Grathwohl, N\. Savinov, S\. Dieleman, L\. Sifre,et al\.\(2022\)Self\-conditioned embedding diffusion for text generation\.arXiv preprint arXiv:2211\.04236\.Cited by:[Appendix B](https://arxiv.org/html/2605.12836#A2.p1.1),[§6](https://arxiv.org/html/2605.12836#S6.p1.1)\.
- \[28\]Q\. Sun, Z\. Jiang, H\. Zhao, and K\. He\(2025\)Is noise conditioning necessary for denoising generative models?\.arXiv preprint arXiv:2502\.13129\.Cited by:[§A\.2](https://arxiv.org/html/2605.12836#A1.SS2.p2.2)\.
- \[29\]D\. Tran, K\. Vafa, K\. Agrawal, L\. Dinh, and B\. Poole\(2019\)Discrete flows: invertible generative models of discrete data\.Advances in Neural Information Processing Systems32\.Cited by:[§F\.2](https://arxiv.org/html/2605.12836#A6.SS2.p3.1)\.
- \[30\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin\(2017\)Attention is all you need\.Advances in neural information processing systems30\.Cited by:[§F\.2](https://arxiv.org/html/2605.12836#A6.SS2.p3.1)\.
- \[31\]G\. Wang, Y\. Schiff, S\. S\. Sahoo, and V\. Kuleshov\(2025\)Remasking discrete diffusion models with inference\-time scaling\.arXiv preprint arXiv:2503\.00307\.Cited by:[Table 9](https://arxiv.org/html/2605.12836#A6.T9),[Table 9](https://arxiv.org/html/2605.12836#A6.T9.12.6),[Table 1](https://arxiv.org/html/2605.12836#S4.T1),[Table 1](https://arxiv.org/html/2605.12836#S4.T1.8.4),[§5\.1](https://arxiv.org/html/2605.12836#S5.SS1.p1.3),[§5\.1](https://arxiv.org/html/2605.12836#S5.SS1.p2.1),[§5](https://arxiv.org/html/2605.12836#S5.p1.1),[§6](https://arxiv.org/html/2605.12836#S6.p1.1),[Algorithm 2](https://arxiv.org/html/2605.12836#alg2),[Algorithm 4](https://arxiv.org/html/2605.12836#alg4)\.
- \[32\]Y\. Wu, Y\. Luo, X\. Kong, E\. E\. Papalexakis, and G\. V\. Steeg\(2024\)Your diffusion model is secretly a noise classifier and benefits from contrastive training\.Advances in Neural Information Processing Systems37,pp\. 32370–32399\.Cited by:[Table 4](https://arxiv.org/html/2605.12836#A1.T4.21.21.2.1.1)\.
- \[33\]Z\. Ziegler and A\. Rush\(2019\)Latent normalizing flows for discrete sequences\.InInternational Conference on Machine Learning,pp\. 7673–7682\.Cited by:[§F\.2](https://arxiv.org/html/2605.12836#A6.SS2.p3.1)\.
## Appendix Overview
[Appendix A: Mathematical Derivations](https://arxiv.org/html/2605.12836#A1)
- §A\.1
- §A\.2
- §A\.3
- §A\.4
- §A\.5
- §A\.6
- §A\.7
- §A\.8
- §A\.9
- §A\.10
[Appendix C: Training and Sampling Details](https://arxiv.org/html/2605.12836#A3)
- §C\.1
- §C\.2
- §C\.3Masked Refinement with a DSL Checkpoint
- §C\.4
- §C\.5
- §C\.6
Appendix D: Mechanistic Analysis
- §D\.1Cyclic Toy Setup and Correction Example
- §D\.2
- §D\.3Robustness to Self\-Generated Intermediate Drafts
- §D\.4Sampling Diagnostics and Over\-Refinement
Appendix E: Calibration and Endpoint\-Smoothing Ablation
- §E\.1Atomic vs\. Smoothed ROAR Endpoints
- §E\.2Near\-Clean Calibration
- §E\.3Downstream Step–Quality Tradeoff
[Appendix F: Additional Experiments](https://arxiv.org/html/2605.12836#A6)
- §F\.1OpenWebText Generation
- §F\.2Text8 Likelihood Evaluation
Appendix I: Additional Qualitative Samples
- §I\.1Text8 Samples
- §I\.2OpenWebText Samples
## Appendix AMathematical Derivations
This appendix collects the derivations underlying the DSL method in Section[2](https://arxiv.org/html/2605.12836#S2)\. The main text only uses three consequences of these results: \(i\) the denoiser is realized\-state invariant under unit\-sphere token geometry, \(ii\) exact likelihood views motivate both continuous and ROAR\-style support in training, and \(iii\) the induced posterior\-mean map is smooth\. Here we provide the full proofs and auxiliary technical remarks\.
### A\.1Notation and Summary Table
Table 4:Summary of notation and key relations\.
### A\.2Optimal Denoiser is SNR\-Invariant
We now derive the optimal denoiser for the noise channel with per token SNR described in the main text\. The denoiser is as follows, where we first re\-write with Bayes rule, then expand the Gaussian noise channel\.
𝒙^\(𝒛,𝜸\)≡𝔼p𝜸\(𝒙\|𝒛\)\[𝒙\]=∑𝒙p𝜸\(𝒛\|𝒙\)P\(𝒙\)𝒙p𝜸\(𝒛\)=∑𝒙p𝜸\(𝒛\|𝒙\)P\(𝒙\)𝒙∑x¯p𝜸\(𝒛\|𝒙¯\)P\(𝒙¯\)=∑𝒙exp\(−1/2‖𝒛−𝜸𝒙‖2/𝜸\)/𝒵\(𝜸\)P\(𝒙\)𝒙∑x¯exp\(−1/2‖𝒛−𝜸𝒙¯‖2/𝜸\)/𝒵\(𝜸\)P\(𝒙¯\)=∑𝒙exp\(𝒛⋅𝒙\)P\(𝒙\)𝒙∑x¯exp\(𝒛⋅𝒙¯\)P\(𝒙¯\)𝒙^\(𝒛,𝜸\)=𝒙^\(𝒛\)=𝔼𝒙\[𝒙e𝒙⋅𝒛\]/𝔼𝒙\[e𝒙⋅𝒛\]\\displaystyle\\begin\{aligned\} \\hat\{\\bm\{x\}\}\(\\bm\{z\},\{\\bm\{\\gamma\}\}\)&\\equiv\\mathbb\{E\}\_\{p\_\{\\bm\{\\gamma\}\}\(\{\\bm\{x\}\}\|\\bm\{z\}\)\}\[\{\\bm\{x\}\}\]=\\frac\{\\sum\_\{\\bm\{x\}\}p\_\{\\bm\{\\gamma\}\}\(\\bm\{z\}\|\{\\bm\{x\}\}\)P\(\{\\bm\{x\}\}\)\{\\bm\{x\}\}\}\{p\_\{\\bm\{\\gamma\}\}\(\\bm\{z\}\)\}=\\frac\{\\sum\_\{\\bm\{x\}\}p\_\{\\bm\{\\gamma\}\}\(\\bm\{z\}\|\{\\bm\{x\}\}\)P\(\{\\bm\{x\}\}\)\{\\bm\{x\}\}\}\{\\sum\_\{\\bar\{x\}\}p\_\{\\bm\{\\gamma\}\}\(\\bm\{z\}\|\\bar\{\\bm\{x\}\}\)P\(\\bar\{\\bm\{x\}\}\)\}\\\\ &=\\frac\{\\sum\_\{\\bm\{x\}\}\\exp\(\-\\nicefrac\{\{1\}\}\{\{2\}\}\{\\\|\\bm\{z\}\-\{\\bm\{\\gamma\}\}\{\\bm\{x\}\}\\\|\}^\{2\}/\{\\bm\{\\gamma\}\}\)/\\mathcal\{Z\}\(\{\\bm\{\\gamma\}\}\)P\(\{\\bm\{x\}\}\)\{\\bm\{x\}\}\}\{\\sum\_\{\\bar\{x\}\}\\exp\(\-\\nicefrac\{\{1\}\}\{\{2\}\}\{\\\|\\bm\{z\}\-\{\\bm\{\\gamma\}\}\\bar\{\\bm\{x\}\}\\\|\}^\{2\}/\{\\bm\{\\gamma\}\}\)/\\mathcal\{Z\}\(\{\\bm\{\\gamma\}\}\)P\(\\bar\{\\bm\{x\}\}\)\}\\\\ &=\\frac\{\\sum\_\{\\bm\{x\}\}\\exp\(\\bm\{z\}\\cdot\{\\bm\{x\}\}\)P\(\{\\bm\{x\}\}\)\{\\bm\{x\}\}\}\{\\sum\_\{\\bar\{x\}\}\\exp\(\\bm\{z\}\\cdot\\bar\{\\bm\{x\}\}\)P\(\\bar\{\\bm\{x\}\}\)\}\\\\ \\hat\{\\bm\{x\}\}\(\\bm\{z\},\{\\bm\{\\gamma\}\}\)&=\\hat\{\\bm\{x\}\}\(\\bm\{z\}\)=\\mathbb\{E\}\_\{\{\\bm\{x\}\}\}\[\{\\bm\{x\}\}e^\{\{\\bm\{x\}\}\\cdot\\bm\{z\}\}\]/\\mathbb\{E\}\_\{\{\\bm\{x\}\}\}\[e^\{\{\\bm\{x\}\}\\cdot\\bm\{z\}\}\]\\end\{aligned\}The division and multiplication between𝜸\{\\bm\{\\gamma\}\}and𝒛\\bm\{z\}should be understood to be applied component\-wise, per token\. We canceled out the‖𝒛‖\{\\\|\\bm\{z\}\\\|\}term, and due to the normed embedding,exp\(−1/2𝜸‖𝒙‖2\)\\exp\(\-\\nicefrac\{\{1\}\}\{\{2\}\}\{\\bm\{\\gamma\}\}\{\\\|\{\\bm\{x\}\}\\\|\}^\{2\}\)terms become constant and cancel out, leaving us with no dependence on𝜸\{\\bm\{\\gamma\}\}at all\. The distributionP\(𝒙\)e𝒛⋅𝒙/𝒵\(𝒛\)P\(\{\\bm\{x\}\}\)e^\{\\bm\{z\}\\cdot\{\\bm\{x\}\}\}/\\mathcal\{Z\}\(\\bm\{z\}\)is sometimes referred to as an exponentially tilted distribution ofP\(𝒙\)P\(\{\\bm\{x\}\}\)\. This form also can be written as∇zlog𝔼𝒙\[e𝒙⋅𝒛\]\\nabla\_\{z\}\\log\\mathbb\{E\}\_\{\{\\bm\{x\}\}\}\[e^\{\{\\bm\{x\}\}\\cdot\\bm\{z\}\}\], which is a gradient of the cumulant generating function\.
The results above also hold for the simpler case whereγi\(t\)=t\\gamma\_\{i\}\(t\)=t, leading to a denoiser which is invariant tott\. Recent work has also empirically noted that even for traditional diffusion models, noise conditioning is not always necessary or useful\[[28](https://arxiv.org/html/2605.12836#bib.bib28)\]\.
### A\.3Exact Likelihood over Arbitrary SNR Paths
We make use of the following theorem, re\-stated with our setting and notation, from\[[17](https://arxiv.org/html/2605.12836#bib.bib13)\], to prove[Equation˜6](https://arxiv.org/html/2605.12836#S2.E6)\.
###### Theorem A\.1\(\[[17](https://arxiv.org/html/2605.12836#bib.bib13)\], Thm\. 5\)
Consider the signal model
𝒛i=γi𝒙i\+γiϵi,\\bm\{z\}\_\{i\}=\\gamma\_\{i\}\{\\bm\{x\}\}\_\{i\}\+\\sqrt\{\\gamma\_\{i\}\}\{\\bm\{\\epsilon\}\}\_\{i\},where𝐱\{\\bm\{x\}\}is arbitrarily distributed with finite second\-order moments, andϵi∼𝒩\(0,𝐈\)\{\\bm\{\\epsilon\}\}\_\{i\}\\sim\\mathcal\{N\}\(0,\\mathbf\{I\}\)is independent Gaussian noise\.
Then the gradient of the Kullback–Leibler divergence between the conditional output distributionp𝛄\(𝐳\|𝐱\)p\_\{\\bm\{\\gamma\}\}\(\\bm\{z\}\|\{\\bm\{x\}\}\)and the unconditional output distributionp𝛄\(𝐳\)p\_\{\\bm\{\\gamma\}\}\(\\bm\{z\}\)is
∇𝜸D\(p𝜸\(𝒛\|𝒙\)∥p𝜸\(𝒛\)\)=1/2𝐄\(𝒙,𝜸\),\\nabla\_\{\{\\bm\{\\gamma\}\}\}D\\big\(p\_\{\\bm\{\\gamma\}\}\(\\bm\{z\}\|\{\\bm\{x\}\}\)\\,\\\|\\,p\_\{\\bm\{\\gamma\}\}\(\\bm\{z\}\)\\big\)=\\nicefrac\{\{1\}\}\{\{2\}\}\\mathbf\{E\}\(\{\\bm\{x\}\},\{\\bm\{\\gamma\}\}\),whereEi\(𝐱,𝛄\)≡𝔼p𝛄\(𝐳\|𝐱\)\[‖𝐱i−𝐱^i\(𝐳\)‖2\]E\_\{i\}\(\{\\bm\{x\}\},\{\\bm\{\\gamma\}\}\)\\equiv\\mathbb\{E\}\_\{p\_\{\\bm\{\\gamma\}\}\(\\bm\{z\}\|\{\\bm\{x\}\}\)\}\[\{\\\|\{\\bm\{x\}\}\_\{i\}\-\\hat\{\{\\bm\{x\}\}\}\_\{i\}\(\\bm\{z\}\)\\\|\}^\{2\}\], and𝐱^\(𝐳\)=𝔼p𝛄\(𝐱\|𝐳\)\[𝐱\]\\hat\{\{\\bm\{x\}\}\}\(\\bm\{z\}\)=\\mathbb\{E\}\_\{p\_\{\\bm\{\\gamma\}\}\(\{\\bm\{x\}\}\|\\bm\{z\}\)\}\[\{\\bm\{x\}\}\]is the MMSE estimator\.
We use this theorem in conjunction with the fundamental theorem of calculus for line integrals to obtain the following\.
∫C𝑑𝜸⋅∇𝜸D\(p𝜸\(𝒛\|𝒙\)∥p𝜸\(𝒛\)\)=1/2∫C𝑑𝜸⋅𝐄\(𝒙,𝜸\)=D\(p𝜸\(𝒛\|𝒙\)∥p𝜸\(𝒛\)\)\|𝜸0𝜸1\\displaystyle\\int\_\{C\}d\{\\bm\{\\gamma\}\}\\cdot\\nabla\_\{\\bm\{\\gamma\}\}D\\big\(p\_\{\\bm\{\\gamma\}\}\(\\bm\{z\}\|\{\\bm\{x\}\}\)\\,\\\|\\,p\_\{\\bm\{\\gamma\}\}\(\\bm\{z\}\)\\big\)=\\nicefrac\{\{1\}\}\{\{2\}\}\\int\_\{C\}d\{\\bm\{\\gamma\}\}\\cdot\\mathbf\{E\}\(\{\\bm\{x\}\},\{\\bm\{\\gamma\}\}\)=\\left\.D\\big\(p\_\{\\bm\{\\gamma\}\}\(\\bm\{z\}\|\{\\bm\{x\}\}\)\\,\\\|\\,p\_\{\\bm\{\\gamma\}\}\(\\bm\{z\}\)\\big\)\\right\|\_\{\{\\bm\{\\gamma\}\}\_\{0\}\}^\{\{\\bm\{\\gamma\}\}\_\{1\}\}\(14\)Here,𝜸0,𝜸1\{\\bm\{\\gamma\}\}\_\{0\},\{\\bm\{\\gamma\}\}\_\{1\}define the endpoints of the contour, which must be piecewise smooth\. Next, let’s consider some of the limits\.
lim𝜸→0D\(p𝜸\(𝒛\|𝒙\)∥p𝜸\(𝒛\)\)\\displaystyle\\lim\_\{\{\\bm\{\\gamma\}\}\\rightarrow 0\}D\\big\(p\_\{\\bm\{\\gamma\}\}\(\\bm\{z\}\|\{\\bm\{x\}\}\)\\,\\\|\\,p\_\{\\bm\{\\gamma\}\}\(\\bm\{z\}\)\\big\)=0\\displaystyle=0\(15\)lim𝜸→∞D\(p𝜸\(𝒛\|𝒙\)∥p𝜸\(𝒛\)\)\\displaystyle\\lim\_\{\{\\bm\{\\gamma\}\}\\rightarrow\\infty\}D\\big\(p\_\{\\bm\{\\gamma\}\}\(\\bm\{z\}\|\{\\bm\{x\}\}\)\\,\\\|\\,p\_\{\\bm\{\\gamma\}\}\(\\bm\{z\}\)\\big\)=−logP\(𝒙\)\\displaystyle=\-\\log P\(\{\\bm\{x\}\}\)\(16\)The second limit can be observed as follows\.
D\(p𝜸\(𝒛\|𝒙\)∥p𝜸\(𝒛\)\)\\displaystyle D\\big\(p\_\{\\bm\{\\gamma\}\}\(\\bm\{z\}\|\{\\bm\{x\}\}\)\\,\\\|\\,p\_\{\\bm\{\\gamma\}\}\(\\bm\{z\}\)\\big\)=𝔼p𝜸\(𝒛\|𝒙\)\[logp\(𝒛\|𝒙\)p\(𝒛\)\]\\displaystyle=\\mathbb\{E\}\_\{p\_\{\\bm\{\\gamma\}\}\(\\bm\{z\}\|\{\\bm\{x\}\}\)\}\\left\[\\log\\frac\{p\(\\bm\{z\}\|\{\\bm\{x\}\}\)\}\{p\(\\bm\{z\}\)\}\\right\]=𝔼p𝜸\(𝒛\|𝒙\)\[logP\(𝒙\|𝒛\)P\(𝒙\)\]\\displaystyle=\\mathbb\{E\}\_\{p\_\{\\bm\{\\gamma\}\}\(\\bm\{z\}\|\{\\bm\{x\}\}\)\}\\left\[\\log\\frac\{P\(\{\\bm\{x\}\}\|\\bm\{z\}\)\}\{P\(\{\\bm\{x\}\}\)\}\\right\]=𝔼p𝜸\(𝒛\|𝒙\)\[logP\(𝒙\|𝒛\)\]−logP\(𝒙\)\\displaystyle=\\mathbb\{E\}\_\{p\_\{\\bm\{\\gamma\}\}\(\\bm\{z\}\|\{\\bm\{x\}\}\)\}\\left\[\\log P\(\{\\bm\{x\}\}\|\\bm\{z\}\)\\right\]\-\\log P\(\{\\bm\{x\}\}\)\(17\)BecauseP\(𝒙\)P\(\{\\bm\{x\}\}\)is discrete,lim𝜸→∞P\(𝐱∣𝐳\)=1almost surely\\lim\_\{\{\\bm\{\\gamma\}\}\\to\\infty\}P\(\\mathbf\{x\}\\mid\\mathbf\{z\}\)=1~~\\text\{almost surely\}\.
Combining[Equation˜17](https://arxiv.org/html/2605.12836#A1.E17)and[Equation˜14](https://arxiv.org/html/2605.12836#A1.E14), we obtain the following\.
−logP\(𝒙\)=1/2∫C𝜸𝑑𝜸¯⋅𝐄\(𝒙,𝜸¯\)−𝔼p𝜸\(𝒛\|𝒙\)\[logP\(𝒙\|𝒛\)\]\\displaystyle\-\\log P\(\{\\bm\{x\}\}\)=\\nicefrac\{\{1\}\}\{\{2\}\}\\int\_\{C\_\{\\bm\{\\gamma\}\}\}d\\bar\{\\bm\{\\gamma\}\}\\cdot\\mathbf\{E\}\(\{\\bm\{x\}\},\\bar\{\\bm\{\\gamma\}\}\)\-\\mathbb\{E\}\_\{p\_\{\\bm\{\\gamma\}\}\(\\bm\{z\}\|\{\\bm\{x\}\}\)\}\\left\[\\log P\(\{\\bm\{x\}\}\|\\bm\{z\}\)\\right\]\(18\)Where the curveC𝜸C\_\{\\bm\{\\gamma\}\}is piecewise smooth, begins at𝜸=0\{\\bm\{\\gamma\}\}=0and ends at𝜸\{\\bm\{\\gamma\}\}\. We can use this result to obtain a path that starts at zero and ends attt, and[Equation˜6](https://arxiv.org/html/2605.12836#S2.E6)when we let the end of the contour go to infinity\.■\\blacksquare
### A\.4Equivalence of Conditional and Unconditional Dynamics
Consider the marginals generated conditioned on the data with the following noise channel\.
p𝜸\(𝒛\)≡∑𝒙p𝜸\(𝒛\|𝒙\)P\(𝒙\)p\_\{\\bm\{\\gamma\}\}\(\\bm\{z\}\)\\equiv\\sum\_\{\{\\bm\{x\}\}\}p\_\{\\bm\{\\gamma\}\}\(\\bm\{z\}\|\{\\bm\{x\}\}\)P\(\{\\bm\{x\}\}\)Wherep𝜸\(𝒛\|𝒙\)p\_\{\\bm\{\\gamma\}\}\(\\bm\{z\}\|\{\\bm\{x\}\}\)is a Gaussian noise channel, which could be defined as follows,
𝒛i=γi𝒙i\+γiϵi,ϵ∼𝒩\(0,I\)\.\\bm\{z\}\_\{i\}=\\gamma\_\{i\}~\{\\bm\{x\}\}\_\{i\}\+\\sqrt\{\\gamma\_\{i\}\}~\{\\bm\{\\epsilon\}\}\_\{i\},\\qquad\{\\bm\{\\epsilon\}\}\\sim\\mathcal\{N\}\(0,I\)\.This is equivalent to the conditional SDE dynamics,
d𝒛i\\displaystyle d\\bm\{z\}\_\{i\}=𝒙iγ˙idt\+γ˙idWi,𝒙∼P\(𝒙\),\\displaystyle=\{\\bm\{x\}\}\_\{i\}\\dot\{\\gamma\}\_\{i\}~dt\+\\sqrt\{\\dot\{\\gamma\}\_\{i\}\}dW\_\{i\},~~\{\\bm\{x\}\}\\sim P\(\{\\bm\{x\}\}\),\(19\)as can be seen from direct integration\. We can use the Fokker\-Planck equation to understand the marginal dynamics of this process\. Consider a particular contour,𝜸\(t\)\{\\bm\{\\gamma\}\}\(t\), as defined in the text, which goes from zero to infinity\. At𝜸\(0\)=0\{\\bm\{\\gamma\}\}\(0\)=0, we have
p𝜸\(0\)\(𝒛\)=δ\(𝒛\)\.p\_\{\{\\bm\{\\gamma\}\}\(0\)\}\(\\bm\{z\}\)=\\delta\(\\bm\{z\}\)\.The Fokker\-Planck equation gives the following for the conditional dynamics\.
∂∂tp𝜸\(t\)\(𝒛∣𝒙\)=−∑i∇𝒛i⋅\(γ˙i𝒙p𝜸\(𝒛∣𝒙\)\)\+∑iγ˙iΔ𝒛ip𝜸\(𝒛∣𝒙\)\\frac\{\\partial\}\{\\partial t\}p\_\{\{\\bm\{\\gamma\}\}\(t\)\}\(\\bm\{z\}\\mid\{\\bm\{x\}\}\)=\-\\sum\_\{i\}\\nabla\_\{\\bm\{z\}\_\{i\}\}\\cdot\\left\(\\dot\{\\gamma\}\_\{i\}~\{\\bm\{x\}\}\\,p\_\{\{\\bm\{\\gamma\}\}\}\(\\bm\{z\}\\mid\{\\bm\{x\}\}\)\\right\)\+\\sum\_\{i\}\\dot\{\\gamma\}\_\{i\}\\Delta\_\{\\bm\{z\}\_\{i\}\}\\,p\_\{\{\\bm\{\\gamma\}\}\}\(\\bm\{z\}\\mid\{\\bm\{x\}\}\)By marginalizing overP\(𝒙\)P\(\{\\bm\{x\}\}\), get the following\.
∂∂tp𝜸\(t\)\(𝒛\)\\displaystyle\\frac\{\\partial\}\{\\partial t\}p\_\{\{\\bm\{\\gamma\}\}\(t\)\}\(\\bm\{z\}\)=∑𝒙P\(𝒙\)∂∂tp𝜸\(t\)\(𝒛∣𝒙\)\\displaystyle=\\sum\_\{\\bm\{x\}\}P\(\{\\bm\{x\}\}\)\\frac\{\\partial\}\{\\partial t\}p\_\{\{\\bm\{\\gamma\}\}\(t\)\}\(\\bm\{z\}\\mid\{\\bm\{x\}\}\)\(20\)=−∑i∇𝒛i⋅\(γ˙i∑𝒙P\(𝒙\)𝒙ip𝜸\(𝒛∣𝒙\)\)\+∑iγ˙iΔ𝒛ip𝜸\(𝒛\)\\displaystyle=\-\\sum\_\{i\}\\nabla\_\{\\bm\{z\}\_\{i\}\}\\cdot\\left\(\\dot\{\\gamma\}\_\{i\}~\\sum\_\{\\bm\{x\}\}P\(\{\\bm\{x\}\}\)\{\\bm\{x\}\}\_\{i\}\\,p\_\{\{\\bm\{\\gamma\}\}\}\(\\bm\{z\}\\mid\{\\bm\{x\}\}\)\\right\)\+\\sum\_\{i\}\\dot\{\\gamma\}\_\{i\}\\Delta\_\{\\bm\{z\}\_\{i\}\}\\,p\_\{\{\\bm\{\\gamma\}\}\}\(\\bm\{z\}\)\(21\)=−∑i∇𝒛i⋅\(γ˙i𝒙^i\(𝒛\)p𝜸\(𝒛\)\)\+∑iγ˙iΔ𝒛ip𝜸\(𝒛\)\\displaystyle=\-\\sum\_\{i\}\\nabla\_\{\\bm\{z\}\_\{i\}\}\\cdot\\left\(\\dot\{\\gamma\}\_\{i\}~\\hat\{\\bm\{x\}\}\_\{i\}\(\\bm\{z\}\)\\,p\_\{\{\\bm\{\\gamma\}\}\}\(\\bm\{z\}\)\\right\)\+\\sum\_\{i\}\\dot\{\\gamma\}\_\{i\}\\Delta\_\{\\bm\{z\}\_\{i\}\}\\,p\_\{\{\\bm\{\\gamma\}\}\}\(\\bm\{z\}\)\(22\)In the last line, we used the definition of the optimal denoiser in terms of the posterior mean\.
The goal is to design an unconditional dynamics that obeys the same evolution of the marginal\. The unconditional dynamics are defined as follows\.
d𝒛i\\displaystyle d\\bm\{z\}\_\{i\}=𝒙^i\(𝒛\)γ˙idt\+γ˙idWi,\\displaystyle=\\hat\{\\bm\{x\}\}\_\{i\}\(\\bm\{z\}\)\\dot\{\\gamma\}\_\{i\}~dt\+\\sqrt\{\\dot\{\\gamma\}\_\{i\}\}dW\_\{i\},\(23\)Applying the Fokker\-Planck equation to this expression directly matches[Equation˜22](https://arxiv.org/html/2605.12836#A1.E22)\.■\\blacksquare
### A\.5Autoregressive Contours as a Special Case
The continuous SNR path formulation also recovers the standard autoregressive factorization as a special contour choice\. In particular, when the contour reveals one token at a time while treating previously revealed tokens as effectively clean and future tokens as maximally corrupted, the path\-integral likelihood reduces to the standard chain rule\. For completeness, we isolate this argument here, since the main text only needs the high\-level consequence\.
Conditional denoisersReplacingP\(𝒙\)P\(\{\\bm\{x\}\}\)above withP\(𝒙\|𝒚\)P\(\{\\bm\{x\}\}\|\{\\bm\{y\}\}\)gives the conditional denoiser, and we still get a result invariant to SNR\.
𝒙^\(𝒛,y,𝜸\)=𝒙^\(𝒛,y\)=𝔼P\(𝒙\|y\)\[𝒙e𝒙⋅𝒛\]/𝔼P\(𝒙\|y\)\[e𝒙⋅𝒛\]\\hat\{\\bm\{x\}\}\(\\bm\{z\},y,\{\\bm\{\\gamma\}\}\)=\\hat\{\\bm\{x\}\}\(\\bm\{z\},y\)=\\mathbb\{E\}\_\{P\(\{\\bm\{x\}\}\|y\)\}\[\{\\bm\{x\}\}e^\{\{\\bm\{x\}\}\\cdot\\bm\{z\}\}\]/\\mathbb\{E\}\_\{P\(\{\\bm\{x\}\}\|y\)\}\[e^\{\{\\bm\{x\}\}\\cdot\\bm\{z\}\}\]
We used the following result to show the connection between probability estimates using certain contours and autoregressive models\.
−logP\(𝒙\)\\displaystyle\-\\log P\(\{\\bm\{x\}\}\)=12∫CAR𝐄\(𝒙,𝜸\)⋅𝑑𝜸=12∑i=1L∫Ci𝐄\(𝒙,𝜸\)⋅𝑑𝜸\\displaystyle=\\frac\{1\}\{2\}\\int\_\{C\_\{AR\}\}\\mathbf\{E\}\(\{\\bm\{x\}\},\{\\bm\{\\gamma\}\}\)\\cdot d\{\\bm\{\\gamma\}\}=\\frac\{1\}\{2\}\\sum\_\{i=1\}^\{L\}\\int\_\{C\_\{i\}\}\\mathbf\{E\}\(\{\\bm\{x\}\},\{\\bm\{\\gamma\}\}\)\\cdot d\{\\bm\{\\gamma\}\}=∑i=1L12∫0∞𝑑γiEi\(𝒙,\(∞,…,γi,0,…\)\)=−∑i=1LlogP\(𝒙i\|𝒙<i\)\\displaystyle=\\sum\_\{i=1\}^\{L\}\\frac\{1\}\{2\}\\int\_\{0\}^\{\\infty\}d\\gamma\_\{i\}E\_\{i\}\(\{\\bm\{x\}\},\(\\infty,\\dotsc,\\gamma\_\{i\},0,\\dotsc\)\)=\-\\sum\_\{i=1\}^\{L\}\\log P\(\{\\bm\{x\}\}\_\{i\}\|\{\\bm\{x\}\}\_\{<i\}\)To show this, we need to demonstrate the last equality explicitly\.
−logP\(𝒙i∣𝒙<i\)\\displaystyle\-\\log P\(\{\\bm\{x\}\}\_\{i\}\\mid\{\\bm\{x\}\}\_\{<i\}\)=12∫0∞𝑑γiEi\(𝒙,𝜸=\(∞,…,γi,0,…\)\)\\displaystyle=\\frac\{1\}\{2\}\\int\_\{0\}^\{\\infty\}d\\gamma\_\{i\}~E\_\{i\}\\left\(\{\\bm\{x\}\},\{\\bm\{\\gamma\}\}=\(\\infty,\\dotsc,\\gamma\_\{i\},0,\\dotsc\)\\right\)\(24\)
First, we rewrite the left hand side\.
12∫0∞𝑑γiEi\(𝒙,𝜸=\(∞,…,γi,0,…\)\)\\displaystyle\\frac\{1\}\{2\}\\int\_\{0\}^\{\\infty\}d\\gamma\_\{i\}~E\_\{i\}\\left\(\{\\bm\{x\}\},\{\\bm\{\\gamma\}\}=\(\\infty,\\dotsc,\\gamma\_\{i\},0,\\dotsc\)\\right\)=12∫0∞𝑑γi𝔼p𝜸\(𝒛∣𝒙\)\[‖𝒙i−𝒙^i\(𝒛\)‖2\]\\displaystyle=\\frac\{1\}\{2\}\\int\_\{0\}^\{\\infty\}d\\gamma\_\{i\}~\\mathbb\{E\}\_\{p\_\{\{\\bm\{\\gamma\}\}\}\(\\bm\{z\}\\mid\{\\bm\{x\}\}\)\}\\left\[\\left\\\|\{\\bm\{x\}\}\_\{i\}\-\\hat\{\\bm\{x\}\}\_\{i\}\(\\bm\{z\}\)\\right\\\|^\{2\}\\right\]=12∫0∞𝑑γi𝔼pγi\(𝒛i∣𝒙i\)\[‖𝒙i−𝒙^i\(𝒙<i,𝒛i,0,…\)‖2\]\\displaystyle=\\frac\{1\}\{2\}\\int\_\{0\}^\{\\infty\}d\\gamma\_\{i\}~\\mathbb\{E\}\_\{p\_\{\\gamma\_\{i\}\}\(\\bm\{z\}\_\{i\}\\mid\{\\bm\{x\}\}\_\{i\}\)\}\\left\[\\left\\\|\{\\bm\{x\}\}\_\{i\}\-\\hat\{\\bm\{x\}\}\_\{i\}\(\{\\bm\{x\}\}\_\{<i\},\\bm\{z\}\_\{i\},0,\\dotsc\)\\right\\\|^\{2\}\\right\]In the last line, we note that for tokens at SNR 0, there is no information, so the optimal denoiser will be conditioned on a constant, zero\. For tokens with infinite SNR,𝒛<i\\bm\{z\}\_\{<i\}is equivalent in distribution to𝒙<i\{\\bm\{x\}\}\_\{<i\}, which we replace in the argument, with a slight abuse of notation\.
Next, we see what happens when we try to writeP\(𝒙i\|𝒙<i\)P\(\{\\bm\{x\}\}\_\{i\}\|\{\\bm\{x\}\}\_\{<i\}\), in terms of the optimal*conditional denoiser*for denoising only𝒛i\\bm\{z\}\_\{i\}, using[Equation˜6](https://arxiv.org/html/2605.12836#S2.E6), where the contours simplify to 1\-d integrals as we only have one SNR,γi\\gamma\_\{i\}\. In that case we know we can write the probability in terms of the conditional denoiser, which we will call𝒙~\(𝒛i,𝒙<i\)\\tilde\{\\bm\{x\}\}\(\\bm\{z\}\_\{i\},\{\\bm\{x\}\}\_\{<i\}\)to distinguish from𝒙^\\hat\{\\bm\{x\}\}\.
−logP\(𝒙i\|𝒙<i\)=12∫0∞𝑑γi𝔼pγi\(𝒛i\|𝒙i\)\[‖𝒙i−𝒙~i\(𝒛i,𝒙<i\)‖2\]\-\\log P\(\{\\bm\{x\}\}\_\{i\}\|\{\\bm\{x\}\}\_\{<i\}\)=\\frac\{1\}\{2\}\\int\_\{0\}^\{\\infty\}d\\gamma\_\{i\}~\\mathbb\{E\}\_\{p\_\{\\gamma\_\{i\}\}\(\\bm\{z\}\_\{i\}\|\{\\bm\{x\}\}\_\{i\}\)\}\[\{\\\|\{\\bm\{x\}\}\_\{i\}\-\\tilde\{\\bm\{x\}\}\_\{i\}\(\\bm\{z\}\_\{i\},\{\\bm\{x\}\}\_\{<i\}\)\\\|\}^\{2\}\]We can see that this matches the expression above, except for the denoisers\. These denoisers are conditioned on the same information, and have to be MMSE denoisers for the same denoising problem, and therefore they must be equivalent, and equality of[Equation˜24](https://arxiv.org/html/2605.12836#A1.E24)must hold\.Conditional denoisersReplacingP\(𝒙\)P\(\{\\bm\{x\}\}\)above withP\(𝒙\|𝒚\)P\(\{\\bm\{x\}\}\|\{\\bm\{y\}\}\)gives the conditional denoiser, and we still get a result invariant to SNR\.
𝒙^\(𝒛,y,𝜸\)=𝒙^\(𝒛,y\)=𝔼P\(𝒙\|y\)\[𝒙e𝒙⋅𝒛\]/𝔼P\(𝒙\|y\)\[e𝒙⋅𝒛\]\\hat\{\\bm\{x\}\}\(\\bm\{z\},y,\{\\bm\{\\gamma\}\}\)=\\hat\{\\bm\{x\}\}\(\\bm\{z\},y\)=\\mathbb\{E\}\_\{P\(\{\\bm\{x\}\}\|y\)\}\[\{\\bm\{x\}\}e^\{\{\\bm\{x\}\}\\cdot\\bm\{z\}\}\]/\\mathbb\{E\}\_\{P\(\{\\bm\{x\}\}\|y\)\}\[e^\{\{\\bm\{x\}\}\\cdot\\bm\{z\}\}\]
### A\.6Exact ROAR Likelihood Decomposition and Estimator
This subsection derives the exact random\-order conditional likelihood that motivates the ROAR\-style endpoint branch in DSL training\. The purpose is not to claim that the main experiments directly optimize this objective\. Rather, it provides a second exact likelihood source, complementary to the continuous path\-integral likelihood, and justifies including masking/reveal\-like endpoint states in the training support\.
In DSL, we point out that masked diffusion modeling is equivalent to paths in SNR space where we completely denoise one variable \(going from zero SNR to infinite SNR\), conditioned on zero noise \(infinite SNR\) for the already “unmasked” variables and infinite noise \(zero SNR\) for the “masked” variables\. If we choose our SNR paths to be the ROAR paths, then we just need to model conditional probabilities of the next variable, given previously unmasked variables, similar to masked diffusion models\. However, this idea doesn’t exactly match up with practical implementations of masked diffusion models like\[[16](https://arxiv.org/html/2605.12836#bib.bib10)\]because they write their objective \(Eq\. 3 in\[[16](https://arxiv.org/html/2605.12836#bib.bib10)\]for example\) in terms of probability of an unmasked set, conditioned on a masked set\. Below, we derive a similar expression starting from the perspective that we are trying to model ROAR SNR paths\. Unlike the result in\[[16](https://arxiv.org/html/2605.12836#bib.bib10)\], our derivation is path\-based and makes no explicit mention of time\.
Consider a multivariate probability distribution,p\(x1,…,xn\)p\(x\_\{1\},\\ldots,x\_\{n\}\)\. We can always decompose this distribution autoregressively asp\(x1,…,xn\)=∏i=1np\(xi\|x1:i\)p\(x\_\{1\},\\ldots,x\_\{n\}\)=\\prod\_\{i=1\}^\{n\}p\(x\_\{i\}\|x\_\{1:i\}\)\. We could in principle use any ordering of the variables for this decomposition\. Letπ\\pibe a permutation of\{1,…,n\}\\\{1,\\dots,n\\\}, indexed with vector notation\. Forn=3n=3we could haveπ1:3=\(3,1,2\)\\pi\_\{1:3\}=\(3,1,2\), for example\. For a particular permutation,π\\pi, we could write the AR decomposition as follows\.
p\(x1,…,xn\)=∏i=1np\(xπi\|xπ1:i−1\)p\(x\_\{1\},\\ldots,x\_\{n\}\)=\\prod\_\{i=1\}^\{n\}p\(x\_\{\\pi\_\{i\}\}\|x\_\{\\pi\_\{1:\{i\-1\}\}\}\)There aren\!n\!permutations, so we can take the average over the set of all permutations,𝒮n\\mathcal\{S\}\_\{n\}, and switch to log probabilities for convenience\.
logp\(x1,…,xn\)=1n\!∑π∈𝒮n∑i=1nlogp\(xπi\|xπ1:i−1\)\\displaystyle\\begin\{aligned\} \\log p\(x\_\{1\},\\ldots,x\_\{n\}\)=\\frac\{1\}\{n\!\}\\sum\_\{\\pi\\in\\mathcal\{S\}\_\{n\}\}\\sum\_\{i=1\}^\{n\}\\log p\(x\_\{\\pi\_\{i\}\}\|x\_\{\\pi\_\{1:\{i\-1\}\}\}\)\\end\{aligned\}\(25\)Now, note that this expression has many equivalent terms that can be grouped together\. We are summing over a total ofn2nn2^\{n\}log probability terms, but terms likelogp\(x3\|x1,x2\)=logp\(x3\|x2,x1\)\\log p\(x\_\{3\}\|x\_\{1\},x\_\{2\}\)=\\log p\(x\_\{3\}\|x\_\{2\},x\_\{1\}\), i\.e\., the order of the conditioned information doesn’t matter\. This suggests that we should re\-write this sum in terms of sets\. Let\[n\]=\{1,…,n\}\[n\]=\\\{1,\\dots,n\\\}and, forA⊆\[n\]A\\subseteq\[n\], writexA=\{xi:i∈A\}x\_\{A\}=\\\{x\_\{i\}:i\\in A\\\}\. We will have terms of the formlogp\(xi\|xA\)\\log p\(x\_\{i\}\|x\_\{A\}\)\(whereiiis not in the subset A\) and we need to count how many such terms appear in the original sum\. The number of permutations where we have the\|A\|\|A\|indices inAAfollowed byii, followed by the remaining\(n−\|A\|−1\)\(n\-\|A\|\-1\)indices in arbitrary order is\|A\|\!\(n−\|A\|−1\)\!\|A\|\!\(n\-\|A\|\-1\)\!\.
logp\(x1,…,xn\)=∑A⊆\[n\]\|A\|\!\(n−\|A\|−1\)\!n\!∑i∈\[n\]∖Alogp\(xi\|xA\)=∑k=0n−1nn∑A⊆\[n\]:\|A\|=kk\!\(n−k−1\)\!n\!n−kn−k∑i∈\[n\]∖Alogp\(xi\|xA\)=n∑k=0n−11n∑A⊆\[n\]:\|A\|=kk\!\(n−k\)\!n\!∑i∈\[n\]∖A1n−klogp\(xi\|xA\)\\displaystyle\\begin\{aligned\} \\log p\(x\_\{1\},\\ldots,x\_\{n\}\)&=\\sum\_\{A\\subseteq\[n\]\}\\frac\{\|A\|\!\(n\-\|A\|\-1\)\!\}\{n\!\}\\sum\_\{i\\in\[n\]\\setminus A\}\\log p\(x\_\{i\}\|x\_\{A\}\)\\\\ &=\\sum\_\{k=0\}^\{n\-1\}\\frac\{n\}\{n\}\\sum\_\{A\\subseteq\[n\]:\|A\|=k\}\\frac\{k\!\(n\-k\-1\)\!\}\{n\!\}\\frac\{n\-k\}\{n\-k\}\\sum\_\{i\\in\[n\]\\setminus A\}\\log p\(x\_\{i\}\|x\_\{A\}\)\\\\ &=n\\sum\_\{k=0\}^\{n\-1\}\\frac\{1\}\{n\}\\sum\_\{A\\subseteq\[n\]:\|A\|=k\}\\frac\{k\!\(n\-k\)\!\}\{n\!\}\\sum\_\{i\\in\[n\]\\setminus A\}\\frac\{1\}\{n\-k\}\\log p\(x\_\{i\}\|x\_\{A\}\)\\end\{aligned\}\(26\)In the second line, we rewrite the sum over sets to a sum over subsets of sizekk, then a sum ofkk\. We also introduce some factors of 1 which are re\-arranged in the third line\. The goal is to re\-write this expression in terms of expectations to get a convenient Monte Carlo estimator\.
1nlogp\(x1,…,xn\)=∑k=0n−11n⏟Uniform over k∑A⊆\[n\]:\|A\|=k1\(nk\)⏟Uniform over sets\|A\|=k∑i∈\[n\]∖A1n−k⏟Uniform overi∈\[n\]∖Alogp\(xi\|xA\)\\displaystyle\\begin\{aligned\} \\frac\{1\}\{n\}\\log p\(x\_\{1\},\\ldots,x\_\{n\}\)=\\underbrace\{\\sum\_\{k=0\}^\{n\-1\}\\frac\{1\}\{n\}\}\_\{\\text\{Uniform over k\}\}\\underbrace\{\\sum\_\{A\\subseteq\[n\]:\|A\|=k\}\\frac\{1\}\{\\dbinom\{n\}\{k\}\}\}\_\{\\text\{Uniform over sets $\|A\|=k$\}\}\\underbrace\{\\sum\_\{i\\in\[n\]\\setminus A\}\\frac\{1\}\{n\-k\}\}\_\{\\text\{Uniform over $i\\in\[n\]\\setminus A$\}\}\\log p\(x\_\{i\}\|x\_\{A\}\)\\end\{aligned\}\(27\)At this point, we can recognize that each sum corresponds to a sample over a uniform distribution\. The number of indicesiiin the set\[n\]∖A\[n\]\\setminus Aisn−kn\-k, the number of subsets of\[n\]\[n\]of sizekkis\(nk\)\\tbinom\{n\}\{k\}, the number of subset sizes isnn\.
1nlogp\(x1,…,xn\)=𝔼P\(k,A,i\)\[logp\(xi\|xA\)\]P\(k\)=Unif\{0,1,…,n−1\}P\(A\|k\)=Unif\{A⊆\[n\]:\|A\|=k\}P\(i\|A,k\)=Unif\{\[n\]∖A\}\\displaystyle\\begin\{aligned\} \\frac\{1\}\{n\}\\log p\(x\_\{1\},\\ldots,x\_\{n\}\)&=\\mathbb\{E\}\_\{P\(k,A,i\)\}\[\\log p\(x\_\{i\}\|x\_\{A\}\)\]\\\\ P\(k\)&=\\mathrm\{Unif\}\\\{0,1,\\dots,n\-1\\\}\\\\ P\(A\|k\)&=\\mathrm\{Unif\}\\\{A\\subseteq\[n\]:\|A\|=k\\\}\\\\ P\(i\|A,k\)&=\\mathrm\{Unif\}\\\{\[n\]\\setminus A\\\}\\end\{aligned\}\(28\)This gives us a concrete procedure for obtaining an unbiased estimator of the log likelihood that only requires evaluating probabilities of marginals, conditioned on sets of variables\. This is exactly what our denoiser is trained to do\. For discrete random variables, the inner expectation is just the mean cross entropy per variable of unmasked variables conditioned on masked ones, as seen in the LLADA estimator\. We construct a Monte Carlo estimate by sampling over subsets ofAA\. Training with this objective is simple as each Monte Carlo sample is a simple cross entropy\. We summarize the procedure in Alg\.[3](https://arxiv.org/html/2605.12836#alg3)\.
Algorithm 3ROAR NLL Estimator1:Sample, x\. Model giving marginal probabilities,
p\(xi\|xA\)p\(x\_\{i\}\|x\_\{A\}\)for any subset of unmasked tokens,
AA\.
2:Sample k and set A of size k uniformly
3:Average cross entropy per variable of masked tokens conditioned on unmasked token set, A
4:returnNLL estimate \(bits per token\)
### A\.7Cross\-entropy as a surrogate for the path\-integral MSE
This appendix justifies the choice of token\-level cross\-entropy on the continuous\-SNR branch of DSL training\. The path\-integral identity Eq\.[6](https://arxiv.org/html/2605.12836#S2.E6)expresses−logP\(𝒙\)\-\\log P\(\{\\bm\{x\}\}\)as a contour integral of the squared embedding error
Ei\(𝒙,γ\)=𝔼pγ\(𝒛∣𝒙\)\[‖𝒙i−𝒙^i\(𝒛\)‖22\],E\_\{i\}\(\{\\bm\{x\}\},\\gamma\)\\;=\\;\\mathbb\{E\}\_\{p\_\{\\gamma\}\(\\bm\{z\}\\mid\{\\bm\{x\}\}\)\}\\\!\\big\[\{\\\|\{\\bm\{x\}\}\_\{i\}\-\\hat\{\\bm\{x\}\}\_\{i\}\(\\bm\{z\}\)\\\|\}\_\{2\}^\{2\}\\big\],where𝒙^i=𝔼\[𝒙i∣𝒛\]\\hat\{\\bm\{x\}\}\_\{i\}=\\mathbb\{E\}\[\{\\bm\{x\}\}\_\{i\}\\mid\\bm\{z\}\]is the Bayes\-optimal denoiser\. A direct continuous\-NLL training scheme would parameterize a denoiser𝒙^i,θ\\hat\{\\bm\{x\}\}\_\{i,\\theta\}in embedding space and minimize this MSE\. We argue that DSL should instead minimize a token\-level cross\-entropy objective that has the same Bayes\-optimal solution but avoids three concrete pathologies of embedding\-space MSE\.
#### Same Bayes\-optimal solution\.
Under the unit\-sphere constraint, the Bayes denoiser admits the closed form
𝒙^\(𝒛\)=𝔼P\(𝒙\)\[𝒙e𝒙⋅𝒛\]/𝔼P\(𝒙\)\[e𝒙⋅𝒛\]=∑v∈𝒱p\(𝒙=v∣𝒛\)𝒆v,\\hat\{\\bm\{x\}\}\(\\bm\{z\}\)\\;=\\;\\mathbb\{E\}\_\{P\(\{\\bm\{x\}\}\)\}\\\!\\big\[\{\\bm\{x\}\}\\,e^\{\{\\bm\{x\}\}\\cdot\\bm\{z\}\}\\big\]\\,/\\,\\mathbb\{E\}\_\{P\(\{\\bm\{x\}\}\)\}\\\!\\big\[e^\{\{\\bm\{x\}\}\\cdot\\bm\{z\}\}\\big\]\\;=\\;\\sum\_\{v\\in\{\\mathcal\{V\}\}\}p\(\{\\bm\{x\}\}\{=\}v\\mid\\bm\{z\}\)\\,\{\\bm\{e\}\}\_\{v\},i\.e\., a vocabulary\-weighted average of unit\-norm embeddings under the induced token posteriorp\(𝒙=v∣𝒛\)p\(\{\\bm\{x\}\}\{=\}v\\mid\\bm\{z\}\)\. Embedding\-space MSE is minimized when𝒙^θ\(𝒛\)\\hat\{\\bm\{x\}\}\_\{\\theta\}\(\\bm\{z\}\)matches this posterior mean\. Token\-level cross\-entropy onpθ\(𝒙=v∣𝒛\)p\_\{\\theta\}\(\{\\bm\{x\}\}\{=\}v\\mid\\bm\{z\}\)is minimized whenpθ\(⋅∣𝒛\)=p\(⋅∣𝒛\)p\_\{\\theta\}\(\\cdot\\mid\\bm\{z\}\)=p\(\\cdot\\mid\\bm\{z\}\)\. Both objectives therefore identify the same Bayes\-optimal solution: at the global minimum, the CE\-trained model recoversp\(⋅∣𝒛\)p\(\\cdot\\mid\\bm\{z\}\), and reading off its posterior mean recovers𝒙^\(𝒛\)\\hat\{\\bm\{x\}\}\(\\bm\{z\}\)\. We use cross\-entropy because, away from the global minimum, embedding\-space MSE has structurally worse optimization geometry, as we explain next\.
#### Embedding collapse under jointly learned embeddings\.
\[[3](https://arxiv.org/html/2605.12836#bib.bib656)\]observe that directℓ2\\ell\_\{2\}regression in embedding space is ill\-posed when the embedding matrixWembW\_\{\\mathrm\{emb\}\}is jointly trained with the denoiser: settingWemb→0W\_\{\\mathrm\{emb\}\}\\to 0together with𝒙^θ→0\\hat\{\\bm\{x\}\}\_\{\\theta\}\\to 0achieves zero loss for any noisy input𝒛\\bm\{z\}, yielding a degenerate solution\. Prior continuous diffusion language models address this only with handcrafted embeddings or auxiliary regularizers\[[13](https://arxiv.org/html/2605.12836#bib.bib660),[2](https://arxiv.org/html/2605.12836#bib.bib657)\]\. DSL constrains every clean embedding to lie on the unit sphere \(‖𝒙i‖2=1\{\\\|\{\\bm\{x\}\}\_\{i\}\\\|\}\_\{2\}=1\), which already rules out exact collapse, but the corresponding low\-rank near\-collapsed configurations remain reachable by gradient descent on embedding\-space MSE\. Cross\-entropy on a softmax over fixed unit\-norm embeddings does not exhibit this pathology: the cross\-entropy minimum requires the predicted distribution to actually concentrate on the correct token, not merely shrink the predicted embedding\.
#### Low\-SNR memorization burden\.
\[[3](https://arxiv.org/html/2605.12836#bib.bib656)\]also note that a denoiser trained directly with embedding MSE must, at low noise, output the clean embedding vectors to high precision—effectively memorizingWembW\_\{\\mathrm\{emb\}\}inside the network parameters\. They mitigate this by reparameterizing the denoiser as a softmax over token logits, then averaging embeddings weighted by that softmax, an architectural change motivated by exactly this observation\. The DSL converter Eq\.[11](https://arxiv.org/html/2605.12836#S3.E11)adopts the same architectural shape, and CE training is the natural objective for the resulting categorical head: the network’s job is to assign probability mass over𝒱\{\\mathcal\{V\}\}, not to reconstruct embedding coordinates\.
#### Mean\-seeking shortcuts on multi\-modal posteriors\.
At intermediate SNR, the token posteriorp\(⋅∣𝒛\)p\(\\cdot\\mid\\bm\{z\}\)is generically multi\-modal: several tokens have non\-trivial probability\. The MMSE estimator𝒙^=∑vp\(v∣𝒛\)𝒆v\\hat\{\\bm\{x\}\}=\\sum\_\{v\}p\(v\\mid\\bm\{z\}\)\{\\bm\{e\}\}\_\{v\}is the posterior mean—an average of the candidate embeddings—which under unit\-sphere geometry is itself non\-unit\-norm and lies in the interior of the embedding simplex\. A finite\-capacity denoiser trained on embedding MSE can approach this mean by outputting a generic interior vector that does not commit to any particular token, lowering raw MSE without producing useful posterior structure\. The cross\-entropy objective penalizes such non\-committal outputs because they correspond to entropic predicted distributions; the optimizer is pushed toward distributions that match the actual posterior shape\.
In summary, cross\-entropy and embedding MSE share the same Bayes\-optimal solution under Eq\.[6](https://arxiv.org/html/2605.12836#S2.E6), but cross\-entropy is preferred for three operational reasons rooted in prior work\[[3](https://arxiv.org/html/2605.12836#bib.bib656)\]: it eliminates joint\-embedding collapse, eliminates the low\-SNR memorization burden, and avoids mean\-seeking shortcuts on multi\-modal posteriors\. This is what makes cross\-entropy a practical surrogate for the path\-integral MSE on the continuous\-SNR branch of DSL training\.
### A\.8Posterior\-State Matching and Inference\-Time Error Correction
A useful way to understand why DSL can open a*correction window*is through a minimal two\-bit manifold example\. Consider sequences\(x1,x2\)∈\{0,1\}2\(x\_\{1\},x\_\{2\}\)\\in\\\{0,1\\\}^\{2\}whose data distribution is supported only on
ℳ=\{\(0,0\),\(1,1\)\},P\(\(0,0\)\)=P\(\(1,1\)\)=1/2\.\{\\mathcal\{M\}\}=\\\{\(0,0\),\(1,1\)\\\},\\qquad P\\bigl\(\(0,0\)\\bigr\)=P\\bigl\(\(1,1\)\\bigr\)=\\nicefrac\{\{1\}\}\{\{2\}\}\.\(29\)Thus, inconsistent states such as\(0,1\)\(0,1\)or\(1,0\)\(1,0\)are off\-manifold: they may arise as intermediate draft errors, but they are not valid solutions\.
Embed the two symbols on the unit sphere by
enc\(0\)=−1,enc\(1\)=\+1,\\operatorname\{enc\}\(0\)=\-1,\\qquad\\operatorname\{enc\}\(1\)=\+1,\(30\)so the two valid clean sequence embeddings are
𝒙\(−\):=\(−1,−1\),𝒙\(\+\):=\(\+1,\+1\)\.\{\\bm\{x\}\}^\{\(\-\)\}:=\(\-1,\-1\),\\qquad\{\\bm\{x\}\}^\{\(\+\)\}:=\(\+1,\+1\)\.\(31\)Under the DSL localization channel, a corrupted state is
𝒛=γ𝒙\+γϵ,ϵ∼𝒩\(0,𝑰2\),𝒙∈\{𝒙\(−\),𝒙\(\+\)\}\.\\bm\{z\}=\\gamma\{\\bm\{x\}\}\+\\sqrt\{\\gamma\}\\,\{\\bm\{\\epsilon\}\},\\qquad\{\\bm\{\\epsilon\}\}\\sim\{\\mathcal\{N\}\}\(0,\{\\bm\{I\}\}\_\{2\}\),\\qquad\{\\bm\{x\}\}\\in\\\{\{\\bm\{x\}\}^\{\(\-\)\},\{\\bm\{x\}\}^\{\(\+\)\}\\\}\.\(32\)
Because the prior is uniform and the clean embeddings are unit norm, the posterior over the two valid hypotheses is
P\(𝒙=𝒙\(\+\)∣𝒛\)\\displaystyle P\\bigl\(\{\\bm\{x\}\}=\{\\bm\{x\}\}^\{\(\+\)\}\\mid\\bm\{z\}\\bigr\)=exp\(𝒛⊤𝒙\(\+\)\)exp\(𝒛⊤𝒙\(\+\)\)\+exp\(𝒛⊤𝒙\(−\)\)=exp\(z1\+z2\)exp\(z1\+z2\)\+exp\(−\(z1\+z2\)\),\\displaystyle=\\frac\{\\exp\(\\bm\{z\}^\{\\top\}\{\\bm\{x\}\}^\{\(\+\)\}\)\}\{\\exp\(\\bm\{z\}^\{\\top\}\{\\bm\{x\}\}^\{\(\+\)\}\)\+\\exp\(\\bm\{z\}^\{\\top\}\{\\bm\{x\}\}^\{\(\-\)\}\)\}=\\frac\{\\exp\(z\_\{1\}\+z\_\{2\}\)\}\{\\exp\(z\_\{1\}\+z\_\{2\}\)\+\\exp\(\-\(z\_\{1\}\+z\_\{2\}\)\)\},\(33\)P\(𝒙=𝒙\(−\)∣𝒛\)\\displaystyle P\\bigl\(\{\\bm\{x\}\}=\{\\bm\{x\}\}^\{\(\-\)\}\\mid\\bm\{z\}\\bigr\)=exp\(𝒛⊤𝒙\(−\)\)exp\(𝒛⊤𝒙\(\+\)\)\+exp\(𝒛⊤𝒙\(−\)\)=exp\(−\(z1\+z2\)\)exp\(z1\+z2\)\+exp\(−\(z1\+z2\)\)\.\\displaystyle=\\frac\{\\exp\(\\bm\{z\}^\{\\top\}\{\\bm\{x\}\}^\{\(\-\)\}\)\}\{\\exp\(\\bm\{z\}^\{\\top\}\{\\bm\{x\}\}^\{\(\+\)\}\)\+\\exp\(\\bm\{z\}^\{\\top\}\{\\bm\{x\}\}^\{\(\-\)\}\)\}=\\frac\{\\exp\(\-\(z\_\{1\}\+z\_\{2\}\)\)\}\{\\exp\(z\_\{1\}\+z\_\{2\}\)\+\\exp\(\-\(z\_\{1\}\+z\_\{2\}\)\)\}\.\(34\)Equivalently,
P\(𝒙=𝒙\(\+\)∣𝒛\)=σ\(2\(z1\+z2\)\),P\(𝒙=𝒙\(−\)∣𝒛\)=1−σ\(2\(z1\+z2\)\)\.P\\bigl\(\{\\bm\{x\}\}=\{\\bm\{x\}\}^\{\(\+\)\}\\mid\\bm\{z\}\\bigr\)=\\sigma\\\!\\bigl\(2\(z\_\{1\}\+z\_\{2\}\)\\bigr\),\\qquad P\\bigl\(\{\\bm\{x\}\}=\{\\bm\{x\}\}^\{\(\-\)\}\\mid\\bm\{z\}\\bigr\)=1\-\\sigma\\\!\\bigl\(2\(z\_\{1\}\+z\_\{2\}\)\\bigr\)\.\(35\)Hence the decision boundary is not determined token\-by\-token, but by the*joint*statistic
z1\+z2=0\.z\_\{1\}\+z\_\{2\}=0\.\(36\)
The posterior\-mean denoiser is
𝒙^\(𝒛\)\\displaystyle\\hat\{\\bm\{x\}\}\(\\bm\{z\}\)=𝔼\[𝒙∣𝒛\]=P\(𝒙=𝒙\(\+\)∣𝒛\)𝒙\(\+\)\+P\(𝒙=𝒙\(−\)∣𝒛\)𝒙\(−\)\\displaystyle=\\mathbb\{E\}\[\{\\bm\{x\}\}\\mid\\bm\{z\}\]=P\\bigl\(\{\\bm\{x\}\}=\{\\bm\{x\}\}^\{\(\+\)\}\\mid\\bm\{z\}\\bigr\)\{\\bm\{x\}\}^\{\(\+\)\}\+P\\bigl\(\{\\bm\{x\}\}=\{\\bm\{x\}\}^\{\(\-\)\}\\mid\\bm\{z\}\\bigr\)\{\\bm\{x\}\}^\{\(\-\)\}=\(P\(𝒙=𝒙\(\+\)∣𝒛\)−P\(𝒙=𝒙\(−\)∣𝒛\)\)\(1,1\)\\displaystyle=\\Bigl\(P\\bigl\(\{\\bm\{x\}\}=\{\\bm\{x\}\}^\{\(\+\)\}\\mid\\bm\{z\}\\bigr\)\-P\\bigl\(\{\\bm\{x\}\}=\{\\bm\{x\}\}^\{\(\-\)\}\\mid\\bm\{z\}\\bigr\)\\Bigr\)\(1,1\)=tanh\(z1\+z2\)\(1,1\)\.\\displaystyle=\\tanh\(z\_\{1\}\+z\_\{2\}\)\\,\(1,1\)\.\(37\)Equation[37](https://arxiv.org/html/2605.12836#A1.E37)is the key point: the denoising field points toward the correct manifold as long as the corrupted state remains on the correct side of the joint decision boundary\.
To make this concrete, suppose the ground\-truth sequence is\(1,1\)\(1,1\), but an intermediate draft contains a local inconsistency resembling\(1,0\)\(1,0\)\. In continuous state space, this corresponds to a region such as
ℛ11corr:=\{𝒛∈ℝ2:z1\>0,z2<0,z1\+z2\>0\}\.\{\\mathcal\{R\}\}\_\{11\}^\{\\mathrm\{corr\}\}:=\\\{\\bm\{z\}\\in\{\\mathbb\{R\}\}^\{2\}:\\ z\_\{1\}\>0,\\ z\_\{2\}<0,\\ z\_\{1\}\+z\_\{2\}\>0\\\}\.\(38\)Insideℛ11corr\{\\mathcal\{R\}\}\_\{11\}^\{\\mathrm\{corr\}\}, the second coordinate looks locally wrong, but the joint posterior still favors𝒙\(\+\)=\(1,1\)\{\\bm\{x\}\}^\{\(\+\)\}=\(1,1\)becausez1\+z2\>0z\_\{1\}\+z\_\{2\}\>0\. Therefore, the denoising update in Eq\.[37](https://arxiv.org/html/2605.12836#A1.E37)continues to point toward\(1,1\)\(1,1\)rather than amplifying the local error\. This is the*correction window*: an off\-manifold draft can still be revised if its induced posterior state remains inside the attraction region of the correct solution\.
This example also clarifies what DSL does*not*claim\. DSL does not guarantee correction for every erroneous draft\. If the state crosses the wrong side of the boundary, i\.e\.
z1\+z2<0,z\_\{1\}\+z\_\{2\}<0,\(39\)then the posterior collapses toward𝒙\(−\)=\(0,0\)\{\\bm\{x\}\}^\{\(\-\)\}=\(0,0\)and the denoising field points to the wrong manifold\. Thus, posterior\-state matching opens a correction window, but does not guarantee that inference remains inside it\.
This toy case explains why mixed\-corruption DSL training can help\. Training does not need to reproduce the exact symbolic error\(1,0\)\(1,0\)as a raw state\. Instead, it is sufficient that the corruption distribution exposes the model to continuous states whose induced posterior geometry overlaps with revisable regions such as Eq\.[38](https://arxiv.org/html/2605.12836#A1.E38)\. In that case, some imperfect self\-generated drafts at inference time are less out\-of\-distribution, and the model is more likely to produce a denoising field that still points back toward a valid solution manifold\.
By contrast, standard autoregressive decoding and plain absorbing masked diffusion do not naturally realize this mechanism\. For autoregressive decoding,
x^1=argmaxx∈\{0,1\}pθ\(x1=x\),x^2=argmaxx∈\{0,1\}pθ\(x2=x∣x^1\),\\hat\{x\}\_\{1\}=\\operatorname\*\{arg\\,max\}\_\{x\\in\\\{0,1\\\}\}p\_\{\\theta\}\(x\_\{1\}=x\),\\qquad\\hat\{x\}\_\{2\}=\\operatorname\*\{arg\\,max\}\_\{x\\in\\\{0,1\\\}\}p\_\{\\theta\}\(x\_\{2\}=x\\mid\\hat\{x\}\_\{1\}\),\(40\)so oncex^1\\hat\{x\}\_\{1\}is emitted, later steps condition on a hard prefix and do not revise it\. Likewise, in plain absorbing masked diffusion, visible tokens are effectively frozen once unmasked, so a visible\-but\-wrong token is not naturally moved back toward the correct manifold unless an explicit remasking mechanism is introduced\. DSL is therefore complementary to remasking\-based samplers: remasking reopens a visible token for revision, while DSL increases the chance that the reopened state still lies in a posterior region where the denoising field points back toward the correct solution\.
The lesson of this example is that correction is not explained by “many possible contexts” alone\. What matters is whether these possibilities are organized into a posterior geometry whose mass still overlaps the correct solution and whose denoising field still points toward it\. DSL helps precisely by broadening the posterior states seen during training, thereby making some refinement\-time erroneous drafts less out\-of\-distribution and more revisable\.
### A\.9Confidence Geometry: Norm–Direction Decomposition and Bounded Entropy
With the posterior written as a function of the realized state alone, its geometry becomes transparent\. Throughout this subsection we use the unit\-sphere token geometry of DSL: each clean token embedding satisfies‖𝒆v‖2=1\\\|\{\\bm\{e\}\}\_\{v\}\\\|\_\{2\}=1forv∈𝒱v\\in\{\\mathcal\{V\}\}\.
###### Proposition A\.1\(Norm–direction decomposition\)
For any nonzero state𝐳\\bm\{z\}and any unit\-norm token embedding𝐞i\{\\bm\{e\}\}\_\{i\}, the score of tokeniiinduced by𝐳\\bm\{z\}satisfies
si\(𝒛\):=⟨𝒛,𝒆i⟩=‖𝒛‖2cosθi,s\_\{i\}\(\\bm\{z\}\):=\\langle\\bm\{z\},\{\\bm\{e\}\}\_\{i\}\\rangle=\{\\\|\\bm\{z\}\\\|\}\_\{2\}\\cos\\theta\_\{i\},\(41\)whereθi\\theta\_\{i\}is the angle between𝐳\\bm\{z\}and𝐞i\{\\bm\{e\}\}\_\{i\}\. At𝐳=0\\bm\{z\}=0, all scores are zero and the posterior reduces to the base measure\.
###### Proof A\.1
For𝐳≠0\\bm\{z\}\\neq 0, the cosine identity gives
cosθi=⟨𝒛,𝒆i⟩‖𝒛‖2‖𝒆i‖2\.\\cos\\theta\_\{i\}=\\frac\{\\langle\\bm\{z\},\{\\bm\{e\}\}\_\{i\}\\rangle\}\{\\\|\\bm\{z\}\\\|\_\{2\}\\\|\{\\bm\{e\}\}\_\{i\}\\\|\_\{2\}\}\.Since‖𝐞i‖2=1\\\|\{\\bm\{e\}\}\_\{i\}\\\|\_\{2\}=1, this implies⟨𝐳,𝐞i⟩=‖𝐳‖2cosθi\\langle\\bm\{z\},\{\\bm\{e\}\}\_\{i\}\\rangle=\\\|\\bm\{z\}\\\|\_\{2\}\\cos\\theta\_\{i\}\.
Proposition[A\.1](https://arxiv.org/html/2605.12836#A1.Thmproposition1)separates two roles that are entangled in unconstrained logits: the angular term determines*which token*is preferred, whereas the radial term determines*how concentrated*the posterior is around that preference\. This yields the discrete hyperspherical posterior
p\(x=i∣𝒛\)=πiexp\(‖𝒛‖2cosθi\)∑j∈𝒱πjexp\(‖𝒛‖2cosθj\)\.p\(x=i\\mid\\bm\{z\}\)=\\frac\{\\pi\_\{i\}\\exp\\\!\\big\(\{\\\|\\bm\{z\}\\\|\}\_\{2\}\\cos\\theta\_\{i\}\\big\)\}\{\\sum\_\{j\\in\{\\mathcal\{V\}\}\}\\pi\_\{j\}\\exp\\\!\\big\(\{\\\|\\bm\{z\}\\\|\}\_\{2\}\\cos\\theta\_\{j\}\\big\)\}\.\(42\)
#### Entropy envelope at bounded radius\.
For any finite radius capβ<∞\\beta<\\infty, write𝒛=β𝒖\\bm\{z\}=\\beta\{\\bm\{u\}\}with‖𝒖‖2≤1\\\|\{\\bm\{u\}\}\\\|\_\{2\}\\leq 1\. The corresponding bounded\-radius posterior family is
qβ\(v∣𝒖\)=πvexp\(β⟨𝒖,𝒆v⟩\)∑j∈𝒱πjexp\(β⟨𝒖,𝒆j⟩\),‖𝒖‖2≤1,‖𝒆v‖2=1\.q\_\{\\beta\}\(v\\mid\{\\bm\{u\}\}\)=\\frac\{\\pi\_\{v\}\\exp\\\!\\bigl\(\\beta\\langle\{\\bm\{u\}\},\{\\bm\{e\}\}\_\{v\}\\rangle\\bigr\)\}\{\\sum\_\{j\\in\{\\mathcal\{V\}\}\}\\pi\_\{j\}\\exp\\\!\\bigl\(\\beta\\langle\{\\bm\{u\}\},\{\\bm\{e\}\}\_\{j\}\\rangle\\bigr\)\},\\qquad\{\\\|\{\\bm\{u\}\}\\\|\}\_\{2\}\\leq 1,\\;\{\\\|\{\\bm\{e\}\}\_\{v\}\\\|\}\_\{2\}=1\.\(43\)Since⟨𝒖,𝒆v⟩∈\[−1,1\]\\langle\{\\bm\{u\}\},\{\\bm\{e\}\}\_\{v\}\\rangle\\in\[\-1,1\], the logits can only tilt away from the base measureπ\\piwithin a finite band controlled byβ\\beta\. Hence the achievable posterior entropy over this bounded\-radius family is confined to the interval
Hmin\(β,π\)≤H\(qβ\(⋅∣𝒖\)\)≤Hmax\(β,π\),H\_\{\\min\}\(\\beta,\\pi\)\\;\\leq\\;H\\\!\\left\(q\_\{\\beta\}\(\\cdot\\mid\{\\bm\{u\}\}\)\\right\)\\;\\leq\\;H\_\{\\max\}\(\\beta,\\pi\),\(44\)where
Hmin\(β,π\):=min‖𝒖‖2≤1H\(qβ\(⋅∣𝒖\)\),Hmax\(β,π\):=max‖𝒖‖2≤1H\(qβ\(⋅∣𝒖\)\)\.H\_\{\\min\}\(\\beta,\\pi\):=\\min\_\{\\\|\{\\bm\{u\}\}\\\|\_\{2\}\\leq 1\}H\(q\_\{\\beta\}\(\\cdot\\mid\{\\bm\{u\}\}\)\),\\qquad H\_\{\\max\}\(\\beta,\\pi\):=\\max\_\{\\\|\{\\bm\{u\}\}\\\|\_\{2\}\\leq 1\}H\(q\_\{\\beta\}\(\\cdot\\mid\{\\bm\{u\}\}\)\)\.\(45\)Thus, within a finite\-SNR or bounded\-converter regime,β\\betacontrols how sharp the posterior may become, whileπ\\pisets the baseline non\-uniformity of token probabilities\. Without such a finite radius cap, the Bayes posterior can become arbitrarily sharp as‖𝒛‖2\\\|\\bm\{z\}\\\|\_\{2\}grows\.
#### Smoothness as a supporting property\.
The same hyperspherical geometry also induces a smooth Bayes posterior\-mean map
𝒎\(𝒛\):=𝔼\[𝒆x∣𝒛\]\.\{\\bm\{m\}\}\(\\bm\{z\}\):=\\mathbb\{E\}\[\{\\bm\{e\}\}\_\{x\}\\mid\\bm\{z\}\]\.\(46\)When the clean\-token distribution is supported on unit\-norm embeddings, this Bayes posterior\-mean map is11\-Lipschitz:
‖𝒎\(𝒛1\)−𝒎\(𝒛2\)‖2≤‖𝒛1−𝒛2‖2\.\\\|\{\\bm\{m\}\}\(\\bm\{z\}\_\{1\}\)\-\{\\bm\{m\}\}\(\\bm\{z\}\_\{2\}\)\\\|\_\{2\}\\leq\\\|\\bm\{z\}\_\{1\}\-\\bm\{z\}\_\{2\}\\\|\_\{2\}\.\(47\)We treat this as a supporting geometric property: it suggests that DSL refinement is well\-behaved by construction, but in practice the decisive mechanism is whether confidence remains informative on refinement\-time draft errors\.
### A\.10Lipschitz Continuity of the Induced Posterior\-Mean Map
The main text treats smoothness as a supporting geometric property rather than a core mechanism\. We therefore defer the full proof here\. The result shows that the posterior\-mean refinement map induced by DSL is11\-Lipschitz under unit\-sphere token geometry\.
###### Lemma A\.1
Let the induced posterior\-mean map \(MMSE denoiser\) of the unconditional SDE be defined by
𝒙^\(𝒛\)=𝔼P\(𝒙\)\[𝒙e𝒙⋅𝒛\]/𝔼P\(𝒙\)\[e𝒙⋅𝒛\]𝒛∈ℝd,\\hat\{\\bm\{x\}\}\(\\bm\{z\}\)=\\mathbb\{E\}\_\{P\(\{\\bm\{x\}\}\)\}\[\{\\bm\{x\}\}e^\{\{\\bm\{x\}\}\\cdot\\bm\{z\}\}\]/\\mathbb\{E\}\_\{P\(\{\\bm\{x\}\}\)\}\[e^\{\{\\bm\{x\}\}\\cdot\\bm\{z\}\}\]\\qquad\\bm\{z\}\\in\\mathbb\{R\}^\{d\},where the data distributionp\(𝐱\)p\(\{\\bm\{x\}\}\)is supported on the unit sphere∥𝐱∥=1\\lVert\{\\bm\{x\}\}\\rVert=1\. Then𝐱^\(⋅\)\\hat\{\{\\bm\{x\}\}\}\(\\cdot\)is11\-Lipschitz; that is,
∀𝒛1,𝒛2∈ℝd:∥𝒙^\(𝒛1\)−𝒙^\(𝒛2\)∥≤∥𝒛1−𝒛2∥\.\\forall\\bm\{z\}\_\{1\},\\bm\{z\}\_\{2\}\\in\\mathbb\{R\}^\{d\}:\\quad\\lVert\\hat\{\{\\bm\{x\}\}\}\(\\bm\{z\}\_\{1\}\)\-\\hat\{\{\\bm\{x\}\}\}\(\\bm\{z\}\_\{2\}\)\\rVert\\;\\leq\\;\\lVert\\bm\{z\}\_\{1\}\-\\bm\{z\}\_\{2\}\\rVert\.
Define the log–partition function
ϕ\(𝒛\)=log𝔼p\(𝒙\)\[e𝒙⋅𝒛\],\\phi\(\\bm\{z\}\)\\;=\\;\\log\\mathbb\{E\}\_\{p\(\{\\bm\{x\}\}\)\}\\\!\\bigl\[e^\{\{\\bm\{x\}\}\\cdot\\bm\{z\}\}\\bigr\],so that𝒙^\(𝒛\)=∇𝒛ϕ\(𝒛\)\\hat\{\{\\bm\{x\}\}\}\(\\bm\{z\}\)=\\nabla\_\{\\bm\{z\}\}\\phi\(\\bm\{z\}\)\.
Becausee𝒙⋅𝒛e^\{\{\\bm\{x\}\}\\cdot\\bm\{z\}\}is convex in𝒛\\bm\{z\}and the expectation preserves convexity,ϕ\\phiis convex\. Its Hessian equals the covariance of𝒙\{\\bm\{x\}\}under the Gibbs posteriorp\(𝒙∣𝒛\)∝p\(𝒙\)e𝒙⋅𝒛p\(\{\\bm\{x\}\}\\mid\\bm\{z\}\)\\propto p\(\{\\bm\{x\}\}\)e^\{\{\\bm\{x\}\}\\cdot\\bm\{z\}\}:
H\(𝒛\)=∇𝒛2ϕ\(𝒛\)=Covp\(𝒙∣𝒛\)\(𝒙\)\.H\(\\bm\{z\}\)\\;=\\;\\nabla\_\{\\bm\{z\}\}^\{2\}\\phi\(\\bm\{z\}\)\\;=\\;\\mathrm\{Cov\}\_\{p\(\{\\bm\{x\}\}\\mid\\bm\{z\}\)\}\\\!\\bigl\(\{\\bm\{x\}\}\\bigr\)\.For any unit vectorv∈ℝdv\\in\\mathbb\{R\}^\{d\},
v⊤H\(𝒛\)v=𝔼p\(𝒙∣𝒛\)\[\(v⋅𝒙\)2\]−\(𝔼p\(𝒙∣𝒛\)\[v⋅𝒙\]\)2≤𝔼p\(𝒙∣𝒛\)\[\(v⋅𝒙\)2\]≤∥v∥2=1,v^\{\\top\}H\(\\bm\{z\}\)\\,v\\;=\\;\\mathbb\{E\}\_\{p\(\{\\bm\{x\}\}\\mid\\bm\{z\}\)\}\\\!\\bigl\[\(v\\cdot\{\\bm\{x\}\}\)^\{2\}\\bigr\]\-\\Bigl\(\\mathbb\{E\}\_\{p\(\{\\bm\{x\}\}\\mid\\bm\{z\}\)\}\[v\\cdot\{\\bm\{x\}\}\]\\Bigr\)^\{\\\!2\}\\;\\leq\\;\\mathbb\{E\}\_\{p\(\{\\bm\{x\}\}\\mid\\bm\{z\}\)\}\\\!\\bigl\[\(v\\cdot\{\\bm\{x\}\}\)^\{2\}\\bigr\]\\;\\leq\\;\\lVert v\\rVert^\{2\}\\;=\\;1,where the last inequality uses∥𝒙∥=1\\lVert\{\\bm\{x\}\}\\rVert=1, therefore, spectral norm satisfies∥H\(𝒛\)∥≤1\\lVert H\(\\bm\{z\}\)\\rVert\\leq 1\.
Since∇𝒛2ϕ\\nabla\_\{\\bm\{z\}\}^\{2\}\\phiis bounded by11everywhere, the gradient∇𝒛ϕ=𝒙^\\nabla\_\{\\bm\{z\}\}\\phi=\\hat\{\{\\bm\{x\}\}\}is11\-Lipschitz:
∥𝒙^\(𝒛1\)−𝒙^\(𝒛2\)∥≤∫01‖H\(𝒛2\+t\(𝒛1−𝒛2\)\)‖∥𝒛1−𝒛2∥𝑑t≤∥𝒛1−𝒛2∥\.\\lVert\\hat\{\{\\bm\{x\}\}\}\(\\bm\{z\}\_\{1\}\)\-\\hat\{\{\\bm\{x\}\}\}\(\\bm\{z\}\_\{2\}\)\\rVert\\;\\leq\\;\\int\_\{0\}^\{1\}\\\!\\bigl\\lVert H\\bigl\(\\bm\{z\}\_\{2\}\+t\(\\bm\{z\}\_\{1\}\-\\bm\{z\}\_\{2\}\)\\bigr\)\\bigr\\rVert\\,\\lVert\\bm\{z\}\_\{1\}\-\\bm\{z\}\_\{2\}\\rVert\\,dt\\;\\leq\\;\\lVert\\bm\{z\}\_\{1\}\-\\bm\{z\}\_\{2\}\\rVert\.This proves non\-expansiveness of the Bayes posterior\-mean map\. Empirical estimates with K < 1 are consistent with this bound, but are stronger than what the lemma guarantees\.■\\blacksquare
## Appendix BContinuous States Are Not Enough: Posterior Coordinates vs\. Timestep\-Conditioned Denoising
This appendix clarifies the distinction between DSL and prior continuous diffusion language models\. Prior continuous diffusion LMs already introduce continuous finite\-SNR states between noise and clean token embeddings\[[13](https://arxiv.org/html/2605.12836#bib.bib660),[2](https://arxiv.org/html/2605.12836#bib.bib657),[3](https://arxiv.org/html/2605.12836#bib.bib656),[27](https://arxiv.org/html/2605.12836#bib.bib659)\]\. Thus, the key difference is not the mere existence of continuous intermediate states\. Rather, the difference is how those states are parameterized and interpreted by the denoiser\.
In prior continuous diffusion LMs, a noisy token state is usually interpreted together with an external nominal noise level or timestep\. Consequently, denoising at different SNRs is presented to the model as a family of timestep\-indexed tasks\. DSL instead represents noisy token states in posterior natural coordinates under unit\-sphere token geometry, so the realized state itself determines the token posterior\. This turns different SNR regimes into different regions of one posterior\-state space, enabling mixed continuous/endpoint training and arbitrary per\-token SNR paths with a single denoiser\.
#### A generic continuous diffusion posterior\.
Consider a generic continuous Gaussian corruption for one token embedding:
𝒚τ=aτ𝒙\+στϵ,ϵ∼𝒩\(𝟎,𝑰\),\{\\bm\{y\}\}\_\{\\tau\}=a\_\{\\tau\}\{\\bm\{x\}\}\+\\sigma\_\{\\tau\}\{\\bm\{\\epsilon\}\},\\qquad\{\\bm\{\\epsilon\}\}\\sim\{\\mathcal\{N\}\}\(\{\\bm\{0\}\},\{\\bm\{I\}\}\),\(48\)whereτ\\taudenotes the nominal diffusion time, andaτ,στa\_\{\\tau\},\\sigma\_\{\\tau\}are the signal and noise scales\. Let𝒆v=enc\(v\)\{\\bm\{e\}\}\_\{v\}=\\operatorname\{enc\}\(v\)be the embedding of tokenv∈𝒱v\\in\{\\mathcal\{V\}\}, and letπv\\pi\_\{v\}denote the context\-dependent prior probability of tokenvv\. Then the Bayes posterior induced by Eq\.[48](https://arxiv.org/html/2605.12836#A2.E48)has the form
P\(𝒙=𝒆v∣𝒚τ,τ\)∝πvexp\(aτστ2⟨𝒚τ,𝒆v⟩−aτ22στ2‖𝒆v‖22\)\.P\(\{\\bm\{x\}\}=\{\\bm\{e\}\}\_\{v\}\\mid\{\\bm\{y\}\}\_\{\\tau\},\\tau\)\\;\\propto\\;\\pi\_\{v\}\\exp\\\!\\left\(\\frac\{a\_\{\\tau\}\}\{\\sigma\_\{\\tau\}^\{2\}\}\\langle\{\\bm\{y\}\}\_\{\\tau\},\{\\bm\{e\}\}\_\{v\}\\rangle\-\\frac\{a\_\{\\tau\}^\{2\}\}\{2\\sigma\_\{\\tau\}^\{2\}\}\{\\\|\{\\bm\{e\}\}\_\{v\}\\\|\}\_\{2\}^\{2\}\\right\)\.\(49\)Therefore, in general, the posterior is a function of the pair\(𝒚τ,τ\)\(\{\\bm\{y\}\}\_\{\\tau\},\\tau\)\. Even if the token embeddings are normalized so that the norm term in Eq\.[49](https://arxiv.org/html/2605.12836#A2.E49)cancels across vocabulary items, the natural posterior coordinate is not the raw noisy state𝒚τ\{\\bm\{y\}\}\_\{\\tau\}, but the rescaled state
𝒛~τ=aτστ2𝒚τ\.\\tilde\{\\bm\{z\}\}\_\{\\tau\}=\\frac\{a\_\{\\tau\}\}\{\\sigma\_\{\\tau\}^\{2\}\}\\,\{\\bm\{y\}\}\_\{\\tau\}\.\(50\)Thus a timestep\-conditioned continuous denoiser is naturally written as
fθ\(𝒚τ,τ\),f\_\{\\theta\}\(\{\\bm\{y\}\}\_\{\\tau\},\\tau\),\(51\)where the timestep label tells the network how to interpret the scale and semantics of the noisy embedding\.
This observation is not a criticism of continuous diffusion as a probabilistic model\. The posterior information is present in the pair\(𝒚τ,τ\)\(\{\\bm\{y\}\}\_\{\\tau\},\\tau\)\. The issue is representational: different noise levels are exposed to the neural network as different denoising regimes indexed byτ\\tau\.
#### DSL uses posterior natural coordinates directly\.
DSL chooses a different parameterization of the Gaussian observation\. For each token position,
𝒛i=γi𝒙i\+γiϵi,ϵi∼𝒩\(𝟎,𝑰\)\.\\bm\{z\}\_\{i\}=\\gamma\_\{i\}\{\\bm\{x\}\}\_\{i\}\+\\sqrt\{\\gamma\_\{i\}\}\\,\{\\bm\{\\epsilon\}\}\_\{i\},\\qquad\{\\bm\{\\epsilon\}\}\_\{i\}\\sim\{\\mathcal\{N\}\}\(\{\\bm\{0\}\},\{\\bm\{I\}\}\)\.\(52\)Here the signal coefficient isγi\\gamma\_\{i\}and the noise variance is alsoγi\\gamma\_\{i\}\. Equivalently, in the notation of Eq\.[48](https://arxiv.org/html/2605.12836#A2.E48), DSL sets
aγi=γi,σγi2=γi,aγiσγi2=1\.a\_\{\\gamma\_\{i\}\}=\\gamma\_\{i\},\\qquad\\sigma\_\{\\gamma\_\{i\}\}^\{2\}=\\gamma\_\{i\},\\qquad\\frac\{a\_\{\\gamma\_\{i\}\}\}\{\\sigma\_\{\\gamma\_\{i\}\}^\{2\}\}=1\.Thus the realized state𝒛i\\bm\{z\}\_\{i\}is already in the posterior natural coordinate\. Under the unit\-sphere constraint‖𝒆v‖2=1\{\\\|\{\\bm\{e\}\}\_\{v\}\\\|\}\_\{2\}=1, the token\-dependent norm term cancels, giving
P\(𝒙i=𝒆v∣𝒛,𝜸\)=P\(𝒙i=𝒆v∣𝒛\)∝πvexp\(⟨𝒛i,𝒆v⟩\)\.P\(\{\\bm\{x\}\}\_\{i\}=\{\\bm\{e\}\}\_\{v\}\\mid\\bm\{z\},\{\\bm\{\\gamma\}\}\)=P\(\{\\bm\{x\}\}\_\{i\}=\{\\bm\{e\}\}\_\{v\}\\mid\\bm\{z\}\)\\;\\propto\\;\\pi\_\{v\}\\exp\\\!\\left\(\\langle\\bm\{z\}\_\{i\},\{\\bm\{e\}\}\_\{v\}\\rangle\\right\)\.\(53\)Consequently, the Bayes\-optimal denoiser is SNR\-invariant:
𝒙^\(𝒛,𝜸\)=𝔼P𝜸\(𝒙∣𝒛\)\[𝒙\]=𝔼P\(𝒙\)\[𝒙e𝒙⋅𝒛\]/𝔼P\(𝒙\)\[e𝒙⋅𝒛\]=𝒙^\(𝒛\)\.\\hat\{\\bm\{x\}\}\(\\bm\{z\},\{\\bm\{\\gamma\}\}\)=\\mathbb\{E\}\_\{P\_\{\{\\bm\{\\gamma\}\}\}\(\{\\bm\{x\}\}\\mid\\bm\{z\}\)\}\[\{\\bm\{x\}\}\]=\\mathbb\{E\}\_\{P\(\{\\bm\{x\}\}\)\}\[\{\\bm\{x\}\}e^\{\{\\bm\{x\}\}\\cdot\\bm\{z\}\}\]\\Big/\\mathbb\{E\}\_\{P\(\{\\bm\{x\}\}\)\}\[e^\{\{\\bm\{x\}\}\\cdot\\bm\{z\}\}\]=\\hat\{\\bm\{x\}\}\(\\bm\{z\}\)\.\(54\)
The learning problem is therefore different\. Instead of learning a timestep\-indexed family of denoising maps, DSL learns one posterior map:
fθ\(𝒛\)≈P\(𝒙i∣𝒛\)or𝒙^i\(𝒛\)\.f\_\{\\theta\}\(\\bm\{z\}\)\\approx P\(\{\\bm\{x\}\}\_\{i\}\\mid\\bm\{z\}\)\\quad\\text\{or\}\\quad\\hat\{\\bm\{x\}\}\_\{i\}\(\\bm\{z\}\)\.\(55\)Low\-SNR, intermediate\-SNR, and high\-SNR states are no longer separate tasks labeled by time\. They are different regions of the same token\-posterior space\.
#### Comparison summary\.
Table[5](https://arxiv.org/html/2605.12836#A2.T5)summarizes the main conceptual differences\.
Table 5:Prior continuous diffusion LMs vs\. DSL\.Both use continuous finite\-SNR states\. The difference is how those states are parameterized, interpreted, and exposed to the denoiser\.
#### Why this matters for training\.
The distinction above changes what it means to train across corruption levels\. In a timestep\-conditioned continuous diffusion LM, training acrossτ\\tauasks the model to learn how to denoise under many externally labeled regimes\. In DSL, training across SNRs asks the model to cover a single posterior\-state space\. This is why continuous finite\-SNR states and endpoint\-like ROAR states can be combined under the same cross\-entropy posterior\-matching objective in Section[3\.1](https://arxiv.org/html/2605.12836#S3.SS1)\. Both branches provide training mass for the same mappθ\(si∣𝒛\)p\_\{\\theta\}\(s\_\{i\}\\mid\\bm\{z\}\), rather than defining separate denoising tasks\.
This also explains why per\-token SNR paths are natural in DSL\. Once the denoiser is no longer tied to one global nominal timestep, there is no structural requirement that every token in a sequence share the same SNR\. A sequence can contain positions near the mask endpoint, positions near the clean endpoint, and positions at intermediate posterior states, all interpreted by the same denoiser\. This is the mechanism behind the statement in Section[2\.3](https://arxiv.org/html/2605.12836#S2.SS3)that AR\-style revealing, masked diffusion, and remasking are different paths through a single space of per\-token SNR configurations\.
#### Why the converter is not just a normalization trick\.
The posterior\-view converter in Eq\.[11](https://arxiv.org/html/2605.12836#S3.E11)should not be understood merely as a fix for exploding input norms\. Many VP\-style continuous diffusion parameterizations keep noisy embeddings numerically well\-scaled\. The purpose of the converter is different: DSL’s𝒛i\\bm\{z\}\_\{i\}is a posterior natural parameter whose scale reflects posterior concentration, and this state must be presented to a Transformer through a stable token\-like interface\. The converter maps𝒛i\\bm\{z\}\_\{i\}into a bounded mixture of token embeddings, preserving the posterior\-view interpretation while making the input compatible with self\-attention\.
#### What this comparison does not claim\.
We do not claim that prior continuous diffusion LMs lack posterior meaning or uncertainty\. Any Gaussian corruption model defines a posteriorP\(𝒙∣𝒚τ,τ\)P\(\{\\bm\{x\}\}\\mid\{\\bm\{y\}\}\_\{\\tau\},\\tau\)\. We also do not claim that their noisy embeddings are necessarily numerically unstable\. The claim is instead that DSL explicitly chooses a posterior natural coordinate in which the realized state alone determines the token posterior, and then builds the training distribution and Transformer interface around this posterior\-state view\. In short, prior continuous diffusion LMs typically perform*timestep\-conditioned continuous denoising*, whereas DSL performs*posterior\-state denoising in natural coordinates*\.
## Appendix CTraining and Sampling Details
This appendix documents the practical choices used in the main experiments\. The main text focuses on the method\-level design: mixed\-support CE posterior matching and a posterior\-view converter\. Here we give the concrete training and decoding settings needed for reproducibility\.
### C\.1OWT Finetuning Settings
We exactly follow MDLM’s\[[22](https://arxiv.org/html/2605.12836#bib.bib663)\]training configurations\.
#### Data and sequence length\.
We fine\-tune on OpenWebText using theopenwebtext\-splitdataset\. All training samples are truncated/padded to a fixed sequence lengthL=1024L=1024tokens\.
#### Backbone and parameterization\.
We use the DiT backbone in official MDLM GitHub repository\[[22](https://arxiv.org/html/2605.12836#bib.bib663)\]with token embedding dimension 64\. Training is conducted in full precision \(FP32\)\.
#### Optimization and batching\.
We train for a maximum of 100,000 optimizer steps with no learning\-rate warmup \(num\_warmup\_steps=0\)\. Everything else in training setting is the same as MDLM training setting\.
#### Compute resources\.
All experiments were run on NVIDIA GPUs\. Text8 experiments were run on 2 H100s, and sampling/evaluation took approximately 1 hour\. OpenWebText training and sampling evaluation used 2 H100s, with 5000 generated samples per configuration\. The hybrid\-sampler sweeps required approximately 2 hours in total, including preliminary hyperparameter searches\. No additional large\-scale pretraining was performed beyond fine\-tuning/evaluating the checkpoints described above\.
### C\.2Mixed SNR Support Used in Training
This subsection specifies the practical mixed\-support training distribution used in DSL\. A subset of token positions is sampled from a ROAR\-style endpoint branch, while the remainder is sampled from a continuous log\-normal SNR branch\. This construction is the practical instantiation of the two exact likelihood views in the main text\.
We denote byγ\\gammathe signal\-to\-noise ratio \(SNR\) used in the training corruption process\. We use a*mixed*SNR path: with probability
pROAR=1k=0\.1,p\_\{\\mathrm\{ROAR\}\}\\;=\\;\\frac\{1\}\{k\}\\;=\\;0\.1,\(56\)we drawγ\\gammafrom the ROAR path sampler, where mask is represented byγ=0\\gamma=0, and clean token is approximated byγmax=100\\gamma\_\{\\text\{max\}\}=100; with probability1−pROAR1\-p\_\{\\mathrm\{ROAR\}\}we drawγ\\gammafrom a lognormal sampler\. Concretely, for the lognormal continuous SNR branch we sample
γ∼LogNormal\(μ,σ2\),μ=1\.65,σ=0\.9,\\gamma\\sim\\mathrm\{LogNormal\}\(\\mu,\\sigma^\{2\}\),\\qquad\\mu=1\.65,\\;\\;\\sigma=0\.9,\(57\)and there is noγmax\\gamma\_\{\\max\}cut\-off for this lognormal continuous training schedule\. Throughout training,γ\\gammadenotes SNR and is distinct from the decoding progress variableτ\\tauused in sampling\.
We summarize DSL training in Algorithm[1](https://arxiv.org/html/2605.12836#alg1)\. Each step samples a sentence from the data, draws a per\-token SNR vectorγ\\gammafrom one of two branches \(ROAR with probability1−λ1\{\-\}\\lambda, continuous\-path with probabilityλ\\lambda\), produces noisy embeddings𝒛\\bm\{z\}, runs the converter and backbone to obtain token posteriorspθ\(⋅∣𝒛\)p\_\{\\theta\}\(\\cdot\\mid\\bm\{z\}\), and updates parameters via cross\-entropy against the clean tokens\.
### C\.3Masked Refinement with a DSL Checkpoint
The first decoding family used in the paper pairs a trained DSL checkpoint with a ReMDM\-style masked\-refinement schedule\. Its role is to test the primary empirical claim of the paper: whether mixed\-support DSL training improves refinement robustness and step\-efficiency under a fixed masked decoder family\. Algorithm[4](https://arxiv.org/html/2605.12836#alg4)gives the high\-level procedure, while all schedule hyperparameters are listed in Section[C\.4](https://arxiv.org/html/2605.12836#A3.SS4)\.
Algorithm 4DSL sampling with ReMDM sampler\. Adapted from\[[31](https://arxiv.org/html/2605.12836#bib.bib678), Algorithm 1\]; in DSL, the ReMDM mask state is represented by zero SNR\.Input:DSL denoiser
𝒙θ\{\\bm\{x\}\}\_\{\\theta\}, sampling steps
TT, noise schedule
αt\\alpha\_\{t\}, clean endpoint
SNRmax\\mathrm\{SNR\}\_\{\\max\},remask scheduleσt\\sigma\_\{t\}\.
Initialize:set every token embedding to the mask endpoint, i\.e\.,
SNR\(𝒛T\)=0\\operatorname\{SNR\}\(\\bm\{z\}\_\{T\}\)=0\.
for
i=T,T−1,…,1i=T,T\-1,\\ldots,1do
t=i/T,s=\(i−1\)/Tt=i/T,\\quad s=\(i\-1\)/T; set
αt,αs\\alpha\_\{t\},\\alpha\_\{s\}\.
Setσt∈\[0,σtmax\]\\sigma\_\{t\}\\in\[0,\\sigma\_\{t\}^\{\\max\}\], whereσtmax=min\{1,\(1−αs\)/αt\}\\sigma\_\{t\}^\{\\max\}=\\min\\\{1,\(1\-\\alpha\_\{s\}\)/\\alpha\_\{t\}\\\}\.
Predict clean\-token posteriors
𝒙^=𝒙θ\(𝒛t\)\\widehat\{\{\\bm\{x\}\}\}=\{\\bm\{x\}\}\_\{\\theta\}\(\\bm\{z\}\_\{t\}\)\.
Form the ReMDM endpoint posterior
pθ\(𝒛s∣𝒛t\)=qσDSL\(𝒛s∣𝒛t,𝒙=𝒙^\),p\_\{\\theta\}\(\\bm\{z\}\_\{s\}\\mid\\bm\{z\}\_\{t\}\)=q\_\{\{\\color\[rgb\]\{0\.70703125,0\.375,0\.09375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.70703125,0\.375,0\.09375\}\\sigma\}\}^\{\\mathrm\{DSL\}\}\\bigl\(\\bm\{z\}\_\{s\}\\mid\\bm\{z\}\_\{t\},\{\\bm\{x\}\}=\\widehat\{\{\\bm\{x\}\}\}\\bigr\),where ReMDM’s mask atom is interpreted as
SNR=0\\operatorname\{SNR\}=0and its clean\-token atom as
SNR=SNRmax\\operatorname\{SNR\}=\\mathrm\{SNR\}\_\{\\max\}\.
Sample
𝒛s∼pθ\(𝒛s∣𝒛t\)\\bm\{z\}\_\{s\}\\sim p\_\{\\theta\}\(\\bm\{z\}\_\{s\}\\mid\\bm\{z\}\_\{t\}\)\.
Realize sampled clean tokens at
SNRmax\\mathrm\{SNR\}\_\{\\max\};realize remasked tokens by resetting their SNR to0\.
Set
𝒛t←𝒛s\\bm\{z\}\_\{t\}\\leftarrow\\bm\{z\}\_\{s\}\.
endfor
Output:decoded tokens from the final clean\-endpoint latents\.
### C\.4Sampling Settings on OWT
This section fully specifies the decoding schedules used in[Section˜5](https://arxiv.org/html/2605.12836#S5)and defines the rewrite\-count statistics and diagnostics reported throughout the paper\.
#### Common evaluation protocol \(all rows in Table 1\)\.
Unless otherwise stated, all methods are evaluated with the same tokenizer \(GPT\-2 BPE\), sequence lengthL=1024L=1024, nucleus sampling \(top\-p=0\.9p=0\.9\), sample count \(5k\), and MAUVE configuration \(GPT\-2 Large embeddings withK=500K=500buckets\) on the same held\-out OWT split\. GenPPL is computed as perplexity under the same GPT\-2 Large autoregressive oracle LM\. Sentence entropy is computed as the average Shannon entropy of the token\-ID histogram within each generated sequence \(prior to text decoding\)\. This matches the MDLM/ReMDM evaluation protocol and ensures baseline parity\.
#### Time parameterization \(matches loggedtt\)\.
We index aTT\-step decode by a*0\-based*step counterk=0,…,T−1k=0,\\dots,T\-1and use normalized time
tk≜1−kT∈\[1T,1\]\.t\_\{k\}\\triangleq 1\-\\frac\{k\}\{T\}\\in\\left\[\\frac\{1\}\{T\},\\,1\\right\]\.\(58\)Thus decoding progresses fromt0=1t\_\{0\}=1\(fully masked/noisy\) totT−1=1/Tt\_\{T\-1\}=1/T\(near\-clean\)\. This convention matches the sampler logs \(e\.g\., forT=128T=128, the final loggedt≈0\.0078≈1/128t\\approx 0\.0078\\approx 1/128\)\.
#### Mask ratio schedule and reveal fraction\.
LetMk⊆\[L\]M\_\{k\}\\subseteq\[L\]denote the masked positions at stepkk\(i\.e\., positions whose visible token is\[MASK\]\)\. We log the*realized*masked ratio
r\(tk\)≜\|Mk\|L\.r\(t\_\{k\}\)\\triangleq\\frac\{\|M\_\{k\}\|\}\{L\}\.\(59\)We additionally report the*reveal fraction*\(newly unmasked mass\) per step,
Δr\(tk\)≜r\(tk\)−r\(tk\+1\)\(k<T−1\),\\Delta r\(t\_\{k\}\)\\triangleq r\(t\_\{k\}\)\-r\(t\_\{k\+1\}\)\\quad\(k<T\-1\),\(60\)which indicates how aggressively the schedule transitions from global drafting \(largeΔr\\Delta rearly\) to local refinement \(smallΔr\\Delta rlate\)\.
#### ReMDM\-loop and the loop window\.
Our main step\-budgeted protocol uses the ReMDM\-loop variant \(for allTT\) with the official loop defaults:
ton=0\.55,toff=0\.05,αloop=0\.9,refresh\_unmasked=true\.t\_\{\\mathrm\{on\}\}=0\.55,\\qquad t\_\{\\mathrm\{off\}\}=0\.05,\\qquad\\alpha\_\{\\mathrm\{loop\}\}=0\.9,\\qquad\\texttt\{refresh\\\_unmasked\}=\\texttt\{true\}\.\(61\)Conceptually, decoding consists of \(i\) a standard MDLM\-style reveal phase, \(ii\) a*loop phase*over the time window\[ton,toff\)\[t\_\{\\mathrm\{on\}\},t\_\{\\mathrm\{off\}\}\)withton\>tofft\_\{\\mathrm\{on\}\}\>t\_\{\\mathrm\{off\}\}, and \(iii\) a final reveal phase that anneals to the near\-clean endpoint\. During the loop phase, the noise level is held fixed atαloop\\alpha\_\{\\mathrm\{loop\}\}so that repeated self\-correction operates under a stable context quality\.
#### Capped remasking intensity \(ReMDM\-cap inside the loop\)\.
When remasking is active \(i\.e\., within the loop window\), ReMDM remasks a subset of currently unmasked tokens to avoid early lock\-in\. We use ReMDM’s capped schedule:
q\(t\)=min\(1,η\(t\)1−r\(t\)\),η\(t\)=ηcapα\(t\)1−α\(t\),q\(t\)\\;=\\;\\min\\\!\\left\(1,\\;\\frac\{\\eta\(t\)\}\{1\-r\(t\)\}\\right\),\\qquad\\eta\(t\)\\;=\\;\\eta\_\{\\mathrm\{cap\}\}\\frac\{\\alpha\(t\)\}\{1\-\\alpha\(t\)\},\(62\)whereq\(t\)q\(t\)is the per\-token remask probability among unmasked positions andα\(t\)\\alpha\(t\)is ReMDM’s noise schedule\. Outside the loop window, we setq\(t\)=0q\(t\)=0\(no remasking\)\.
#### Step\-budget\-aware choice ofηcap\(T\)\\eta\_\{\\mathrm\{cap\}\}\(T\)\(as used in our logs\)\.
For the step\-budgeted schedule used in[Table˜1](https://arxiv.org/html/2605.12836#S4.T1)\(DSL\-FT \+ ReMDM\-loop\), we set
ηcap\(128\)=0\.010,ηcap\(256\)=0\.008,ηcap\(512\)=0\.007,ηcap\(1024\)=0\.004,\\eta\_\{\\mathrm\{cap\}\}\(128\)=0\.010,\\quad\\eta\_\{\\mathrm\{cap\}\}\(256\)=0\.008,\\quad\\eta\_\{\\mathrm\{cap\}\}\(512\)=0\.007,\\quad\\eta\_\{\\mathrm\{cap\}\}\(1024\)=0\.004,\(63\)and keep all other ReMDM\-loop hyperparameters identical to the official setting\. This is a minimal, reproducible adaptation that controls rewrite counts while front\-loading corrections\.
Table 6:Decoding hyperparameters for OWT experiments under our step\-budgeted ReMDM\-loop protocol\. All settings are identical acrossTTexceptηcap\(T\)\\eta\_\{\\mathrm\{cap\}\}\(T\)\.
#### Rewrite\-count statistics\.
Letx\(k\)∈𝒱Lx^\{\(k\)\}\\in\\mathcal\{V\}^\{L\}be the sampled sequence after stepkk\(with\[MASK\]treated as a valid symbol during decoding\)\. We define the per\-position rewrite count
Ri≜∑k=1T−1𝟏\[xi\(k\)≠xi\(k−1\)\],i∈\[L\],R\_\{i\}\\triangleq\\sum\_\{k=1\}^\{T\-1\}\\mathbf\{1\}\\\!\\left\[x^\{\(k\)\}\_\{i\}\\neq x^\{\(k\-1\)\}\_\{i\}\\right\],\\qquad i\\in\[L\],\(64\)and report the mean rewrite count1L∑i=1LRi\\frac\{1\}\{L\}\\sum\_\{i=1\}^\{L\}R\_\{i\}\(equivalently “Rewrites\-per\-token” in our logs\), averaged over the 5k generated samples\.
Table 7:Rewrite statistics for the step\-budgeted ReMDM\-loop protocol \(computed from Eq\.[64](https://arxiv.org/html/2605.12836#A3.E64), averaged over 5k samples\)\.
#### Logged posterior sharpness checkpoints \(sanity check for[Section˜D\.4](https://arxiv.org/html/2605.12836#A4.SS4)\)\.
To connect the sampler to the diagnostics in §[D\.4](https://arxiv.org/html/2605.12836#A4.SS4), we also log \(i\) the average per\-token maximum probability \(“mean\_max\_prob”\) and \(ii\) the average top\-ppnucleus size \(“nucleus size”\) at selected steps\. A representative subset of checkpoints is shown below; the strong monotone sharpening \(early diffuse→\\rightarrowlate near\-deterministic\) is consistent across step budgets\.
Table 8:Logged posterior sharpness checkpoints from the sampler logs \(“ReMDM logits stats” and “ReMDM nucleus size”\)\.
### C\.5ROAR Sampler Details
The random\-order autoregressive \(ROAR\) sampler instantiates the ROAR endpoint configurations of Eq\.[7](https://arxiv.org/html/2605.12836#S2.E7)as a single\-pass inference path on the same DSL\-finetuned checkpoint\. The sampler reveals one token at a time in a uniformly random order; each position is denoised exactly once, with no remasking, revisits, or self\-correction\. It requires no retraining or sampler\-specific finetuning—it is an alternative decoding rule applied to the same checkpoint used for iterative refinement\. Algorithm[5](https://arxiv.org/html/2605.12836#alg5)gives the procedure\.
#### State representation\.
At each step, the input𝒛∈ℝL×d\\bm\{z\}\\in\\mathbb\{R\}^\{L\\times d\}is composed position\-by\-position according to the current reveal status\. Unrevealed positionsi∉Ai\\notin Aare set to𝒛i=𝟎\\bm\{z\}\_\{i\}=\{\\bm\{0\}\}, corresponding toγi=0\\gamma\_\{i\}=0in the DSL channel\. Revealed positionsi∈Ai\\in Aare set to𝒛i=γmaxenc\(si\)\\bm\{z\}\_\{i\}=\\gamma\_\{\\max\}\\,\\operatorname\{enc\}\(s\_\{i\}\)withγmax=100\\gamma\_\{\\max\}=100, providing the denoiser with the same near\-clean signal it sees at training time on the ROAR endpoint branch \(§[3\.1](https://arxiv.org/html/2605.12836#S3.SS1), Appendix[C\.2](https://arxiv.org/html/2605.12836#A3.SS2)\)\. The sampler thus operates on the exact endpoint support the model was explicitly trained to cover; revealed positions act as conditioning context, unrevealed positions act as targets for the next prediction\.
#### Reveal order\.
For each generation batch, we draw a single permutationπ\\piof\[L\]\[L\]uniformly at random and process positions in the orderπ1,π2,…,πL\\pi\_\{1\},\\pi\_\{2\},\\dots,\\pi\_\{L\}\. The same permutation is shared across all sequences in the batch but resampled across batches\. We do not use confidence\-based or content\-aware reveal orders, preserving the unbiased random\-order structure of the ROAR estimator Eq\.[7](https://arxiv.org/html/2605.12836#S2.E7)\. The sampler also exposes acausal=Trueoption that fixesπ\\pito the identity \(πk=k\\pi\_\{k\}=k\), recovering strict left\-to\-right autoregressive decoding; our reported results use the random\-order setting\.
#### Per\-step prediction\.
At stepkk, the sampler queries the converter and backbone on the current partial state𝒛\\bm\{z\}to obtain token posteriorspθ\(⋅∣𝒛\)p\_\{\\theta\}\(\\cdot\\mid\\bm\{z\}\)at every position, reads the marginal at positionπk\\pi\_\{k\}, and samplessπks\_\{\\pi\_\{k\}\}from this marginal using nucleus sampling\. Positionπk\\pi\_\{k\}is then added toAA, and𝒛πk\\bm\{z\}\_\{\\pi\_\{k\}\}is updated toγmaxenc\(sπk\)\\gamma\_\{\\max\}\\,\\operatorname\{enc\}\(s\_\{\\pi\_\{k\}\}\)for use in subsequent steps\. The remainingL−1L\-1positions are recomputed in the next forward pass under the updated𝒛\\bm\{z\}, but each positioniiis itself only*committed*\(sampled\) once, namely at stepkksuch thatπk=i\\pi\_\{k\}=i\.
#### Hyperparameters\.
For OWT we useL=1024L=1024,γmax=100\\gamma\_\{\\max\}=100,top\-p=0\.9\\text\{top\-\}p=0\.9, and runT=L=1024T=L=1024reveal steps per sample\.
Algorithm 5DSL ROAR sampler \(single pass, random\-order\)1:Sequence length
LL; reveal\-time SNR
γmax\\gamma\_\{\\max\}; nucleus parameter
top\-p\\text\{top\-\}p;causalflag \(defaultFalse\)\.
2:
A←∅A\\leftarrow\\emptyset
3:
𝒛i←𝟎\\bm\{z\}\_\{i\}\\leftarrow\{\\bm\{0\}\}for all
i∈\[L\]i\\in\[L\]⊳\\trianglerightAll positions start atγi=0\\gamma\_\{i\}=0
4:ifcausalthen
5:
π←\(1,2,…,L\)\\pi\\leftarrow\(1,2,\\dots,L\)
6:else
7:
π∼Unif\(𝔖L\)\\pi\\sim\\mathrm\{Unif\}\(\\mathfrak\{S\}\_\{L\}\)⊳\\trianglerightSingle permutation shared across the batch
8:endif
9:for
k=1,…,Lk=1,\\dots,Ldo
10:Compute token posteriors
pθ\(⋅∣𝒛\)p\_\{\\theta\}\(\\cdot\\mid\\bm\{z\}\)via converter and backbone
11:
sπk∼nucleus\(pθ\(⋅∣𝒛\)πk;top\-p\)s\_\{\\pi\_\{k\}\}\\sim\\text\{nucleus\}\(p\_\{\\theta\}\(\\cdot\\mid\\bm\{z\}\)\_\{\\pi\_\{k\}\};\\;\\text\{top\-\}p\)
12:
𝒛πk←γmaxenc\(sπk\)\\bm\{z\}\_\{\\pi\_\{k\}\}\\leftarrow\\gamma\_\{\\max\}\\,\\operatorname\{enc\}\(s\_\{\\pi\_\{k\}\}\)
13:
A←A∪\{πk\}A\\leftarrow A\\cup\\\{\\pi\_\{k\}\\\}
14:endfor
15:return
𝒔=\(s1,…,sL\)\{\\bm\{s\}\}=\(s\_\{1\},\\dots,s\_\{L\}\)
### C\.6Hybrid Sampler Details
The hybrid sampler combines continuous\-state EDM\-style denoising with discrete masked refinement, both executed on the same DSL\-finetuned checkpoint\. The high\-level idea is that continuous denoising provides a low\-cost global initialization across allLLtoken positions in parallel, after which a short discrete refinement stage commits the final tokens\. Algorithm[6](https://arxiv.org/html/2605.12836#alg6)gives the procedure; we describe the three stages below\.
#### Stage 1: Continuous EDM denoising toσswitch\\sigma\_\{\\mathrm\{switch\}\}\.
We follow the standard EDM\[[10](https://arxiv.org/html/2605.12836#bib.bib45)\]parameterization with a Karrasσ\\sigmaschedule, but truncate the schedule at an intermediate switching pointσswitch\\sigma\_\{\\mathrm\{switch\}\}rather than running it all the way toσmin\\sigma\_\{\\min\}\. Throughout the continuous stage we maintain two views of the state: the physical state𝒚t∈ℝL×d\{\\bm\{y\}\}\_\{t\}\\in\\mathbb\{R\}^\{L\\times d\}and its rescaled form𝒛t=𝒚t/σt2\\bm\{z\}\_\{t\}=\{\\bm\{y\}\}\_\{t\}/\\sigma\_\{t\}^\{2\}, which matches the SNR\-parameterized input the DSL denoiser was trained on \(§[2\.1](https://arxiv.org/html/2605.12836#S2.SS1)\)\. Initialization is𝒚0=σmaxϵ\{\\bm\{y\}\}\_\{0\}=\\sigma\_\{\\max\}\{\\bm\{\\epsilon\}\}withϵ∼𝒩\(𝟎,𝑰\)\{\\bm\{\\epsilon\}\}\\sim\\mathcal\{N\}\(\{\\bm\{0\}\},\{\\bm\{I\}\}\)\. At each step we form𝒛t=𝒚t/σt2\\bm\{z\}\_\{t\}=\{\\bm\{y\}\}\_\{t\}/\\sigma\_\{t\}^\{2\}, query the converter and backbone to obtain𝒙^\(𝒛t\)\\hat\{\\bm\{x\}\}\(\\bm\{z\}\_\{t\}\), and update𝒚t\{\\bm\{y\}\}\_\{t\}via either an Euler or Heun step,
𝒚t\+1=𝒚t\+\(σt\+1−σt\)𝒚t−𝒙^\(𝒛t\)σt\(Euler\),\{\\bm\{y\}\}\_\{t\+1\}=\{\\bm\{y\}\}\_\{t\}\+\(\\sigma\_\{t\+1\}\-\\sigma\_\{t\}\)\\,\\frac\{\{\\bm\{y\}\}\_\{t\}\-\\hat\{\\bm\{x\}\}\(\\bm\{z\}\_\{t\}\)\}\{\\sigma\_\{t\}\}\\quad\(\\text\{Euler\}\),\(65\)with the Heun second\-order correction applied analogously\. We optionally inject EDM\-style stochastic churn\[[10](https://arxiv.org/html/2605.12836#bib.bib45)\]within a configurableσ\\sigmaband to improve sample diversity\.
#### Stage 2: Projection to discrete tokens atσswitch\\sigma\_\{\\mathrm\{switch\}\}\.
At the switching point, we read out token logits from the backbone applied to𝒛T=𝒚T/σT2\\bm\{z\}\_\{T\}=\{\\bm\{y\}\}\_\{T\}/\\sigma\_\{T\}^\{2\}and sample an initial discrete sequence using nucleus sampling with parameterstop\-p\\text\{top\-\}pand temperature\. This produces an initial token sequence𝒔\(0\)=\(s1\(0\),…,sL\(0\)\)\{\\bm\{s\}\}^\{\(0\)\}=\(s\_\{1\}^\{\(0\)\},\\dots,s\_\{L\}^\{\(0\)\}\)that captures the global structure resolved by the continuous stage but may contain locally inconsistent tokens\.
#### Stage 3: Discrete masked refinement\.
We refine𝒔\(0\)\{\\bm\{s\}\}^\{\(0\)\}withTmdlmT\_\{\\mathrm\{mdlm\}\}steps of MDLM\-style remasking\. Tokens are mapped to high\-SNR continuous states𝒛i=γmaxenc\(si\)\\bm\{z\}\_\{i\}=\\gamma\_\{\\max\}\\,\\operatorname\{enc\}\(s\_\{i\}\)withγmax=100\\gamma\_\{\\max\}=100to provide the denoiser with a near\-clean input at unmasked positions\. At each refinement step, the most uncertain tokens—ranked by1−maxvpθ\(v∣𝒛i\)1\-\\max\_\{v\}p\_\{\\theta\}\(v\\mid\\bm\{z\}\_\{i\}\)—are remasked \(their𝒛i\\bm\{z\}\_\{i\}set to𝟎\{\\bm\{0\}\}\) and resampled conditioned on the remaining tokens\. The mask ratio decays from an adaptive initial valuer0r\_\{0\}to a targetrendr\_\{\\mathrm\{end\}\}via a cosine schedule\. The initial ratior0r\_\{0\}is set from the mean per\-token uncertainty at the switching point, clipped to\[rmin,rmax\]\[r\_\{\\min\},r\_\{\\max\}\], so sequences with sharper switching\-point posteriors begin refinement with less aggressive remasking\.
#### Hyperparameters\.
The configurations reported in Table[2](https://arxiv.org/html/2605.12836#S5.T2)use the Karrasσ\\sigmaschedule withσmax=10\\sigma\_\{\\max\}=10,σmin=0\.01\\sigma\_\{\\min\}=0\.01,ρ=7\\rho=7, Heun solver, EDM churn=1\.41=1\.41within the band\[σmin,σmax\]\[\\sigma\_\{\\min\},\\sigma\_\{\\max\}\], andσswitch=0\.3\\sigma\_\{\\mathrm\{switch\}\}=0\.3\. The discrete stage usestop\-p=0\.9\\text\{top\-\}p=0\.9, temperature0\.80\.8,γmax=100\\gamma\_\{\\max\}=100, mask ratio bounds\[rmin,rmax\]=\[0\.2,0\.8\]\[r\_\{\\min\},r\_\{\\max\}\]=\[0\.2,0\.8\], andrend=0r\_\{\\mathrm\{end\}\}=0\.
Algorithm 6DSL hybrid continuous\-then\-discrete sampler1:Sequence length
LL; continuous step count
TcontT\_\{\\mathrm\{cont\}\}; discrete step count
TmdlmT\_\{\\mathrm\{mdlm\}\}; sigma schedule
\{σt\}t=0Tcont\\\{\\sigma\_\{t\}\\\}\_\{t=0\}^\{T\_\{\\mathrm\{cont\}\}\}with
σ0=σmax\\sigma\_\{0\}=\\sigma\_\{\\max\}and
σTcont=σswitch\\sigma\_\{T\_\{\\mathrm\{cont\}\}\}=\\sigma\_\{\\mathrm\{switch\}\}; nucleus parameters
\(top\-p,τ\)\(\\text\{top\-\}p,\\tau\); refinement parameters
\(γmax,rmin,rmax,rend\)\(\\gamma\_\{\\max\},r\_\{\\min\},r\_\{\\max\},r\_\{\\mathrm\{end\}\}\)\.
2:
𝒚0←σmaxϵ\{\\bm\{y\}\}\_\{0\}\\leftarrow\\sigma\_\{\\max\}\\,\{\\bm\{\\epsilon\}\},
ϵ∼𝒩\(𝟎,𝑰L×d\)\{\\bm\{\\epsilon\}\}\\sim\\mathcal\{N\}\(\{\\bm\{0\}\},\{\\bm\{I\}\}\_\{L\\times d\}\)⊳\\trianglerightStage 1: continuous EDM
3:for
t=0,…,Tcont−1t=0,\\dots,T\_\{\\mathrm\{cont\}\}\-1do
4:
𝒛t←𝒚t/σt2\\bm\{z\}\_\{t\}\\leftarrow\{\\bm\{y\}\}\_\{t\}/\\sigma\_\{t\}^\{2\}
5:
𝒙^t←\\hat\{\\bm\{x\}\}\_\{t\}\\leftarrowDSL denoiser
\(𝒛t\)\(\\bm\{z\}\_\{t\}\)⊳\\trianglerightconverter→\\tobackbone
6:Update
𝒚t\+1\{\\bm\{y\}\}\_\{t\+1\}via Heun step on
\(𝒚t−𝒙^t\)/σt\(\{\\bm\{y\}\}\_\{t\}\-\\hat\{\\bm\{x\}\}\_\{t\}\)/\\sigma\_\{t\}\(with optional churn\)
7:endfor
8:
𝒛Tcont←𝒚Tcont/σTcont2\\bm\{z\}\_\{T\_\{\\mathrm\{cont\}\}\}\\leftarrow\{\\bm\{y\}\}\_\{T\_\{\\mathrm\{cont\}\}\}/\\sigma\_\{T\_\{\\mathrm\{cont\}\}\}^\{2\}⊳\\trianglerightStage 2: project to tokens
9:Compute logits
ℓi\\ell\_\{i\}at each position; set
ui←1−maxvsoftmax\(ℓi\)vu\_\{i\}\\leftarrow 1\-\\max\_\{v\}\\mathrm\{softmax\}\(\\ell\_\{i\}\)\_\{v\}
10:Sample
si\(0\)∼nucleus\(ℓi;top\-p,τ\)s\_\{i\}^\{\(0\)\}\\sim\\text\{nucleus\}\(\\ell\_\{i\};\\text\{top\-\}p,\\tau\)for all
ii
11:
r0←clip\(u¯,rmin,rmax\)r\_\{0\}\\leftarrow\\mathrm\{clip\}\(\\bar\{u\},r\_\{\\min\},r\_\{\\max\}\)where
u¯\\bar\{u\}is the mean of
\{ui\}\\\{u\_\{i\}\\\}
12:for
k=0,…,Tmdlm−1k=0,\\dots,T\_\{\\mathrm\{mdlm\}\}\-1do⊳\\trianglerightStage 3: discrete refinement
13:
rk←rend\+12\(r0−rend\)\(1\+cos\(πk/Tmdlm\)\)r\_\{k\}\\leftarrow r\_\{\\mathrm\{end\}\}\+\\tfrac\{1\}\{2\}\(r\_\{0\}\-r\_\{\\mathrm\{end\}\}\)\(1\+\\cos\(\\pi k/T\_\{\\mathrm\{mdlm\}\}\)\)
14:Select
⌈rkL⌉\\lceil r\_\{k\}L\\rceilpositions with highest current uncertainty; mask them
15:Set
𝒛i←𝟎\\bm\{z\}\_\{i\}\\leftarrow\{\\bm\{0\}\}at masked positions,
𝒛i←γmaxenc\(si\(k\)\)\\bm\{z\}\_\{i\}\\leftarrow\\gamma\_\{\\max\}\\,\\operatorname\{enc\}\(s\_\{i\}^\{\(k\)\}\)otherwise
16:Resample
si\(k\+1\)s\_\{i\}^\{\(k\+1\)\}at masked positions from DSL denoiser logits
17:endfor
18:return
𝒔\(Tmdlm\)\{\\bm\{s\}\}^\{\(T\_\{\\mathrm\{mdlm\}\}\)\}
## Appendix DMechanistic Analysis
The main text only briefly summarizes why DSL works in practice\. This appendix provides supporting mechanistic evidence\. The central message is that DSL improves refinement when training exposes the denoiser to the posterior states encountered under self\-generated drafts, and that useful uncertainty is more important than mere local smoothness\.
### D\.1Cyclic Toy Setup and Correction Example
We use a simple synthetic discrete dataset to illustrate how DSL handles both masked tokens and visible\-but\-wrong tokens within a single framework\. Fix an integerKKand consider the dataset consisting of all cyclic shifts of a base sequence:
x\(0\)=\[0,1,…,K−1\],x\(i\)=roll\(x\(0\),i\),i=0,…,K−1\.x^\{\(0\)\}=\[0,1,\\ldots,K\-1\],\\qquad x^\{\(i\)\}=\\mathrm\{roll\}\(x^\{\(0\)\},i\),\\;\\;i=0,\\ldots,K\-1\.For visualization, each token is embedded inℝ2\\mathbb\{R\}^\{2\}on the unit circle with equal angular spacing, yielding a symmetric geometry in which posterior trajectories can be plotted directly\.
To illustrate correction, suppose the true sequence isABCDEFG, but the input contains both masked and garbled positions:
Original:ABCDEFGInput:\_⏟masked\_⏟maskedC⏟correctD⏟correctB⏟garbledF⏟correctF⏟garbled\\begin\{array\}\[\]\{cccccccc\}\\text\{Original:\}&A&B&C&D&E&F&G\\\\ \\text\{Input:\}&\\underbrace\{\\\_\}\_\{\\text\{masked\}\}&\\underbrace\{\\\_\}\_\{\\text\{masked\}\}&\\underbrace\{C\}\_\{\\text\{correct\}\}&\\underbrace\{D\}\_\{\\text\{correct\}\}&\\underbrace\{B\}\_\{\\text\{garbled\}\}&\\underbrace\{F\}\_\{\\text\{correct\}\}&\\underbrace\{F\}\_\{\\text\{garbled\}\}\\end\{array\}In DSL, masked positions can be assignedSNRi=0SNR\_\{i\}=0, while visible but uncertain positions can be assigned small positive SNR\. This allows the model to use partial signal without forcing visible tokens to remain fixed\. In our cyclic toy, this lets the same dynamics both fill masked positions and revise garbled visible ones\.
\(a\)Masked \+ garbled input
\(b\)Reconstruction
Figure 3:DSL correction under masked and garbled inputs\.The input contains both masked positions and visible\-but\-wrong tokens\. DSL can assign zero SNR to masked tokens and small positive SNR to uncertain visible tokens, allowing the same refinement dynamics to both fill missing values and correct garbled ones\.
### D\.2Why Targeted Remasking Matters
The toy reveals a key asymmetry: masked positions are already mutable, whereas visible\-but\-wrong positions must first be remasked before they can be corrected\. This is exactly why refinement\-time uncertainty must remain informative\. If the model cannot identify which visible tokens should become mutable again, then self\-correction stalls\. Conversely, once posteriors have already sharpened, late remasking tends to induce churn rather than useful revision\.
### D\.3Robustness to Self\-Generated Intermediate Drafts
Two complementary ingredients explain DSL’s robustness to imperfect self\-generated drafts\.
#### \(1\) Exposure to rollout\-like partially corrupted contexts\.
Refinement\-time drafts mix masked gaps with model\-made mistakes that remain visible\. DSL trains the denoiser under a mixed\-support corruption distribution that covers both endpoint\-like masked contexts and intermediate partially corrupted drafts\. This broadens the support of training contexts and reduces brittleness when the model conditions on imperfect self\-generated states\.
#### \(2\) Smoothness can help, but remasking requires informative confidence\.
A plausible supporting factor is geometric smoothness in continuous embedding space: a denoiser that behaves smoothly across nearby contexts may generalize better across partially\-correct drafts\. However, in a ReMDM\-style discrete sampler, correction is bottlenecked by*which tokens get remasked*\. Thus the decisive property is not just smooth interpolation, but whether confidence remains informative enough to identify low\-confidence visible tokens that should be revised\.
### D\.4Sampling Diagnostics and Over\-Refinement
We track uncertainty, posterior sharpness, and repair workload in order to diagnose when iterative refinement remains productive and when it begins to over\-refine\.
#### Token uncertainty and entropy\.
Letpθ,t\(xi\)p\_\{\\theta,t\}\(x\_\{i\}\)be the categorical distribution at token positioniiand sampling steptt\. Define average uncertainty and entropy by
ut\\displaystyle u\_\{t\}:=1L∑i=1L\(1−maxvpθ,t\(xi=v\)\),\\displaystyle:=\\frac\{1\}\{L\}\\sum\_\{i=1\}^\{L\}\\Bigl\(1\-\\max\_\{v\}p\_\{\\theta,t\}\(x\_\{i\}=v\)\\Bigr\),\(66\)Ht\\displaystyle H\_\{t\}:=1L∑i=1L\(−∑vpθ,t\(xi=v\)logpθ,t\(xi=v\)\)\.\\displaystyle:=\\frac\{1\}\{L\}\\sum\_\{i=1\}^\{L\}\\Bigl\(\-\\sum\_\{v\}p\_\{\\theta,t\}\(x\_\{i\}=v\)\\log p\_\{\\theta,t\}\(x\_\{i\}=v\)\\Bigr\)\.\(67\)A rapid drop in\(ut,Ht\)\(u\_\{t\},H\_\{t\}\)indicates posterior sharpening and the onset of a local\-repair regime\.
#### Nucleus size as a sharpness proxy\.
For a fixed top\-ppthreshold \(e\.g\.p=0\.9p=0\.9\), define
kt:=1L∑i\|TopP\(pθ,t\(xi\),p\)\|\.k\_\{t\}:=\\frac\{1\}\{L\}\\sum\_\{i\}\|\\mathrm\{TopP\}\(p\_\{\\theta,t\}\(x\_\{i\}\),p\)\|\.\(68\)Largektk\_\{t\}corresponds to broad uncertainty; smallktk\_\{t\}indicates sharp, near\-deterministic posteriors\.
#### Repair workload and diminishing returns\.
Letℳt\\mathcal\{M\}\_\{t\}be the set of positions remasked or rewritten at steptt\. We track the remask ratio and realized token\-change rate:
rt:=\|ℳt\|/L,Δt:=1L∑i=1L𝕀\[xi\(t\)≠xi\(t−1\)\]\.r\_\{t\}:=\|\\mathcal\{M\}\_\{t\}\|/L,\\qquad\\Delta\_\{t\}:=\\frac\{1\}\{L\}\\sum\_\{i=1\}^\{L\}\\mathbb\{I\}\[x\_\{i\}^\{\(t\)\}\\neq x\_\{i\}^\{\(t\-1\)\}\]\.\(69\)Over\-refinement is characterized by non\-trivialrtr\_\{t\}but tinyΔt\\Delta\_\{t\}, indicating diminishing returns\.
\(a\)Masking & reveal
\(b\)Remasking / rewrite rate
\(c\)Posterior sharpness
Figure 4:Sampling diagnostics under a fixed step budget\.\(a\) Masking and reveal schedule\. \(b\) Remasking intensity and realized rewrites per token\. \(c\) Posterior sharpening measured by mean max\-probability and top\-ppnucleus size\.
#### Over\-refinement hurts coverage\.
After posteriors sharpen, the sampler may still apply non\-trivial remasking while realizing little actual change\. This corresponds to diminishing returns: local likelihood proxies may keep improving while distributional coverage degrades\. This explains the sweet spot observed in the main OWT experiments\.
## Appendix ECalibration and Endpoint\-Smoothing Ablation
This appendix supports one of the paper’s main practical conclusions: endpoint\-like support is necessary, but endpoint\-only training is insufficient\. To obtain usable refinement\-time uncertainty, the ROAR\-style endpoint branch should be smoothed rather than concentrated entirely on atomic extremes\.
### E\.1Atomic vs\. Smoothed ROAR Endpoints
DSL uses a ROAR\-style endpoint branch to expose the denoiser to masking/reveal\-like states\. A natural baseline is*atomic*endpoint sampling, where ROAR tokens are assigned only the two extreme values
γ∈\{0,γmax\}\.\\gamma\\in\\\{0,\\gamma\_\{\\max\}\\\}\.In contrast, our*smoothed*endpoint variant samples from two narrow endpoint ranges,
γ∼Unif\(0,γmin\)orγ∼Unif\(cγmax,γmax\),\\gamma\\sim\\mathrm\{Unif\}\(0,\\gamma\_\{\\min\}\)\\qquad\\text\{or\}\\qquad\\gamma\\sim\\mathrm\{Unif\}\(c\\gamma\_\{\\max\},\\gamma\_\{\\max\}\),with equal probability\.
The motivation is practical\. Atomic endpoints preserve the masking/reveal semantics, but make the endpoint branch too degenerate: near the fully masked and near\-clean limits, corruption becomes almost deterministic, reducing local variation in training contexts\. Smoothed endpoints preserve the same semantics while broadening the support near both extremes\.
### E\.2Near\-Clean Calibration
We evaluate calibration under teacher forcing on held\-out corrupted inputs\. Compared with atomic ROAR endpoints, smoothed endpoints improve calibration precisely in the near\-clean regime where refinement decisions are made\.
\(a\)ECE at large SNRs
\(b\)Reliability atSNR=100\\mathrm\{SNR\}=100
Figure 5:Endpoint smoothing improves near\-clean calibration\.We compare atomic ROAR endpoints \(γ∈\{0,γmax\}\\gamma\\in\\\{0,\\gamma\_\{\\max\}\\\}\) to smoothed endpoint ranges\. Smoothing reduces ECE at large SNR and yields reliability curves closer to the diagonal near the clean limit\.These results indicate that endpoint\-only support is insufficient\. A denoiser trained only on atomic extremes can become poorly calibrated near the clean limit, even though those are exactly the states where refinement\-time confidence must be trusted\.
### E\.3Downstream Step–Quality Tradeoff
Improved near\-clean calibration translates into better downstream refinement under the same decoder family\.
\(a\)MAUVE vs\. sampling steps
\(b\)GenPPL vs\. sampling steps
Figure 6:Endpoint smoothing improves the step–quality tradeoff under fixed decoding\.Using the same ReMDM\-style sampler with identical schedules, smoothed\-endpoint checkpoints achieve stronger MAUVE across step budgets while maintaining comparable or better GenPPL\.Together, these results support the practical point made in the main text: endpoint coverage is necessary because masking/reveal states are part of the likelihood family used by refinement, but collapsing training entirely onto atomic endpoints hurts calibration on intermediate and near\-clean draft states\. Smoothing the endpoint branch improves the quality of refinement\-time uncertainty and thereby improves downstream step\-efficient decoding\.
## Appendix FAdditional Experiments
This section provides additional experimental details and diagnostics omitted from the main text for space\. We report full OpenWebText generation results for masked\-refinement decoders and hybrid continuous\-then\-discrete sampling, including GenPPL and sentence\-entropy diagnostics in addition to MAUVE\. We also provide supplementary Text8 likelihood results and implementation details for reproducibility\.
### F\.1OWT
Here we report the full OpenWebText generation tables\. Table[9](https://arxiv.org/html/2605.12836#A6.T9)compares DSL\-finetuned checkpoints with masked\-refinement baselines across sampling budgets\. Table[10](https://arxiv.org/html/2605.12836#A6.T10)reports diagnostics for the selected DSL hybrid continuous\-then\-MDM handoff configurations used in the main text\.
Table 9:OWT \(L=1024L=1024\) unconditional generation under masked\-refinement decoders\. Each entry is MAUVE↑\\uparrow/ GenPPL↓\\downarrow/ SentEnt↑\\uparrow; bold marks per\-column best MAUVE among non\-reference rows\. Rows marked‡\\ddaggerare taken from\[[31](https://arxiv.org/html/2605.12836#bib.bib678)\]; rows marked§\\Sfrom\[[21](https://arxiv.org/html/2605.12836#bib.bib682)\]\. All other rows use our protocol\. The two DSL rows use the same DSL\-finetuned checkpoint with different ReMDM\-family remasking schedules\.Table 10:Full diagnostics for the selected DSL hybrid continuous\-then\-MDM sampler configurations on OWT\. All rows use the same DSL\-finetuned checkpoint\. The main text reports MAUVE only; here we include GenPPL and sentence entropy diagnostics for the selected handoff configurations\.
### F\.2Text8
DatasetWe test DSL on Text8 dataset\[[15](https://arxiv.org/html/2605.12836#bib.bib661)\], a relatively small\-scale, character\-level text modeling benchmark extracted from English Wikipedia\.
Training SetupsFollowing the previous work\[[3](https://arxiv.org/html/2605.12836#bib.bib656),[1](https://arxiv.org/html/2605.12836#bib.bib65),[14](https://arxiv.org/html/2605.12836#bib.bib36),[24](https://arxiv.org/html/2605.12836#bib.bib664)\], we evaluated all models on short text chunks of length 256, and also follow the same dataset split and transformer model size to parameterize the denoising models\. For all the models including our method and baselines, we follow the common practice of using 12\-layer transformers similar to GPT2\-small scale\[[24](https://arxiv.org/html/2605.12836#bib.bib664)\]\. Our transformer has the same number of heads \(12\) and hidden dimension \(784\) as in\[[22](https://arxiv.org/html/2605.12836#bib.bib663)\]\. Note that all baseline diffusion language models are trained for a million steps, except for our model is trained for half a million\.
BaselinesWe compare DSL against state\-of\-the\-art continuous and discrete diffusion models, and autoregressive models\[[30](https://arxiv.org/html/2605.12836#bib.bib669)\]\. Continuous diffusion baselines include Plaid\[[3](https://arxiv.org/html/2605.12836#bib.bib656)\], CDCD\[[2](https://arxiv.org/html/2605.12836#bib.bib657)\]\. Discrete diffusion baselines include Discrete Diffusion Model \(D3PM\)\[[1](https://arxiv.org/html/2605.12836#bib.bib65)\], Score Entropy Discrete Diffusion \(SEDD\)\[[14](https://arxiv.org/html/2605.12836#bib.bib36)\], Masked Diffusion Language Model \(MDLM\)\[[22](https://arxiv.org/html/2605.12836#bib.bib663)\]and MD4\[[24](https://arxiv.org/html/2605.12836#bib.bib664)\]\. For autoregressive models, we choose Any\-order Autoregressive Models ARDM\[[7](https://arxiv.org/html/2605.12836#bib.bib668)\]and MAC\[[25](https://arxiv.org/html/2605.12836#bib.bib671)\], and flow\-based methods IAF/SCF\[[33](https://arxiv.org/html/2605.12836#bib.bib673)\], AR Argmax Flow\[[8](https://arxiv.org/html/2605.12836#bib.bib670)\], Discrete Flow\[[29](https://arxiv.org/html/2605.12836#bib.bib672)\], and Multinomial Diffusion\.
## Appendix GLimitations and Broader Impacts
DSL is a foundational method for diffusion\-based language generation\. Potential positive impacts include improving the efficiency, controllability, and flexibility of non\-autoregressive text generation systems\. At the same time, improvements in text generation quality may increase risks associated with synthetic text, including spam, misinformation, impersonation, or other harmful content generation\. We do not release a deployed model or user\-facing system, and we encourage future applications of DSL\-style methods to follow standard safeguards for generative models, including misuse monitoring, content filtering, and careful release practices\.
## Appendix HAssets and Licenses
Table[11](https://arxiv.org/html/2605.12836#A8.T11)summarizes the existing assets used in this work\. We use these assets only for research purposes and follow their stated licenses and terms of use\.
Table 11:Existing assets used in this work\.
## Appendix IAdditional Qualitative Samples
### I\.1Text8
Example 1:\[BOS\]s to some anarchists to view the betrayal of destiay and show that the essential character of dut theme survived felled to before nine zero bc and were replaced most of all knightnote have changed dramatically basing on edmund lynn s name death on body f\[EOS\]
Example 2:\[BOS\]ics at a grant university s trade psychology marketing a statement us by saddam is an example of the abortion of aids most modern reputation says the body agrees to the pope whose arm is considered one of the most intrigued piecses of his body see also c\[EOS\]
Example 3:\[BOS\]h the irish protectors caused margraves and other powerful men and hence to present a confusingly successful tour the quotes more interpreted by mary newton puts the rejection of religion lewis and wats honored having led radcliffe men to question the la\[EOS\]
### I\.2OWT
The following examples are generated from DSL finetuned checkpoint, with ReMDM\-loop \(principledηcap\(T\)\\eta\_\{\\text\{cap\}\}\(T\)\)\.
Example withT=128T=128\. BIP that only need the hard press\. Now I think it works very well against your blindness\.
At RT11
Here’s how it works\.
I’ve adjusted the fast slider to go above the 60’s\. Eventually, myshoot rate is now 4 percent\. It works all on the T\-Mobile bands\. If I’m too low, I can cross it up the corner, but if it’s with a stick hand \(Carl Rasmus could cut\), I infrotm a little near the edge of the box\. If I’m going too low, I’ll have the ability to adjust target angles and will still only know what it looks like\.
The perfect, though disappointing feeling\.
Links:
Lettered September 29st, 2014
are your no planning?
no planning is a natural lifestyle
@khovgeksriumki
forgo where the night of day breaks \-go to Eavriumki is going to start the week\.
hahahaha no a bet on the day\. Daily is on the off right right now\. From that I look at the page of his resume and ask if you can break the week, so if the week is fast, I don’t have real hassle asking if he’s going to break it\. So I think he’s going to start to go slow, maybe a couple may be there\.
Signed on September 29st, 2014
and he’s not going to go after that week\.
\(out/of\-in his catalog\)
On this week’s 2nd, I had a lot of green green warning about my start of the weekend so I knew I’m going to lose\. I didn’t want to lose any extra weight or anything else, I just wanted to the rest time\. At a point in the day I got nothing, I could get anything, but the end of the 60’s was not what I wanted to go\. I got myself on track so let’s do a routine at 20am\. Before I could wake 20am, I realized I needed to wake up at 20am\. It didn’t seem a good day\.
Before I had it the next day, I bought a new SIM card, and I spent this week backwards when I was fixed\. It was just not a good thing, but I kept the bell circle off for both days\.
So it took about 7 minutes to go to work, then 5 minutes on the field between 8 and 7, and had a weight roll out around 5 minutes later\. Then the clock on the card still constantly waiting for a delay in the middle of the weight world\.
A couple of times, people say, “You don’t have time for this” or “well\.” I’m able to stream people and online, listening to radio stations and live\. I’m able to tell a lot of people, “Yes, you were able to track people down in front of you today\. And I’m thankful I had told myself to do a few things a year and a half\.” For me, I get hundreds of texts and messages every day, so it’s right to get to people\.
NOLOM feel good
It kind of feels better having spoken to someone now\. I’m hungry, but he’s going to do it already\. He, I found him happy but I’m halfway through it right now\. I’ll bet upon everything from I could buy a couple PTTs from a Gdigy and a T\-Mobile, and I don’t want to buy\.
The coolest thing now is the ability to fly, long distances, making swimmers with taking pictures\. I could do not shoot a flyw if I would like to let it shoot a kmm for me, still out there, still at least where I am\. I’ll have got something to show\. I don’t worry about a flyw that I shoot for when it’s nowhere\.
I could have shoot a group that’s not certain I can shoot, but I come to the angle I have to shoot and where my camera moves it\. When I do shoot, I’m going to search, so I can see where a wave’s coming and see if it’s coming on right\. Writing and writing is a way to be free from ideas and scrubbing from thought\.
I drive a slow car\. I’ll take the drive over it and drag it into a right\. If I can take it it, I bend my head, and I’m going to take it\. And
Example withT=256T=256\. NASts from age 19 to Cuba for Cuba and Cuba for Cuba with the exception of the house in Havana, a tourism official says\.
Fiesta Fiaza is missing from ages 11 and 12 at the time she was conceived, but she happily adopted her younger daughter and adopted her at age\. “The best question we have right now is how much she wants to give away,” says Arushal Chomam, Managing Director of Case & Privacy with the International Investigations Center\. “We will give a guess and as soon as we are certified certified, please check for us for some time\.”
See also: Cruise Cruise Cruise of Banks Search for MissingFiesta Fiaza – At $15\. Ft Ft : $15\. Ft Ft \- Offer Nov\. Nov\. \(1601\) Grand Competition Winner Contest Winner New Original Siftth\-Officile by an Italian Pararonous Unit
Generally, Fume does handcrafted, and she takes away from her collection\. Even though she has not used her collection’s collection or design, she says she’ll put on any of her first set Grashinaza, something that won’t be used again with her collection for years\. Not too much is said and given what she will be buying in the amount of time\. Fume’s collection is also in an eye of local cultural heritage\. “It’s disappointing and whether there are more or more examples of local heritage, but I just said there is something kind of African heritage there that is indicative of people who love Bayay Surplitia,” she says\.
In addition to an appreciation of culturals and historical traditions, Fume has an appreciation of indigenous cultures as a part of her collection\. She, too, will showcase their cultures across color, ethnicity, art, design, and indigenous traditions\. “When I’ll give out, I hope my collection will be just as unique,” she says\. “You know, when I come out and when I give out, my collection is just as strong,” she continues\. “I’m surprised we have a green color, a dark color, or a black color, or something\. It’s just what I hope is a positive experience\.”
Fume currently holds about 100 pieces of Tokara Kirara Suguiaiko just out of $12,002 in Albany, N\.Y\. “It’s a mixed\-together piece, there are a few strokes of stroke or the nib brush that give an impression, but it’s also so beautiful, it’s so cute and I think is so deeply related\. My style is the inner hybrid of a stylistic mixed\-together piece, not that or anything\. It makes me feel comfortable and relaxed\. It really feels like a marriage of something and the nuance of something you’ve been doing and done\. It feels like fun and it feels like a touch in the room\. It’s not very dark when you’re trying to make this touch with a different color palette as it does and sometimes you’re so distracted when you want to wake up and enjoy it\. It’s fun and it inspires me as much as I want to do, and I am proud of my time and my work\.”
All artwork which comes from a detailed dark coloring color tones to a subtle dark colored background scheme\.
Estimated 1 12,013
Estimated 1 12,n1 12,013 \(approximately $100\)
Approximately 100 pieces of Tokara Suguiaiko Grashinaza hand\-crafted of Tokara Suguiaiko
Details: As follows
Price: All sales, and pricing may vary and is a requirement for normal times\. As soon as this site is on lockdown we will do an FAQ tour in stores for up to $50\.
Cost: Available online \(approx $8 \+ $9\) when you can order order online\. For more details what we can think we know by which orders ship on eBay\. There’s a web form in which you can just print the details, grab something else you can use to sign or keep one of your items on eBay page\. You’ll see some really nice zilzali, Mugabe, Picasso, and a stock, but you can have a strategy of ordering or keeping one of the items shipped directly to eBay\.
• CY CYV Custom Design \(rounded by
This is where you’ll get sketches, drawings, and your own delivery information\. What went wrong? This is where we say I’ll go see my own artist drawings\.
This is where CY CYV Custom Design/aka delivery info here for an overview of what you can choose for whom you can use to construct a shop of the
Example withT=512T=512\. See also: California is not trying to impose taxes on other states\. Instead, states actually adopt the California tax code\.
The best form of arguments already makes the great news when comparing taxes, local payouts, local profits, and national profits\. But those arguments are too far from explaining how it works and are too far from explaining basic science of how well it works\.
I have heard a lot stories about the Corporate Sales Tax\. It is that big smaller corporations were forced to tax because they had no enough to pay taxes, other big corporations did not pay taxes because they deduct the taxes for other things like sales\.
It is called no taxation: tax, corporate tax, the income is more than a dollar, more than a ‘ with a few multimillionaires and no place in society\. It is that you don’t have to pay your taxes all of the time\.
Hawking Taxes in the US:
Corporations \(usually big corporations\) have learned how no taxation is the best way to live life\. Big companies don’t do things that kind of, people, have to pay the taxes, just by converting them to the US\. They do other things that kind of: big corporations like being bailed out of the US\. Instead of having to make a profit, they get paid\.
There is a way in the world, a baker, not baker at all, where they are in trouble\. It is because people have to think big corporations can get paid, because those companies hate the system and are trying to hide it\.
If you become an investor, you get paid\. Think of 0\.8
Raising is the most important thing in the economy\. It is it it that you aren’t buying for $1\.00 but if you pay the taxes only for the second one, it becomes 25% of your income\. If you pay the taxes only, but according to the law, the rest will not stay on your income until it becomes 25% of your income\. If you deduct a couple of deductions for that number, it won’t stay on your taxable income, or the total, until the income tax is paid\.
You don’t see any gains from the tax until $1 million\. By that time you get 25% of your income, the tax that now has reinvested into your\. You can spend the dividends, save your taxes, and use it to retirement\.
And of course, the money used to pay the tax doesn’t change it needs to go up or higher\. But the more the tax is paid, the more money it needs to spend\.
Then the car gets paid off for the insurance, what the $2\.50 gets and the $2\.50 gets are not additional taxes\.
But the amount that gets paid by less than the person gets charged\. Tax taxes are taxed on some of the earns a person\.
The big difference is that this is how the real income gets taxed\. Corporate income is one of the highest levels of taxation\. The real income income can be easily derived from the excise tax and the chained tax\.
Compensation Ductes: Say you spend money on on a business and you earn the capital to ship it\. You pay taxes on ‘the corporate owner \(or local company\)’ and you earn the same amount of capital costs to ship it\.
You say your industry is money capital which consists of lots of money from other companies\. First you own the profits\. Then you give it to the government and use it to others\. This is not true\. You owe billions of dollars in income tax taxes to them\. No profit government for ‘gas makers’, no taxes for ‘gas makers\.’
The capital dividend is not something that they use\. If they trade with the people of capital, they pay them not, but that capital equals it\.
Digid Benefit Ductes: America float these profits from the coffers of corporations to pay their taxes\. They can and collect at the same time the rate of interest rates that they do\. Let’s wait until we don’t realize how much of a nuisance we don’t need to pay our taxes\. Do we need to pay our taxes with a 2\.5 trillion hole?
A couple of months ago we were actually paying taxes on a company that would provide to certain people id any ill\. Another company had fixed higher tax rates for individual hospitals and other hospitals and hospitals, while another company had fixed 40 times the rate of federal tax rates for individual insurers, services and other health care providers, while another company had fixed more than
Example withT=1024T=1024\. ongs of Africans played a role in determining population status, family structure, political etc\. Traditionally, Africa was political symbol of each person of their own, ethnic or ethnic status\. Instead, Africa was political symbol of goodwill for U\.S\. citizens of all races and cultures\. In this case, African Americans have not looked back to the 1950s\.
The split isn’t because very few Africans \(both present and later\) arrived in the U\.S\. History does not necessarily fix the split\. An analysis of “African Remnantss” discovered that far fewer African Americans arrived in the U\.S than they did in the mid\-9\.01% of today’s population from today\. The split is primarily due to the influx of Chinese people\.
The Chinese people from the past are not migrants who traveled to coast and coast\. It may indeed have been allowed to enter the U\.S\. under open borders\. This is probably partially, since the Chinese people are nowhere fully integrated into the U\.S\., and the borders for the past 25 years are still porous\. Under open borders, the nation still belong to the Republic of Somalia\. I would suggest that it was the nation of its own or its neighbors who made the decision\.
Certainly the nation’s productivity has not been a positive\. One scholar whose family was Awaisan Ahmed told Al\-Sal Al we have a “culture of immigrants originating from the founders of the Somali Republic of Somalia\.” Another former head of Oshala College concluded that the nation’s productivity fell by an average 3\.3% in 2013\. That’s what I’d guess from a fellow Al\-Sahala college scholar and the Al\-Qaeda fundamentalist\.
That’s not to say that the nation is a good neighbor — that the U\.S\. is a state that cares for all\. As adjusted capita employment levels and as indexed adjusted income, productivity has fallen 15% since then\. In addition, the new state provides health assistance for women, the elderly and all except for the state basic health care\. How has the income fallen by nearly 15% in the rest of the country? So has the working class\.
Let me begin with the opportunity of success in the U\.S\. as a nation nation – just from being considered a great nation and having a heart, but being able to get the heart that can go anywhere else\. I’m a proud American born in the U\.S\., my life has gone through periods of war, occupation, occupation, employment as a citizen, as a city, elected as a citizen and a citizen of government\. In World War II, America had only due to Russia and Russia in an effort to preserve, build and expand its bases to bring it up as a nation with a rich tradition of generosity and generosity\. In the U\.S\., I salute the great fathers as an instrumental to bring the world again\. One that truly belongs to all Americans\.
Today, I must formally recognize my power that U\.S\.\-centric solutions must not simply solve the world’s problems\. As many southern American countries where I am live, I must recognize what I am doing – as a citizen – and our God Spirit to bring us closer to America\. The U\.S\. Federal Government \(Fed\-Fed\)
Everything Doesn’t \(CNN Talk Anything\)
Fox News News
If statements from the 9/11/11 and 13/11 attacks are accurate, the hijackers or any false theories are simply going to blow up\.
Shortly after attempting 9/9/11 to obtain documents from the CIA, they began questioning whether the hijackers, or if they wanted the documents\.
The Pentagon 9/11acking conspiracy was a rum being featured on a CNN show\. Why have America’s world stature shift people’s attention to credible answers?
Some of the stories are simply nuts\.
One 9/9/11 theory that has been spun on to be “fabged” by CIA’s secret secret operations \(something U\.S\. authorities were able to track back\), in theory it may have been planted by CIA\. This story remains kept wraps ever after leaked 9/9/11 video Video Goes Be Allenged \- Message to President Obama\.
The narrative of the ongoing 9/9/11 is anything but false and remains as the mainstream media reporting has been consistent\.
On Friday, a statement from the White House said that U\.S\. intelligence personnel were being intercepted by a computer used to collect them between January and October in 9/9/11\.
The statement said that data collected when retrieving them from US intelligence operations operating in Moscow is classified because no records exist\. Why don’t we know that there are so many no records? i\.e\. the CIA’s no records?
It seems like President Obama himselfSimilar Articles
TextLDM: Language Modeling with Continuous Latent Diffusion
This paper introduces TextLDM, a method that adapts visual latent diffusion transformers for language modeling by mapping discrete tokens to continuous latents. It demonstrates that this approach, enhanced by representation alignment, matches GPT-2 performance and unifies visual and text generation architectures.
Drifting Objectives for Refining Discrete Diffusion Language Models
This paper introduces TokenDrift, a drifting objective that refines discrete diffusion language models by lifting categorical predictions to a continuous semantic space for anti-symmetric drifting, significantly improving generation quality under a fixed number of denoising steps.
Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space
This paper reformulates language generation as a stochastic optimal control problem, addressing limitations of autoregressive and diffusion models, and proposes a closed-loop diffusion method in latent control space using Flow Matching, achieving high-fidelity generation and efficient parallel sampling.
Towards Closing the Autoregressive Gap in Language Modeling via Entropy-Gated Continuous Bitstream Diffusion
This paper introduces a diffusion language model that treats text as a continuous process over binary bitstreams, using entropy-gated stochastic sampling to close the performance gap with autoregressive models. It achieves state-of-the-art results on LM1B and OWT benchmarks while reducing memory footprint.
BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion
This paper introduces BitLM, a language model that uses bitwise continuous diffusion to generate multiple tokens in parallel, aiming to overcome the sequential bottleneck of traditional autoregressive generation while preserving causal structure.