Drifting Objectives for Refining Discrete Diffusion Language Models

arXiv cs.CL 05/20/26, 04:00 AM Papers
discrete-diffusion language-models drifting-objectives text-generation refinement token-drift sampling
Summary
This paper introduces TokenDrift, a drifting objective that refines discrete diffusion language models by lifting categorical predictions to a continuous semantic space for anti-symmetric drifting, significantly improving generation quality under a fixed number of denoising steps.
arXiv:2605.19470v1 Announce Type: new Abstract: Discrete diffusion language models (DDLMs) generate text by iteratively denoising categorical token sequences, while recent drifting methods for continuous generators suggest that part of this sampling-time correction can instead be absorbed into training through an anti-symmetric fixed-point objective. We study how to transfer this principle to DDLMs, where the main challenge is the interface with discrete text: hard token samples are non-differentiable, and categorical predictions do not directly provide continuous samples to drift. We formulate TokenDrift, a drifting objective that lifts categorical predictions to soft-token features, applies anti-symmetric drifting in a frozen semantic space, and backpropagates the resulting stop-gradient feature target to DDLM logits. In controlled continual-training experiments with masked and uniform-state diffusion backbones, TokenDrift improves fixed-NFE generation quality over matched continuation baselines, reducing Gen.-PPL at 4 NFEs by 89% on MDLM and 86% on DUO. These results suggest that drifting can provide a practical refinement objective for DDLMs.
Original Article
View Cached Full Text
Cached at: 05/20/26, 08:25 AM
# Drifting Objectives for Refining Discrete Diffusion Language Models
Source: [https://arxiv.org/html/2605.19470](https://arxiv.org/html/2605.19470)
Daisuke Oba1Hiroki FurutaNaoaki Okazaki1,2,3 1Institute of Science Tokyo2AIST3NII LLMC \{daisuke\.oba@nlp\.,okazaki@\}comp\.isct\.ac\.jp

###### Abstract

Discrete diffusion language models \(DDLMs\) generate text by iteratively denoising categorical token sequences, while recent drifting methods for continuous generators suggest that part of this sampling\-time correction can instead be absorbed into training through an anti\-symmetric fixed\-point objective\. We study how to transfer this principle to DDLMs, where the main challenge is the interface with discrete text: hard token samples are non\-differentiable, and categorical predictions do not directly provide continuous samples to drift\. We formulateTokenDrift, a drifting objective that lifts categorical predictions to soft\-token features, applies anti\-symmetric drifting in a frozen semantic space, and backpropagates the resulting stop\-gradient feature target to DDLM logits\. In controlled continual\-training experiments with masked and uniform\-state diffusion backbones,TokenDriftimproves fixed\-NFE generation quality over matched continuation baselines, reducing Gen\.\-PPL at 4 NFEs by89%89\\%on MDLM and86%86\\%on DUO\. These results suggest that drifting can provide a practical refinement objective for DDLMs\.

Project page:[https://daioba\.github\.io/tokendrift/](https://daioba.github.io/tokendrift/)

## 1Introduction

Discrete diffusion language models \(DDLMs\) provide a non\-autoregressive alternative to left\-to\-right generation by iteratively denoising corrupted token sequences\[[1](https://arxiv.org/html/2605.19470#bib.bib51),[13](https://arxiv.org/html/2605.19470#bib.bib89),[22](https://arxiv.org/html/2605.19470#bib.bib20),[25](https://arxiv.org/html/2605.19470#bib.bib21),[23](https://arxiv.org/html/2605.19470#bib.bib10),[17](https://arxiv.org/html/2605.19470#bib.bib5),[29](https://arxiv.org/html/2605.19470#bib.bib67),[31](https://arxiv.org/html/2605.19470#bib.bib91),[5](https://arxiv.org/html/2605.19470#bib.bib90),[24](https://arxiv.org/html/2605.19470#bib.bib92)\]\. Their practical behavior is therefore governed not only by the learned denoiser, but also by the quality achieved under a fixed number of denoising steps\. While many works improve sampling by designing new schedules, samplers, or distilled generators, we ask a complementary question:*can the training objective itself refine an existing DDLM so that the same sampler produces better samples at the same inference budget?*

Recent drifting\-based methods for continuous generative models offer a promising training\-side perspective for this question\[[4](https://arxiv.org/html/2605.19470#bib.bib79)\]\. They replace part of the iterative correction normally performed during sampling with a fixed\-point training objective \(Fig\.[1](https://arxiv.org/html/2605.19470#S1.F1); left\): generated samples are moved along an attraction–repulsion field, toward nearby data samples and away from nearby model samples\. This is appealing for DDLM refinement because it targets the generated samples themselves, rather than only the reconstruction of corrupted tokens, and therefore provides a direct objective for improving sample quality under a fixed inference budget\.

However, transferring drifting to DDLMs is not a direct substitution\. In continuous drifting, generated samples or their feature representations can be nudged along a drift direction and used as stop\-gradient training targets\. For text, the generator instead produces categorical distributions over tokens\. If these distributions are collapsed tohard tokensbefore feature extraction, the resulting feature\-space loss no longer provides a useful gradient to the model logits\. Thus, applying drifting to DDLMs requires a differentiable bridge from categorical predictions to the continuous feature space where the drift is defined\.

![Refer to caption](https://arxiv.org/html/2605.19470v1/x1.png)Figure 1:Overview of our drifting formulation for discrete diffusion language models\.Original drifting constructsa stop\-gradient targeth⋆h^\{\\star\}by movingthe generated featurehhalong adrift fieldVV\. For discrete text, hard token sampling blocks gradients, so we lift token probabilities tosoft embeddings, compute the drift target in feature space, and backpropagate the loss to logits\.We formulateTokenDrift, a drifting objective for refining DDLMs through soft\-token features \(Fig\.[1](https://arxiv.org/html/2605.19470#S1.F1); right\)\. Rather than sampling hard tokens,TokenDriftfeeds the frozen semantic encoder with expected token embeddings computed from the model’s categorical predictions\. This creates a differentiable path from the feature\-space drifting loss back to the DDLM logits\. In that feature space, we estimate the same attraction–repulsion drift as in continuous drifting and train the generator toward a stop\-gradient drifted feature target\.

A central part of our study is formulation\. There are several plausible ways to connect a feature\-space drift to a categorical generator: one can match the drifted target directly in feature space, convert the drift into a mirror teacher in logit space, or combine either objective with the original denoising loss\. We compare these alternatives under matched budgets in Section[4\.3](https://arxiv.org/html/2605.19470#S4.SS3), and find that direct feature\-space drifting provides the most stable trade\-off between likelihood\-based quality and entropy\. We further show that the soft\-token lift is essential: replacing it with a hard straight\-through token surrogate\[[3](https://arxiv.org/html/2605.19470#bib.bib86)\]severely degrades generation quality, indicating that preserving predictive uncertainty in the semantic feature space is important for effective drifting\.

In controlled continual\-training experiments on OpenWebText \(OWT\)\[[7](https://arxiv.org/html/2605.19470#bib.bib84)\],TokenDriftsubstantially improves fixed\-NFE generation quality over both a pretrained masked diffusion language model, MDLM\[[22](https://arxiv.org/html/2605.19470#bib.bib20)\], and ordinary continuation training\. The same objective also improves a uniform\-state discrete diffusion backbone, DUO\[[23](https://arxiv.org/html/2605.19470#bib.bib10)\], suggesting that the effect is not specific to masked diffusion\. Together, these results positionTokenDriftas a refinement objective for existing DDLMs: rather than designing a new sampler or distilling a low\-NFE student, we keep the model class and sampler fixed and improve the generator through the training objective\.

Contributions\.In summary, we formulate drifting as a trainable refinement objective for DDLMs, identify the design choices needed to make it effective for discrete text, and show controlled improvements on both masked and uniform\-state diffusion language models\.

## 2Preliminaries

We briefly review the ingredients needed to formulate drifting objectives for refining DDLMs: categorical denoising models, drifting\-based fixed\-point learning, and the difficulty of applying continuous drifting directly to text\.

### 2\.1Discrete Diffusion Language Models

Let𝒱\\mathcal\{V\}be a vocabulary and letx=\(x1,…,xL\)∈𝒱Lx=\(x\_\{1\},\\dots,x\_\{L\}\)\\in\\mathcal\{V\}^\{L\}be a token sequence\. A DDLM parameterized byθ\\thetadefines a corruption process over token sequences and trains a denoiserfθf\_\{\\theta\}to predict clean tokens from corrupted inputs\. At each denoising step, the model outputs logitsℓθ=fθ\(xt\)∈ℝL×\|𝒱\|,\\ell\_\{\\theta\}=f\_\{\\theta\}\(x\_\{t\}\)\\in\\mathbb\{R\}^\{L\\times\|\\mathcal\{V\}\|\},which induce position\-wise categorical distributionspθ,t=softmax\(ℓθ,t\)∈Δ\|𝒱\|−1\.p\_\{\\theta,t\}=\\mathrm\{softmax\}\(\\ell\_\{\\theta,t\}\)\\in\\Delta^\{\|\\mathcal\{V\}\|\-1\}\.

Different DDLMs instantiate the corruption process differently: masked diffusion corrupts tokens toward a mask state, while uniform\-state diffusion corrupts tokens toward a uniform distribution over the vocabulary\. In both cases, the denoiser predicts categorical token distributions and generation proceeds by applying the denoiser for a chosen number of steps\.

### 2\.2Drifting\-Based Fixed\-Point Learning

Drifting defines a fixed\-point training rule for continuous generators\[[4](https://arxiv.org/html/2605.19470#bib.bib79)\]\. For a generated sample or featureyy, letV\(y;Pdata,Pθ\)V\(y;P\_\{\\mathrm\{data\}\},P\_\{\\mathrm\{\\theta\}\}\)be a drift field estimated by attraction toward nearby data samples and repulsion from nearby model samples\. Drifting forms a stop\-gradient target

y⋆=sg\(y\+αV\(y;Pdata,Pθ\)\),y^\{\\star\}=\\mathrm\{sg\}\\\!\\left\(y\+\\alpha V\(y;P\_\{\\mathrm\{data\}\},P\_\{\\mathrm\{\\theta\}\}\)\\right\),and trains the generator to match this target\. When raw sample\-space distances are not meaningful, the same construction can be applied in a frozen feature space\.

A key structural property is anti\-symmetry: swapping the attractive and repulsive distributions reverses the drift direction\. Thus, if the model and data distributions coincide, then attraction and repulsion cancel andV\(⋅;P,P\)=0,V\(\\cdot;P,P\)=0,so the drift signal vanishes at equilibrium\. We aim to preserve this fixed\-point structure for DDLMs\.

### 2\.3The Discrete\-Text Interface

Continuous drifting assumes generated samples or features that can be additively shifted and differentiated through\. DDLMs instead output categorical token distributions: hard tokenization breaks gradients to logits, while direct probability updates must respect the simplex\. Thus, applying drifting to DDLMs requires a differentiable bridge from categorical predictions to the continuous feature space where drift is defined\.

## 3TokenDrift

We formulate drifting as a refinement objective for DDLMs\. Given categorical token predictions,TokenDriftlifts them to soft\-token features, computes an anti\-symmetric attraction–repulsion drift in a frozen semantic space, and trains the generator toward a stop\-gradient drifted feature target\. This makes feature\-space drifting differentiable with respect to token logits\.

Notation\.We useiifor samples,ttfor token positions, andvvfor vocabulary indices\. For sampleii, the generator outputs logitsℓi∈ℝL×\|𝒱\|\\ell\_\{i\}\\\!\\in\\\!\\mathbb\{R\}^\{L\\times\|\\mathcal\{V\}\|\}and distributionspi=softmax\(ℓi\)∈ℝL×\|𝒱\|p\_\{i\}\\\!\\\!=\\\!\\\!\\mathrm\{softmax\}\(\\ell\_\{i\}\)\\\!\\in\\\!\\mathbb\{R\}^\{L\\times\|\\mathcal\{V\}\|\}, withpi,t∈Δ\|𝒱\|−1p\_\{i,t\}\\in\\Delta^\{\|\\mathcal\{V\}\|\-1\}\.

### 3\.1Soft\-Token Feature Lift

LetE∈ℝ\|𝒱\|×dE\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\\times d\}be the embedding matrix used by a frozen semantic encoderϕ\\phi, andx0\(i\)x\_\{0\}^\{\(i\)\}be a clean sequence andx¯\(i\)\\bar\{x\}^\{\(i\)\}its corrupted input\. Givenx¯\(i\)\\bar\{x\}^\{\(i\)\}, the generator outputs distributionspip\_\{i\}\. We form encoder inputs by settinge~i,t=pi,tE\\tilde\{e\}\_\{i,t\}=p\_\{i,t\}Efor predicted positionst∈ℳit\\in\\mathcal\{M\}\_\{i\}ande~i,t=E\[x¯t\(i\)\]\\tilde\{e\}\_\{i,t\}=E\[\\bar\{x\}^\{\(i\)\}\_\{t\}\]otherwise, whereℳi\\mathcal\{M\}\_\{i\}is defined by the underlying diffusion backbone\. The generated feature and the corresponding real feature are

hi=ϕ\(e~i\)∈ℝm,ui=ϕ\(E\[x0\(i\)\]\)∈ℝm\.h\_\{i\}=\\phi\(\\tilde\{e\}\_\{i\}\)\\in\\mathbb\{R\}^\{m\},\\qquad u\_\{i\}=\\phi\(E\[x\_\{0\}^\{\(i\)\}\]\)\\in\\mathbb\{R\}^\{m\}\.Thus,hih\_\{i\}is the feature of the model\-completed sequence, whileuiu\_\{i\}is the feature of the corresponding clean sequence\. Becausee~i\\tilde\{e\}\_\{i\}depends onℓi\\ell\_\{i\}throughpip\_\{i\}on predicted positions, feature\-space drifting losses can backpropagate to the logits\. For backbones without observed\-token positions, such as uniform\-state diffusion, we setℳi=\{1,…,L\}\\mathcal\{M\}\_\{i\}=\\\{1,\\dots,L\\\}and use soft embeddings at all positions\.

### 3\.2Anti\-Symmetric Drift Estimation

For each generated featurehih\_\{i\}, we build a positive reference set𝒫i\\mathcal\{P\}\_\{i\}from real data features and a negative reference set𝒩i\\mathcal\{N\}\_\{i\}from generated features\. The drift follows the attraction–repulsion structure of drifting models\[[4](https://arxiv.org/html/2605.19470#bib.bib79)\]: positives pull the sample toward the data distribution, while negatives push it away from the current model distribution\.

Foruj∈𝒫iu\_\{j\}\\in\\mathcal\{P\}\_\{i\}andvk∈𝒩iv\_\{k\}\\in\\mathcal\{N\}\_\{i\}, we compute temperature\-scaled affinitiessij\+=−‖hi−uj‖22/τs^\{\+\}\_\{ij\}=\-\\\|h\_\{i\}\-u\_\{j\}\\\|\_\{2\}^\{2\}/\\tauandsik−=−‖hi−vk‖22/τs^\{\-\}\_\{ik\}=\-\\\|h\_\{i\}\-v\_\{k\}\\\|\_\{2\}^\{2\}/\\tau\. Following the original drifting construction, we jointly normalize positive and negative affinities to obtain weightsWij\+W^\{\+\}\_\{ij\}andWik−W^\{\-\}\_\{ik\}\. The corresponding positive and negative barycenters define the temperature\-τ\\taudrift field:

bi\+=∑jWij\+uj,bi−=∑kWik−vk,Vi\(τ\)=bi\+−bi−\.b\_\{i\}^\{\+\}=\\sum\\nolimits\_\{j\}W^\{\+\}\_\{ij\}u\_\{j\},\\qquad b\_\{i\}^\{\-\}=\\sum\\nolimits\_\{k\}W^\{\-\}\_\{ik\}v\_\{k\},\\qquad V\_\{i\}^\{\(\\tau\)\}=b\_\{i\}^\{\+\}\-b\_\{i\}^\{\-\}\.Thus, the drift points from nearby model features toward nearby data features\.

Multi\-temperature drift\.We computeVi\(τ\)V\_\{i\}^\{\(\\tau\)\}for eachτ∈𝒯\\tau\\in\\mathcal\{T\}, normalize each temperature by a scalar batch\-level RMS scale, and average:

s\(τ\)=meani⁡‖Vi\(τ\)‖22\+ϵ,Vi=1\|𝒯\|∑τ∈𝒯Vi\(τ\)s\(τ\)\.s^\{\(\\tau\)\}=\\sqrt\{\\operatorname\{mean\}\_\{i\}\\\|V\_\{i\}^\{\(\\tau\)\}\\\|\_\{2\}^\{2\}\+\\epsilon\},\\qquad V\_\{i\}=\\frac\{1\}\{\|\\mathcal\{T\}\|\}\\sum\\nolimits\_\{\\tau\\in\\mathcal\{T\}\}\\frac\{V\_\{i\}^\{\(\\tau\)\}\}\{s^\{\(\\tau\)\}\}\.This prevents any single temperature scale from dominating the final drift\.

### 3\.3Feature\-Space Fixed\-Point Objective

Our main objective is a direct feature\-space fixed\-point loss\. Given the current generated featurehih\_\{i\}and driftViV\_\{i\}, we form the stop\-gradient targethi⋆=sg\(hi\+αVi\),h\_\{i\}^\{\\star\}=\\mathrm\{sg\}\\\!\\left\(h\_\{i\}\+\\alpha V\_\{i\}\\right\),whereα\>0\\alpha\>0is a drift scale\. The drifting objective is

ℒdrift=12B∑i=1B‖hi−hi⋆‖22\.\\mathcal\{L\}\_\{\\mathrm\{drift\}\}=\\frac\{1\}\{2B\}\\sum\\nolimits\_\{i=1\}^\{B\}\\left\\\|h\_\{i\}\-h\_\{i\}^\{\\star\}\\right\\\|\_\{2\}^\{2\}\.Becausehi⋆h\_\{i\}^\{\\star\}is frozen, the feature\-space gradient is proportional to−Vi\-V\_\{i\}, so minimizing this loss pusheshih\_\{i\}in the drift direction\. Through the soft\-token lift, this feature\-space signal backpropagates to the token logits\.

Relation to the base objective\.The drifting objective can be used either alone or in combination with the original discrete diffusion training objective\. Unless otherwise stated, our main method uses the drifting objective as the primary training signal, while the combined variant is evaluated as an ablation \(Sec\.[4\.3](https://arxiv.org/html/2605.19470#S4.SS3)\)\.

### 3\.4Alternative Formulation via Mirror Teacher

As an alternative to direct feature\-space matching, we also consider converting the feature\-space drift into a token\-level teacher distribution\. Lethi=h\(ℓi\)h\_\{i\}=h\(\\ell\_\{i\}\)be the feature induced by the logits\. We define

gi=∇ℓi\(h\(ℓi\)⊤sg\(Vi\)\),ℓi⋆=sg\(ℓi\+ηgi\),pi⋆=softmax\(ℓi⋆\)\.g\_\{i\}=\\nabla\_\{\\ell\_\{i\}\}\\\!\\left\(h\(\\ell\_\{i\}\)^\{\\top\}\\mathrm\{sg\}\(V\_\{i\}\)\\right\),\\qquad\\ell\_\{i\}^\{\\star\}=\\mathrm\{sg\}\(\\ell\_\{i\}\+\\eta g\_\{i\}\),\\qquad p\_\{i\}^\{\\star\}=\\mathrm\{softmax\}\(\\ell\_\{i\}^\{\\star\}\)\.The softmax maps the updated logits back to valid categorical distributions, yielding a simplex\-aware mirror teacher\. We evaluate two matching losses on predicted positionsℳi\\mathcal\{M\}\_\{i\}: distributional KL and logit\-space MSE:

ℒmirror\-KL=1B∑i∑t∈ℳiKL\(pi,t⋆∥pi,t\),ℒmirror\-MSE=1B∑i∑t∈ℳi‖ℓi,t⋆−ℓi,t‖22\.\\mathcal\{L\}\_\{\\mathrm\{mirror\\text\{\-\}KL\}\}=\\frac\{1\}\{B\}\\sum\\nolimits\_\{i\}\\sum\\nolimits\_\{t\\in\\mathcal\{M\}\_\{i\}\}\\mathrm\{KL\}\(p\_\{i,t\}^\{\\star\}\\\|p\_\{i,t\}\),\\qquad\\mathcal\{L\}\_\{\\mathrm\{mirror\\text\{\-\}MSE\}\}=\\frac\{1\}\{B\}\\sum\\nolimits\_\{i\}\\sum\\nolimits\_\{t\\in\\mathcal\{M\}\_\{i\}\}\\\|\\ell\_\{i,t\}^\{\\star\}\-\\ell\_\{i,t\}\\\|\_\{2\}^\{2\}\.We use these mirror objectives as alternative formulations in the study of Sec\.[4\.3](https://arxiv.org/html/2605.19470#S4.SS3)\. Appendix[D](https://arxiv.org/html/2605.19470#A4)provides the corresponding theoretical justification: the mirror teacher is the KL\-proximal simplex update induced by the logit\-space direction \(Prop\.[D\.1](https://arxiv.org/html/2605.19470#A4.Thmtheorem1)\), this direction locally improves alignment with the feature\-space drift \(Prop\.[D\.2](https://arxiv.org/html/2605.19470#A4.Thmtheorem2)\), and the construction preserves the equilibrium property of anti\-symmetric drifting \(Prop\.[D\.3](https://arxiv.org/html/2605.19470#A4.Thmtheorem3)\)\.

### 3\.5Algorithm

Algo\.[1](https://arxiv.org/html/2605.19470#alg1)summarizes one training step ofTokenDrift, our main feature\-space drifting objective\. This applies to different DDLM backbones; the corruption process and the set of predicted positionsℳi\\mathcal\{M\}\_\{i\}are backbone\-specific\. Heregather⁡\(⋅\)\\operatorname\{gather\}\(\\cdot\)denotes cross\-device feature gathering when enabled\.

Algorithm 1One training step ofTokenDrift1:clean mini\-batch

x01:Bx\_\{0\}^\{1:B\}, corruption process

q\(x¯∣x0\)q\(\\bar\{x\}\\\!\\\!\\mid\\\!\\\!x\_\{0\}\), frozen encoder

ϕ\\phi, queues

𝒬real,𝒬gen\\mathcal\{Q\}\_\{\\mathrm\{real\}\},\\mathcal\{Q\}\_\{\\mathrm\{gen\}\}
2:sample corrupted inputs

x¯\(i\)∼q\(x¯∣x0\(i\)\)\\bar\{x\}^\{\(i\)\}\\sim q\(\\bar\{x\}\\mid x\_\{0\}^\{\(i\)\}\)for all

i=1,…,Bi=1,\\dots,B
3:compute logits

ℓi=fθ\(x¯\(i\)\)\\ell\_\{i\}=f\_\{\\theta\}\(\\bar\{x\}^\{\(i\)\}\)and distributions

pi=softmax\(ℓi\)p\_\{i\}=\\mathrm\{softmax\}\(\\ell\_\{i\}\)for all

ii
4:form encoder inputs

e~i\\tilde\{e\}\_\{i\}using

pi,tEp\_\{i,t\}Efor

t∈ℳit\\in\\mathcal\{M\}\_\{i\}and

E\[x¯t\(i\)\]E\[\\bar\{x\}\_\{t\}^\{\(i\)\}\]otherwise

5:compute generated and real features

hi=ϕ\(e~i\)h\_\{i\}=\\phi\(\\tilde\{e\}\_\{i\}\)and

ui=ϕ\(E\[x0\(i\)\]\)u\_\{i\}=\\phi\(E\[x\_\{0\}^\{\(i\)\}\]\)for all

ii
6:form current feature sets

𝒰cur=gather⁡\(\{ui\}i=1B\)\\mathcal\{U\}\_\{\\mathrm\{cur\}\}=\\operatorname\{gather\}\(\\\{u\_\{i\}\\\}\_\{i=1\}^\{B\}\)and

ℋcur=gather⁡\(\{hi\}i=1B\)\\mathcal\{H\}\_\{\\mathrm\{cur\}\}=\\operatorname\{gather\}\(\\\{h\_\{i\}\\\}\_\{i=1\}^\{B\}\)
7:build references

𝒰=𝒰cur∪𝒬real\\mathcal\{U\}=\\mathcal\{U\}\_\{\\mathrm\{cur\}\}\\cup\\mathcal\{Q\}\_\{\\mathrm\{real\}\}and

ℋ=ℋcur∪𝒬gen\\mathcal\{H\}=\\mathcal\{H\}\_\{\\mathrm\{cur\}\}\\cup\\mathcal\{Q\}\_\{\\mathrm\{gen\}\}
8:compute

Vi=Drift𝒯\(hi,sg\(𝒰\),sg\(ℋ∖\{hi\}\)\)V\_\{i\}=\\mathrm\{Drift\}\_\{\\mathcal\{T\}\}\\\!\\left\(h\_\{i\},\\mathrm\{sg\}\(\\mathcal\{U\}\),\\mathrm\{sg\}\(\\mathcal\{H\}\\setminus\\\{h\_\{i\}\\\}\)\\right\)for all

ii
9:construct feature targets

hi⋆=sg\(hi\+αVi\)h\_\{i\}^\{\\star\}=\\mathrm\{sg\}\(h\_\{i\}\+\\alpha V\_\{i\}\)for all

ii
10:compute

ℒdrift=12B∑i=1B‖hi−hi⋆‖22\\mathcal\{L\}\_\{\\mathrm\{drift\}\}=\\frac\{1\}\{2B\}\\sum\_\{i=1\}^\{B\}\\\|h\_\{i\}\-h\_\{i\}^\{\\star\}\\\|\_\{2\}^\{2\}
11:backpropagate through

ℒdrift\\mathcal\{L\}\_\{\\mathrm\{drift\}\}and update

θ\\theta
12:push detached current features

sg\(\{ui\}i=1B\)\\mathrm\{sg\}\(\\\{u\_\{i\}\\\}\_\{i=1\}^\{B\}\)/

sg\(\{hi\}i=1B\)\\mathrm\{sg\}\(\\\{h\_\{i\}\\\}\_\{i=1\}^\{B\}\)into

𝒬real\\mathcal\{Q\}\_\{\\mathrm\{real\}\}/

𝒬gen\\mathcal\{Q\}\_\{\\mathrm\{gen\}\}, evicting the oldest entries

InferenceAt inference time,TokenDriftsimply uses the original DDLM sampler\. All drift\-related components are training\-time only and add no sampling\-time cost\.

### 3\.6Theoretical Properties of theTokenDriftObjective

We summarize the main properties of the proposed objective; formal statements and proofs are given in Appendix[C](https://arxiv.org/html/2605.19470#A3)\. First, the soft\-token lift makes the feature\-space loss trainable for categorical generators \(Prop\.[C\.1](https://arxiv.org/html/2605.19470#A3.Thmtheorem1)\)\. Sinceℓi→pi=softmax\(ℓi\)→e~i=piE→hi=ϕ\(e~i\)\\ell\_\{i\}\\rightarrow p\_\{i\}=\\mathrm\{softmax\}\(\\ell\_\{i\}\)\\rightarrow\\tilde\{e\}\_\{i\}=p\_\{i\}E\\rightarrow h\_\{i\}=\\phi\(\\tilde\{e\}\_\{i\}\)is differentiable on predicted positions, gradients from the drifting loss can backpropagate to the logits\. Hard token selection \(e\.g\.,argmax\\mathrm\{argmax\}\) would break this path and turn the feature\-space target into a non\-differentiable signal\.

Second, the fixed\-point loss does more than match features: it induces the drift\-following update prescribed by the computed field \(Prop\.[C\.2](https://arxiv.org/html/2605.19470#A3.Thmtheorem2)\)\. For the targethi⋆=sg\(hi\+αVi\)h\_\{i\}^\{\\star\}=\\mathrm\{sg\}\(h\_\{i\}\+\\alpha V\_\{i\}\),

∇hi12‖hi−hi⋆‖22=−αVi,∇ℓiℒdrift=−αJhi\(ℓi\)⊤Vi\.\\nabla\_\{h\_\{i\}\}\\frac\{1\}\{2\}\\\|h\_\{i\}\-h\_\{i\}^\{\\star\}\\\|\_\{2\}^\{2\}=\-\\alpha V\_\{i\},\\qquad\\nabla\_\{\\ell\_\{i\}\}\\mathcal\{L\}\_\{\\mathrm\{drift\}\}=\-\\alpha J\_\{h\_\{i\}\}\(\\ell\_\{i\}\)^\{\\top\}V\_\{i\}\.Thus, the continuous feature\-space drift is pulled back through the soft\-token feature map to produce a logit\-space update direction for the DDLM\.

Third, the objective inherits the equilibrium structure of drifting\. If the attraction–repulsion field is anti\-symmetric, then the drift vanishes when the data and model feature distributions match \(Cor\.[C\.3](https://arxiv.org/html/2605.19470#A3.Thmtheorem3)\)\.

These properties do not imply global convergence of the nonconvex generator, but they show that the proposed objective is differentiable, drift\-following, and fixed\-point consistent in the geometry where the drift is defined\.

## 4Experiments

We evaluate whether drifting objectives improve discrete diffusion language models under matched training and inference budgets\. We conduct continual training on OpenWebText \(OWT\)\[[7](https://arxiv.org/html/2605.19470#bib.bib84)\]across two DDLM parameterizations: masked diffusion and uniform\-state diffusion\. For each parameterization, we start from a released checkpoint and compare against ordinary continuation training while keeping the backbone, initialization, data, additional compute budget, and inference budget matched\. Our experiments focus on two questions: whether drifting improves generation quality beyond extra optimization alone, and which discrete adaptation of drifting is most effective\.

### 4\.1Experimental Setup

We provide full implementation details in Appendix[F](https://arxiv.org/html/2605.19470#A6); here we summarize the controlled setup used in the main experiments\.

Benchmarks and backbones\.Our main benchmark is OpenWebText \(OWT\)\[[7](https://arxiv.org/html/2605.19470#bib.bib84)\], where we perform continual training from released DDLM checkpoints\. We consider two discrete diffusion parameterizations: masked diffusion,MDLM\[[22](https://arxiv.org/html/2605.19470#bib.bib20)\], using the released 170M MDLM checkpoint, and uniform\-state diffusion,DUO\[[23](https://arxiv.org/html/2605.19470#bib.bib10)\], using the released checkpoint\. In all controlled comparisons, methods within the same backbone start from the same checkpoint and use the same data, additional training budget, and evaluation protocol\.

Controlled comparisons\.For each backbone, we compare the released checkpoint, ordinary continuation training with the original DDLM objective, andTokenDrift\. These baselines isolate whether improvements come from the drifting objective rather than from additional optimization alone\. We also evaluate key components of drifting formulations \(Sec\.[3\.4](https://arxiv.org/html/2605.19470#S3.SS4)\), including mirror\-teacher variants, to study which discrete adaptation of drifting is most effective\.

Drifting implementation\.TokenDriftuses a frozen copy of the same pretrained backbone checkpoint as the semantic encoderϕ\\phito compute pooled sequence features\. It estimates an anti\-symmetric attraction–repulsion drift from real and generated references built from the current distributed micro\-batch and FIFO queues of detached features; unless otherwise stated, the default queue size is10241024for both real and generated features\. The generator is trained with the feature\-space drifting objective, and the drift is estimated using the multi\-temperature scheme of the original drifting method\[[4](https://arxiv.org/html/2605.19470#bib.bib79)\]with𝒯=\{0\.02,0\.05,0\.2\}\\mathcal\{T\}=\\\{0\.02,0\.05,0\.2\\\}\.

Evaluation\.We focus on unconditional text generation and report generative perplexity \(Gen\.\-PPL\) computed by GPT\-2 Large\[[20](https://arxiv.org/html/2605.19470#bib.bib82)\]\. We also report entropy as a diagnostic for diversity and degeneration\. All main comparisons use matched numbers of function evaluations \(NFEs\)\.

Table 1:OWT few\-step generation results for masked diffusion \(MDLM\) at ckpt\. 13k\.Non\-gray rows are controlled comparisons using the same additional training budget; values are mean±SDover 3 seeds\. Gray rows provide contextual results from distillation methods and are not controlled baselines\. Entropy is reported as a diversity diagnostic, and bold denotes the best Gen\.\-PPL per NFE\.Table 2:OWT few\-step generation results for uniform\-state diffusion \(DUO\) at ckpt\. 13k\.Non\-gray rows are controlled comparisons using the same additional training budget; values are mean±SDover 3 seeds\. The gray row provides a contextual DCD reference and is not a controlled baseline\. Entropy is reported as a diversity diagnostic, and bold denotes the best Gen\.\-PPL per NFE\.![Refer to caption](https://arxiv.org/html/2605.19470v1/x2.png)Figure 2:Training dynamics from released MDLM\[[22](https://arxiv.org/html/2605.19470#bib.bib20)\]and DUO\[[23](https://arxiv.org/html/2605.19470#bib.bib10)\]checkpoints\.As drifting training progresses, Gen\.\-PPL decreases across the NFEs, showing that our drifting objective progressively improves fixed\-budget generation quality rather than merely selecting a better final checkpoint\.
### 4\.2Main Results

Masked diffusion\.Table[1](https://arxiv.org/html/2605.19470#S4.T1)reports controlled OWT continual\-training results on top of MDLM in the few\-step regime, using 4, 8, and 16 NFEs\. The controlled block keeps the checkpoint, backbone, data, training budget, and evaluation protocol fixed across MDLM, ordinary continuation training, andTokenDrift\. Under this matched setting,TokenDriftsubstantially improves Gen\.\-PPL at every reported NFE\. Ordinary continuation training barely changes the original checkpoint, indicating that the gains come from the drifting objective rather than extra optimization alone\. Within this few\-step range, entropy decreases moderately but remains far from the severe collapse\.

Uniform\-state diffusion\.Table[2](https://arxiv.org/html/2605.19470#S4.T2)evaluates the same few\-step regime on DUO, a uniform\-state discrete diffusion backbone\. Starting from the released DUO checkpoint and using the same additional training budget,TokenDriftagain improves Gen\.\-PPL over both the original checkpoint and ordinary continuation training at every reported NFE\. The entropy values remain close to the original DUO and continuation baselines in this range, suggesting that the quality gains do not come from an obvious diversity collapse\. Together with the MDLM results, this shows that the drifting objective is effective across both masked and uniform\-state DDLM parameterizations\.

Progressive improvement from drifting\.Figure[2](https://arxiv.org/html/2605.19470#S4.F2)tracks Gen\.\-PPL during the optimization\. As the drifting objective is optimized, Gen\.\-PPL decreases across the evaluated NFEs, and the gap to both the original checkpoint and ordinary continuation training widens over time\. This is the key empirical signal: a drifting loss consistently translates into better token\-level sample quality under fixed decoding budgets viaTokenDrift\.

Contextual distillation references\.The gray rows in Tables[1](https://arxiv.org/html/2605.19470#S4.T1)and[2](https://arxiv.org/html/2605.19470#S4.T2)provide contextual comparisons to specialized distillation methods\. These methods use different objectives, training pipelines, and sometimes different teachers or students, so they are not controlled baselines for our claims\. Nevertheless,TokenDriftachieves lower Gen\.\-PPL than the reported references at all listed NFEs\.

Table 3:Ablation on the input to the encoderϕ\\phiforTokenDrifton top of MDLM at ckpt 13k\.*Soft*feeds the expected embeddingptEp\_\{t\}E\(continuous mixture over the vocabulary\), while*Hard*feeds the embedding of the argmax tokenE\[arg⁡max⁡pt\]E\[\\arg\\max p\_\{t\}\]\.∗default configuration used in the main table\.
### 4\.3Formulation Study: How should Drifting Be Applied to Discrete Diffusion LMs?

Soft\-token lift is essential\.We explore whether our soft\-token lift can be replaced by a hard\-token straight\-through estimator\[[3](https://arxiv.org/html/2605.19470#bib.bib86)\]\. The soft variant feeds the encoder the expected embeddingptEp\_\{t\}E, while the hard variant uses the argmax embedding in the forward pass and passes gradients through

ytST=ythard−sg\(pt\)\+pt\.ythard=onehot\(argmaxvpt,v\)\.y\_\{t\}^\{\\mathrm\{ST\}\}=y\_\{t\}^\{\\mathrm\{hard\}\}\-\\mathrm\{sg\}\(p\_\{t\}\)\+p\_\{t\}\.\\qquad y\_\{t\}^\{\\mathrm\{hard\}\}=\\mathrm\{onehot\}\(\\arg\\max\_\{v\}p\_\{t,v\}\)\.Table[3](https://arxiv.org/html/2605.19470#S4.T3)shows that, despite this surrogate gradient path, the hard variant performs much worse and collapses entropy\. Thus, differentiability alone is not sufficient: effective drifting requires exposing the semantic encoder to the model’s soft predictive distribution\.

Target space: feature or logit?We compare three drift objectives: direct feature\-space L2 matching, and its alternatives, i\.e\., logit\-space L2/KL matching through a mirror teacher \(Sec\.[3\.4](https://arxiv.org/html/2605.19470#S3.SS4)\)\. Table[4](https://arxiv.org/html/2605.19470#S4.T4)\(upper\) shows that all three variants are effective, indicating that the main gain comes from the drifting signal itself rather than a single target construction\. However, they trade off likelihood and entropy differently\. We therefore use feature\-space L2 as the default, stable formulation forTokenDrift\.

Interaction with the base denoising objective\.The lower block of Table[4](https://arxiv.org/html/2605.19470#S4.T4)shows that adding the original MDLM loss is not consistently beneficial\. For logit\-space objectives, the base loss largely cancels the effect of drifting; for feature\-space L2, it remains competitive but does not improve over drift\-only training\. This suggests that, in our refinement setting, drifting is not merely a regularizer on top of denoising but can serve as the primary training signal\.

Table 4:Design choices ablation ofTokenDriftat ckpt 13k\.Space: “*feature*” compares teacher–student hidden representations; “*logit*” compares output distributions in the mirror space\. Dist\.: L2/KL means the matching metrics between anchor and teacher point\. Values are mean±SDover 3 seeds\. Bold = best Gen\.\-PPL per NFE\.∗default configuration used in the main table\.Table 5:Queue size ablation forTokenDriftat ckpt 5k\.\|𝒬gen,real\|\|\\mathcal\{Q\}\_\{\\text\{gen,real\}\}\|= queue size\. Values are mean±SDover 3 seeds\.∗default configuration used in the main table\.Table 6:Ablation for attraction–repulsion ratio \(i\.e\.,bi\+b\_\{i\}^\{\+\}vs\.bi−b\_\{i\}^\{\-\}\) at ckpt 5k\.Values are mean±SDover 3 seeds\.∗default setting carried over from the main table\.
### 4\.4Drift Estimation Ablations

Reference\-set size\.Table[5](https://arxiv.org/html/2605.19470#S4.T5)varies the number of real and generated reference features used to estimate the drift field\. Gen\.\-PPL improves consistently as the queue size increases: small queues give noisy drift estimates, while the default queue size of 1024 gives the best result at every reported NFE\. Entropy stays in a comparable range, suggesting that the gain comes from a better\-estimated drift direction rather than from a collapse in output diversity\. Thus, sufficiently large reference sets are important for making the attraction–repulsion field reliable\.

Attraction–repulsion balance\.Table[6](https://arxiv.org/html/2605.19470#S4.T6)tests the anti\-symmetric structure inherited from drifting\[[4](https://arxiv.org/html/2605.19470#bib.bib79)\]\. The balanced fieldVi=bi\+−bi−V\_\{i\}=b\_\{i\}^\{\+\}\-b\_\{i\}^\{\-\}, which preserves the feature\-space equilibrium property discussed in Corollary[C\.3](https://arxiv.org/html/2605.19470#A3.Thmtheorem3), performs best across the evaluated NFEs\. Attraction\-only or attraction\-heavy variants remain usable but are consistently weaker, while repulsion\-only or repulsion\-heavy variants fail catastrophically with extremely high Gen\.\-PPL and inflated entropy\. This shows that the attraction–repulsion balance is not just a theoretical convenience: it is necessary for a stable and useful drift estimate in discrete diffusion language models\.

### 4\.5Qualitative Samples

MDLMUpon takingHerEven so I was under sprinkler cabin and TommyFire set ablaze fire\.There wasMy igies bucket here thaif weand my nose in Pac\-Pac, my step \.\.\.

TokenDriftThe police were asked to register the case on Thursday after forensic examination showed blood and bruises on the body\. File photo: A poison victim was found \.\.\.

Figure 3:A generated example:MDLM \(top\) andTokenDrift\(bottom\) atNFE=16\\text\{NFE\}=16\.Figure[3](https://arxiv.org/html/2605.19470#S4.F3)shows representative generations from MDLM andTokenDriftatNFE=16\\text\{NFE\}=16\. The MDLM sample exhibits the typical failure mode of low\-budget diffusion decoding, with fragmented phrases, abrupt topic shifts, and weak local coherence\. In contrast,TokenDriftproduces a substantially more fluent and document\-like continuation with coherent syntax and a plausible news\-style structure\. This qualitative comparison is consistent with the Gen\.\-PPL improvements in Tables[1](https://arxiv.org/html/2605.19470#S4.T1)and[2](https://arxiv.org/html/2605.19470#S4.T2)\.

## 5Related Work

Distillation for discrete diffusion language models\.Distillation methods such as SDTT\[[6](https://arxiv.org/html/2605.19470#bib.bib76)\], Di4C\[[8](https://arxiv.org/html/2605.19470#bib.bib85)\], and DiDi\-Instruct\[[30](https://arxiv.org/html/2605.19470#bib.bib75)\]train DDLM\-based models for low\-NFE generation using pretrained teachers\. They remain closely related to our setting, but solve a different optimization problem: their goal is to distill a teacher into a few\-step generator, whereas we refine the same starting DDLM checkpoint with a drifting objective under matched additional training budgets and NFEs\. We therefore report distillation results as contextual references rather than controlled baselines\.

Flow\-based formulations for discrete sequences\.Another line of work studies flow\-based formulations for discrete or simplex\-valued sequence generation\[[11](https://arxiv.org/html/2605.19470#bib.bib74),[19](https://arxiv.org/html/2605.19470#bib.bib7),[21](https://arxiv.org/html/2605.19470#bib.bib87),[26](https://arxiv.org/html/2605.19470#bib.bib8),[16](https://arxiv.org/html/2605.19470#bib.bib9)\]\. Rather than refining an existing denoising model and sampler, they define generation through continuous denoising dynamics, flow maps, simplex\-valued probability paths, or discrete flow\-matching objectives\. We instead keep the DDLM checkpoint and sampler fixed, and test whether a drifting objective improves the same model under matched training budgets and NFEs\.

## 6Conclusion

We introduced drifting objectives for refining discrete diffusion language models\. By lifting categorical predictions to soft\-token features, our formulation makes feature\-space drifting trainable for DDLMs while preserving the attraction–repulsion structure of continuous drifting\. Controlled experiments with OWT show that this objective improves fixed\-NFE generation quality over matched continuation baselines and transfers beyond masked diffusion to a uniform\-state backbone\. Our results suggest that drifting is a practical training objective for improving existing DDLMs without changing their sampler or relying on specialized distillation\.

Discussions\.The main limitation ofTokenDriftis that its effectiveness depends on the quality of the frozen semantic feature space used to estimate drift\. In addition, our experiments focus on unconditional generation with DDLM backbones; extending the same objective to conditional generation, instruction\-following settings, or larger\-scale diffusion language models remains an important direction for future work\.

## Acknowledgement

This work was partially supported by JSPS KAKENHI Grant Number 25H01137 and JST K Program Japan Grant Number JPMJKP24C3\.

## References

- \[1\]\(2021\)Structured denoising diffusion models in discrete state\-spaces\.Advances in neural information processing systems34,pp\. 17981–17993\.Cited by:[§1](https://arxiv.org/html/2605.19470#S1.p1.1)\.
- \[2\]A\. Beck and M\. Teboulle\(2003\)Mirror descent and nonlinear projected subgradient methods for convex optimization\.Operations Research Letters31\(3\),pp\. 167–175\.Cited by:[§D\.1](https://arxiv.org/html/2605.19470#A4.SS1.p1.1)\.
- \[3\]Y\. Bengio, N\. Léonard, and A\. Courville\(2013\)Estimating or propagating gradients through stochastic neurons for conditional computation\.arXiv preprint arXiv:1308\.3432\.Cited by:[§1](https://arxiv.org/html/2605.19470#S1.p5.1),[§4\.3](https://arxiv.org/html/2605.19470#S4.SS3.p1.1)\.
- \[4\]M\. Deng, H\. Li, T\. Li, Y\. Du, and K\. He\(2026\)Generative modeling via drifting\.arXiv preprint arXiv:2602\.04770\.Cited by:[§D\.3](https://arxiv.org/html/2605.19470#A4.SS3.p1.1),[Appendix F](https://arxiv.org/html/2605.19470#A6.SS0.SSS0.Px5.p1.4),[§1](https://arxiv.org/html/2605.19470#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.19470#S2.SS2.p1.2),[§3\.2](https://arxiv.org/html/2605.19470#S3.SS2.p1.3),[§4\.1](https://arxiv.org/html/2605.19470#S4.SS1.p4.3),[§4\.4](https://arxiv.org/html/2605.19470#S4.SS4.p2.1)\.
- \[5\]J\. Deschenaux, C\. Gulcehre, and S\. S\. Sahoo\(2026\)The diffusion duality, chapter II: $\\psi$\-samplers and efficient curriculum\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=RSIoYWIzaP)Cited by:[§1](https://arxiv.org/html/2605.19470#S1.p1.1)\.
- \[6\]J\. Deschenaux and C\. Gulcehre\(2025\)Beyond autoregression: fast LLMs via self\-distillation through time\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=uZ5K4HeNwd)Cited by:[Table 1](https://arxiv.org/html/2605.19470#S4.T1.35.33.7.1),[§5](https://arxiv.org/html/2605.19470#S5.p1.1)\.
- \[7\]A\. Gokaslan and V\. Cohen\(2019\)OpenWebText corpus\.Note:[http://Skylion007\.github\.io/OpenWebTextCorpus](http://skylion007.github.io/OpenWebTextCorpus)Cited by:[Table 7](https://arxiv.org/html/2605.19470#A2.T7.3.2.1.1),[Appendix F](https://arxiv.org/html/2605.19470#A6.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.19470#S1.p6.1),[§4\.1](https://arxiv.org/html/2605.19470#S4.SS1.p2.1),[§4](https://arxiv.org/html/2605.19470#S4.p1.1)\.
- \[8\]S\. Hayakawa, Y\. Takida, M\. Imaizumi, H\. Wakaki, and Y\. Mitsufuji\(2025\)Distillation of discrete diffusion through dimensional correlations\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=jCEl0aJpF6)Cited by:[Table 1](https://arxiv.org/html/2605.19470#S4.T1.41.39.7.1),[§5](https://arxiv.org/html/2605.19470#S5.p1.1)\.
- \[9\]D\. Israel, G\. V\. d\. Broeck, and A\. Grover\(2025\)Accelerating diffusion llms via adaptive parallel decoding\.arXiv preprint arXiv:2506\.00413\.Cited by:[Appendix E](https://arxiv.org/html/2605.19470#A5.SS0.SSS0.Px1.p1.1)\.
- \[10\]J\. Kivinen and M\. K\. Warmuth\(1997\)Exponentiated gradient versus gradient descent for linear predictors\.Information and computation132\(1\),pp\. 1–63\.Cited by:[§D\.1](https://arxiv.org/html/2605.19470#A4.SS1.p1.1)\.
- \[11\]C\. Lee, J\. Yoo, M\. Agarwal, S\. Shah, J\. Huang, A\. Raghunathan, S\. Hong, N\. M\. Boffi, and J\. Kim\(2026\)Flow map language models: one\-step language modeling via continuous denoising\.arXiv preprint arXiv:2602\.16813\.Cited by:[§5](https://arxiv.org/html/2605.19470#S5.p2.1)\.
- \[12\]Z\. Liu, Y\. Yang, Y\. Zhang, J\. Chen, C\. Zou, Q\. Wei, S\. Wang, and L\. Zhang\(2025\)Dllm\-cache: accelerating diffusion large language models with adaptive caching\.arXiv preprint arXiv:2506\.06295\.Cited by:[Appendix E](https://arxiv.org/html/2605.19470#A5.SS0.SSS0.Px1.p1.1)\.
- \[13\]A\. Lou, C\. Meng, and S\. Ermon\(2024\)Discrete diffusion modeling by estimating the ratios of the data distribution\.InForty\-first International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=CNicRIVIPA)Cited by:[§1](https://arxiv.org/html/2605.19470#S1.p1.1)\.
- \[14\]O\. Luxembourg, H\. Permuter, and E\. Nachmani\(2025\)Plan for speed–dilated scheduling for masked diffusion language models\.arXiv preprint arXiv:2506\.19037\.Cited by:[Appendix E](https://arxiv.org/html/2605.19470#A5.SS0.SSS0.Px1.p1.1)\.
- \[15\]X\. Ma, R\. Yu, G\. Fang, and X\. Wang\(2025\)Dkv\-cache: the cache for diffusion language models\.arXiv preprint arXiv:2505\.15781\.Cited by:[Appendix E](https://arxiv.org/html/2605.19470#A5.SS0.SSS0.Px1.p1.1)\.
- \[16\]A\. K\. Monsefi, N\. Bhendawade, M\. R\. Ciosici, D\. Culver, Y\. Zhang, and I\. Belousova\(2025\)Fs\-dfm: fast and accurate long text generation with few\-step diffusion language models\.arXiv preprint arXiv:2509\.20624\.Cited by:[§5](https://arxiv.org/html/2605.19470#S5.p2.1)\.
- \[17\]S\. Nie, F\. Zhu, Z\. You, X\. Zhang, J\. Ou, J\. Hu, J\. Zhou, Y\. Lin, J\. Wen, and C\. Li\(2025\)Large language diffusion models\.External Links:2502\.09992,[Link](https://arxiv.org/abs/2502.09992)Cited by:[§1](https://arxiv.org/html/2605.19470#S1.p1.1)\.
- \[18\]D\. Oba, D\. Bollegala, M\. Kaneko, and N\. Okazaki\(2026\)Stopping computation for converged tokens in masked diffusion\-LM decoding\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=PzhNnMepgl)Cited by:[Appendix E](https://arxiv.org/html/2605.19470#A5.SS0.SSS0.Px1.p1.1)\.
- \[19\]P\. Potaptchik, J\. Yim, A\. Saravanan, P\. Holderrieth, E\. Vanden\-Eijnden, and M\. S\. Albergo\(2026\)Discrete flow maps\.arXiv preprint arXiv:2604\.09784\.Cited by:[§5](https://arxiv.org/html/2605.19470#S5.p2.1)\.
- \[20\]A\. Radford, J\. Wu, R\. Child, D\. Luan, D\. Amodei, I\. Sutskever,et al\.\(2019\)Language models are unsupervised multitask learners\.OpenAI blog1\(8\),pp\. 9\.Cited by:[Table 7](https://arxiv.org/html/2605.19470#A2.T7.3.5.4.1),[Appendix F](https://arxiv.org/html/2605.19470#A6.SS0.SSS0.Px8.p1.1),[§4\.1](https://arxiv.org/html/2605.19470#S4.SS1.p5.1)\.
- \[21\]D\. Roos, O\. Davis, F\. Eijkelboom, M\. Bronstein, M\. Welling, İ\. İ\. Ceylan, L\. Ambrogioni, and J\. van de Meent\(2026\)Categorical flow maps\.arXiv preprint arXiv:2602\.12233\.Cited by:[§5](https://arxiv.org/html/2605.19470#S5.p2.1)\.
- \[22\]S\. Sahoo, M\. Arriola, Y\. Schiff, A\. Gokaslan, E\. Marroquin, J\. Chiu, A\. Rush, and V\. Kuleshov\(2024\)Simple and effective masked diffusion language models\.Advances in Neural Information Processing Systems37,pp\. 130136–130184\.Cited by:[Table 7](https://arxiv.org/html/2605.19470#A2.T7.3.3.2.1),[Appendix F](https://arxiv.org/html/2605.19470#A6.SS0.SSS0.Px1.p1.1),[Appendix F](https://arxiv.org/html/2605.19470#A6.SS0.SSS0.Px2.p1.1),[Appendix F](https://arxiv.org/html/2605.19470#A6.SS0.SSS0.Px8.p1.1),[§1](https://arxiv.org/html/2605.19470#S1.p1.1),[§1](https://arxiv.org/html/2605.19470#S1.p6.1),[Figure 2](https://arxiv.org/html/2605.19470#S4.F2.1.1),[Figure 2](https://arxiv.org/html/2605.19470#S4.F2.2.1),[§4\.1](https://arxiv.org/html/2605.19470#S4.SS1.p2.1),[Table 1](https://arxiv.org/html/2605.19470#S4.T1.17.15.7)\.
- \[23\]S\. S\. Sahoo, J\. Deschenaux, A\. Gokaslan, G\. Wang, J\. T\. Chiu, and V\. Kuleshov\(2025\)The diffusion duality\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=9P9Y8FOSOk)Cited by:[Table 7](https://arxiv.org/html/2605.19470#A2.T7.3.4.3.1),[Appendix F](https://arxiv.org/html/2605.19470#A6.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.19470#S1.p1.1),[§1](https://arxiv.org/html/2605.19470#S1.p6.1),[Figure 2](https://arxiv.org/html/2605.19470#S4.F2.1.1),[Figure 2](https://arxiv.org/html/2605.19470#S4.F2.2.1),[§4\.1](https://arxiv.org/html/2605.19470#S4.SS1.p2.1),[Table 2](https://arxiv.org/html/2605.19470#S4.T2.17.15.7),[Table 2](https://arxiv.org/html/2605.19470#S4.T2.35.33.7.1.1)\.
- \[24\]S\. S\. Sahoo, J\. Lemercier, Z\. Yang, J\. Deschenaux, J\. Liu, J\. Thickstun, and A\. Jukic\(2026\)Scaling beyond masked diffusion language models\.arXiv preprint arXiv:2602\.15014\.Cited by:[§1](https://arxiv.org/html/2605.19470#S1.p1.1)\.
- \[25\]J\. Shi, K\. Han, Z\. Wang, A\. Doucet, and M\. Titsias\(2024\)Simplified and generalized masked diffusion for discrete data\.Advances in neural information processing systems37,pp\. 103131–103167\.Cited by:[§1](https://arxiv.org/html/2605.19470#S1.p1.1)\.
- \[26\]H\. Stark, B\. Jing, C\. Wang, G\. Corso, B\. Berger, R\. Barzilay, and T\. Jaakkola\(2024\)Dirichlet flow matching with applications to dna sequence design\.InProceedings of the 41st International Conference on Machine Learning,pp\. 46495–46513\.Cited by:[§5](https://arxiv.org/html/2605.19470#S5.p2.1)\.
- \[27\]Q\. Wei, Y\. Zhang, Z\. Liu, P\. Zeng, Y\. Wang, B\. Qi, D\. Liu, and L\. Zhang\(2025\)Accelerating diffusion large language models with slowfast sampling: the three golden principles\.arXiv preprint arXiv:2506\.10848\.Cited by:[Appendix E](https://arxiv.org/html/2605.19470#A5.SS0.SSS0.Px1.p1.1)\.
- \[28\]C\. Wu, H\. Zhang, S\. Xue, Z\. Liu, S\. Diao, L\. Zhu, P\. Luo, S\. Han, and E\. Xie\(2025\)Fast\-dllm: training\-free acceleration of diffusion llm by enabling kv cache and parallel decoding\.arXiv preprint arXiv:2505\.22618\.Cited by:[Appendix E](https://arxiv.org/html/2605.19470#A5.SS0.SSS0.Px1.p1.1)\.
- \[29\]J\. Ye, Z\. Xie, L\. Zheng, J\. Gao, Z\. Wu, X\. Jiang, Z\. Li, and L\. Kong\(2025\)Dream 7b: diffusion large language models\.arXiv preprint arXiv:2508\.15487\.Cited by:[§1](https://arxiv.org/html/2605.19470#S1.p1.1)\.
- \[30\]H\. Zheng, X\. Liu, X\. Kong, N\. Jiang, Z\. Hu, W\. Luo, W\. Deng, and G\. Lin\(2026\)Ultra\-fast language generation via discrete diffusion divergence instruct\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=mtdyZsa47V)Cited by:[Table 1](https://arxiv.org/html/2605.19470#S4.T1.47.45.7.1),[§5](https://arxiv.org/html/2605.19470#S5.p1.1)\.
- \[31\]F\. Zhu, R\. Wang, S\. Nie, X\. Zhang, C\. Wu, J\. Hu, J\. Zhou, J\. Chen, Y\. Lin, J\. Wen,et al\.\(2025\)Llada 1\.5: variance\-reduced preference optimization for large language diffusion models\.arXiv preprint arXiv:2505\.19223\.Cited by:[§1](https://arxiv.org/html/2605.19470#S1.p1.1)\.

## Appendix ABroader Impact

This work studies training objectives for improving discrete diffusion language models, and therefore shares the broader impacts of generative language modeling\. Improved generation quality at fixed inference budgets may make non\-autoregressive text generation more practical and efficient, which could benefit applications requiring lower latency or reduced sampling cost\. At the same time, stronger language generators can also be misused to produce misleading, low\-quality, or harmful text at scale\. Our method does not introduce new data sources, deployment mechanisms, or safety filters, and our experiments are limited to standard unconditional generation benchmarks\. Responsible deployment of models trained with drifting objectives should therefore follow existing best practices for language\-model evaluation, including bias, toxicity, memorization, and misuse assessments before use in user\-facing systems\.

## Appendix BLicenses

We use publicly available datasets, checkpoints, and evaluation models in accordance with their released licenses and terms\. Table[7](https://arxiv.org/html/2605.19470#A2.T7)summarizes the main external assets used in this work\. We do not redistribute the original datasets or pretrained checkpoints; our release will include only our code, configuration files, and instructions for obtaining the external resources from their original providers\.

Table 7:Main external assets used in this work\.
## Appendix CTheoretical Analysis ofTokenDrift

We give three results that justify the feature\-space drifting objective used byTokenDrift\. The first shows that the soft\-token lift makes drifting differentiable for categorical sequence generators\. The second shows that the resulting fixed\-point loss directly follows the feature\-space drift\. The third establishes that the equilibrium property of anti\-symmetric drifting is preserved in the discrete setting\. We provide formal statements and proofs for the properties summarized in Sec\.[3\.6](https://arxiv.org/html/2605.19470#S3.SS6)\.

Throughout this section, we omit the sample index when it is clear from context\. For logitsℓ∈ℝL×\|𝒱\|\\ell\\in\\mathbb\{R\}^\{L\\times\|\\mathcal\{V\}\|\}, we write

p=softmax⁡\(ℓ\),e~=pE,h\(ℓ\)=ϕ\(e~\),p=\\operatorname\{softmax\}\(\\ell\),\\qquad\\tilde\{e\}=pE,\\qquad h\(\\ell\)=\\phi\(\\tilde\{e\}\),where the softmax is applied row\-wise over the vocabulary dimension,EEis the token embedding matrix, andϕ\\phiis a frozen semantic encoder\. Given a feature\-space driftVV, the drifting target and objective are

h⋆=sg⁡\(h\(ℓ\)\+αV\),ℒdrift\(ℓ;V\)=12‖h\(ℓ\)−h⋆‖22,h^\{\\star\}=\\operatorname\{sg\}\\\!\\left\(h\(\\ell\)\+\\alpha V\\right\),\\qquad\\mathcal\{L\}\_\{\\mathrm\{drift\}\}\(\\ell;V\)=\\frac\{1\}\{2\}\\left\\\|h\(\\ell\)\-h^\{\\star\}\\right\\\|\_\{2\}^\{2\},whereα\>0\\alpha\>0is the drift scale andsg⁡\(⋅\)\\operatorname\{sg\}\(\\cdot\)denotes stop\-gradient\.

### C\.1Soft\-token features make drifting differentiable

The first issue in applying drifting to text is that hard token samples are not differentiable with respect to logits\. The soft\-token lift resolves this by replacing a hard token embedding with the expected embedding under the model distribution\.

###### Proposition C\.1\(Differentiable soft\-token lift\)\.

Assume that the frozen encoderϕ\\phiis differentiable with respect to its input embeddings\. Then

h\(ℓ\)=ϕ\(softmax⁡\(ℓ\)E\)h\(\\ell\)=\\phi\(\\operatorname\{softmax\}\(\\ell\)E\)is differentiable with respect toℓ\\ell\. Consequently, for any differentiable feature\-space objectiveR\(h\)R\(h\),

∇ℓR\(h\(ℓ\)\)=Jh\(ℓ\)⊤∇hR\(h\),\\nabla\_\{\\ell\}R\(h\(\\ell\)\)=J\_\{h\}\(\\ell\)^\{\\top\}\\nabla\_\{h\}R\(h\),whereJh\(ℓ\)J\_\{h\}\(\\ell\)is the Jacobian ofhhwith respect toℓ\\ell\. The JacobianJh\(ℓ\)J\_\{h\}\(\\ell\)pulls the feature\-space gradient back to logit space\.

Proof\.The row\-wise softmax map

ℓ↦p=softmax\(ℓ\)\\ell\\mapsto p=\\mathrm\{softmax\}\(\\ell\)is differentiable with respect toℓ\\ell\. The soft\-token embedding map

p↦e~=pEp\\mapsto\\tilde\{e\}=pEis linear, and hence differentiable\. By assumption, the frozen encoderϕ\\phiis differentiable with respect to its input embeddings\. Therefore, the composition

h\(ℓ\)=ϕ\(softmax\(ℓ\)E\)h\(\\ell\)=\\phi\(\\mathrm\{softmax\}\(\\ell\)E\)is differentiable with respect toℓ\\ell\.

Now letRRbe any differentiable objective defined on the feature representationhh\. SinceR\(h\(ℓ\)\)R\(h\(\\ell\)\)is a composition of differentiable maps, the chain rule gives

∇ℓR\(h\(ℓ\)\)=Jh\(ℓ\)⊤∇hR\(h\),\\nabla\_\{\\ell\}R\(h\(\\ell\)\)=J\_\{h\}\(\\ell\)^\{\\top\}\\nabla\_\{h\}R\(h\),whereJh\(ℓ\)J\_\{h\}\(\\ell\)is the Jacobian ofhhwith respect toℓ\\ell\. Thus, gradients of feature\-space objectives can be pulled back to the generator logits through the soft\-token lift\. ∎

Interpretation\.Proposition[C\.1](https://arxiv.org/html/2605.19470#A3.Thmtheorem1)identifies the key bridge from continuous drifting to discrete text\. Although the drift loss is defined in a semantic feature space, the soft\-token lift makes this loss differentiable with respect to the generator logits\. Thus, the feature\-space drift can update a categorical sequence generator by standard backpropagation\. In contrast, hard token selection would break this path and turn the drift target into a non\-differentiable evaluation signal rather than a trainable objective\.

### C\.2Connecting feature\-space drifting to logits

The fixed\-point loss below is the same local mechanism used in continuous drifting: a generated feature is trained toward a stop\-gradient target shifted by the drift field\. The nontrivial point in our setting is that the optimized variables are not continuous samples, but token logits\. Using the soft\-token lift \(Prop\.[C\.1](https://arxiv.org/html/2605.19470#A3.Thmtheorem1)\), we show that the feature\-space drifting signal induces a well\-defined update direction for the logits of a categorical generator\.

###### Proposition C\.2\(Logit\-space pullback of the drift\)\.

Fix a drift vectorVVand define

h⋆=sg⁡\(h\+αV\),ℒdrift=12‖h−h⋆‖22\.h^\{\\star\}=\\operatorname\{sg\}\(h\+\\alpha V\),\\qquad\\mathcal\{L\}\_\{\\mathrm\{drift\}\}=\\frac\{1\}\{2\}\\\|h\-h^\{\\star\}\\\|\_\{2\}^\{2\}\.Then the feature\-space gradient is

∇hℒdrift=−αV\.\\nabla\_\{h\}\\mathcal\{L\}\_\{\\mathrm\{drift\}\}=\-\\alpha V\.Moreover, whenh=h\(ℓ\)h=h\(\\ell\)is the soft\-token feature map from Proposition[C\.1](https://arxiv.org/html/2605.19470#A3.Thmtheorem1),

∇ℓℒdrift=−αJh\(ℓ\)⊤V\.\\nabla\_\{\\ell\}\\mathcal\{L\}\_\{\\mathrm\{drift\}\}=\-\\alpha J\_\{h\}\(\\ell\)^\{\\top\}V\.Thus, the continuous drift field is transferred to the categorical generator as a logit\-space update through the Jacobian of the soft\-token feature map\.

Proof\.Sinceh⋆=sg\(h\+αV\)h^\{\\star\}=\\mathrm\{sg\}\(h\+\\alpha V\), the target is treated as constant when differentiating with respect tohh\. Therefore,

∇hℒdrift=∇h12‖h−h⋆‖22=h−h⋆=h−sg\(h\+αV\)=−αV\.\\nabla\_\{h\}\\mathcal\{L\}\_\{\\mathrm\{drift\}\}=\\nabla\_\{h\}\\frac\{1\}\{2\}\\\|h\-h^\{\\star\}\\\|\_\{2\}^\{2\}=h\-h^\{\\star\}=h\-\\mathrm\{sg\}\(h\+\\alpha V\)=\-\\alpha V\.Now supposeh=h\(ℓ\)h=h\(\\ell\)\. By Proposition[C\.1](https://arxiv.org/html/2605.19470#A3.Thmtheorem1),h\(ℓ\)h\(\\ell\)is differentiable with respect toℓ\\ell\. Applying the chain rule gives

∇ℓℒdrift=Jh\(ℓ\)⊤∇hℒdrift=−αJh\(ℓ\)⊤V\.\\nabla\_\{\\ell\}\\mathcal\{L\}\_\{\\mathrm\{drift\}\}=J\_\{h\}\(\\ell\)^\{\\top\}\\nabla\_\{h\}\\mathcal\{L\}\_\{\\mathrm\{drift\}\}=\-\\alpha J\_\{h\}\(\\ell\)^\{\\top\}V\.This proves the stated logit\-space pullback of the feature\-space drift\. ∎

Local effect on generated features\.We next check that pulling the drift back to logits does not destroy its intended feature\-space effect\. Letℓ\+=ℓ−γ∇ℓℒdrift\\ell^\{\+\}=\\ell\-\\gamma\\nabla\_\{\\ell\}\\mathcal\{L\}\_\{\\mathrm\{drift\}\}be one gradient descent step with step sizeγ\>0\\gamma\>0\. A first\-order expansion gives

⟨h\(ℓ\+\)−h\(ℓ\),V⟩=γα‖Jh\(ℓ\)⊤V‖22\+O\(γ2\)\.\\left\\langle h\(\\ell^\{\+\}\)\-h\(\\ell\),\\,V\\right\\rangle=\\gamma\\alpha\\\|J\_\{h\}\(\\ell\)^\{\\top\}V\\\|\_\{2\}^\{2\}\+O\(\\gamma^\{2\}\)\.Therefore, for sufficiently smallγ\\gamma, the logit update moves the generated feature in a direction that has positive alignment with the drift, wheneverJh\(ℓ\)⊤V≠0J\_\{h\}\(\\ell\)^\{\\top\}V\\neq 0\.

Interpretation\.The identity∇hℒdrift=−αV\\nabla\_\{h\}\\mathcal\{L\}\_\{\\mathrm\{drift\}\}=\-\\alpha Vmirrors the fixed\-point construction of continuous drifting\. The contribution here is the connection to discrete generators: because the feature map is built from soft token predictions, the same drift\-following signal can be pulled back to token logits as−αJh\(ℓ\)⊤V\-\\alpha J\_\{h\}\(\\ell\)^\{\\top\}V\. In this sense, the objective is not merely a feature\-space matching loss; it is a trainable way to apply continuous drifting dynamics to categorical sequence models\.

### C\.3Inherited equilibrium under anti\-symmetric drift

The original drifting formulation relies on an anti\-symmetric attraction–repulsion field so that the drift vanishes when the data and model distributions match\. We here verify that the same fixed\-point property is inherited by our discrete objective after replacing continuous samples with soft\-token features\.

LetPdataϕP\_\{\\mathrm\{data\}\}^\{\\phi\}denote the distribution of real\-text featuresu=ϕ\(E\[x\]\)u=\\phi\(E\[x\]\), and letPmodelϕP\_\{\\mathrm\{model\}\}^\{\\phi\}denote the distribution of generated soft\-token featuresh\(ℓ\)=ϕ\(softmax\(ℓ\)E\)h\(\\ell\)=\\phi\(\\mathrm\{softmax\}\(\\ell\)E\)\. LetV¯\(h;P,Q\)\\bar\{V\}\(h;P,Q\)be the population drift field in this frozen feature space\.

###### Corollary C\.3\(No drift signal at feature\-space equilibrium\)\.

Assume the feature\-space drift operator is anti\-symmetric:

V¯\(h;P,Q\)=−V¯\(h;Q,P\)for allh,P,Q\.\\bar\{V\}\(h;P,Q\)=\-\\bar\{V\}\(h;Q,P\)\\qquad\\text\{for all \}h,P,Q\.IfPdataϕ=PmodelϕP\_\{\\mathrm\{data\}\}^\{\\phi\}=P\_\{\\mathrm\{model\}\}^\{\\phi\}, then

V¯\(h;Pdataϕ,Pmodelϕ\)=0\.\\bar\{V\}\(h;P\_\{\\mathrm\{data\}\}^\{\\phi\},P\_\{\\mathrm\{model\}\}^\{\\phi\}\)=0\.Consequently, the stop\-gradient target satisfies

h⋆=sg\(h\+αV¯\)=sg\(h\),h^\{\\star\}=\\mathrm\{sg\}\(h\+\\alpha\\bar\{V\}\)=\\mathrm\{sg\}\(h\),and therefore

ℒdrift=12‖h−h⋆‖22=0,∇hℒdrift=0,∇ℓℒdrift=0\.\\mathcal\{L\}\_\{\\mathrm\{drift\}\}=\\frac\{1\}\{2\}\\\|h\-h^\{\\star\}\\\|\_\{2\}^\{2\}=0,\\qquad\\nabla\_\{h\}\\mathcal\{L\}\_\{\\mathrm\{drift\}\}=0,\\qquad\\nabla\_\{\\ell\}\\mathcal\{L\}\_\{\\mathrm\{drift\}\}=0\.

Proof\.SincePdataϕ=PmodelϕP\_\{\\mathrm\{data\}\}^\{\\phi\}=P\_\{\\mathrm\{model\}\}^\{\\phi\}, write this common feature distribution asPP\. By anti\-symmetry,

V¯\(h;P,P\)=−V¯\(h;P,P\)\.\\bar\{V\}\(h;P,P\)=\-\\bar\{V\}\(h;P,P\)\.Hence2V¯\(h;P,P\)=02\\bar\{V\}\(h;P,P\)=0, soV¯\(h;P,P\)=0\\bar\{V\}\(h;P,P\)=0\. Therefore,

V¯\(h;Pdataϕ,Pmodelϕ\)=0\.\\bar\{V\}\(h;P\_\{\\mathrm\{data\}\}^\{\\phi\},P\_\{\\mathrm\{model\}\}^\{\\phi\}\)=0\.
Substituting this into the target definition gives

h⋆=sg\(h\+αV¯\)=sg\(h\)\.h^\{\\star\}=\\mathrm\{sg\}\(h\+\\alpha\\bar\{V\}\)=\\mathrm\{sg\}\(h\)\.Sincesg\(h\)\\mathrm\{sg\}\(h\)has the same value ashh, we have

ℒdrift=12‖h−sg\(h\)‖22=0\.\\mathcal\{L\}\_\{\\mathrm\{drift\}\}=\\frac\{1\}\{2\}\\\|h\-\\mathrm\{sg\}\(h\)\\\|\_\{2\}^\{2\}=0\.Moreover, using Proposition[C\.2](https://arxiv.org/html/2605.19470#A3.Thmtheorem2)withV=0V=0,

∇hℒdrift=0,∇ℓℒdrift=−αJh\(ℓ\)⊤0=0\.\\nabla\_\{h\}\\mathcal\{L\}\_\{\\mathrm\{drift\}\}=0,\\qquad\\nabla\_\{\\ell\}\\mathcal\{L\}\_\{\\mathrm\{drift\}\}=\-\\alpha J\_\{h\}\(\\ell\)^\{\\top\}0=0\.Thus, at feature\-space equilibrium, the drifting objective injects no feature\-space or logit\-space learning signal\. ∎

Interpretation\.This result is inherited from the equilibrium property of continuous drifting\. Its role is to show that our discrete objective does not introduce a spurious learning signal at feature\-space equilibrium\. The statement is deliberately made in feature space: matchingPdataϕP\_\{\\mathrm\{data\}\}^\{\\phi\}andPmodelϕP\_\{\\mathrm\{model\}\}^\{\\phi\}does not imply equality of logits or token distributions, since the feature map may be many\-to\-one\. What it guarantees is that, in the geometry where the drift is defined, the objective stops pushing once the model\-induced feature distribution matches the data\-induced feature distribution\.

Summary\.Together, these results justify the core design ofTokenDrift\. The soft\-token lift makes feature\-space drifting differentiable for categorical generators; the fixed\-point loss follows the computed drift direction exactly in feature space; and anti\-symmetry ensures that the learning signal vanishes at equilibrium\. These results do not claim global convergence of the nonconvex generator, but they characterize the local mechanism and fixed\-point structure of the proposed objective\.

## Appendix DTheoretical Analysis of Mirror\-Teacher Objectives

We provide the theoretical details for the mirror\-teacher alternatives used in Sec\.[3\.4](https://arxiv.org/html/2605.19470#S3.SS4)\. The first result shows that the mirror teacher is the KL\-proximal simplex update induced by a logit\-space direction\. The second shows that this direction locally improves alignment with the feature\-space drift\. The third verifies that the equilibrium property of anti\-symmetric drifting is preserved by the mirror\-teacher loss\.

Throughout, for a current logit tensorℓ\\ell, we write

p=softmax\(ℓ\),g=∇ℓ⟨h\(ℓ\),sg\(V\)⟩,p⋆=softmax\(sg\(ℓ\+ηg\)\),p=\\mathrm\{softmax\}\(\\ell\),\\qquad g=\\nabla\_\{\\ell\}\\\!\\left\\langle h\(\\ell\),\\mathrm\{sg\}\(V\)\\right\\rangle,\\qquad p^\{\\star\}=\\mathrm\{softmax\}\(\\mathrm\{sg\}\(\\ell\+\\eta g\)\),whereh\(ℓ\)h\(\\ell\)is the soft\-token semantic feature,VVis the feature\-space drift, andsg\(⋅\)\\mathrm\{sg\}\(\\cdot\)denotes stop\-gradient\.

### D\.1Mirror teacher as a simplex\-constrained update

Our mirror teacher has the closed form

pv⋆∝pvexp⁡\(ηgv\),p^\{\\star\}\_\{v\}\\propto p\_\{v\}\\exp\(\\eta g\_\{v\}\),which is standard in mirror descent and exponentiated\-gradient updates on the simplex\[[2](https://arxiv.org/html/2605.19470#bib.bib80),[10](https://arxiv.org/html/2605.19470#bib.bib81)\]\. The following proposition shows that this teacher is not an ad hoc construction: it is the unique KL\-regularized update that moves the current distribution in directionggwhile remaining on the simplex\.

###### Proposition D\.1\(Variational form of the mirror teacher\)\.

Letp∈Δ\|𝒱\|−1p\\in\\Delta^\{\|\\mathcal\{V\}\|\-1\}andg∈ℝ\|𝒱\|g\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\}\. For anyη\>0\\eta\>0, the distribution

pv⋆=pvexp⁡\(ηgv\)∑upuexp⁡\(ηgu\)p^\{\\star\}\_\{v\}=\\frac\{p\_\{v\}\\exp\(\\eta g\_\{v\}\)\}\{\\sum\_\{u\}p\_\{u\}\\exp\(\\eta g\_\{u\}\)\}is the unique solution of

arg⁡maxq∈Δ\|𝒱\|−1⁡\{⟨g,q⟩−1ηKL⁡\(q∥p\)\}\.\\arg\\max\_\{q\\in\\Delta^\{\|\\mathcal\{V\}\|\-1\}\}\\left\\\{\\langle g,q\\rangle\-\\frac\{1\}\{\\eta\}\\operatorname\{KL\}\(q\\,\\\|\\,p\)\\right\\\}\.Equivalently,p⋆=softmax⁡\(ℓ\+ηg\)p^\{\\star\}=\\operatorname\{softmax\}\(\\ell\+\\eta g\)wheneverp=softmax⁡\(ℓ\)p=\\operatorname\{softmax\}\(\\ell\)\.

Proof\.Consider the Lagrangian

𝒥\(q,λ\)=∑vqvgv−1η∑vqvlog⁡qvpv\+λ\(∑vqv−1\)\.\\mathcal\{J\}\(q,\\lambda\)=\\sum\_\{v\}q\_\{v\}g\_\{v\}\-\\frac\{1\}\{\\eta\}\\sum\_\{v\}q\_\{v\}\\log\\frac\{q\_\{v\}\}\{p\_\{v\}\}\+\\lambda\\left\(\\sum\_\{v\}q\_\{v\}\-1\\right\)\.Differentiating with respect toqvq\_\{v\}gives

gv−1η\(log⁡qvpv\+1\)\+λ=0\.g\_\{v\}\-\\frac\{1\}\{\\eta\}\\left\(\\log\\frac\{q\_\{v\}\}\{p\_\{v\}\}\+1\\right\)\+\\lambda=0\.Rearranging,

qv=Cpvexp⁡\(ηgv\),q\_\{v\}=C\\,p\_\{v\}\\exp\(\\eta g\_\{v\}\),whereC\>0C\>0is a constant independent ofvv\. Enforcing the simplex constraint∑vqv=1\\sum\_\{v\}q\_\{v\}=1yields

qv⋆=pvexp⁡\(ηgv\)∑upuexp⁡\(ηgu\)\.q\_\{v\}^\{\\star\}=\\frac\{p\_\{v\}\\exp\(\\eta g\_\{v\}\)\}\{\\sum\_\{u\}p\_\{u\}\\exp\(\\eta g\_\{u\}\)\}\.Since the objective is linear inqqplus the strictly concave term−1ηKL⁡\(q∥p\)\-\\frac\{1\}\{\\eta\}\\operatorname\{KL\}\(q\\\|p\), it is strictly concave on the simplex interior wheneverpv\>0p\_\{v\}\>0for allvv\. Hence this stationary point is the unique maximizer\. Finally, ifp=softmax\(ℓ\)p=\\mathrm\{softmax\}\(\\ell\), then

qv⋆=exp⁡\(ℓv\)exp⁡\(ηgv\)∑uexp⁡\(ℓu\)exp⁡\(ηgu\)=softmax\(ℓ\+ηg\)v\.q\_\{v\}^\{\\star\}=\\frac\{\\exp\(\\ell\_\{v\}\)\\exp\(\\eta g\_\{v\}\)\}\{\\sum\_\{u\}\\exp\(\\ell\_\{u\}\)\\exp\(\\eta g\_\{u\}\)\}=\\mathrm\{softmax\}\(\\ell\+\\eta g\)\_\{v\}\.Thereforep⋆=softmax\(ℓ\+ηg\)p^\{\\star\}=\\mathrm\{softmax\}\(\\ell\+\\eta g\), as claimed\. ∎

Interpretation\.Proposition[D\.1](https://arxiv.org/html/2605.19470#A4.Thmtheorem1)formalizes the role of the mirror step in this alternative mirror formulation ofTokenDrift\. A Euclidean additive update would generally leave the simplex, whereas the mirror teacher is the KL\-proximal update naturally associated with categorical outputs\.

### D\.2Feature\-space drift induces local semantic improvement

The drift field is computed in semantic feature space, but the generator is updated in logit space\. The next result shows that this mismatch is locally well behaved: the logit\-space direction used by mirror\-formulation ofTokenDriftprovably improves alignment with the desired feature\-space drift\.

Define the local alignment objective

Ψ\(ℓ;V\)=⟨h\(ℓ\),sg\(V\)⟩\.\\Psi\(\\ell;V\)=\\left\\langle h\(\\ell\),\\mathrm\{sg\}\(V\)\\right\\rangle\.By construction,

g=∇ℓΨ\(ℓ;V\)\.g=\\nabla\_\{\\ell\}\\Psi\(\\ell;V\)\.
###### Proposition D\.2\(Local ascent in semantic alignment\)\.

Assume thatΨ\(ℓ;V\)\\Psi\(\\ell;V\)isLL\-smooth inℓ\\ellfor fixedVV\. Let

ℓ~=ℓ\+ηg,g=∇ℓΨ\(ℓ;V\)\.\\tilde\{\\ell\}=\\ell\+\\eta g,\\qquad g=\\nabla\_\{\\ell\}\\Psi\(\\ell;V\)\.Then, for any0<η<2/L0<\\eta<2/L,

Ψ\(ℓ~;V\)≥Ψ\(ℓ;V\)\+η\(1−Lη2\)‖g‖22\.\\Psi\(\\tilde\{\\ell\};V\)\\geq\\Psi\(\\ell;V\)\+\\eta\\left\(1\-\\frac\{L\\eta\}\{2\}\\right\)\\\|g\\\|\_\{2\}^\{2\}\.In particular, ifg≠0g\\neq 0, then the teacher logitsℓ~\\tilde\{\\ell\}strictly improve semantic alignment with the drift direction for sufficiently smallη\\eta\.

Proof\.SinceΨ\(⋅;V\)\\Psi\(\\cdot;V\)isLL\-smooth inℓ\\ell, for any perturbationΔ\\Delta,

Ψ\(ℓ\+Δ;V\)≥Ψ\(ℓ;V\)\+⟨∇ℓΨ\(ℓ;V\),Δ⟩−L2‖Δ‖22\.\\Psi\(\\ell\+\\Delta;V\)\\geq\\Psi\(\\ell;V\)\+\\left\\langle\\nabla\_\{\\ell\}\\Psi\(\\ell;V\),\\Delta\\right\\rangle\-\\frac\{L\}\{2\}\\\|\\Delta\\\|\_\{2\}^\{2\}\.SetΔ=ηg\\Delta=\\eta g, whereg=∇ℓΨ\(ℓ;V\)g=\\nabla\_\{\\ell\}\\Psi\(\\ell;V\)\. Then

Ψ\(ℓ\+ηg;V\)\\displaystyle\\Psi\(\\ell\+\\eta g;V\)≥Ψ\(ℓ;V\)\+η⟨∇ℓΨ\(ℓ;V\),g⟩−Lη22‖g‖22\\displaystyle\\geq\\Psi\(\\ell;V\)\+\\eta\\left\\langle\\nabla\_\{\\ell\}\\Psi\(\\ell;V\),g\\right\\rangle\-\\frac\{L\\eta^\{2\}\}\{2\}\\\|g\\\|\_\{2\}^\{2\}=Ψ\(ℓ;V\)\+η‖g‖22−Lη22‖g‖22\\displaystyle=\\Psi\(\\ell;V\)\+\\eta\\\|g\\\|\_\{2\}^\{2\}\-\\frac\{L\\eta^\{2\}\}\{2\}\\\|g\\\|\_\{2\}^\{2\}=Ψ\(ℓ;V\)\+η\(1−Lη2\)‖g‖22\.\\displaystyle=\\Psi\(\\ell;V\)\+\\eta\\left\(1\-\\frac\{L\\eta\}\{2\}\\right\)\\\|g\\\|\_\{2\}^\{2\}\.Since0<η<2/L0<\\eta<2/L, the coefficient

η\(1−Lη2\)\\eta\\left\(1\-\\frac\{L\\eta\}\{2\}\\right\)is positive\. Therefore, ifg≠0g\\neq 0, the teacher logitsℓ~=ℓ\+ηg\\tilde\{\\ell\}=\\ell\+\\eta gstrictly increaseΨ\(ℓ;V\)\\Psi\(\\ell;V\)\. ∎

Feature\-space view\.WritingJh\(ℓ\)J\_\{h\}\(\\ell\)for the Jacobian ofhhwith respect toℓ\\ell, we have

g=Jh\(ℓ\)⊤V\.g=J\_\{h\}\(\\ell\)^\{\\top\}V\.Hence a first\-order expansion gives

⟨h\(ℓ\+ηg\)−h\(ℓ\),V⟩=η‖Jh\(ℓ\)⊤V‖22\+O\(η2\),\\left\\langle h\(\\ell\+\\eta g\)\-h\(\\ell\),\\,V\\right\\rangle=\\eta\\\|J\_\{h\}\(\\ell\)^\{\\top\}V\\\|\_\{2\}^\{2\}\+O\(\\eta^\{2\}\),which makes the geometric role of the update explicit: mirror\-based formulation ofTokenDriftchooses the logit\-space direction that increases semantic alignment with the feature\-space drift at first order\.

Interpretation\.Proposition[D\.2](https://arxiv.org/html/2605.19470#A4.Thmtheorem2)is the main justification for transporting drift from feature space to logits\. It shows that the mirror teacher is not merely simplex\-valid; it is also locally aligned with the semantic correction prescribed by the drift field\.

### D\.3Equilibrium preservation under anti\-symmetric drift

A central property of drifting is that the learning signal should disappear once the model and data distributions match\[[4](https://arxiv.org/html/2605.19470#bib.bib79)\]\. We now show that this equilibrium property is preserved by mirror\-based formulation ofTokenDrift\.

LetPdataϕP\_\{\\mathrm\{data\}\}^\{\\phi\}andPmodelϕP\_\{\\mathrm\{model\}\}^\{\\phi\}denote the distributions of semantic features induced by the data and the current model under the frozen encoderϕ\\phi\. LetV¯\(h;P,Q\)\\bar\{V\}\(h;P,Q\)denote the population drift field in feature space, wherePPsupplies the attractive references andQQsupplies the repulsive references\.

###### Proposition D\.3\(Equilibrium under anti\-symmetric drift\)\.

Assume that the population drift field is anti\-symmetric:

V¯\(h;P,Q\)=−V¯\(h;Q,P\)for allh,P,Q\.\\bar\{V\}\(h;P,Q\)=\-\\,\\bar\{V\}\(h;Q,P\)\\qquad\\text\{for all \}h,P,Q\.Then

V¯\(h;P,P\)=0for allh\.\\bar\{V\}\(h;P,P\)=0\\qquad\\text\{for all \}h\.Consequently, ifPdataϕ=PmodelϕP\_\{\\mathrm\{data\}\}^\{\\phi\}=P\_\{\\mathrm\{model\}\}^\{\\phi\}, then the drift term vanishes, the induced logit\-space direction satisfiesg=0g=0, the mirror teacher reduces top⋆=pp^\{\\star\}=p, and the drift loss is zero:

KL⁡\(p⋆∥p\)=0\.\\operatorname\{KL\}\(p^\{\\star\}\\,\\\|\\,p\)=0\.

Proof\.SettingQ=PQ=Pin the anti\-symmetry condition gives

V¯\(h;P,P\)=−V¯\(h;P,P\)\.\\bar\{V\}\(h;P,P\)=\-\\bar\{V\}\(h;P,P\)\.Hence2V¯\(h;P,P\)=02\\bar\{V\}\(h;P,P\)=0, and therefore

V¯\(h;P,P\)=0\.\\bar\{V\}\(h;P,P\)=0\.
Now supposePdataϕ=PmodelϕP\_\{\\mathrm\{data\}\}^\{\\phi\}=P\_\{\\mathrm\{model\}\}^\{\\phi\}\. Let this common feature distribution be denoted byPP\. Then the population drift used by the mirror\-teacher objective satisfies

V=V¯\(h;Pdataϕ,Pmodelϕ\)=V¯\(h;P,P\)=0\.V=\\bar\{V\}\(h;P\_\{\\mathrm\{data\}\}^\{\\phi\},P\_\{\\mathrm\{model\}\}^\{\\phi\}\)=\\bar\{V\}\(h;P,P\)=0\.By the definition of the logit\-space direction,

g=∇ℓ⟨h\(ℓ\),sg\(V\)⟩,g=\\nabla\_\{\\ell\}\\left\\langle h\(\\ell\),\\mathrm\{sg\}\(V\)\\right\\rangle,and sinceV=0V=0, we have

g=∇ℓ⟨h\(ℓ\),0⟩=0\.g=\\nabla\_\{\\ell\}\\left\\langle h\(\\ell\),0\\right\\rangle=0\.Therefore the mirror teacher becomes

p⋆=softmax\(sg\(ℓ\+ηg\)\)=softmax\(sg\(ℓ\)\)\.p^\{\\star\}=\\mathrm\{softmax\}\(\\mathrm\{sg\}\(\\ell\+\\eta g\)\)=\\mathrm\{softmax\}\(\\mathrm\{sg\}\(\\ell\)\)\.The stop\-gradient operator does not change the value of its argument, so

p⋆=softmax\(ℓ\)=p\.p^\{\\star\}=\\mathrm\{softmax\}\(\\ell\)=p\.Thus the mirror KL loss vanishes:

KL\(p⋆∥p\)=KL\(p∥p\)=0\.\\mathrm\{KL\}\(p^\{\\star\}\\\|p\)=\\mathrm\{KL\}\(p\\\|p\)=0\.This proves that, at feature\-space equilibrium, the mirror\-teacher objective injects no drift signal\. ∎

Interpretation\.Proposition[D\.3](https://arxiv.org/html/2605.19470#A4.Thmtheorem3)shows that the discrete mirror\-teacher construction preserves the fixed\-point structure of drifting\. When the model\-induced feature distribution matches the data\-induced feature distribution, mirror\-based formulation ofTokenDriftinjects no spurious learning signal\. This is the discrete analogue of the equilibrium property that motivates drifting in the continuous setting\.

### D\.4Summary

Taken together, the three results establish that mirror\-based formulation ofTokenDriftis theoretically well aligned with its design goals: the teacher is the correct simplex\-constrained update, the chosen direction improves semantic alignment with the drift field at first order, and the resulting learning signal vanishes at equilibrium\.

## Appendix EAdditional Related Work

#### Inference\-time efficiency for discrete diffusion language models\.

A separate line of work improves the decoding efficiency of discrete diffusion language models at inference time\. These methods reduce sampling cost by shortening denoising schedules, reusing or skipping computation across denoising steps, or stopping computation once parts of the sequence have converged\[[14](https://arxiv.org/html/2605.19470#bib.bib18),[9](https://arxiv.org/html/2605.19470#bib.bib16),[27](https://arxiv.org/html/2605.19470#bib.bib17),[15](https://arxiv.org/html/2605.19470#bib.bib13),[12](https://arxiv.org/html/2605.19470#bib.bib14),[28](https://arxiv.org/html/2605.19470#bib.bib15),[18](https://arxiv.org/html/2605.19470#bib.bib93)\]\. For example, SureLock\[[18](https://arxiv.org/html/2605.19470#bib.bib93)\]reduces decoding cost by early\-stopping token positions whose predictions have stabilized during iterative denoising\. These approaches modify the*inference*procedure or computation schedule of an already trained model\.

#### Relation to our work\.

Our work addresses a different part of the pipeline: we study drifting as a training\-time objective for refining discrete diffusion language models under matched sampling budgets\. The semantic encoder, reference queues, and drift targets used byTokenDriftare used only during training; at inference time, the underlying DDLM sampler is unchanged\. Thus, inference\-time acceleration methods are not controlled baselines for our objective\-level study\. Instead, they represent a distinct direction that improves how a DDLM is sampled, whereas our work improves the model produced before sampling begins\.

## Appendix FAdditional Experimental Details

#### Benchmark and continual\-training setup\.

We use OpenWebText \(OWT\)\[[7](https://arxiv.org/html/2605.19470#bib.bib84)\]for the main continual\-training experiments\. Unless otherwise stated, all methods start from a released checkpoint and are trained for 13k additional global steps under the same data and compute budget\. We focus on unconditional text generation, following prior work on diffusion language models\[[22](https://arxiv.org/html/2605.19470#bib.bib20)\], and evaluate all methods with matched numbers of function evaluations \(NFEs\)\.

#### Backbones\.

For the masked\-diffusion experiments, we use the released 170M\-parameter MDLM checkpoint\[[22](https://arxiv.org/html/2605.19470#bib.bib20)\]111Released at[https://github\.com/kuleshov\-group/mdlm](https://github.com/kuleshov-group/mdlm)\.\. For the uniform\-state diffusion experiments, we use the released DUO checkpoint\[[23](https://arxiv.org/html/2605.19470#bib.bib10)\]222Released at[https://github\.com/s\-sahoo/duo](https://github.com/s-sahoo/duo)\.\. Within each backbone, all controlled comparisons share the same architecture, tokenizer, initialization, data, additional training budget, and evaluation protocol\.

#### Controlled baselines\.

For each backbone, we compare against two matched baselines\. The first is the released checkpoint evaluated without additional training\. The second, denotedContinuation, starts from the same checkpoint and is trained for the same number of additional updates using only the original DDLM objective\. These baselines isolate the effect of the drifting objective from improvements due merely to additional optimization\.

#### Drifting objective variants\.

Our default method uses the direct feature\-space drifting objective described in Sec\.[3\.3](https://arxiv.org/html/2605.19470#S3.SS3)\. We also evaluate variants that add the original denoising objective and mirror\-teacher alternatives described in Sec\.[3\.4](https://arxiv.org/html/2605.19470#S3.SS4)\. Unless otherwise stated, the default setting uses the drifting objective alone; base\-plus\-drift and mirror\-teacher variants are used for the formulation study\.

#### Optimization and drift hyperparameters\.

The drift scale is fixed toα=1\\alpha=1\. We use multi\-temperature drift estimation with𝒯=\{0\.02,0\.05,0\.2\}\\mathcal\{T\}=\\\{0\.02,0\.05,0\.2\\\}\. For each temperature, we compute per\-sample drift vectors, normalize them by a scalar batch\-level RMS scale, and average the normalized drifts following the original drifting implementation\[[4](https://arxiv.org/html/2605.19470#bib.bib79)\]\. We optimize with AdamW using a global batch size of 512, learning rate3×10−53\\times 10^\{\-5\}or5×10−55\\times 10^\{\-5\}\.

#### Feature encoder\.

For each backbone, the frozen semantic encoder is a frozen copy of the same released checkpoint used to initialize the generator\. We extract sequence\-level features by mean\-pooling the penultimate and final hidden layers and L2\-normalizing their concatenation\. Concretely, ifH\(L−1\),H\(L\)∈ℝT×dH^\{\(L\-1\)\},H^\{\(L\)\}\\in\\mathbb\{R\}^\{T\\times d\}are the penultimate and final hidden states, we define

h=normalize⁡\(\[1T∑t=1THt\(L−1\);1T∑t=1THt\(L\)\]\)∈ℝ2d\.h=\\operatorname\{normalize\}\\left\(\\left\[\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}H\_\{t\}^\{\(L\-1\)\};\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}H\_\{t\}^\{\(L\)\}\\right\]\\right\)\\in\\mathbb\{R\}^\{2d\}\.

#### Drift reference sets\.

To estimate the drift field efficiently, we build positive and negative reference sets from the current distributed micro\-batch and feature queues\. Each GPU uses a per\-device batch size of 2, and training is performed on 4 GPUs of NVIDIA H100 SXM5 94GB, so each micro\-step provides 8 fresh examples after cross\-device all\-gather\. Generated features are computed from model predictions on corrupted inputs sampled from the underlying diffusion process, and real features are computed from the corresponding clean texts\. For each anchor, we use the current real and generated features together with separate FIFO queues of detached real and generated features\. The default queue size is 1024 for both queues\. The queues store only pooled sequence\-level features, not token\-level representations\. We computeℒdrift\\mathcal\{L\}\_\{\\mathrm\{drift\}\}at every micro\-step and accumulate gradients to reach the global batch size of 512\.

#### Evaluation protocol\.

Following prior work\[[22](https://arxiv.org/html/2605.19470#bib.bib20)\], we evaluate generated sample quality using generative perplexity \(Gen\.\-PPL\) computed by pretrained GPT\-2 Large\[[20](https://arxiv.org/html/2605.19470#bib.bib82)\]\. We also report entropy as a diagnostic for diversity and degeneration\.
Drifting Objectives for Refining Discrete Diffusion Language Models

Similar Articles

dOPSD: On-Policy Self-Distillation for Diffusion Language Models

Diffusion Policy Optimization without Drifting Apart

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

DRIFT: Refining Instruction Data via On-Policy Data Attribution

GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models

Submit Feedback

Similar Articles

dOPSD: On-Policy Self-Distillation for Diffusion Language Models
Diffusion Policy Optimization without Drifting Apart
Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models
DRIFT: Refining Instruction Data via On-Policy Data Attribution
GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models