Masked Diffusion Decoding as $x$-Prediction Flow

arXiv cs.CL Papers

Summary

This paper reinterprets masked diffusion language model decoding as continuous clean-state prediction, introducing a flow-based framework where tokens are updated continuously and asynchronously based on confidence, achieving 97% of LLaDA's performance with 25% of the decoding budget.

arXiv:2606.29066v1 Announce Type: new Abstract: Masked diffusion language models (MDLMs) generate text by iteratively unmasking tokens, but their standard decoder reduces each step to a binary action: a position is either committed to a single token or left fully masked, with no representation of partial belief in between. This all-or-nothing regime discards rich predictive information and forces premature, irrevocable commitments, leading to poor performance under a limited decoding budget. In this paper, we reinterpret mask prediction as clean-state prediction ($x$-prediction) and show that it can be used to induce a continuous flow in input embedding space. Building on this view, we propose a continuous decoding framework for MDLMs where tokens can accumulate partial progress at each diffusion step and remain revisable. To match the uneven contextual constraints across positions in language, we replace the globally synchronous schedule in image diffusion with a confidence-based asynchronous update in which the diffusion progress is token-wise accumulated. Additionally, we introduce a lightweight policy network and formulate its training as a reinforcement learning problem. Applied to pretrained LLaDA, our continuous decoder reaches 97% of its performance on the HumanEval dataset with 25% of decoding budget.
Original Article
View Cached Full Text

Cached at: 06/30/26, 05:29 AM

# Masked Diffusion Decoding as 𝑥-Prediction Flow
Source: [https://arxiv.org/html/2606.29066](https://arxiv.org/html/2606.29066)
Weitian Wang1,2, Lianlei Shan3, Shubham Rai1, Cecilia De La Parra1, Akash Kumar2 1Robert Bosch GmbH, Germany,2Ruhr University Bochum, Germany, 3University of the Chinese Academy of Sciences, China

###### Abstract

Masked diffusion language models \(MDLMs\) generate text by iteratively unmasking tokens, but their standard decoder reduces each step to a binary action: a position is either committed to a single token or left fully masked, with no representation of partial belief in between\. This all\-or\-nothing regime discards rich predictive information and forces premature, irrevocable commitments, leading to poor performance under a limited decoding budget\. In this paper, we reinterpret mask prediction as clean\-state prediction \(xx\-prediction\) and show that it can be used to induce a continuous flow in input embedding space\. Building on this view, we propose a continuous decoding framework for MDLMs where tokens can accumulate partial progress at each diffusion step and remain revisable\. To match the uneven contextual constraints across positions in language, we replace the globally synchronous schedule in image diffusion with a confidence\-based asynchronous update in which the diffusion progress is token\-wise accumulated\. Additionally, we introduce a lightweight policy network and formulate its training as a reinforcement learning problem\. Applied to pretrained LLaDA, our continuous decoder reaches 97% of its performance on the HumanEval dataset with 25% of decoding budget\.

## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.29066v1/x1.png)Figure 1:By reinterpreting mask prediction as clean\-state prediction in embedding space and defining a velocity field that moves the current state from mask embedding \(noisy state\) toward the predicted clean state, ourxx\-prediction flow decoding evolves*all*tokens*continuously*in embedding space\.Large language models built on the autoregressive \(AR\) paradigm\[[1](https://arxiv.org/html/2606.29066#bib.bib13);[7](https://arxiv.org/html/2606.29066#bib.bib14)\]have driven recent progress in natural language processing\. However, their strictly sequential factorization imposes a fundamental constraint at inference time: tokens must be produced one after another, conditional only on the previously generated prefix\. This sequential dependency imposes a high decoding latency\. This is further exacerbated in modern reasoning models, where chain\-of\-thought traces routinely span thousands of tokens\[[9](https://arxiv.org/html/2606.29066#bib.bib15)\]\.

Diffusion language models \(DLMs\) have recently emerged as a promising alternative\[[16](https://arxiv.org/html/2606.29066#bib.bib1);[3](https://arxiv.org/html/2606.29066#bib.bib5)\]\. Instead of committing tokens left\-to\-right, DLMs cast generation as an iterative denoising process: a fully corrupted response is refined in parallel over a small number of steps, with every position able to attend bidirectionally to the rest of the sequence\. This parallelism enables faster generation, supports controllable and non\-causal generation patterns, and provides a built\-in mechanism for refining earlier predictions in light of later context\.

A natural design is to follow image diffusion\[[11](https://arxiv.org/html/2606.29066#bib.bib10);[15](https://arxiv.org/html/2606.29066#bib.bib9);[14](https://arxiv.org/html/2606.29066#bib.bib2)\]and denoise directly in a continuous embedding space\. This recipe cannot be transferred directly to language\. Language tokens are discrete and highly context\-dependent, with the meaning of a token determined by the identities of its neighbours rather than by its position in any continuous coordinate\. Hence, injecting stochastic noise into an embedding produces a perturbation whose magnitude has no semantically meaningful correspondence to a syntactic or lexical change\. The dominant DLM family\[[17](https://arxiv.org/html/2606.29066#bib.bib8);[16](https://arxiv.org/html/2606.29066#bib.bib1);[3](https://arxiv.org/html/2606.29066#bib.bib5)\]therefore adopts*masked*diffusion, building on the long\-standing success of mask prediction popularized by BERT\[[5](https://arxiv.org/html/2606.29066#bib.bib4)\], in which a fraction of tokens are corrupted with a special\[M\]\[\\texttt\{M\}\]symbol and the model is asked to recover them from the surrounding context\. This formulation has enabled MDLMs to scale to billions of parameters and to match AR baselines on standard benchmarks, while also showing advantages on tasks that benefit from bidirectional structure, such as code generation and reverse reasoning\[[16](https://arxiv.org/html/2606.29066#bib.bib1)\]\.

Despite this scalability, the standard decoding procedure of MDLMs makes inefficient use of its compute budget\. At each step, the model outputs a categorical distribution over the vocabulary at every masked position, but the sampler reduces this distribution to a binary action: either the position is unmasked to a single committed token, or it is left as\[M\]\[\\texttt\{M\}\]and re\-predicted from scratch at the next step\. From the perspective of any individual position, the per\-step state is therefore all\-or\-nothing, with no representation of partial belief between steps\. Two consequences follow\. First, the rich predictive information contained in the full output distribution, including the runner\-up candidates and their relative likelihoods, is discarded as soon as a position is retained\. Second, once a token is unmasked, the commitment is final, even when later updates to neighbouring positions would have favoured a different choice\. The model is forced to choose between premature certainty and complete uncertainty, which limits how effectively a fixed number of decoding steps can be used\. This limitation motivates the design of a decoding scheme in which the belief about each token can evolve*continuously*across steps, so that confidence accumulates gradually with context and committed predictions remain revisable until generation converges\.

In this work, we realize this continuous decoding scheme by reinterpreting mask prediction as clean\-state prediction\. We treat the model’s masked\-position output as an estimate of the clean state in embedding space, and use it to define a velocity field that moves the current state from mask embedding toward the predicted clean state\. We refer to the resulting continuous dynamics as anxx\-prediction flow\. Unlike image diffusion, which starts from stochastic Gaussian noise, this flow is initialized from the deterministic mask embedding, a state on which MDLMs are explicitly trained\. Building on this flow, we further adapt the schedule to the asymmetric, context\-dependent nature of language generation, where some tokens must commit early to inform others\. Our main contributions are:

- •A continuous decoding paradigm for pretrained MDLMs\.We show that mask\-prediction decoding can be reformulated as anxx\-prediction flow in the input embedding space, where each trajectory starts from the mask embedding and is iteratively moved toward the predicted clean state\. This turns binary unmasking into continuous, revisable updates and runs on off\-the\-shelf MDLMs with only a few hundred alignment\-training steps\.
- •An asynchronous token\-wise diffusion schedule\.We replace the globally synchronous schedule in image diffusion with a confidence\-based asynchronous update scheme in which each token carries its own decoding progress\. This allows high\-confidence positions to progress faster, which provides a clearer context for other tokens\.
- •A learned token\-wise step\-size policy\.We parameterize each token’s update as a log\-scale fraction of its remaining diffusion distance, conditioned on the token’s confidence statistics and current decoding progress, and train this policy with GRPO\[[18](https://arxiv.org/html/2606.29066#bib.bib11)\]using task\-level rewards plus a completion regularization term\.

## 2Masked Diffusion Language Model

This section introduces masked diffusion language models \(MDLMs\)—a family of non\-autoregressive generative models for text that recover clean tokens from partially masked sequences—and establishes the notation used throughout the rest of the paper\. We describe MDLMs in their general form, of which LLaDA\[[16](https://arxiv.org/html/2606.29066#bib.bib1)\]is a representative instantiation at scale\.

Let𝒱\\mathcal\{V\}be a discrete vocabulary augmented with a special mask token\[M\]\[\\texttt\{M\}\]\. A language sequence of lengthNNis𝐱0=\(x01,…,x0N\)∈𝒱N\\mathbf\{x\}\_\{0\}=\(x\_\{0\}^\{1\},\\ldots,x\_\{0\}^\{N\}\)\\in\\mathcal\{V\}^\{N\}\. Each tokenv∈𝒱v\\in\\mathcal\{V\}is associated with a learned embedding via an embedding matrix𝐖e∈ℝ\|𝒱\|×E\\mathbf\{W\}\_\{e\}\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\\times E\}, and we denote the mask embedding as𝐦≜𝐖e​\[\[M\]\]∈ℝE\\mathbf\{m\}\\triangleq\\mathbf\{W\}\_\{e\}\[\[\\texttt\{M\}\]\]\\in\\mathbb\{R\}^\{E\}\. We assume access to a corruption mechanism that, given a mask ratioσ∈\[0,1\]\\sigma\\in\[0,1\], produces a partially masked sequence𝐱~\\tilde\{\\mathbf\{x\}\}in which a subsetℳ⊆\{1,…,N\}\\mathcal\{M\}\\subseteq\\\{1,\\ldots,N\\\}of positions \(with expected sizeσ​N\\sigma N\) are replaced by\[M\]\[\\texttt\{M\}\], while the remaining positions𝒰=\{1,…,N\}∖ℳ\\mathcal\{U\}=\\\{1,\\ldots,N\\\}\\setminus\\mathcal\{M\}retain their original tokens\.

#### Training objective

An MDLM is trained to recover the clean tokens at masked positions given the surrounding context\. Concretely, the model defines a conditional distributionpθ​\(x0i∣𝐱~\)p\_\{\\theta\}\(x\_\{0\}^\{i\}\\mid\\tilde\{\\mathbf\{x\}\}\)for each positionii, and is optimized by minimizing a cross\-entropy loss restricted to the masked positions:

ℒMDLM​\(θ\)=−𝔼σ,𝐱0,𝐱~​\[1σ​∑i=1N𝟏​\[i∈ℳ\]​log⁡pθ​\(x0i∣𝐱~\)\],\\mathcal\{L\}\_\{\\text\{MDLM\}\}\(\\theta\)=\-\\,\\mathbb\{E\}\_\{\\sigma,\\,\\mathbf\{x\}\_\{0\},\\,\\tilde\{\\mathbf\{x\}\}\}\\left\[\\frac\{1\}\{\\sigma\}\\sum\_\{i=1\}^\{N\}\\mathbf\{1\}\\\!\\left\[i\\in\\mathcal\{M\}\\right\]\\log p\_\{\\theta\}\\\!\\left\(x\_\{0\}^\{i\}\\mid\\tilde\{\\mathbf\{x\}\}\\right\)\\right\],\(1\)whereσ∼𝒰​\[0,1\]\\sigma\\sim\\mathcal\{U\}\[0,1\]and the1σ\\frac\{1\}\{\\sigma\}weighting compensates for the expected fraction of masked tokens\. This objective has been shown to be a variational upper bound on the negative log\-likelihood of the data distribution\[[17](https://arxiv.org/html/2606.29066#bib.bib8);[16](https://arxiv.org/html/2606.29066#bib.bib1)\], providing a principled likelihood\-based training criterion\.

#### Mask predictor

The conditionalpθ\(⋅∣𝐱~\)p\_\{\\theta\}\(\\cdot\\mid\\tilde\{\\mathbf\{x\}\}\)is parameterized by a*mask predictor*fθf\_\{\\theta\}, typically realized as a bidirectional Transformer\[[19](https://arxiv.org/html/2606.29066#bib.bib17)\]so that every masked position can attend to the full surrounding context\. Given𝐱~\\tilde\{\\mathbf\{x\}\}, the predictor outputs per\-position logits

𝐳predi=fθ\(𝐱~\)i∈ℝ\|𝒱\|,pθ\(⋅∣𝐱~\)i=softmax\(𝐳predi\),\\mathbf\{z\}\_\{\\text\{pred\}\}^\{i\}=f\_\{\\theta\}\(\\tilde\{\\mathbf\{x\}\}\)^\{i\}\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\},\\qquad p\_\{\\theta\}\(\\cdot\\mid\\tilde\{\\mathbf\{x\}\}\)^\{i\}=\\operatorname\{softmax\}\(\\mathbf\{z\}\_\{\\text\{pred\}\}^\{i\}\),\(2\)and the predicted token at positioniiisxpredi=arg⁡maxv⁡pθ​\(v∣𝐱~\)ix\_\{\\text\{pred\}\}^\{i\}=\\arg\\max\_\{v\}\\,p\_\{\\theta\}\(v\\mid\\tilde\{\\mathbf\{x\}\}\)^\{i\}\. Crucially, all masked positions are predicted in parallel within a single forward pass, in sharp contrast to the sequential left\-to\-right factorization of autoregressive language models\.

#### Standard discrete sampling

At inference time, the input to the model is the concatenation of a clean prompt𝐩\\mathbf\{p\}and a fully masked response of a chosen length, and decoding proceeds over a fixed number of steps until every response position is filled\. At each step,fθf\_\{\\theta\}predicts all currently masked response tokens simultaneously conditioned on𝐩\\mathbf\{p\}and the partially decoded response; a subset of these predictions is then committed \(unmasked\), while the rest are re\-masked for the next step\. The prompt𝐩\\mathbf\{p\}is never masked\. Existing MDLMs typically commit a fixed fraction of tokens per step and rely on heuristics such as low\-confidence remasking—which retains the most confident predictions and re\-masks the least confident ones\[[16](https://arxiv.org/html/2606.29066#bib.bib1)\]—to decide which tokens to commit\. However, this discrete, synchronous update ignores the continuous structure of the token embedding space and forces premature hard decisions that cannot be revised in later steps—motivating the continuous decoding framework introduced next\.

## 3Continuous\-State Diffusion for MDLMs

The goal of this paper is to bring continuous\-state diffusion to MDLMs\. Unlike the discrete update in standard MDLM decoding, where each masked position is either committed to a single token or left fully masked at every step, a continuous\-state decoder can carry a soft intermediate estimate of every token across steps and refine it gradually\. Preserving this intermediate state instead of collapsing it into a hard decision uses each diffusion step more efficiently and yields higher\-quality generations under a limited diffusion budget\. To this end, we propose a continuous decoding framework based on anxx\-prediction flow in token embedding space, anchored at the mask embedding rather than at Gaussian noise, which lets a pretrained MDLM be operated as a continuous\-state diffusion model with only lightweight alignment training\.

### 3\.1Challenges of Transferring Continuous\-State Diffusion to Language

In this section, we first identify two fundamental challenges of transferring continuous\-state diffusion from images to language, which will be addressed by our method in the subsequent subsections\. Continuous\-state diffusion has achieved remarkable success in the image domain\[[11](https://arxiv.org/html/2606.29066#bib.bib10);[15](https://arxiv.org/html/2606.29066#bib.bib9);[14](https://arxiv.org/html/2606.29066#bib.bib2)\], where a clean state is recovered by gradually denoising a corrupted continuous state through a sequence of small refinement steps\. Naively transplanting this recipe to language, however, runs into two fundamental obstacles that any continuous decoder for text must confront\.

#### Gaussian noise is not a well\-defined state for language

In continuous\-state image diffusion, the corruption process always involves adding some kind of Gaussian noise to a clean image\. Adding Gaussian noise to a clean image yields another image, blurred but still a valid element of the pixel space, and the model is trained on a continuum of such states\. Language does not admit an analogous notion of noise\. Tokens are discrete, and their meaning is determined by lexical identity rather than by position in any continuous coordinate, so a Gaussian perturbation of a token embedding does not correspond to any vocabulary item, and the magnitude of the perturbation has no semantically interpretable counterpart\. As a consequence, a randomly noised embedding is not a state on which a pretrained MDLM has ever been conditioned, and using it as the starting point of a continuous diffusion trajectory leaves the model operating off\-distribution from the outset\.

#### Language generation is highly context\-dependent

Continuous\-state image diffusion adopts a synchronous denoising schedule because pixels, although locally correlated, can be refined largely in parallel: most of the information needed to disambiguate one pixel is shared symmetrically with its neighbours and is gradually revealed as noise is removed\. Language exhibits a much stronger and more asymmetric form of contextual dependence\. The identity of a token can be determined almost entirely by a few specific tokens elsewhere in the sequence, and those informative tokens may themselves be uncertain\. A useful continuous decoder must therefore allow some positions to commit early so that they can serve as anchors for the rest, while leaving other positions open to refinement until enough surrounding context has stabilized\. A globally synchronous schedule with a single shared step size, as in standard image diffusion, cannot express this asymmetric refinement order and forces every position to be resolved at the same rate regardless of how informative its current context is\.

### 3\.2xx\-Prediction Flow Anchored at Mask Embedding

As elaborated above, Gaussian noise is not a well\-defined state in language\. Hence, we formulate our corrupted state𝐗in∈ℝN×E\\mathbf\{X\}\_\{\\text\{in\}\}\\in\\mathbb\{R\}^\{N\\times E\}as:

𝐗in=𝐭⋅𝐗0\+\(1−𝐭\)⋅𝐦\\mathbf\{X\}\_\{\\text\{in\}\}=\\mathbf\{t\}\\cdot\\mathbf\{X\}\_\{0\}\+\(1\-\\mathbf\{t\}\)\\cdot\\mathbf\{m\}\(3\)where N is the number of response tokens, and E is the embedding dimension\.𝐗0\\mathbf\{X\}\_\{0\}denotes some clean tokens, and𝐦\\mathbf\{m\}is the mask embedding that represents a well\-defined unknown state for the pretrained MDLMs, and𝐭\\mathbf\{t\}is the diffusion time, which represents the corruption level in diffusion and decoding progress in our case\.

At the beginning of the diffusion process, where𝐭=0\\mathbf\{t\}=0, the model starts with pure mask embeddings, which align with the training setup of MDLMs\. During the diffusion process, the model refines the response and increases the decoding progress𝐭\\mathbf\{t\}at each step\. When𝐭\\mathbf\{t\}reaches 1, the tokens are fixed to the final believed clean tokens𝐗0\\mathbf\{X\}\_\{0\}\.

At each diffusion step, the MDLMfθf\_\{\\theta\}produces per\-position logits over the vocabulary from the corrupted input\. We take the argmax token at every position and map it back into the input embedding space via the embedding lookupEmbed⁡\(⋅\)\\operatorname\{Embed\}\(\\cdot\)to obtain the predicted state𝐗pred\\mathbf\{X\}\_\{\\text\{pred\}\}:

𝐗pred=Embed⁡\(argmax⁡\(fθ​\(𝐗in\)\)\)\\mathbf\{X\}\_\{\\text\{pred\}\}=\\operatorname\{Embed\}\(\\operatorname\{argmax\}\(f\_\{\\theta\}\(\\mathbf\{X\}\_\{\\text\{in\}\}\)\)\)\(4\)We interpret this prediction as the current estimation of the clean signal in the input space\. Following thexx\-prediction dynamic system defined in JiT\[[14](https://arxiv.org/html/2606.29066#bib.bib2)\], we derive the velocity field as:

𝐕​\(𝐗in,𝐭\)=𝐗pred−𝐗in1−𝐭\\mathbf\{V\}\(\\mathbf\{X\}\_\{\\text\{in\}\},\\mathbf\{t\}\)=\\frac\{\\mathbf\{X\}\_\{\\text\{pred\}\}\-\\mathbf\{X\}\_\{\\text\{in\}\}\}\{1\-\\mathbf\{t\}\}\(5\)and we can update the current state𝐗in\\mathbf\{X\}\_\{\\text\{in\}\}with:

𝐗in=𝐗in\+𝐕⋅Δ​𝐭\\mathbf\{X\}\_\{\\text\{in\}\}=\\mathbf\{X\}\_\{\\text\{in\}\}\+\\mathbf\{V\}\\cdot\\Delta\\mathbf\{t\}\(6\)This formulation allows each token to be continuously refined in the input space of the MDLM\.

### 3\.3Confidence\-based Asynchronous Decoding

As discussed in Sec\.[3\.1](https://arxiv.org/html/2606.29066#S3.SS1), language generation is highly contextually dependent \- We need to fix some tokens first to infer other tokens\. To meet this nature of language generation, we introduce a confidence\-based asynchronous update mechanism where the decoding progressttis a vector𝐭∈ℝN\\mathbf\{t\}\\in\\mathbb\{R\}^\{N\}\.

The step size is defined as:

Δ​𝐭=𝐚⋅\(𝟏−𝐭\)\\Delta\\mathbf\{t\}=\\mathbf\{a\}\\cdot\(\\mathbf\{1\}\-\\mathbf\{t\}\)\(7\)where𝐚∈\[amin,amax\]N\\mathbf\{a\}\\in\[a\_\{\\min\},a\_\{\\max\}\]^\{N\}is a per\-token step fraction produced by a confidence\-based step\-size policy that takes the model’s predictive distribution and the current decoding progress as input\. The policy parameterization and training are detailed in Sec\.[4\.2](https://arxiv.org/html/2606.29066#S4.SS2)\. After each step, the decoding progress𝐭\\mathbf\{t\}is updated with:

𝐭=𝐭\+Δ​𝐭\\mathbf\{t\}=\\mathbf\{t\}\+\\Delta\\mathbf\{t\}\(8\)This mechanism allows high\-confidence tokens to progress faster, providing a richer context for the refinement of low\-confidence tokens\.

### 3\.4Confidence\-Driven Discrete Adjustments

Standard MDLM decoding makes two confidence\-driven decisions at every step: it commits the most confident predictions and re\-masks the rest\. On top of our continuous flow, we retain two analogous discrete adjustments that inherit the same confidence signal: one rolls back the diffusion state when a previously high\-confidence prediction is no longer trusted, and the other commits the most reliable prediction early to anchor the rest\. We define the confidence of tokeniias

ci=max⁡\(softmax⁡\(𝐳i\)\),c\_\{i\}=\\operatorname\{max\}\(\\operatorname\{softmax\}\(\\mathbf\{z\}\_\{i\}\)\),\(9\)where𝐳i\\mathbf\{z\}\_\{i\}is the output logits of tokenii\.

#### Re\-editing

The step sizeΔ​𝐭\\Delta\\mathbf\{t\}in Sec\.[3\.3](https://arxiv.org/html/2606.29066#S3.SS3)is positive, so decoding progress is monotonically increasing\. In practice, the model is sometimes no longer confident in a token whose progress has already advanced\. To allow such tokens to be re\-estimated, whenci−ti<−0\.1c\_\{i\}\-t\_\{i\}<\-0\.1at a given step we re\-edit the state of tokeniias a mixture of the current prediction𝐱pred\\mathbf\{x\}\_\{\\text\{pred\}\}and the mask embedding𝐦\\mathbf\{m\}:

𝐱i=ci⋅𝐱pred\+\(1−ci\)⋅𝐦,\\mathbf\{x\}\_\{i\}=c\_\{i\}\\cdot\\mathbf\{x\}\_\{\\text\{pred\}\}\+\(1\-c\_\{i\}\)\\cdot\\mathbf\{m\},\(10\)and reset its decoding progress toti=cit\_\{i\}=c\_\{i\}\. Intuitively, when the model’s confidence drops well below the position’s commitment level, we let the diffusion state jump back toward the current estimation and abandon the previously trusted information\.

#### Hard commitment

At each diffusion step, among all tokens with decoding progressti<0\.99t\_\{i\}<0\.99, we fix the token with the highest confidence and set its decoding progress to11\. This anchors the most reliable prediction so that it provides a stable context for the remaining tokens, mirroring the commit\-by\-confidence heuristic of standard MDLM decoding\.

## 4Training

In this section, we describe how we adapt a pretrained MDLM forxx\-prediction flow and how we train the step\-size policy\.

### 4\.1xx\-Prediction Alignment

As discussed in Sec\.[2](https://arxiv.org/html/2606.29066#S2), the optimization goal of MDLMs is to minimize the cross\-entropy loss at*masked positions*\. For unmasked positions, the output of MDLMs is not explicitly constrained, as predictions in unmasked positions are not used for decoding anymore\.

When we reinterpret mask prediction asxx\-prediction flow as stated in Sec\.[3\.2](https://arxiv.org/html/2606.29066#S3.SS2), we need the model to:

1. 1\.Predict clean token𝐱0i\\mathbf\{x\}\_\{0\}^\{i\}from𝐦\\mathbf\{m\}on all masked positionsi∈ℳi\\in\\mathcal\{M\}\.
2. 2\.Predict the identical clean token𝐱0i\\mathbf\{x\}\_\{0\}^\{i\}from𝐱0i\\mathbf\{x\}\_\{0\}^\{i\}on all unmasked positionsi∈𝒰i\\in\\mathcal\{U\}because the estimation on clean diffusion states should remain the same in a stable dynamic system\.

The MDLMs are already trained to meet requirement 1\. Hence, we need to further align the model to meet requirement 2\.

For any masked input sequence𝐗in=\{𝐦,if masked𝐗0,otherwise\\mathbf\{X\}\_\{\\text\{in\}\}=\\begin\{cases\}\\mathbf\{m\},&\\text\{if masked\}\\\\ \\mathbf\{X\}\_\{0\},&\\text\{otherwise\}\\end\{cases\}and MDLMfθf\_\{\\theta\}, we construct the following loss:

𝐗pred=softmax⁡\(𝐙pred\)⋅𝐖in,𝐙pred=fθ​\(𝐗in\)\\mathbf\{X\}\_\{\\text\{pred\}\}=\\operatorname\{softmax\}\(\\mathbf\{Z\}\_\{\\text\{pred\}\}\)\\cdot\\mathbf\{W\}\_\{\\text\{in\}\},\\qquad\\mathbf\{Z\}\_\{\\text\{pred\}\}=f\_\{\\theta\}\(\\mathbf\{X\}\_\{\\text\{in\}\}\)\(11\)ℒx​\-pred=‖𝐗pred−𝐗0‖22\\mathcal\{L\}\_\{x\\text\{\-pred\}\}=\\left\\\|\\mathbf\{X\}\_\{\\text\{pred\}\}\-\\mathbf\{X\}\_\{0\}\\right\\\|\_\{2\}^\{2\}\(12\)
Thexx\-prediction lossℒx​\-pred\\mathcal\{L\}\_\{x\\text\{\-pred\}\}constrains the model to predict clean states from corrupted sequences in the embedding space without hurting its original mask prediction ability\. Since we use mean squared error loss instead of cross\-entropy loss to explicitly align the model’s output in the embedding space,ℒx​\-pred\\mathcal\{L\}\_\{x\\text\{\-pred\}\}is not suitable for guiding the model’s generation on unseen inputs\. We use self\-generated answers as clean data𝐗0\\mathbf\{X\}\_\{0\}in experiments to make sure that the loss originates from misalignment but not wrong prediction\.

### 4\.2Step\-Size Policy Training

As discussed in Sec\.[3\.3](https://arxiv.org/html/2606.29066#S3.SS3), decoding requires a step\-size policy that produces an adaptive step size for each token from its confidence statistics\. We train this policy with reinforcement learning because we want it to optimize for end\-task performance rather than per\-token matching with a reference, and the former is non\-differentiable through the multi\-step decoder\. Among RL algorithms, we adopt GRPO\[[18](https://arxiv.org/html/2606.29066#bib.bib11)\]to avoid learning a separate value baseline, which would be unreliable here since per\-step confidence statistics are only weakly predictive of the final sequence reward\.

At each decoding step, for each token, we construct a compact policy state from the MDLM predictive distribution and the current diffusion progress:

𝐬=\[p\(1\),p\(2\),p\(3\),p\(4\),Hk,pm,t,ρ\],\\mathbf\{s\}=\\left\[p\_\{\(1\)\},p\_\{\(2\)\},p\_\{\(3\)\},p\_\{\(4\)\},H\_\{k\},p\_\{m\},t,\\rho\\right\],\(13\)wherep\(j\)p\_\{\(j\)\}denotes thejj\-th largest probability fromp=softmax⁡\(𝐳pred\)p=\\operatorname\{softmax\}\(\\mathbf\{z\}\_\{\\text\{pred\}\}\), andpm=p\(1\)−p\(2\)p\_\{m\}=p\_\{\(1\)\}\-p\_\{\(2\)\}is the confidence margin\. The scalarHkH\_\{k\}is the normalized entropy over the top\-kkprobabilities:

Hk\\displaystyle H\_\{k\}=−1log⁡k​∑i∈Skp~i​log⁡p~i,\\displaystyle=\-\\frac\{1\}\{\\log k\}\\sum\_\{i\\in S\_\{k\}\}\\tilde\{p\}\_\{i\}\\log\\tilde\{p\}\_\{i\},p~i\\displaystyle\\qquad\\tilde\{p\}\_\{i\}=pi∑j∈Skpj,i∈Sk\.\\displaystyle=\\frac\{p\_\{i\}\}\{\\sum\_\{j\\in S\_\{k\}\}p\_\{j\}\},\\quad i\\in S\_\{k\}\.\(14\)whereSkS\_\{k\}is the set of top\-kkindices; in our implementationk=4k=4\. Finally,t∈\[0,1\]t\\in\[0,1\]denotes the token\-wise decoding progress, andρ=ℓT−1\\rho=\\frac\{\\ell\}\{T\-1\}denotes the normalized global diffusion step at stepℓ\\ellout ofTTtotal steps\.

#### Log\-scale Beta policy

The step policy parameterizes a Beta distribution over a latent variabley∈\[0,1\]y\\in\[0,1\], which is later mapped to the actionaa\. Concretely, a two\-layer MLP with SiLU activations encodes𝐬\\mathbf\{s\}into a hidden representation𝐡θ​\(𝐬\)∈ℝd\\mathbf\{h\}\_\{\\theta\}\(\\mathbf\{s\}\)\\in\\mathbb\{R\}^\{d\}, and two linear heads predict the Beta mean and concentration from it:

μθ​\(𝐬\)\\displaystyle\\mu\_\{\\theta\}\(\\mathbf\{s\}\)=0\.05\+0\.9⋅σ​\(𝐖μ​𝐡θ​\(𝐬\)\+bμ\),\\displaystyle=05\+9\\cdot\\sigma\\\!\\left\(\\mathbf\{W\}\_\{\\mu\}\\mathbf\{h\}\_\{\\theta\}\(\\mathbf\{s\}\)\+b\_\{\\mu\}\\right\),\(15\)κθ​\(𝐬\)\\displaystyle\\kappa\_\{\\theta\}\(\\mathbf\{s\}\)=κmin\+\(κmax−κmin\)⋅σ​\(𝐖κ​𝐡θ​\(𝐬\)\+bκ\),\\displaystyle=\\kappa\_\{\\min\}\+\(\\kappa\_\{\\max\}\-\\kappa\_\{\\min\}\)\\cdot\\sigma\\\!\\left\(\\mathbf\{W\}\_\{\\kappa\}\\mathbf\{h\}\_\{\\theta\}\(\\mathbf\{s\}\)\+b\_\{\\kappa\}\\right\),where𝐖μ,𝐖κ∈ℝ1×d\\mathbf\{W\}\_\{\\mu\},\\mathbf\{W\}\_\{\\kappa\}\\in\\mathbb\{R\}^\{1\\times d\}andbμ,bκ∈ℝb\_\{\\mu\},b\_\{\\kappa\}\\in\\mathbb\{R\}are the parameters of the two heads, andσ\\sigmais the sigmoid\. The affine rescalings boundμθ​\(𝐬\)∈\[0\.05,0\.95\]\\mu\_\{\\theta\}\(\\mathbf\{s\}\)\\in\[0\.05,0\.95\]andκθ​\(𝐬\)∈\[κmin,κmax\]\\kappa\_\{\\theta\}\(\\mathbf\{s\}\)\\in\[\\kappa\_\{\\min\},\\kappa\_\{\\max\}\], keeping the Beta distribution away from degenerate endpoints and extreme concentrations while still allowing the policy to represent both conservative and aggressive updates\. The latent variable is then sampled as

y∼Beta⁡\(αθ,βθ\),αθ=μθ​\(𝐬\)​κθ​\(𝐬\),βθ=\(1−μθ​\(𝐬\)\)​κθ​\(𝐬\)\.y\\sim\\operatorname\{Beta\}\(\\alpha\_\{\\theta\},\\beta\_\{\\theta\}\),\\qquad\\alpha\_\{\\theta\}=\\mu\_\{\\theta\}\(\\mathbf\{s\}\)\\kappa\_\{\\theta\}\(\\mathbf\{s\}\),\\qquad\\beta\_\{\\theta\}=\\left\(1\-\\mu\_\{\\theta\}\(\\mathbf\{s\}\)\\right\)\\kappa\_\{\\theta\}\(\\mathbf\{s\}\)\.\(16\)Instead of usingyydirectly as the step fraction, we map it to an actiona∈\[amin,amax\]a\\in\[a\_\{\\min\},a\_\{\\max\}\]on a logarithmic scale:

a=2u,u=log2⁡amin\+y​\(log2⁡amax−log2⁡amin\)\.a=2^\{u\},\\qquad u=\\log\_\{2\}a\_\{\\min\}\+y\\left\(\\log\_\{2\}a\_\{\\max\}\-\\log\_\{2\}a\_\{\\min\}\\right\)\.\(17\)We useamin=1/256a\_\{\\min\}=1/256andamax=1a\_\{\\max\}=1\. The resulting token\-wise diffusion step for tokeniiisΔ​ti=ai⋅\(1−ti\)\\Delta t\_\{i\}=a\_\{i\}\\cdot\(1\-t\_\{i\}\)as in Eq\.[7](https://arxiv.org/html/2606.29066#S3.E7)\. Thus, the policy controls the fraction of the remaining diffusion distance to consume at this step\. The log\-scale action space is important because useful updates span orders of magnitude: uncertain tokens often need very small refinements, while high\-confidence tokens can safely consume most of their remaining distance\. During deterministic decoding, we replace the sampled action by its policy expectation under this transformed Beta distribution\.

#### Reward design

For each promptqbq\_\{b\}\(indexed bybbin the batch\), we sample a group ofGGdecoded trajectories with the current policy, indexed byg∈\{1,…,G\}g\\in\\\{1,\\ldots,G\\\}\. Each token receives a reward that combines a trajectory\-level task term \(e\.g\., pass rate\) and a per\-token completion term:

Rb,g,i=Rb,gtask\+λ⋅tb,g,ifinal,R\_\{b,g,i\}=R^\{\\text\{task\}\}\_\{b,g\}\+\\lambda\\cdot t^\{\\text\{final\}\}\_\{b,g,i\},\(18\)whereRb,gtaskR^\{\\text\{task\}\}\_\{b,g\}is the trajectory\-level task reward and is broadcast to every token in trajectorygg, whiletb,g,ifinal∈\[0,1\]t^\{\\text\{final\}\}\_\{b,g,i\}\\in\[0,1\]is the final decoding progress of tokenii\. The completion term constrains the policy to push every token’s diffusion progress toward11, preventing it from leaving tokens partially decoded at the end of the budget\.

For the within\-group advantage construction and the clipped policy update, we follow GRPO\[[18](https://arxiv.org/html/2606.29066#bib.bib11)\]\.

## 5Experiments

We evaluate ourxx\-prediction flow decoding on the code generation task using LLaDA\[[16](https://arxiv.org/html/2606.29066#bib.bib1)\]\. To show that our method scales up to larger models, we further evaluate on LLaDA2\.0\[[3](https://arxiv.org/html/2606.29066#bib.bib5)\]\.

### 5\.1xx\-Prediction Alignment Training

Forxx\-prediction alignment, we first generate self\-alignment data from the Tulu3\[[13](https://arxiv.org/html/2606.29066#bib.bib7)\]\-SFT\-Personas\-Code dataset using LLaDA and LLaDA2\.0\. We then train the model with the alignment objective described in Sec\.[4\.1](https://arxiv.org/html/2606.29066#S4.SS1): LLaDA is updated with full\-parameter training, while LLaDA2\.0 is updated with LoRA\[[12](https://arxiv.org/html/2606.29066#bib.bib6)\]\. We update LLaDA for 100 steps and LLaDA2\.0 for 400 steps\. In all experiments below, this alignment training is the only update applied to the pretrained MDLM\. The model is not trained on any intermediate states\.

Fig\.[2\(a\)](https://arxiv.org/html/2606.29066#S5.F2.sf1)shows the MSE alignment loss at masked and unmasked positions\. After only 100 update steps, the loss on unmasked positions drops close to zero, indicating that the model quickly learns to preserve the clean embedding state when the input token is already clean\. Fig\.[2\(b\)](https://arxiv.org/html/2606.29066#S5.F2.sf2)further shows that this alignment does not degrade the original masked\-token prediction behavior: the CE loss at masked positions remains close to that of the pretrained LLaDA reference\.

These results support the central premise of our formulation\. Clean\-state prediction in embedding space is not a new capability that must be learned from scratch; rather, it is largely compatible with the representations already learned by the MDLM backbone\. The lightweight alignment mainly calibrates the model to output a stable clean state at every position, including positions that are not masked during standard MDLM training\. Consequently, our reinterpretation of mask prediction asxx\-prediction flow can reuse pretrained MDLMs with minimal backbone modification\.

![Refer to caption](https://arxiv.org/html/2606.29066v1/x2.png)\(a\)Mean squared error \(MSE\) loss\.
![Refer to caption](https://arxiv.org/html/2606.29066v1/x3.png)\(b\)Cross entropy \(CE\) loss\.

Figure 2:Training losses duringxx\-prediction alignment\. The MSE curve tracks masked and unmasked positions, while the CE curve compares the aligned model against the pretrained LLaDA reference at masked positions\.
### 5\.2GRPO Training

We train the step\-size policy on top of LLaDA\-8B\-Instruct, using the MBPP\[[2](https://arxiv.org/html/2606.29066#bib.bib12)\]training split as the prompt source\. Each rollout generates a response of length256256overT=256T=256diffusion steps\.

#### Prompt filtering

Rather than training on the full MBPP training split, we first run the pretrained LLaDA\-8B\-Instruct on every training problem and keep only the164164problems that it can already solve\. The step\-size policy controls only how the diffusion progress accumulates across tokens and steps, not what the underlying model can solve\. For problems that the base model cannot solve under any schedule, every trajectory in a group receives zero task reward; consequently, the within\-group advantage collapses and the policy receives no useful learning signal\. Restricting training to solvable problems therefore concentrates the reward signal on the regime in which the schedule actually matters\.

#### Optimization

We train the policy with a batch size of88prompts and a group size ofG=8G=8rollouts per prompt for66k steps\.

### 5\.3Code Generation Evaluation

#### Setup

We evaluate on HumanEval\[[4](https://arxiv.org/html/2606.29066#bib.bib16)\]and MBPP\. The generation length is512512on HumanEval and256256on MBPP\. For LLaDA\-8B\-Instruct, we use a block length equal to the full generation length, i\.e\., no semi\-autoregressive partitioning\. LLaDA2\.0\-mini was pretrained with a blockwise causal mask, so we keep its native semi\-autoregressive setup and use a block size of3232\.

Table 1:Accuracy \(%\) on HumanEval and MBPP under varying decoding budgets \(fraction of diffusion steps compared to generation length\)\.LLaDA\-8B\-InstructLLaDA2\.0\-miniBudgetDecoding MethodHumanEvalMBPPHumanEvalMBPP1/4mask prediction33\.5421\.2032\.3231\.40xx\-prediction flow45\.1233\.0059\.7643\.80*\+11\.58**\+11\.80**\+27\.44*\+12\.401/1mask prediction46\.3439\.6079\.8863\.20
#### Results

Table[1](https://arxiv.org/html/2606.29066#S5.T1)reports accuracy under a1/41/4decoding budget \(diffusion steps equal to one\-quarter of the generation length\), with the full\-budget mask\-prediction baseline included for reference\.

On both base models and both datasets,xx\-prediction flow improves over mask\-prediction decoding by11\.5811\.58to27\.4427\.44points of accuracy at the same1/41/4budget\. With LLaDA\-8B\-Instruct on HumanEval,xx\-prediction flow at1/41/4budget reaches45\.1245\.12accuracy, recovering97%97\\%of the full\-budget mask\-prediction baseline \(46\.3446\.34\) while using a quarter of the diffusion steps\. The gain grows further on LLaDA2\.0\-mini, where HumanEval accuracy jumps from32\.3232\.32to59\.7659\.76at the same1/41/4budget\.

The LLaDA2\.0\-mini results shown are obtained with the step\-size policy trained on LLaDA\-8B\-Instruct\. Without any policy retraining, the policy still produces a large gain, indicating that it learns general patterns over per\-token confidence statistics rather than model\-specific shortcuts\.

#### Ablations

We ablate the main design components on LLaDA\-8B\-Instruct using HumanEval under the1/41/4decoding budget\. Table[2](https://arxiv.org/html/2606.29066#S5.T2)reports the accuracy after removing one component at a time\. The result shows that all our design components are important for the final performance, especiallyxx\-prediction alignment\.

Table 2:Ablation results on HumanEval with LLaDA\.

## 6Related Work

#### Continuous diffusion language models

Continuous DLMs\[[8](https://arxiv.org/html/2606.29066#bib.bib18);[6](https://arxiv.org/html/2606.29066#bib.bib19)\]maintain a continuous state in latent space during the diffusion process and perform a discretization step at the final readout\. However, they either target restricted conditional generation settings or require substantially more compute to approach the same performance as autoregressive language models\. In contrast, our method exploits existing pretrained MDLMs and focuses on improving their efficiency without introducing substantial training effort\.

#### Soft\-Masked Diffusion Language Models

The closest prior effort to ours is Soft\-Masked Diffusion Language Models \(SM\)\[[10](https://arxiv.org/html/2606.29066#bib.bib3)\], which replaces the embedding of every retained mask with a convex combination of the\[M\]\[\\texttt\{M\}\]embedding and the top\-kkpredicted token embeddings from the previous step, weighted by their confidence scores\. Although SM softens the input representation of retained masks, its decoding process still follows a discrete position\-level schedule: each position is either committed to a token or treated as masked, and the soft mixture is recomputed from the current step rather than carried as a persistent continuous state\. In contrast, our method defines an explicit dynamical system in the input embedding space, where token\-wise decoding progress accumulates across iterations and remains revisable until commitment\. Additionally, SM requires more than 10k fine\-tuning steps to teach the model to make use of the richer SM feedback, whereas ourxx\-prediction flow exploits pretrained MDLM’s existing ability without training it on any mixed inputs\. With only hundreds of steps of alignment, our method is able to achieve substantial improvement over the baseline models with the same budget limitation\.

## 7Conclusion

We introducedxx\-prediction flow, a continuous decoding framework that reinterprets mask prediction in MDLMs as clean\-state prediction in embedding space\. By anchoring the flow at the mask embedding, maintaining token\-wise decoding progress, and learning confidence\-conditioned step sizes, our method turns binary unmasking into continuous, revisable refinement while remaining compatible with pretrained MDLMs\. Experiments on code generation show that this formulation substantially improves low\-budget decoding, and ablations confirm the importance of our design components\.

#### Limitations\.

Our framework explicitly assumes that linear combinations of token embeddings, including their mixtures with the mask embedding𝐦\\mathbf\{m\}, still lie within an input region that a pretrained MDLM can interpret\. This assumption is supported by our empirical validation on the two evaluated models and datasets, but a rigorous theoretical justification and the geometry of the input embedding space that makes pretrained MDLMs tolerant to such continuous inputs remains poorly understood\. A principled characterization of this property is an important direction for future work\.

## References

- J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023\)Gpt\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§1](https://arxiv.org/html/2606.29066#S1.p1.1)\.
- J\. Austin, A\. Odena, M\. Nye, M\. Bosma, H\. Michalewski, D\. Dohan, E\. Jiang, C\. Cai, M\. Terry, Q\. Le,et al\.\(2021\)Program synthesis with large language models\.arXiv preprint arXiv:2108\.07732\.Cited by:[§5\.2](https://arxiv.org/html/2606.29066#S5.SS2.p1.2)\.
- T\. Bie, M\. Cao, K\. Chen, L\. Du, M\. Gong, Z\. Gong, Y\. Gu, J\. Hu, Z\. Huang, Z\. Lan,et al\.\(2025\)Llada2\. 0: scaling up diffusion language models to 100b\.arXiv preprint arXiv:2512\.15745\.Cited by:[§1](https://arxiv.org/html/2606.29066#S1.p2.1),[§1](https://arxiv.org/html/2606.29066#S1.p3.1),[§5](https://arxiv.org/html/2606.29066#S5.p1.1)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. de Oliveira Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman, A\. Ray, R\. Puri, G\. Krueger, M\. Petrov, H\. Khlaaf, G\. Sastry, P\. Mishkin, B\. Chan, S\. Gray, N\. Ryder, M\. Pavlov, A\. Power, L\. Kaiser, M\. Bavarian, C\. Winter, P\. Tillet, F\. P\. Such, D\. Cummings, M\. Plappert, F\. Chantzis, E\. Barnes, A\. Herbert\-Voss, W\. H\. Guss, A\. Nichol, A\. Paino, N\. Tezak, J\. Tang, I\. Babuschkin, S\. Balaji, S\. Jain, W\. Saunders, C\. Hesse, A\. N\. Carr, J\. Leike, J\. Achiam, V\. Misra, E\. Morikawa, A\. Radford, M\. Knight, M\. Brundage, M\. Murati, K\. Mayer, P\. Welinder, B\. McGrew, D\. Amodei, S\. McCandlish, I\. Sutskever, and W\. Zaremba \(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.External Links:2107\.03374Cited by:[§5\.3](https://arxiv.org/html/2606.29066#S5.SS3.SSS0.Px1.p1.3)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)Bert: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 \(long and short papers\),pp\. 4171–4186\.Cited by:[§1](https://arxiv.org/html/2606.29066#S1.p3.1)\.
- S\. Gong, M\. Li, J\. Feng, Z\. Wu, and L\. Kong \(2022\)Diffuseq: sequence to sequence text generation with diffusion models\.arXiv preprint arXiv:2210\.08933\.Cited by:[§6](https://arxiv.org/html/2606.29066#S6.SS0.SSS0.Px1.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§1](https://arxiv.org/html/2606.29066#S1.p1.1)\.
- I\. Gulrajani and T\. B\. Hashimoto \(2023\)Likelihood\-based diffusion language models\.Advances in Neural Information Processing Systems36,pp\. 16693–16715\.Cited by:[§6](https://arxiv.org/html/2606.29066#S6.SS0.SSS0.Px1.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)DeepSeek\-r1 incentivizes reasoning in llms through reinforcement learning\.Nature645\(8081\),pp\. 633–638\.Cited by:[§1](https://arxiv.org/html/2606.29066#S1.p1.1)\.
- M\. Hersche, S\. Moor\-Smith, T\. Hofmann, and A\. Rahimi \(2025\)Soft\-masked diffusion language models\.arXiv preprint arXiv:2510\.17206\.Cited by:[§6](https://arxiv.org/html/2606.29066#S6.SS0.SSS0.Px2.p1.3)\.
- J\. Ho, A\. Jain, and P\. Abbeel \(2020\)Denoising diffusion probabilistic models\.Advances in neural information processing systems33,pp\. 6840–6851\.Cited by:[§1](https://arxiv.org/html/2606.29066#S1.p3.1),[§3\.1](https://arxiv.org/html/2606.29066#S3.SS1.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, W\. Chen,et al\.\(2022\)Lora: low\-rank adaptation of large language models\.\.Iclr1\(2\),pp\. 3\.Cited by:[§5\.1](https://arxiv.org/html/2606.29066#S5.SS1.p1.1)\.
- N\. Lambert, J\. Morrison, V\. Pyatkin, S\. Huang, H\. Ivison, F\. Brahman, L\. J\. V\. Miranda, A\. Liu, N\. Dziri, S\. Lyu,et al\.\(2024\)Tulu 3: pushing frontiers in open language model post\-training\.arXiv preprint arXiv:2411\.15124\.Cited by:[§5\.1](https://arxiv.org/html/2606.29066#S5.SS1.p1.1)\.
- T\. Li and K\. He \(2025\)Back to basics: let denoising generative models denoise\.arXiv preprint arXiv:2511\.13720\.Cited by:[§1](https://arxiv.org/html/2606.29066#S1.p3.1),[§3\.1](https://arxiv.org/html/2606.29066#S3.SS1.p1.1),[§3\.2](https://arxiv.org/html/2606.29066#S3.SS2.p3.4)\.
- Y\. Lipman, R\. T\. Chen, H\. Ben\-Hamu, M\. Nickel, and M\. Le \(2022\)Flow matching for generative modeling\.arXiv preprint arXiv:2210\.02747\.Cited by:[§1](https://arxiv.org/html/2606.29066#S1.p3.1),[§3\.1](https://arxiv.org/html/2606.29066#S3.SS1.p1.1)\.
- S\. Nie, F\. Zhu, Z\. You, X\. Zhang, J\. Ou, J\. Hu, J\. Zhou, Y\. Lin, J\. Wen, and C\. Li \(2025\)Large language diffusion models\.arXiv preprint arXiv:2502\.09992\.Cited by:[§1](https://arxiv.org/html/2606.29066#S1.p2.1),[§1](https://arxiv.org/html/2606.29066#S1.p3.1),[§2](https://arxiv.org/html/2606.29066#S2.SS0.SSS0.Px1.p1.4),[§2](https://arxiv.org/html/2606.29066#S2.SS0.SSS0.Px3.p1.4),[§2](https://arxiv.org/html/2606.29066#S2.p1.1),[§5](https://arxiv.org/html/2606.29066#S5.p1.1)\.
- J\. Ou, S\. Nie, K\. Xue, F\. Zhu, J\. Sun, Z\. Li, and C\. Li \(2024\)Your absorbing discrete diffusion secretly models the conditional distributions of clean data\.arXiv preprint arXiv:2406\.03736\.Cited by:[§1](https://arxiv.org/html/2606.29066#S1.p3.1),[§2](https://arxiv.org/html/2606.29066#S2.SS0.SSS0.Px1.p1.4)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[3rd item](https://arxiv.org/html/2606.29066#S1.I1.i3.p1.1),[§4\.2](https://arxiv.org/html/2606.29066#S4.SS2.SSS0.Px2.p2.1),[§4\.2](https://arxiv.org/html/2606.29066#S4.SS2.p1.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.Advances in neural information processing systems30\.Cited by:[§2](https://arxiv.org/html/2606.29066#S2.SS0.SSS0.Px2.p1.3)\.

Similar Articles

Masked Language Flow Models

arXiv cs.CL

This paper introduces Masked Language Flow Models (MLFMs), which incorporate masking into flow-based language models to enable continuous flow for conditional generation and allow pretrained Masked Diffusion Models to be converted. The authors propose a novel sampler that alternates continuous denoising with discrete unmasking, demonstrating for the first time that flow-based language models can scale to downstream reasoning and instruction-following tasks.

LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling

Hugging Face Daily Papers

LangFlow presents the first continuous diffusion language model that rivals discrete diffusion approaches, challenging the long-held belief that continuous diffusion is inferior for language modeling. The work introduces key ingredients like optimal Gumbel-based noise scheduling and demonstrates competitive perplexity and transfer learning performance compared to discrete diffusion baselines.