Adaptive Order Policies for Masked Diffusion

arXiv cs.LG 06/02/26, 04:00 AM Papers
Summary
Proposes learning the unmasking order in masked diffusion models using a lightweight policy network, with a weighted loss that outperforms heuristics on combinatorial tasks and protein design.
arXiv:2606.00295v1 Announce Type: new Abstract: Masked diffusion models have seen great success in capturing data distributions over discrete sequences in domains such as text and proteins. These models generate data by iteratively unmasking tokens starting from a fully masked sequence, with the unmasking order typically chosen at random or using a heuristic based on denoiser probabilities. In this work, we propose a scheme for learning the unmasking order using an additional lightweight policy network on top of a diffusion model. Our proposed loss reweights terms in the masked diffusion loss according to policy probabilities, and results in a policy that prefers positions where the denoiser is more likely to be correct. We study this loss in two settings: (i) training solely the policy while using a frozen pre-trained denoiser, and (ii) training the policy and denoiser jointly with the weighted loss to allow for mutual adaptation. We demonstrate that our approach outperforms common heuristics on problems that are sensitive to token ordering, such as combinatorial tasks and proteins.
Original Article
View Cached Full Text
Cached at: 06/02/26, 03:40 PM
# Adaptive Order Policies for Masked Diffusion
Source: [https://arxiv.org/html/2606.00295](https://arxiv.org/html/2606.00295)
Jama Hussein Mohamud1,2, Mohsin Hasan1,211footnotemark:1, Mirco Ravanelli2,3,Yoshua Bengio1,2,4 1Université de Montréal,2Mila,3Concordia University,4LawZero

###### Abstract

Masked diffusion models have seen great success in capturing data distributions over discrete sequences in domains such as text and proteins\. These models generate data by iteratively unmasking tokens starting from a fully masked sequence, with the unmasking order typically chosen at random or using a heuristic based on denoiser probabilities\. In this work, we propose a scheme for learning the unmasking order using an additional lightweight policy network on top of a diffusion model\. Our proposed loss reweights terms in the masked diffusion loss according to policy probabilities, and results in a policy that prefers positions where the denoiser is more likely to be correct\. We study this loss in two settings: \(i\) training solely the policy while using a frozen pre\-trained denoiser, and \(ii\) training the policy and denoiser jointly with the weighted loss to allow for mutual adaptation\. We demonstrate that our approach outperforms common heuristics on problems that are sensitive to token ordering, such as combinatorial tasks and proteins\.

## 1Introduction

Diffusion models have established themselves as a powerful paradigm for generative modeling, achieving remarkable success in continuous domains such as images\(Hoet al\.,[2020](https://arxiv.org/html/2606.00295#bib.bib74); Sahariaet al\.,[2022](https://arxiv.org/html/2606.00295#bib.bib31); Rombachet al\.,[2022](https://arxiv.org/html/2606.00295#bib.bib104)\)and molecular structures\(Watsonet al\.,[2023](https://arxiv.org/html/2606.00295#bib.bib106); Abramsonet al\.,[2024](https://arxiv.org/html/2606.00295#bib.bib105)\)\. More recently,*discrete*diffusion models – which operate directly on token sequences by iteratively masking and unmasking – have shown strong results in language modeling\(Sahooet al\.,[2024](https://arxiv.org/html/2606.00295#bib.bib84); Nieet al\.,[2025](https://arxiv.org/html/2606.00295#bib.bib77); Shiet al\.,[2024](https://arxiv.org/html/2606.00295#bib.bib85)\), protein design\(Alamdariet al\.,[2023](https://arxiv.org/html/2606.00295#bib.bib110); Wanget al\.,[2024](https://arxiv.org/html/2606.00295#bib.bib27)\), and drug discovery\(Leeet al\.,[2025](https://arxiv.org/html/2606.00295#bib.bib111)\)\.\(Huanget al\.,[2022](https://arxiv.org/html/2606.00295#bib.bib1)\)\(Chen and Lipman,[2024](https://arxiv.org/html/2606.00295#bib.bib2)\)\(Austinet al\.,[2021](https://arxiv.org/html/2606.00295#bib.bib3)\)\(Gatet al\.,[2024](https://arxiv.org/html/2606.00295#bib.bib39)\)

A key design choice in masked diffusion models \(MDM\) is the*order*in which tokens are unmasked during generation\. The standard approach selects positions uniformly at random\. However, practitioners have found that heuristic ordering strategies – such as unmasking the most confident position first\(Nieet al\.,[2025](https://arxiv.org/html/2606.00295#bib.bib77)\)or the position with the largest probability margin\(Kimet al\.,[2025](https://arxiv.org/html/2606.00295#bib.bib126)\)– can dramatically improve sample quality on downstream tasks\. This effect is particularly pronounced on constraint satisfaction problems such as Sudoku and Boolean satisfiability \(3\-SAT\), where the unmasking order directly impacts whether the model can propagate constraints correctly\.

Despite the empirical success of heuristic orderings, these remain hand\-designed and may be suboptimal for a given model and dataset\. A natural question arises:*can we learn the unmasking order?*That is, rather than relying on fixed heuristics, can we train a lightweight auxiliary network to predict which positions to unmask, conditioned on the current partially masked sequence?

In this work, we propose a simple approach for learning adaptive unmasking orderings in MDMs\. Our approach introduces a policy networkqϕ\(i∣xt\)q^\{\\phi\}\(i\\mid x\_\{t\}\)and a modified cross\-entropy objective that can be used either to train a lightweight policy layer on top of a pretrained masked diffusion model or to jointly train the policy and denoiser\. The objective weights policy probabilities by the cross entropy of the denoiser at each token position, encouraging the policy to select positions that are most informative for generation\. Across both the policy\-only and joint training settings, we show improvements over existing heuristic orderings on Sudoku, 3\-SAT, and protein generation with DPLM\. In the policy\-only setting, these gains come with very few additional parameters \(<1%<1\\%of the MDM’s total parameters\) and require only a few hundred training iterations, compared to the hundreds of thousands typically required for MDM training\. In the joint training setting, we additionally introduce a policy\-aware denoising objective, and show that it further improves performance on combinatorial tasks while also improving predicted foldability in protein generation and maintaining diversity close to heuristic orderings\.

## 2Method

### 2\.1Masked Diffusion Models

Throughout this work, we denote a sequence of lengthLLasx=\(x1,…,xL\)∈𝒱Lx=\(x^\{1\},\\dots,x^\{L\}\)\\in\\mathcal\{V\}^\{L\}, with tokens taking values in some vocabulary setxi∈𝒱x^\{i\}\\in\\mathcal\{V\}\. We consider the case of masked diffusion, where a special masking tokenmmis included in the vocabulary set\. Other notation includes: the Kronecker symbolδ\(i,j\)\\delta\(i,j\)\(equal to 1 fori=ji=jand 0 otherwise\),Cat\(x;p\)\\mathrm\{Cat\}\(x;p\)to denote the categorical distribution with probabilitiespp, andΔk\\Delta^\{k\}to denote the probability simplex overkkdimensions\.

Masked diffusion models \(MDMs\) use a noising process to map the data distributionpdata\(x\)p\_\{\\mathrm\{data\}\}\(x\)at time0to the delta distribution at the fully masked stateM=\(m,…,m\)M=\(m,\\dots,m\),p1\(x\)=δ\(x,M\)p\_\{1\}\(x\)=\\delta\(x,M\)at time11\. A typical noising process consists of converting a data tokenx0ix^\{i\}\_\{0\}into the masked token with some probability1−αt1\-\\alpha\_\{t\}, independently over dimensions\(Sahooet al\.,[2024](https://arxiv.org/html/2606.00295#bib.bib84)\):p\(xt∣x0\)=∏i=1Lαtδ\(xti,x0i\)\+\(1−αt\)δ\(xti,m\)p\(x\_\{t\}\\mid x\_\{0\}\)=\\prod^\{L\}\_\{i=1\}\\alpha\_\{t\}\\delta\(x^\{i\}\_\{t\},x^\{i\}\_\{0\}\)\+\(1\-\\alpha\_\{t\}\)\\delta\(x^\{i\}\_\{t\},m\)\. The parameterαt\\alpha\_\{t\}denotes a decreasing noise schedule, withα0=1\\alpha\_\{0\}=1andα1=0\\alpha\_\{1\}=0\. A typical choice is the linear scheduleαt=1−t\\alpha\_\{t\}=1\-t\.

For reversing this process, a neural network parameterizes a distribution over clean datax0x\_\{0\}conditioned on the partially masked sequencextx\_\{t\}\. In particular, the network outputs an independent distribution over each token positionii, asμθ\(xt\)\[i,⋅\]∈Δ\|𝒱\|\\mu^\{\\theta\}\(x\_\{t\}\)\[i,\\cdot\]\\in\\Delta^\{\|\\mathcal\{V\}\|\}, which satisfiesμθ\(xt\)\[i,m\]=0\\mu^\{\\theta\}\(x\_\{t\}\)\[i,m\]=0\(the clean data cannot contain masks\) andμθ\(xt\)\[i,xti\]=1\\mu^\{\\theta\}\(x\_\{t\}\)\[i,x^\{i\}\_\{t\}\]=1ifxti≠mx^\{i\}\_\{t\}\\neq m\(the clean data approximation retains unmasked positions inxtx\_\{t\}\)\. The functionμθ\\mu^\{\\theta\}is referred to as the denoiser\.

Given a denoiser, the reverse transition over two nearby time\-stepss<ts<tis\(Sahooet al\.,[2024](https://arxiv.org/html/2606.00295#bib.bib84)\):

pθ\(xsi∣xt\)=\{Cat\(xsi;1−αs1−αtδ\(⋅,m\)\+αs−αt1−αtμθ\(xt\)\[i,⋅\]\)ifxti=mCat\(xsi;δ\(⋅,xti\)\)ifxti≠m\\displaystyle p^\{\\theta\}\(x^\{i\}\_\{s\}\\mid x\_\{t\}\)=\\begin\{cases\}\\mathrm\{Cat\}\\left\(x^\{i\}\_\{s\};\\frac\{1\-\\alpha\_\{s\}\}\{1\-\\alpha\_\{t\}\}\\delta\(\\cdot,m\)\+\\frac\{\\alpha\_\{s\}\-\\alpha\_\{t\}\}\{1\-\\alpha\_\{t\}\}\\mu^\{\\theta\}\(x\_\{t\}\)\[i,\\cdot\]\\right\)&\\text\{if \}x^\{i\}\_\{t\}=m\\\\ \\mathrm\{Cat\}\(x^\{i\}\_\{s\};\\delta\(\\cdot,x^\{i\}\_\{t\}\)\)&\\text\{if \}x^\{i\}\_\{t\}\\neq m\\end\{cases\}\(1\)
LetCE\(i,p\)\\mathrm\{CE\}\(i,p\)denote the cross entropy loss with sampleiiand probabilitypp:CE\(i,p\)=−log⁡p\[i\]\\mathrm\{CE\}\(i,p\)=\-\\log p\[i\]\. The goal is to trainμθ\(xt\)\\mu^\{\\theta\}\(x\_\{t\}\)to match the factored posterior∏i=1Lp\(x0i∣xt\)\\prod\_\{i=1\}^\{L\}p\(x\_\{0\}^\{i\}\\mid x\_\{t\}\)\. For such a denoiser, the transitions in[Equation˜1](https://arxiv.org/html/2606.00295#S2.E1)correctly reverse the noising process and recover the data distribution at time0\(Sahooet al\.,[2024](https://arxiv.org/html/2606.00295#bib.bib84); Shiet al\.,[2024](https://arxiv.org/html/2606.00295#bib.bib85)\)\. The training objective forμθ\\mu^\{\\theta\}is the weighted cross entropy loss:

ℒMDM\(θ\)\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{MDM\}\}\(\\theta\)=𝔼t∼𝒰\[0,1\],xt∼p\(xt\|x0\),x0∼pdata\[−αt′1−αt∑i=1Lδ\(xti,m\)CE\(x0i,μθ\(xt\)\[i,⋅\]\)\]\\displaystyle=\\mathbb\{E\}\_\{t\\sim\\mathcal\{U\}\[0,1\],x\_\{t\}\\sim p\(x\_\{t\}\|x\_\{0\}\),x\_\{0\}\\sim p\_\{\\mathrm\{data\}\}\}\\left\[\\frac\{\-\\alpha^\{\\prime\}\_\{t\}\}\{1\-\\alpha\_\{t\}\}\\sum^\{L\}\_\{i=1\}\\delta\(x^\{i\}\_\{t\},m\)\\mathrm\{CE\}\(x^\{i\}\_\{0\},\\mu^\{\\theta\}\(x\_\{t\}\)\[i,\\cdot\]\)\\right\]\(2\)With a trained denoiser, the generation process consists of starting with a completely masked sequencex1=Mx\_\{1\}=Mand iteratively unmasking tokens in the sequence overTTsteps to obtain a final samplex0x\_\{0\}\. One step of this sampling procedure is done by simulating[Equation˜1](https://arxiv.org/html/2606.00295#S2.E1), which involves:

1. \(i\)Sampling an approximation of clean data from the denoiserx^0∼μθ\(xt\)\\hat\{x\}\_\{0\}\\sim\\mu^\{\\theta\}\(x\_\{t\}\)
2. \(ii\)Randomly choosing which \(currently masked\) positionsIIinxtx\_\{t\}to unmask \(by replacingxti=mx^\{i\}\_\{t\}=mwithx^0i\\hat\{x\}^\{i\}\_\{0\}for alli∈Ii\\in I\)\.

Properly simulating[Equation˜1](https://arxiv.org/html/2606.00295#S2.E1)requires random selection of the set of unmasking positionsII\(Sahooet al\.,[2024](https://arxiv.org/html/2606.00295#bib.bib84)\)\. However, a number of works have found success in selectingIIthrough some heuristic informed by the denoiser logitsμθ\(xt\)\\mu^\{\\theta\}\(x\_\{t\}\)\(Nieet al\.,[2025](https://arxiv.org/html/2606.00295#bib.bib77); Kimet al\.,[2025](https://arxiv.org/html/2606.00295#bib.bib126); Ben\-Hamuet al\.,[2025](https://arxiv.org/html/2606.00295#bib.bib127)\)\. These involve calculating a scoresis\_\{i\}and then prioritizing unmasking positionsiiwith higher score \(possibly with added noise\)\.

Some options for the score calculation which have been investigated in previous work include choices such as \(i\)Top probability: The probability of the sampled clean tokenssi=μθ\(xt\)\[i,x^0i\]s\_\{i\}=\\mu^\{\\theta\}\(x\_\{t\}\)\[i,\\hat\{x\}^\{i\}\_\{0\}\]\(Nieet al\.,[2025](https://arxiv.org/html/2606.00295#bib.bib77)\), \(ii\)Top probability margin: The probability margin, i\.e\. the gap between the highest probability and the second highest probabilitysi=μθ\(xt\)\[i,j1\]−μθ\(xt\)\[i,j2\]s\_\{i\}=\\mu^\{\\theta\}\(x\_\{t\}\)\[i,j\_\{1\}\]\-\\mu^\{\\theta\}\(x\_\{t\}\)\[i,j\_\{2\}\]wherej1=arg⁡maxj⁡μθ\(xt\)\[i,j\]j\_\{1\}=\\arg\\max\_\{j\}\\mu^\{\\theta\}\(x\_\{t\}\)\[i,j\]andj2=arg⁡maxj≠j1⁡μθ\(xt\)\[i,j\]j\_\{2\}=\\arg\\max\_\{j\\neq j\_\{1\}\}\\mu^\{\\theta\}\(x\_\{t\}\)\[i,j\]\(Kimet al\.,[2025](https://arxiv.org/html/2606.00295#bib.bib126)\), and \(iii\)Entropy: The negative entropy of the positionsi=−H\(μθ\(xt\)\[i,⋅\]\)s\_\{i\}=\-H\(\\mu^\{\\theta\}\(x\_\{t\}\)\[i,\\cdot\]\)\(Ben\-Hamuet al\.,[2025](https://arxiv.org/html/2606.00295#bib.bib127)\)\.

These heuristics have been shown to yield better performance on a number of downstream tasks \(such as coding and math\), or on tasks such as Sudoku, where generating a valid solution is highly dependent on the order of unmasking\(Nieet al\.,[2025](https://arxiv.org/html/2606.00295#bib.bib77); Kimet al\.,[2025](https://arxiv.org/html/2606.00295#bib.bib126)\)\.

The problem of selecting the unmasking ordering also informs the efficiency of performing inference, since being able to unmask multiple tokens leads to fewer calls to the denoising model to generate a complete sequence\. Intuitively, we expect certain token positions to be independent of others, and we would expect unmasking them in parallel to retain the same performance as unmasking one at a time\.

### 2\.2Learnable Adaptive Order Policies

Given the importance of the unmasking ordering, we ask the following question: armed with a dataset, and a pre\-trained denoiserμθ\\mu^\{\\theta\}, can we train a lightweight*policy network*qϕ\(i∣xt\)q^\{\\phi\}\(i\\mid x\_\{t\}\)which outputs a distribution over the next token positioniito unmask forxtx\_\{t\}? The hope is to add a small number of additional trainable parameters, and spend some additional training iterations to obtain better performance on challenging tasks\.

We can note that MDMs are sensitive to the token ordering precisely due to imperfections in the denoiser model\. As argued byBen\-Hamuet al\.\([2025](https://arxiv.org/html/2606.00295#bib.bib127)\)a perfect denoiser is able to sample from the target regardless of the unmasking order \(since it corresponds to different factorizations of the joint distribution, according to the chain rule of probability\)\. Therefore, a reasonable objective for the policy is one that accounts for the loss of the denoiser model\. Based on this, we propose to train the policy to place higher probability on positions which obtain a smaller loss, as measured by the cross entropy\. This is captured in the objective:

ℒORDER\(ϕ\)\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{ORDER\}\}\(\\phi\)=𝔼t∼𝒰\[0,1\],xt∼p\(xt\|x0\),x0∼pdata\[−αt′1−αt∑i=1Lqϕ\(i∣xt\)CE\(x0i,μθ\(xt\)\[i,⋅\]\)\]\\displaystyle=\\mathbb\{E\}\_\{t\\sim\\mathcal\{U\}\[0,1\],x\_\{t\}\\sim p\(x\_\{t\}\|x\_\{0\}\),x\_\{0\}\\sim p\_\{\\mathrm\{data\}\}\}\\left\[\\frac\{\-\\alpha^\{\\prime\}\_\{t\}\}\{1\-\\alpha\_\{t\}\}\\sum^\{L\}\_\{i=1\}\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}q^\{\\phi\}\(i\\mid x\_\{t\}\)\}\\mathrm\{CE\}\(x^\{i\}\_\{0\},\\mu^\{\\theta\}\(x\_\{t\}\)\[i,\\cdot\]\)\\right\]\(3\)Note that we assumeq\(i∣xt\)=0q\(i\\mid x\_\{t\}\)=0for unmasked positionsxti≠mx^\{i\}\_\{t\}\\neq m\. This is identical to the MDM objective[Equation˜2](https://arxiv.org/html/2606.00295#S2.E2)except for the fact that the sum over masked positions is weighted by the policyqϕq^\{\\phi\}\. In particular, for a uniform policy over masked tokens, we recover, up to multiplication, the vanilla MDM loss\.

We can further justify this choice of objective by relating it to an ELBO bound for the policy\-guided unmasking process\. Using an ELBO decomposition for such a process, derived byPenget al\.\([2025](https://arxiv.org/html/2606.00295#bib.bib12)\), we can show that our loss in[Equation˜3](https://arxiv.org/html/2606.00295#S2.E3)results from making approximations to avoid computing intractable terms\. We present the details of this derivation in[Appendix˜C](https://arxiv.org/html/2606.00295#A3)\.

For a frozen denoiser networkθ\\theta, the policy networkqϕq^\{\\phi\}is trained to predict where the current denoiser is most likely correct\. The optimal policy places all probability on the position with smallest cross entropy loss\. We will refer to this theoretically optimal policy as theoracle:qoracle\(i∣xt\)=δ\(i,arg⁡minj,xtj=m⁡CE\(x0j,μθ\(xt\)\[j,⋅\]\)\)q^\{\\mathrm\{oracle\}\}\(i\\mid x\_\{t\}\)=\\delta\(i,\\arg\\min\_\{j,x^\{j\}\_\{t\}=m\}\\mathrm\{CE\}\(x^\{j\}\_\{0\},\\mu^\{\\theta\}\(x\_\{t\}\)\[j,\\cdot\]\)\)\. We empirically validate the choice of this loss by evaluating the oracle policy \(with access to ground truth datax0x\_\{0\}\), and confirming that it improves metrics relative to other heuristic samplers in[Section˜3](https://arxiv.org/html/2606.00295#S3)\. Other approaches for training unmasking policies typically rely on more expensive gradient estimation, or RL procedures for tasks involving a reward\. These are discussed in[Section˜4](https://arxiv.org/html/2606.00295#S4)\.

Finally we note that for sampling multiple positions, we can use the policy probabilities as score valuessi=qϕ\(i∣xt\)s\_\{i\}=q^\{\\phi\}\(i\\mid x\_\{t\}\)and use them in a similar way to other heuristics \(for instance, unmasking thekkpositions with the largest policy probabilities\)\.

### 2\.3Policy\-Aware Denoiser Training

The policy objective in[Equation˜3](https://arxiv.org/html/2606.00295#S2.E3)answers where the model should unmask next, but it also raises two related questions: can the policy improve the denoiser itself, and can the denoiser be trained while explicitly knowing that a policy will be used at sampling time? We note that the objective[Equation˜3](https://arxiv.org/html/2606.00295#S2.E3)already allows for gradients with respect to the denoiser parametersθ\\theta, and admits a sensible interpretation\. Namely, viewed from the perspective of training the denoiser, the objective upweights the loss at positions proportional to the probability the policy will select them for unmasking\.

A slight issue may occur with directly using[Equation˜3](https://arxiv.org/html/2606.00295#S2.E3)for training the denoiser, since the policy may quickly collapse to a handful of positions and prevent denoiser training on most masked tokens\. We alleviate this with a simple modification: we add the original MDM loss to the policy\-weighted objective to ensure better training signal\. This is a stability trick used in other works\(Penget al\.,[2025](https://arxiv.org/html/2606.00295#bib.bib12)\)\.

Putting together these observations, we propose the following denoiser loss as a substitute to the vanilla MDM objective:

ℒPA\-MDM\(θ\)\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{PA\\mbox\{\-\}MDM\}\}\(\\theta\)=𝔼\[−αt′1−αt∑i=1Lδ\(xti,m\)\(1\+qϕ\(i∣xt\)\)CE\(x0i,μθ\(xt\)\[i,⋅\]\)\]\.\\displaystyle=\\mathbb\{E\}\\left\[\\frac\{\-\\alpha^\{\\prime\}\_\{t\}\}\{1\-\\alpha\_\{t\}\}\\sum\_\{i=1\}^\{L\}\\delta\(x\_\{t\}^\{i\},m\)\\left\(1\+q^\{\\phi\}\(i\\mid x\_\{t\}\)\\right\)\\mathrm\{CE\}\(x\_\{0\}^\{i\},\\mu^\{\\theta\}\(x\_\{t\}\)\[i,\\cdot\]\)\\right\]\.\(4\)In implementation, the policy probabilities are detached inside the denoiser loss, so gradients with respect toθ\\thetado not backpropagate throughqϕq^\{\\phi\}\.

The policy itself is still trained with[Equation˜3](https://arxiv.org/html/2606.00295#S2.E3), with the denoiser losses detached when optimizingϕ\\phi\. Joint training therefore lets the policy identify consequential positions, while simultaneously teaching the denoiser to allocate more capacity to them\. The algorithm for either policy\-only training or joint training is summarized in[Algorithm˜1](https://arxiv.org/html/2606.00295#algorithm1)\.

To separate the effect of*learned*policy guidance from the effect of simply reweighting the denoiser loss, we can also compare against heuristic\-aware denoiser objectives, similar to those in\(Penget al\.,[2025](https://arxiv.org/html/2606.00295#bib.bib12)\)\. Concretely, we can replaceqϕ\(i∣xt\)q^\{\\phi\}\(i\\mid x\_\{t\}\)in[Equation˜4](https://arxiv.org/html/2606.00295#S2.E4)by normalized weights derived from standard confidence scores computed from the denoiser logits, such as the top log\-probability or the top\-two margin\. This yields a matched control where the denoiser is still trained with emphasis on selected positions, but the emphasis is determined by a fixed heuristic rather than a learned policy\.

Such a comparison isolates the main question of interest: whether informing the denoiser about the existence of a learned unmasking policy during training provides benefits beyond the gains obtainable from generic confidence\-based reweighting alone\.

Input:dataset𝒟\\mathcal\{D\}, denoiserμθ\\mu^\{\\theta\}, policyqϕq^\{\\phi\}, noise scheduleαt\\alpha\_\{t\}

1

2while*not converged*do

3Sample

x0∼𝒟x\_\{0\}\\sim\\mathcal\{D\},

t∼𝒰\[0,1\]t\\sim\\mathcal\{U\}\[0,1\], and

xt∼p\(xt∣x0\)x\_\{t\}\\sim p\(x\_\{t\}\\mid x\_\{0\}\)
4Compute denoiser logits

μθ\(xt\)\\mu^\{\\theta\}\(x\_\{t\}\)and policy probabilities

qϕ\(i∣xt\)q^\{\\phi\}\(i\\mid x\_\{t\}\)over masked positions

5Compute token losses

ℓi←δ\(xti,m\)CE\(x0i,μθ\(xt\)\[i,⋅\]\)\\ell\_\{i\}\\leftarrow\\delta\(x\_\{t\}^\{i\},m\)\\,\\mathrm\{CE\}\(x\_\{0\}^\{i\},\\mu^\{\\theta\}\(x\_\{t\}\)\[i,\\cdot\]\)for

i=1,…,Li=1,\\dots,L
6Define

ℒORDER\(ϕ\)=−αt′1−αt∑i=1Lqϕ\(i∣xt\)stopgrad⁡\(ℓi\)\\mathcal\{L\}\_\{\\mathrm\{ORDER\}\}\(\\phi\)=\\frac\{\-\\alpha\_\{t\}^\{\\prime\}\}\{1\-\\alpha\_\{t\}\}\\sum\_\{i=1\}^\{L\}q^\{\\phi\}\(i\\mid x\_\{t\}\)\\,\\operatorname\{stopgrad\}\(\\ell\_\{i\}\)if*joint training is enabled*then

Define

ℒPA\-MDM\(θ\)=−αt′1−αt∑i=1L\(1\+stopgrad⁡\(qϕ\(i∣xt\)\)\)ℓi\\mathcal\{L\}\_\{\\mathrm\{PA\\mbox\{\-\}MDM\}\}\(\\theta\)=\\frac\{\-\\alpha\_\{t\}^\{\\prime\}\}\{1\-\\alpha\_\{t\}\}\\sum\_\{i=1\}^\{L\}\\left\(1\+\\operatorname\{stopgrad\}\(q^\{\\phi\}\(i\\mid x\_\{t\}\)\)\\right\)\\ell\_\{i\}Define

ℒtotal=ℒORDER\(ϕ\)\+ℒPA\-MDM\(θ\)\\mathcal\{L\}\_\{\\mathrm\{total\}\}=\\mathcal\{L\}\_\{\\mathrm\{ORDER\}\}\(\\phi\)\+\\mathcal\{L\}\_\{\\mathrm\{PA\\mbox\{\-\}MDM\}\}\(\\theta\)
//joint training

7

8else

Define

ℒtotal=ℒORDER\(ϕ\)\\mathcal\{L\}\_\{\\mathrm\{total\}\}=\\mathcal\{L\}\_\{\\mathrm\{ORDER\}\}\(\\phi\)
//policy\-only training

9

10Update trainable parameters using

ℒtotal\\mathcal\{L\}\_\{\\mathrm\{total\}\}
11

Algorithm 1Training with Policy\-Only or Joint Updates

## 3Experiments

We evaluate our adaptive ordering approach on two constraint satisfaction tasks known to be sensitive to token ordering, Sudoku puzzle solving and 3\-SAT \(Boolean satisfiability with 3 literals per clause\)\(Kimet al\.,[2025](https://arxiv.org/html/2606.00295#bib.bib126); Yeet al\.,[2024](https://arxiv.org/html/2606.00295#bib.bib125)\), and on protein sequence generation with DPLM\(Wanget al\.,[2024](https://arxiv.org/html/2606.00295#bib.bib27)\)\. For Sudoku and 3\-SAT we use a 6M\-parameter GPT\-2 denoiser trained as an MDM withT=20T\{=\}20steps; full dataset descriptions and training details, including the DPLM setup, are provided in[Appendix˜A](https://arxiv.org/html/2606.00295#A1)\.

##### Policy architecture

We parameterizeqϕq^\{\\phi\}as a lightweight per\-token MLP that conditions on both the denoiser’s confidence scores \(max log\-probability\) and the hidden states from the last layer of the base model\. Specifically, each scalar confidence score is projected to the hidden dimension \(d=384d\{=\}384\) via a linear layer, summed with the corresponding hidden\-state vector, and passed through a two\-layer MLP \(hidden dimension 128, ReLU activation\) that outputs a per\-position routing logit\. The policy adds∼\{\\sim\}50K parameters \(<<1% of the base model\)\. In the policy\-only setting, it is trained on top of a frozen denoiser using[Equation˜3](https://arxiv.org/html/2606.00295#S2.E3), while in the joint\-training setting it is optimized together with the denoiser under the objectives described in[Section˜2](https://arxiv.org/html/2606.00295#S2)\. We also experimented with a transformer\-based policy variant\(Jazbecet al\.,[2025](https://arxiv.org/html/2606.00295#bib.bib11)\), which did not yield further improvement \(see[Appendix˜B](https://arxiv.org/html/2606.00295#A2)\)\.

##### Combinatorial baselines and decoding

At inference, we compare against theHigh conf\.,Margin, andOracleordering strategies described in[Section˜2](https://arxiv.org/html/2606.00295#S2)\. Unless otherwise noted, all methods use deterministic top\-kkdecoding with a linear schedule overT=20T\{=\}20reverse steps\. For our joint training experiments, the heuristic and policy variants use the modified denoiser objective from[Equation˜4](https://arxiv.org/html/2606.00295#S2.E4)\.

##### Protein generation with DPLM

To test whether the same ideas transfer beyond combinatorial reasoning, we also evaluate adaptive ordering on DPLM\-150M\(Wanget al\.,[2024](https://arxiv.org/html/2606.00295#bib.bib27)\), a masked diffusion model for protein sequence generation\. We consider both policy\-only adaptation and joint policy\-denoiser training, and evaluate structure quality, foldability, and diversity using metrics including pLDDT, pTM, pAE, foldability rate, token entropy, and inner\-TM; metric definitions are given in[Section˜A\.2\.2](https://arxiv.org/html/2606.00295#A1.SS2.SSS2)\.

### 3\.1Main results

Table 1:Results across policy\-only training and joint policy\-denoiser training\. For Sudoku and 3\-SAT, we report deterministic decoding accuracy\. For DPLM\-150M\(Wanget al\.,[2024](https://arxiv.org/html/2606.00295#bib.bib27)\), we report mean pLDDT \(and standard deviation\) across 3 random seeds after averaging over sequence lengths\{100,200,300,400,500\}\\\{100,200,300,400,500\\\}\. Higher is better; best result within each block and column is shown inbold\.Sudoku3\-SATDPLM\-150MPolicy\-only trainingHigh conf\.89\.84%75\.9%82\.20±\\pm0\.76Margin88\.67%76\.0%82\.02±\\pm0\.31Policy90\.82%76\.1%83\.85±\\pm0\.80Oracle policy100\.0%82\.3%N/AJoint trainingBaseline92\.7288\.882\.20±\\pm0\.76High conf\.92\.6889\.882\.85±\\pm0\.85Margin91\.0085\.282\.28±\\pm0\.53Policy92\.8790\.984\.94±\\pm1\.00
∗Policy adds<<50K params \(<<1% of the base model\)\.

![Refer to caption](https://arxiv.org/html/2606.00295v1/x1.png)Figure 1:Adaptive ordering on DPLM\-150M\. Top: policy\-only adaptation\. Bottom: joint training\. Left: mean pLDDT by target sequence length\. Right: inner\-TM diversity, where lower is better\. Representative predicted 3D folds illustrate jointly trained samples across different sequence lengths\. The learned policy improves foldability while maintaining diversity close to the heuristic baselines\.##### Policy\-only training

The upper block of[Table˜1](https://arxiv.org/html/2606.00295#S3.T1)summarizes policy\-only adaptation with a frozen denoiser using[Equation˜3](https://arxiv.org/html/2606.00295#S2.E3)\. Across Sudoku, 3\-SAT, and protein generation with DPLM, the learned policy outperforms the heuristic baselines\. On the combinatorial tasks, it achieves the best non\-oracle performance while still leaving a visible gap to the oracle\. On DPLM\-150M, it substantially improves predicted foldability relative to both heuristic baselines while maintaining diversity close to the heuristic orderings, as also shown in the top row of[Figure˜1](https://arxiv.org/html/2606.00295#S3.F1)\. This establishes that even without modifying the base denoiser, learning the unmasking order alone yields consistent gains across domains\.

##### Joint training

The lower block of[Table˜1](https://arxiv.org/html/2606.00295#S3.T1)evaluates the policy\-aware denoiser objective from[Equation˜4](https://arxiv.org/html/2606.00295#S2.E4)\. Here,*High conf\.*and*Margin*replace the learned policy weights with normalized heuristic weights derived from max log\-probability and top\-two margin, respectively\. The main pattern is that learned policy reweighting improves the denoiser more reliably than matched heuristic reweighting\. On Sudoku and 3\-SAT, policy\-aware scaling gives the strongest deterministic results among the compared training objectives\. The same comparison also extends to DPLM, where the learned policy\-weighted objective improves protein generation relative to the heuristic\-weighted controls, as also illustrated in the bottom row of[Figure˜1](https://arxiv.org/html/2606.00295#S3.F1)\. Taken together, these results show that the policy is useful not only as an inference\-time ordering policy but also as a training signal for the denoiser\.

##### Protein sequence generation

We evaluate our joint policy\-denoiser model by generating 100 sequences at lengths 200, 300, …, 800, which are then folded into 3D structures using ESMFold\(Linet al\.,[2022](https://arxiv.org/html/2606.00295#bib.bib128)\)\. Results are shown in[Table˜2](https://arxiv.org/html/2606.00295#S3.T2), which reports structure quality, foldability, and diversity metrics; the precise metric definitions are collected in[Section˜A\.2\.2](https://arxiv.org/html/2606.00295#A1.SS2.SSS2)\. We note that both our results and those ofPenget al\.\([2025](https://arxiv.org/html/2606.00295#bib.bib12)\)reported in the table use the same number of parameters \(150M\)\.

Table 2:Protein sequence generation results including our joint\-policy, with baselines results taken fromPenget al\.\([2025](https://arxiv.org/html/2606.00295#bib.bib12)\)\. Each model generates 100 sequences at lengths 200, 300, …, 800, which are folded into 3D structures using ESMFold\. Structure quality is measured by pLDDT, pTM, and pAE, while diversity is measured by token entropy and sequence uniqueness\. Foldability is the percent of sequences with pLDDT\>80\>80, pTM\>0\.7\>0\.7, and pAE<10<10\. Best values in each column are shown inbold\.ModelpLDDT↑\\uparrowpTM↑\\uparrowpAE↓\\downarrowFoldability \(%\)↑\\uparrowEntropy↑\\uparrowDiversity \(%\)↑\\uparrowLargeESM334\.130\.2324\.651\.503\.9993\.44ProGen2\-medium57\.940\.3820\.8112\.752\.9191\.45ProGen2\-large55\.070\.3522\.0011\.872\.7391\.48DPLM\-650M79\.530\.6611\.8549\.143\.1892\.22150M\-scaleEvoDiff31\.840\.2124\.760\.434\.0593\.19ProGen2\-small49\.380\.2823\.384\.482\.5589\.31DPLM\-150M80\.230\.6512\.0748\.143\.1492\.80DLM\-150M81\.320\.6512\.0042\.433\.2192\.45DLM\-150M \+ PAPL81\.480\.728\.9759\.403\.1291\.73Joint policy \(ours\)86\.430\.769\.6854\.144\.1293\.06

An important property of policy\-aware training is that its benefits are not limited to decoding with the learned policy itself\. As shown in[Figure˜2](https://arxiv.org/html/2606.00295#S3.F2), the denoiser trained with policy\-aware scaling also improves heuristic decoding, most clearly on 3\-SAT and in the stochastic Sudoku setting, while remaining competitive in the deterministic Sudoku setting\. This suggests that the policy\-weighted objective does not merely tailor the denoiser to one decoding rule, but improves the underlying denoiser in a way that transfers across different unmasking heuristics\.

### 3\.2Ablations

![Refer to caption](https://arxiv.org/html/2606.00295v1/x2.png)Figure 2:Heuristic transfer under policy\-aware training\. We compare heuristic\-specific training objectives against our policy\-weighted objective when decoding with the same heuristic\. Policy\-aware training improves both high\-confidence and margin decoding across tasks, especially in stochastic decoding regimes\.Table 3:Deterministic versus stochastic decoding ablation on combinatorial tasks\. Stochastic decoding adds Gumbel noise with scale 0\.5\. Higher is better\. Best result per column within each section is shown inbold\. The plot on the right visualizes the gap between deterministic and stochastic decoding\.Sudoku3\-SATPolicy\-only orderingDet\.Stoch\.Det\.Stoch\.High conf\.89\.84%18\.26%75\.9%72\.8%Margin88\.67%88\.38%76\.0%75\.6%Policy90\.82%90\.53%76\.1%75\.9%Joint\-training objectiveDet\.Stoch\.Det\.Stoch\.Baseline92\.7218\.3588\.887\.8High conf\.92\.6819\.6789\.888\.5Margin91\.0090\.3885\.284\.7Policy92\.8793\.3690\.990\.9

![[Uncaptioned image]](https://arxiv.org/html/2606.00295v1/x3.png)

##### Deterministic versus stochastic decoding

We study the effect of decoding noise in[Table˜3](https://arxiv.org/html/2606.00295#S3.T3)\. One notable observation is the discrepancy between our top\-probability result on Sudoku under deterministic decoding \(89\.84%\) and the 18\.51% reported byKimet al\.\([2025](https://arxiv.org/html/2606.00295#bib.bib126)\)\. We find that this gap is primarily an artifact of the*decoding strategy*, not an inherent limitation of the heuristic\. As shown in[Table˜3](https://arxiv.org/html/2606.00295#S3.T3), when we switch from deterministic to stochastic decoding \(adding Gumbel noise with scale 0\.5\), the top\-probability heuristic on Sudoku drops to 18\.26%, closely matching the 18\.51% ofKimet al\.\([2025](https://arxiv.org/html/2606.00295#bib.bib126)\)\. The margin heuristic, by contrast, remains stable under stochastic decoding\. This reveals that the reported large advantage of the margin heuristic over top\-probability is largely attributable to the latter’s sensitivity to stochastic perturbations in decoding, rather than a fundamentally superior ordering strategy\. Under deterministic decoding, both heuristics perform comparably, with top\-probability slightly ahead\. This pattern is less extreme on 3\-SAT, where top\-probability degrades modestly under stochastic decoding, while the margin heuristic remains stable\. Notably, our learned policy is robust across both decoding regimes and consistently outperforms both heuristics\. The lower block of[Table˜3](https://arxiv.org/html/2606.00295#S3.T3), together with[Figure˜2](https://arxiv.org/html/2606.00295#S3.F2), shows that the same robustness pattern also appears in the joint\-training setting: policy\-aware scaling avoids the sharp stochastic drop seen for the baseline and high\-confidence variants, while remaining robust under stochastic decoding and outperforming all alternative training objectives in both decoding regimes\.

##### Efficiency as a function of diffusion steps

We also study how ordering quality interacts with the number of reverse stepsTT, which directly controls inference cost\. As shown in[Figure˜3](https://arxiv.org/html/2606.00295#S3.F3), learned ordering improves efficiency by achieving higher accuracy for the same step budget\. On Sudoku, the learned policy substantially outperforms both heuristic baselines and nearly matches the oracle atT=100T\{=\}100\. On 3\-SAT, the learned policy consistently improves over the high\-confidence heuristic and is competitive with, or slightly better than, the margin heuristic across moderate and large step budgets\. In both tasks, the oracle remains better than the learned policy, indicating that there is still substantial room to improve learned unmasking strategies\. Overall, these results reinforce that ordering is not only an accuracy issue, but also an efficiency issue: stronger policies can achieve better performance with fewer denoising steps\.

![Refer to caption](https://arxiv.org/html/2606.00295v1/x4.png)Figure 3:Accuracy as a function of the number of reverse diffusion stepsTTon Sudoku \(left\) and 3\-SAT \(right\)\. Better ordering is especially valuable at small step budgets, where improved unmasking policies can recover substantially more accuracy for the same inference cost\.

### 3\.3Discussion

Several observations emerge from these results\. First, the learned policy consistently improves over heuristic orderings on both constraint\-satisfaction tasks, using a lightweight auxiliary network with less than 1% parameter overhead, demonstrating that the unmasking order can be improved by learning\. Second, the remaining gap to the oracle suggests that significantly better policies are still achievable, motivating future work on more expressive policy architectures and training procedures\. Third, our analysis on decoding strategies highlights the importance of carefully controlling for inference\-time design choices when comparing ordering heuristics – a point that has been underappreciated in prior work\. Fourth, the policy\-aware denoiser training ablation indicates that the policy is useful not only at inference time, but also as a training signal for the denoiser itself; the learned reweighting consistently outperforms matched heuristic\-based controls\. Finally, the DPLM results show that the same objective adaptation extends beyond logical reasoning tasks to protein generation in both the policy\-only and joint\-training settings\. In all cases, the policy converged within a few hundred iterations, requiring only a small fraction compared to the base model’s training budget\.

## 4Related Works

Wanget al\.\([2025](https://arxiv.org/html/2606.00295#bib.bib10)\)also train a policy for the unmasking order, by treating the order as a latent variablezz\. They propose a variational method for optimizing it, which requires parameterizing the posterior approximationqϕ\(z∣x\)q^\{\\phi\}\(z\\mid x\), in addition to the trainable policy over orderspθ\(z∣x\)p^\{\\theta\}\(z\\mid x\)\. The former is only used as part of the training objective for the latter, and not during inference\. In addition, the optimization of the variational posterior requires gradient estimation techniques to reduce variance, and complicates the optimization loop\. Our loss by contrast is much simpler\.

Another set of works assume access to verifiable reward functions rather than a dataset\(Honget al\.,[2025](https://arxiv.org/html/2606.00295#bib.bib9); Jazbecet al\.,[2025](https://arxiv.org/html/2606.00295#bib.bib11)\)\. These frame the generation process of a masked diffusion model as a Markov Decision Process, and optimize the unmasking policy using RL objectives\. Our work focuses on the setting where data is available, since the aim is to expand to modalities where an explicit reward function is not as available, such as protein sequence generation\.

Penget al\.\([2025](https://arxiv.org/html/2606.00295#bib.bib12)\)propose a modification of the MDM loss which accounts for using heuristics when determining the unmasking order \(rather than random unmasking\)\. The loss resembles our objective[Equation˜4](https://arxiv.org/html/2606.00295#S2.E4), and we outline the connections between our objective and their ELBO bound in[Appendix˜C](https://arxiv.org/html/2606.00295#A3)\. The main difference with our work is that our objective makes necessary simplifications to enable joint training of the denoiser and policy \(as opposed to only training the denoiser, as done in their work\)\.

Recent works also learn auxiliary token\-level scoring modules that can be used to remask tokens during inference, to improve generation quality\.Huanget al\.\([2025](https://arxiv.org/html/2606.00295#bib.bib13)\)train a confidence head for self\-reflective remasking in diffusion language models: for unmasked tokens inxtx\_\{t\}, the head predicts whether they are correct, while for masked positions it is supervised by the denoiser’s probability of recovering the ground\-truth token\.Meshchaninovet al\.\([2025](https://arxiv.org/html/2606.00295#bib.bib14)\)similarly fine\-tune a lightweight error\-prediction layer on top of a pretrained model, to predict whether tokens in a denoiser\-producedx^0\\hat\{x\}\_\{0\}are incorrect\. These works are related to our policy\-only setting in that they train an additional scoring layer on top of a pretrained denoiser\. However, while their framework is concerned with using scores to remask tokens, our method more directly aims to learn an unmasking order\. Our work differs in the objective and training setup: we directly optimize a policy over unmasking decisions with a cross\-entropy\-based objective weighted by denoiser losses, rather than training an error classifier, and we additionally study a joint training objective that couples the policy and denoiser, which neither of these approaches explore\.

## 5Conclusion

This work studies a method for training a policy over token orderings for masked diffusion models, by learning to sample positions with lower cross\-entropy loss\. We demonstrate that our approach outperforms common heuristics on logical tasks that are sensitive to the unmasking ordering, and that the same adaptation also transfers to protein generation with DPLM\. In the policy\-only setting, these gains come at the cost of only a few training iterations for a lightweight auxiliary network, while joint training further improves performance through a policy\-aware denoising objective\. Our results suggest that the unmasking order in masked diffusion models should be treated as a learnable component rather than a fixed heuristic\. The proposed objective supports both lightweight policy adaptation and joint policy\-denoiser training, yielding improvements across reasoning and protein\-generation domains while also improving efficiency under limited diffusion\-step budgets\.

For future work, the policy parametrization and training scheme of the method can be improved to close the gap between current performance and oracle performance\.

## Acknowledgements

The research was enabled in part by computational resources provided by the Digital Research Alliance of Canada \([https://alliancecan\.ca](https://alliancecan.ca/)\) and Mila \([https://mila\.quebec](https://mila.quebec/)\)\.

## References

- J\. Abramson, J\. Adler, J\. Dunger, R\. Evans, T\. Green, A\. Pritzel, O\. Ronneberger, L\. Willmore, A\. J\. Ballard, J\. Bambrick,et al\.\(2024\)Accurate structure prediction of biomolecular interactions with alphafold 3\.Nature630\(8016\),pp\. 493–500\.Cited by:[§1](https://arxiv.org/html/2606.00295#S1.p1.1)\.
- S\. Alamdari, N\. Thakkar, R\. van den Berg, N\. Tenenholtz, B\. Strome, A\. Moses, A\. X\. Lu, N\. Fusi, A\. P\. Amini, and K\. K\. Yang \(2023\)Protein generation with evolutionary diffusion: sequence is all you need\.BioRxiv,pp\. 2023–09\.Cited by:[§1](https://arxiv.org/html/2606.00295#S1.p1.1)\.
- J\. Austin, D\. D\. Johnson, J\. Ho, D\. Tarlow, and R\. van den Berg \(2021\)Structured denoising diffusion models in discrete state\-spaces\.InAdvances in Neural Information Processing Systems,A\. Beygelzimer, Y\. Dauphin, P\. Liang, and J\. W\. Vaughan \(Eds\.\),External Links:[Link](https://openreview.net/forum?id=h7-XixPCAL)Cited by:[§1](https://arxiv.org/html/2606.00295#S1.p1.1)\.
- H\. Ben\-Hamu, I\. Gat, D\. Severo, N\. Nolte, and B\. Karrer \(2025\)Accelerated sampling from masked diffusion models via entropy bounded unmasking\.arXiv preprint arXiv:2505\.24857\.External Links:2505\.24857,[Link](https://arxiv.org/abs/2505.24857)Cited by:[§2\.1](https://arxiv.org/html/2606.00295#S2.SS1.p6.5),[§2\.1](https://arxiv.org/html/2606.00295#S2.SS1.p7.5),[§2\.2](https://arxiv.org/html/2606.00295#S2.SS2.p2.4)\.
- R\. T\. Q\. Chen and Y\. Lipman \(2024\)Flow matching on general geometries\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=g7ohDlTITL)Cited by:[§1](https://arxiv.org/html/2606.00295#S1.p1.1)\.
- I\. Gat, T\. Remez, N\. Shaul, F\. Kreuk, R\. T\. Chen, G\. Synnaeve, Y\. Adi, and Y\. Lipman \(2024\)Discrete flow matching\.Advances in Neural Information Processing Systems \(NeurIPS\)37,pp\. 133345–133385\.Cited by:[§1](https://arxiv.org/html/2606.00295#S1.p1.1)\.
- J\. Ho, A\. Jain, and P\. Abbeel \(2020\)Denoising diffusion probabilistic models\.arXiv preprint arXiv:2006\.11239\.Cited by:[§1](https://arxiv.org/html/2606.00295#S1.p1.1)\.
- C\. Hong, S\. An, M\. Kim, and J\. C\. Ye \(2025\)Improving discrete diffusion unmasking policies beyond explicit reference policies\.arXiv preprint arXiv:2510\.05725\.External Links:2510\.05725,[Link](https://arxiv.org/abs/2510.05725)Cited by:[§4](https://arxiv.org/html/2606.00295#S4.p2.1)\.
- C\. Huang, M\. Aghajohari, J\. Bose, P\. Panangaden, and A\. Courville \(2022\)Riemannian diffusion models\.InAdvances in Neural Information Processing Systems,A\. H\. Oh, A\. Agarwal, D\. Belgrave, and K\. Cho \(Eds\.\),External Links:[Link](https://openreview.net/forum?id=ecevn9kPm4)Cited by:[§1](https://arxiv.org/html/2606.00295#S1.p1.1)\.
- Z\. Huang, Y\. Wang, Z\. Chen, and G\. Qi \(2025\)Don’t settle too early: self\-reflective remasking for diffusion language models\.arXiv preprint arXiv:2509\.23653\.External Links:2509\.23653,[Link](https://arxiv.org/abs/2509.23653)Cited by:[§4](https://arxiv.org/html/2606.00295#S4.p4.2)\.
- M\. Jazbec, T\. X\. Olausson, L\. Béthune, P\. Ablin, M\. Kirchhof, J\. Monteiro, V\. Turrisi, J\. Ramapuram, and M\. Cuturi \(2025\)Learning unmasking policies for diffusion language models\.arXiv preprint arXiv:2512\.09106\.External Links:2512\.09106,[Link](https://arxiv.org/abs/2512.09106)Cited by:[§3](https://arxiv.org/html/2606.00295#S3.SS0.SSS0.Px1.p1.4),[§4](https://arxiv.org/html/2606.00295#S4.p2.1)\.
- J\. Kim, K\. Shah, V\. Kontonis, S\. Kakade, and S\. Chen \(2025\)Train for the worst, plan for the best: understanding token ordering in masked diffusions\.arXiv preprint arXiv:2502\.06768\.External Links:2502\.06768,[Link](https://arxiv.org/abs/2502.06768)Cited by:[§A\.1](https://arxiv.org/html/2606.00295#A1.SS1.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.00295#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.00295#S2.SS1.p6.5),[§2\.1](https://arxiv.org/html/2606.00295#S2.SS1.p7.5),[§2\.1](https://arxiv.org/html/2606.00295#S2.SS1.p8.1),[§3\.2](https://arxiv.org/html/2606.00295#S3.SS2.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2606.00295#S3.p1.1)\.
- S\. Lee, K\. Kreis, S\. P\. Veccham, M\. Liu, D\. Reidenbach, Y\. Peng, S\. Paliwal, W\. Nie, and A\. Vahdat \(2025\)GenMol: a drug discovery generalist with discrete diffusion\.arXiv preprint arXiv:2501\.06158\.Cited by:[§1](https://arxiv.org/html/2606.00295#S1.p1.1)\.
- Z\. Lin, H\. Akin, R\. Rao, B\. Hie, Z\. Zhu, W\. Lu, N\. Smetanin, R\. Verkuil, O\. Kabeli, Y\. Shmueli, A\. dos Santos Costa, M\. Fazel\-Zarandi, T\. Sercu, S\. Candido, and A\. Rives \(2022\)Evolutionary\-scale prediction of atomic level protein structure with a language model\.bioRxiv\.External Links:[Document](https://dx.doi.org/10.1101/2022.07.20.500902),[Link](https://www.biorxiv.org/content/early/2022/10/31/2022.07.20.500902),https://www\.biorxiv\.org/content/early/2022/10/31/2022\.07\.20\.500902\.full\.pdfCited by:[§A\.2\.2](https://arxiv.org/html/2606.00295#A1.SS2.SSS2.p1.4),[§3\.1](https://arxiv.org/html/2606.00295#S3.SS1.SSS0.Px3.p1.1)\.
- V\. Meshchaninov, E\. Shibaev, A\. Makoian, I\. Klimov, D\. Sheshenya, A\. Malinin, N\. Balagansky, D\. Gavrilov, A\. Alanov, and D\. Vetrov \(2025\)Guided star\-shaped masked diffusion\.arXiv preprint arXiv:2510\.08369\.External Links:2510\.08369,[Link](https://arxiv.org/abs/2510.08369)Cited by:[§4](https://arxiv.org/html/2606.00295#S4.p4.2)\.
- S\. Nie, F\. Zhu, Z\. You, X\. Zhang, J\. Ou, J\. Hu, J\. Zhou, Y\. Lin, J\. Wen, and C\. Li \(2025\)Large language diffusion models\.arXiv preprint arXiv:2502\.09992\.External Links:2502\.09992,[Link](https://arxiv.org/abs/2502.09992)Cited by:[§1](https://arxiv.org/html/2606.00295#S1.p1.1),[§1](https://arxiv.org/html/2606.00295#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.00295#S2.SS1.p6.5),[§2\.1](https://arxiv.org/html/2606.00295#S2.SS1.p7.5),[§2\.1](https://arxiv.org/html/2606.00295#S2.SS1.p8.1)\.
- F\. Z\. Peng, Z\. Bezemek, J\. Rector\-Brooks, S\. Zhang, A\. R\. Zhang, M\. Bronstein, A\. J\. Bose, and A\. Tong \(2025\)Planner aware path learning in diffusion language models training\.External Links:2509\.23405,[Link](https://arxiv.org/abs/2509.23405)Cited by:[§A\.2\.1](https://arxiv.org/html/2606.00295#A1.SS2.SSS1.Px1.p1.2),[§A\.2\.2](https://arxiv.org/html/2606.00295#A1.SS2.SSS2.p1.3),[1st item](https://arxiv.org/html/2606.00295#A3.I1.i1.p1.3),[Appendix C](https://arxiv.org/html/2606.00295#A3.p2.1),[§2\.2](https://arxiv.org/html/2606.00295#S2.SS2.p3.1),[§2\.3](https://arxiv.org/html/2606.00295#S2.SS3.p2.1),[§2\.3](https://arxiv.org/html/2606.00295#S2.SS3.p5.1),[§3\.1](https://arxiv.org/html/2606.00295#S3.SS1.SSS0.Px3.p1.1),[Table 2](https://arxiv.org/html/2606.00295#S3.T2),[Table 2](https://arxiv.org/html/2606.00295#S3.T2.6.3),[§4](https://arxiv.org/html/2606.00295#S4.p3.1),[Proposition 1](https://arxiv.org/html/2606.00295#Thmproposition1)\.
- R\. Rombach, A\. Blattmann, D\. Lorenz, P\. Esser, and B\. Ommer \(2022\)High\-resolution image synthesis with latent diffusion models\.Conference on Computer Vision and Pattern Recognition \(CVPR\)\.Cited by:[§1](https://arxiv.org/html/2606.00295#S1.p1.1)\.
- C\. Saharia, W\. Chan, S\. Saxena, L\. Li, J\. Whang, E\. L\. Denton, K\. Ghasemipour, R\. Gontijo Lopes, B\. Karagol Ayan, T\. Salimans,et al\.\(2022\)Photorealistic text\-to\-image diffusion models with deep language understanding\.Advances in Neural Information Processing Systems \(NeurIPS\)35,pp\. 36479–36494\.Cited by:[§1](https://arxiv.org/html/2606.00295#S1.p1.1)\.
- S\. S\. Sahoo, M\. Arriola, Y\. Schiff, A\. Gokaslan, E\. Marroquin, J\. T\. Chiu, A\. Rush, and V\. Kuleshov \(2024\)Simple and effective masked diffusion language models\.arXiv preprint arXiv:2406\.07524\.Cited by:[§1](https://arxiv.org/html/2606.00295#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.00295#S2.SS1.p2.12),[§2\.1](https://arxiv.org/html/2606.00295#S2.SS1.p4.1),[§2\.1](https://arxiv.org/html/2606.00295#S2.SS1.p5.8),[§2\.1](https://arxiv.org/html/2606.00295#S2.SS1.p6.5)\.
- J\. Shi, K\. Han, Z\. Wang, A\. Doucet, and M\. K\. Titsias \(2024\)Simplified and generalized masked diffusion for discrete data\.arXiv preprint arXiv:2406\.04329\.Cited by:[§1](https://arxiv.org/html/2606.00295#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.00295#S2.SS1.p5.8)\.
- X\. Wang, Z\. Zheng, F\. Ye, D\. Xue, S\. Huang, and Q\. Gu \(2024\)Diffusion language models are versatile protein learners\.International Conference on Machine Learning \(ICML\)\.Cited by:[§A\.2\.1](https://arxiv.org/html/2606.00295#A1.SS2.SSS1.Px1.p1.2),[§1](https://arxiv.org/html/2606.00295#S1.p1.1),[§3](https://arxiv.org/html/2606.00295#S3.SS0.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2606.00295#S3.T1),[Table 1](https://arxiv.org/html/2606.00295#S3.T1.2.1),[§3](https://arxiv.org/html/2606.00295#S3.p1.1)\.
- Z\. Wang, J\. Shi, N\. Heess, A\. Gretton, and M\. K\. Titsias \(2025\)Learning\-order autoregressive models with application to molecular graph generation\.arXiv preprint arXiv:2503\.05979\.External Links:2503\.05979,[Link](https://arxiv.org/abs/2503.05979)Cited by:[§4](https://arxiv.org/html/2606.00295#S4.p1.3)\.
- J\. L\. Watson, D\. Juergens, N\. R\. Bennett, B\. L\. Trippe, J\. Yim, H\. E\. Eisenach, W\. Ahern, A\. J\. Borst, R\. J\. Ragotte, L\. F\. Milles,et al\.\(2023\)De novo design of protein structure and function with rfdiffusion\.Nature620\(7976\),pp\. 1089–1100\.Cited by:[§1](https://arxiv.org/html/2606.00295#S1.p1.1)\.
- J\. Ye, J\. Gao, S\. Gong, L\. Zheng, X\. Jiang, Z\. Li, and L\. Kong \(2024\)Beyond autoregression: discrete diffusion for complex reasoning and planning\.arXiv preprint arXiv:2410\.14157\.Cited by:[§A\.1](https://arxiv.org/html/2606.00295#A1.SS1.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2606.00295#S3.p1.1)\.
- Y\. Zhang and J\. Skolnick \(2004\)Scoring function for automated assessment of protein structure template quality\.Proteins: Structure57\.Cited by:[§A\.2\.2](https://arxiv.org/html/2606.00295#A1.SS2.SSS2.p1.3)\.

## Appendix AExperimental Details

### A\.1Combinatorial Tasks

##### Tasks and datasets

We consider two tasks\.Sudoku: 9×\{\\times\}9 puzzles where the model fills in blank cells given a partially completed grid, represented as a flat sequence of 81 tokens \(digits 1–9\)\.3\-SAT: Boolean satisfiability instances with 9 variables and 3 literals per clause, where the model must find a satisfying assignment given the clause structure\. For Sudoku, we follow the dataset and evaluation protocol ofKimet al\.\([2025](https://arxiv.org/html/2606.00295#bib.bib126)\); for 3\-SAT, we use the dataset ofYeet al\.\([2024](https://arxiv.org/html/2606.00295#bib.bib125)\)\. We report instance\-level accuracy: an instance is correct only if the entire solution is valid\.

##### Base model and training

We use a 6M\-parameter GPT\-2 architecture \(3 layers, 384 hidden dimensions, 12 attention heads, vocabulary size 31\) as the denoiserμθ\\mu^\{\\theta\}, trained as a masked diffusion model \(MDM\) withT=20T\{=\}20diffusion steps\. Training uses focal\-loss\-style token reweighting \(α=0\.25\\alpha\{=\}0\.25,γ=1\\gamma\{=\}1\) and linear time reweighting, with a learning rate of10−310^\{\-3\}, batch size of 1024, cosine learning rate schedule, and mixed\-precision \(fp16\) on a single A100 GPU\. We study two training regimes on top of this base model\. In the*policy\-only*regime, the denoiser is frozen and only the lightweight policy head is optimized using[Equation˜3](https://arxiv.org/html/2606.00295#S2.E3); for Sudoku, the policy is trained for 24K steps after 115K denoiser pretraining steps, and for 3\-SAT, the policy is trained for 7\.5K steps after 58\.5K denoiser pretraining steps\. In the*joint\-training*regime, we optimize the denoiser and policy together with the policy\-aware weighted objective from[Equation˜4](https://arxiv.org/html/2606.00295#S2.E4)\.

### A\.2Protein / DPLM Experiments

#### A\.2\.1DPLM Setup

##### Evaluation setup

For protein experiments, we adapt the same policy parameterization to DPLM\-150M\(Wanget al\.,[2024](https://arxiv.org/html/2606.00295#bib.bib27)\)\. The policy again takes per\-position confidence scores together with final\-layer hidden states and predicts routing logits over masked positions\. Our DPLM setup supports both policy\-only adaptation, where the pretrained backbone is frozen and only the lightweight policy head is trained, and joint training of the backbone and policy through the policy\-aware weighted objective from[Equation˜4](https://arxiv.org/html/2606.00295#S2.E4)\. In the joint DPLM, we initialize from the released DPLM\-150M checkpoint ofWanget al\.\([2024](https://arxiv.org/html/2606.00295#bib.bib27)\)111[https://huggingface\.co/airkingbd/dplm\_150m](https://huggingface.co/airkingbd/dplm_150m), unfreeze the backbone, and fine\-tune with the adaptive DPLM training configuration using a reduced learning rate \(4×10−54\\times 10^\{\-5\}\) on A100 and H100 GPUs\. For the DPLM results summarized in[Table˜1](https://arxiv.org/html/2606.00295#S3.T1)and visualized in[Figure˜1](https://arxiv.org/html/2606.00295#S3.F1), we generate proteins at sequence lengths\{100,200,300,400,500\}\\\{100,200,300,400,500\\\}; pLDDT is averaged across three random seeds after averaging within seed over sequence lengths, and we additionally report inner\-TM diversity in the corresponding diversity plots\. For the protein sequence generation comparison in[Table˜2](https://arxiv.org/html/2606.00295#S3.T2), we followPenget al\.\([2025](https://arxiv.org/html/2606.00295#bib.bib12)\)and generate 100 sequences at lengths 200, 300, …, 800 and report pLDDT, pTM, pAE, foldability, token entropy, and sequence diversity\. The heuristic baselines are high\-confidence decoding and margin decoding, matching the orderings used in the combinatorial experiments\.

#### A\.2\.2Protein Sequence Generation Metrics

For protein evaluation, we fold each generated sequence with ESMFold\(Linet al\.,[2022](https://arxiv.org/html/2606.00295#bib.bib128)\)and use the predicted structures to assess quality, foldability, and diversity\. For the joint comparison in[Table˜2](https://arxiv.org/html/2606.00295#S3.T2), we report three structural confidence metrics, a composite foldability score, and two diversity statistics\. The per\-sequence structural metrics are:

pLDDT\(i\)\\displaystyle\\mathrm\{pLDDT\}\(i\)=100×𝔼j∈𝒩\(i\)\[1−\|dijpred−dijtrue\|dijtrue\],\\displaystyle=100\\times\\mathbb\{E\}\_\{j\\in\\mathcal\{N\}\(i\)\}\\left\[1\-\\frac\{\|d^\{\\mathrm\{pred\}\}\_\{ij\}\-d^\{\\mathrm\{true\}\}\_\{ij\}\|\}\{d^\{\\mathrm\{true\}\}\_\{ij\}\}\\right\],\(5\)pTM\\displaystyle\\mathrm\{pTM\}=maxu⁡1L∑i=1L11\+\(di,u\(i\)/d0\(L\)\)2,\\displaystyle=\\max\_\{u\}\\frac\{1\}\{L\}\\sum\_\{i=1\}^\{L\}\\frac\{1\}\{1\+\\left\(d\_\{i,u\(i\)\}/d\_\{0\}\(L\)\\right\)^\{2\}\},\(6\)pAE\(i,j\)\\displaystyle\\mathrm\{pAE\}\(i,j\)=𝔼\[‖xipred−xjtrue‖2\]\.\\displaystyle=\\mathbb\{E\}\\\!\\left\[\\left\\\|x^\{\\mathrm\{pred\}\}\_\{i\}\-x^\{\\mathrm\{true\}\}\_\{j\}\\right\\\|\_\{2\}\\right\]\.\(7\)pLDDT measures local per\-residue confidence, pTM measures global structural similarity and is adapted from the TM\-score\(Zhang and Skolnick,[2004](https://arxiv.org/html/2606.00295#bib.bib122)\), and pAE measures expected alignment error across residue pairs\. We report pLDDT averaged over residues and pAE averaged over residue pairs; higher pLDDT and pTM are better, while lower pAE is better\. We also report a binary*foldability*rate, defined as the fraction of generated sequences satisfyingpLDDT\>80\\mathrm\{pLDDT\}\>80,pTM\>0\.7\\mathrm\{pTM\}\>0\.7, andpAE<10\\mathrm\{pAE\}<10\(as inPenget al\.\([2025](https://arxiv.org/html/2606.00295#bib.bib12)\)\)\.

To assess diversity and possible mode collapse, we report token entropy, sequence diversity, and inner\-TM\. Token entropy and sequence diversity are defined as:

H\\displaystyle H=−∑a∈𝒜p\(a\)log⁡p\(a\),\\displaystyle=\-\\sum\_\{a\\in\\mathcal\{A\}\}p\(a\)\\log p\(a\),\(8\)Id\(x\(m\),x\(n\)\)\\displaystyle\\mathrm\{Id\}\\\!\\left\(x^\{\(m\)\},x^\{\(n\)\}\\right\)=1L∑i=1L𝟏\[xi\(m\)=xi\(n\)\],\\displaystyle=\\frac\{1\}\{L\}\\sum\_\{i=1\}^\{L\}\\mathbf\{1\}\\\!\\left\[x^\{\(m\)\}\_\{i\}=x^\{\(n\)\}\_\{i\}\\right\],\(9\)Diversity\\displaystyle\\mathrm\{Diversity\}=1−2B\(B−1\)∑1≤m<n≤BId\(x\(m\),x\(n\)\)\.\\displaystyle=1\-\\frac\{2\}\{B\(B\-1\)\}\\sum\_\{1\\leq m<n\\leq B\}\\mathrm\{Id\}\\\!\\left\(x^\{\(m\)\},x^\{\(n\)\}\\right\)\.\(10\)Here,𝒜\\mathcal\{A\}is the set of amino acids observed in the generated set,p\(a\)p\(a\)is the empirical frequency of amino acidaa, andBBis the number of generated sequences being compared\. Higher entropy and diversity indicate richer amino\-acid usage and lower sequence\-level collapse\. We additionally report inner\-TM, defined as the average pairwise TM\-score between predicted structures in the generated set; lower inner\-TM indicates that the sampled proteins are more structurally diverse\.

## Appendix BAdditional Policy Architecture Results

### B\.1Transformer Policy Variant

In the policy\-only setting, we also evaluated a transformer\-based policy architecture as an alternative to the per\-token MLP described in the main text\. This variant replaces the MLP with a single transformer encoder layer, allowing the policy to attend across positions when making ordering decisions\. We tested two configurations:

Table 4:Comparison of policy architectures on Sudoku \(deterministic\-linear decoding,T=20T\{=\}20steps\)\.Policy ArchitectureAccuracyPer\-token MLP \(scores \+ hidden\)90\.82%Score Transformer \(scores only\)89\.84%Score \+ Hidden Transformer \(scores \+ hidden\)90\.23%- •Score Transformer: Takes only per\-token confidence scores as input, projecting each scalar tod=128d\{=\}128before a transformer encoder layer\.
- •Score \+ Hidden Transformer: Additionally conditions on the denoiser’s hidden states \(d=384d\{=\}384\), summing projected scores with hidden representations before the transformer layer\.

Results are shown in[Table˜4](https://arxiv.org/html/2606.00295#A2.T4)\. The Score \+ Hidden Transformer achieves slightly lower than the simpler per\-token MLP\. This suggests that cross\-position attention in the policy does not provide additional benefit for this task, and the per\-token hidden\-state representation already captures sufficient information for effective ordering decisions\.

## Appendix CRelation of Objective to ELBO Bounds

We can derive our objective[Equation˜3](https://arxiv.org/html/2606.00295#S2.E3)as an approximation to an ELBO bound\.

In particular, we make use of Proposition 3\.2 fromPenget al\.\([2025](https://arxiv.org/html/2606.00295#bib.bib12)\)\. This gives an ELBO\-style objective for masked diffusion decoding under an unmasking policyqϕq^\{\\phi\}\. Rewritten in the notation of this paper, the proposition can be stated as follows\.

###### Proposition 1\(Proposition 3\.2 of\(Penget al\.,[2025](https://arxiv.org/html/2606.00295#bib.bib12)\)\)\.

Letqϕ\(i∣x0,xt\)q^\{\\phi\}\(i\\mid x\_\{0\},x\_\{t\}\)be a policy over masked positions at timett, and letpθ,ϕ\(x0\)p\_\{\\theta,\\phi\}\(x\_\{0\}\)denote the marginal probability ofx0x\_\{0\}under the unmasking process obtained by decoding according to this policy\. For each timett, letrtϕ\(⋅∣x0\)r\_\{t\}^\{\\phi\}\(\\cdot\\mid x\_\{0\}\)denote the distribution of the partially masked sequencextx\_\{t\}under the policy\-guided unmasking process\. Then

log⁡pθ,ϕ\(x0\)≥ℰ1θ,ϕ\(x0\)\+ℰ2θ,ϕ\(x0\),\\displaystyle\\log p\_\{\\theta,\\phi\}\(x\_\{0\}\)\\geq\\mathcal\{E\}\_\{1\}^\{\\theta,\\phi\}\(x\_\{0\}\)\+\\mathcal\{E\}\_\{2\}^\{\\theta,\\phi\}\(x\_\{0\}\),\(11\)where

ℰ1θ,ϕ\(x0\)\\displaystyle\\mathcal\{E\}\_\{1\}^\{\\theta,\\phi\}\(x\_\{0\}\)=𝔼t∼𝒰\[0,1\]\[c\(αt\)𝔼xt∼rtϕ\(⋅∣x0\)\[∑i=1Lqϕ\(i∣x0,xt\)CE\(x0i,μθ\(xt\)\[i,⋅\]\)\]\],\\displaystyle=\\mathbb\{E\}\_\{t\\sim\\mathcal\{U\}\[0,1\]\}\\left\[c\(\\alpha\_\{t\}\)\\mathbb\{E\}\_\{x\_\{t\}\\sim r\_\{t\}^\{\\phi\}\(\\cdot\\mid x\_\{0\}\)\}\\left\[\\sum\_\{i=1\}^\{L\}q^\{\\phi\}\(i\\mid x\_\{0\},x\_\{t\}\)\\,\\mathrm\{CE\}\(x\_\{0\}^\{i\},\\mu^\{\\theta\}\(x\_\{t\}\)\[i,\\cdot\]\)\\right\]\\right\],\(12\)and

ℰ2θ,ϕ\(x0\)=−𝔼t∼𝒰\[0,1\]\[c\(αt\)𝔼xt∼rtϕ\(⋅∣x0\)\[∑i=1Lqϕ\(i∣x0,xt\)log⁡qϕ\(i∣x0,xt\)Fθ,ϕ\(xt,x0i,i\)\]\]\.\\displaystyle\\mathcal\{E\}\_\{2\}^\{\\theta,\\phi\}\(x\_\{0\}\)=\-\\mathbb\{E\}\_\{t\\sim\\mathcal\{U\}\[0,1\]\}\\left\[c\(\\alpha\_\{t\}\)\\mathbb\{E\}\_\{x\_\{t\}\\sim r\_\{t\}^\{\\phi\}\(\\cdot\\mid x\_\{0\}\)\}\\left\[\\sum\_\{i=1\}^\{L\}q^\{\\phi\}\(i\\mid x\_\{0\},x\_\{t\}\)\\,\\log\\frac\{q^\{\\phi\}\(i\\mid x\_\{0\},x\_\{t\}\)\}\{F\_\{\\theta,\\phi\}\(x\_\{t\},x\_\{0\}^\{i\},i\)\}\\right\]\\right\]\.\(13\)Herec\(αt\)c\(\\alpha\_\{t\}\)is some constant depending on the schedule, andFθ,ϕ\(xt,x0i,i\)F\_\{\\theta,\\phi\}\(x\_\{t\},x\_\{0\}^\{i\},i\)is an auxiliary distribution:

Fθ,ϕ\(xt,x0i,i\)=𝔼z∼μθ\(xt\)\[qϕ\(i∣z−i,x0i,xt\)\],\\displaystyle F\_\{\\theta,\\phi\}\(x\_\{t\},x\_\{0\}^\{i\},i\)=\\mathbb\{E\}\_\{z\\sim\\mu^\{\\theta\}\(x\_\{t\}\)\}\\left\[q^\{\\phi\}\(i\\mid z^\{\-i,x\_\{0\}^\{i\}\},x\_\{t\}\)\\right\],\(14\)z−i,x0iz^\{\-i,x\_\{0\}^\{i\}\}iszzwith tokeniiset to valuex0ix\_\{0\}^\{i\}\.

To simplify for our setting, sincex0x\_\{0\}is sampled from the denoiser, we can consider the policyqϕ\(i∣x0∼μθ\(xt\),xt\)q^\{\\phi\}\(i\\mid x\_\{0\}\\sim\\mu^\{\\theta\}\(x\_\{t\}\),x\_\{t\}\)as being conditioned on onlyxtx\_\{t\}, so thatqϕ\(i∣xt,x0\)=qϕ\(i∣xt\)q^\{\\phi\}\(i\\mid x\_\{t\},x\_\{0\}\)=q^\{\\phi\}\(i\\mid x\_\{t\}\)\. With this notation, we note that the first termℰ1\\mathcal\{E\}\_\{1\}is very similar to the negative of the policy\-weighted loss in[Equation˜3](https://arxiv.org/html/2606.00295#S2.E3)evaluated along policy\-induced pathsrtϕ\(⋅∣x0\)r\_\{t\}^\{\\phi\}\(\\cdot\\mid x\_\{0\}\)rather than the standard noising processp\(xt∣x0\)p\(x\_\{t\}\\mid x\_\{0\}\)\.

The quantityFθ,ϕ\(xt,x0i,i\)F\_\{\\theta,\\phi\}\(x\_\{t\},x\_\{0\}^\{i\},i\)in the second term evaluates the policy probability of unmasking positioniito have tokenx0ix^\{i\}\_\{0\}\(marginalizing over possible denoiser randomness\)\. Explicitly evaluating theℰ2\\mathcal\{E\}\_\{2\}term requires multiple passes of the policy module\.

From the above ELBO, if we make the following approximations, then we can recover our objective in[Equation˜3](https://arxiv.org/html/2606.00295#S2.E3):

- •samplextx\_\{t\}from the noising distributionp\(xt∣x0\)p\(x\_\{t\}\\mid x\_\{0\}\)rather than the policy\-induced distributionrtϕ\(⋅∣x0\)r\_\{t\}^\{\\phi\}\(\\cdot\\mid x\_\{0\}\), since the former is tractable as simple\.Penget al\.\([2025](https://arxiv.org/html/2606.00295#bib.bib12)\)also make this simplification\.
- •we omit the second termℰ2θ,ϕ\(x0\)\\mathcal\{E\}\_\{2\}^\{\\theta,\\phi\}\(x\_\{0\}\), which involves a termFθ,ϕ\(xt,x0i,i\)F\_\{\\theta,\\phi\}\(x\_\{t\},x\_\{0\}^\{i\},i\)that is difficult to compute\. This term may be omitted when fixing the policy \(such as when it is obtained from a heuristic\)\. By contrast, we learn our policy, but argue that omittingℰ2\\mathcal\{E\}\_\{2\}still yields a reasonable objective for training the policy, which we reinforce empirically\.

Given these approximations, we obtain our policy weighted objective as a surrogate of the policy\-induced ELBO\. We use this objective for both policy\-only training \(in[Equation˜3](https://arxiv.org/html/2606.00295#S2.E3)\), as well as joint training of the policy and denoiser \(in[Equation˜4](https://arxiv.org/html/2606.00295#S2.E4)\)\.

## Appendix DBroader Impacts

The work proposes a method for improving masked diffusion models by learning the unmasking order\. This is a general algorithm which has potential uses in biological applications \(e\.g\. generating protein or DNA sequences\) as well as potential harmful consequences such as for instance, in generating code with exploitable weaknesses\. However, we discourage such applications\.
Adaptive Order Policies for Masked Diffusion

Similar Articles

Reinforcing the Generation Order of Multimodal Masked Diffusion Models

Trace-Based On-Policy Distillation for Masked Diffusion Language Models

Recovering Hidden Reward in Diffusion-Based Policies

From Noise to Control: Parameterized Diffusion Policies

Attention-Discounted Adaptive Sampler for Masked Diffusion Language Models

Submit Feedback

Similar Articles

Reinforcing the Generation Order of Multimodal Masked Diffusion Models
Trace-Based On-Policy Distillation for Masked Diffusion Language Models
Recovering Hidden Reward in Diffusion-Based Policies
From Noise to Control: Parameterized Diffusion Policies
Attention-Discounted Adaptive Sampler for Masked Diffusion Language Models