Remask, Don't Replace: Token-to-Mask Refinement in Masked Diffusion Language Models

arXiv cs.CL 04/22/26, 04:00 AM Papers
Summary
Introduces Token-to-Mask (T2M) remasking to fix generation errors in masked diffusion LMs by resetting suspect tokens to mask state instead of overwriting, yielding up to +5.92 accuracy on CMATH without extra training or parameters.
arXiv:2604.18738v1 Announce Type: new Abstract: Masked diffusion language models such as LLaDA2.1 rely on Token-to-Token (T2T) editing to correct their own generation errors: whenever a different token crosses a confidence threshold, the committed token is overwritten. We identify three structural failure modes of this rule. The trigger cannot fire when no single alternative is confident enough; the replacement is computed under a context that may itself contain errors; and the uniform perturbations used to train the T2T stream do not resemble the coherent, semantically plausible mistakes that the model actually makes at inference. As an alternative, we propose Token-to-Mask (T2M) remasking. Rather than overwriting a suspect token with a new guess, T2M resets the position to the mask state, so that the next denoising step re-predicts it from an in-distribution context. The method is training-free, modifies only the editing rule, and introduces no new parameters. We pair it with three detection heuristics and give a short theoretical account of why a mask is a better conditioning signal than an erroneous token. Across 8 benchmarks, T2M improves accuracy on tasks that require exact token-level output. Its largest gain is +5.92 points on CMATH, where we attribute 79.9% of baseline errors to last-mile corruption (correct reasoning followed by a garbled final answer); T2M repairs 41.3% of these cases.
Original Article
View Cached Full Text
Cached at: 04/22/26, 08:29 AM
# Token-to-Mask Refinement in Masked Diffusion Language ModelsCode available at https://github.com/synsis/remasked_DLM.
Source: [https://arxiv.org/html/2604.18738](https://arxiv.org/html/2604.18738)
## Remask, Don’t Replace: Token\-to\-Mask Refinement in Masked Diffusion Language Models††thanks:Code available at[https://github\.com/synsis/remasked\_DLM](https://github.com/synsis/remasked_DLM)\.

Lin Yao1,2 1School of Computer Science, Shanghai Jiao Tong University, Shanghai, 200240, China 2Zhongguancun Academy, Beijing, 100097, China lin\.yao@sjtu\.edu\.cn

###### Abstract

Masked diffusion language models such as LLaDA2\.1 rely on Token\-to\-Token \(T2T\) editing to correct their own generation errors: whenever a different token crosses a confidence threshold, the committed token is overwritten\. We identify three structural failure modes of this rule\. The trigger cannot fire when no single alternative is confident enough; the replacement is computed under a context that may itself contain errors; and the uniform perturbations used to train the T2T stream do not resemble the coherent, semantically plausible mistakes that the model actually makes at inference\. As an alternative, we propose*Token\-to\-Mask*\(T2M\) remasking\. Rather than overwriting a suspect token with a new guess, T2M resets the position to the mask state, so that the next denoising step re\-predicts it from an in\-distribution context\. The method is training\-free, modifies only the editing rule, and introduces no new parameters\. We pair it with three detection heuristics and give a short theoretical account of why a mask is a better conditioning signal than an erroneous token\. Across 8 benchmarks, T2M improves accuracy on tasks that require exact token\-level output\. Its largest gain is\+5\.92\+5\.92points on CMATH, where we attribute 79\.9% of baseline errors to*last\-mile corruption*\(correct reasoning followed by a garbled final answer\); T2M repairs 41\.3% of these cases\.

## 1Introduction

Discrete masked diffusion language models \(dLLMs\) generate text by starting from a fully masked sequence and iteratively filling it, predicting multiple tokens in parallel at each denoising step\([1](https://arxiv.org/html/2604.18738#bib.bib11),[14](https://arxiv.org/html/2604.18738#bib.bib9),[10](https://arxiv.org/html/2604.18738#bib.bib10)\)\. At scale, LLaDA\([12](https://arxiv.org/html/2604.18738#bib.bib1)\)and Dream\([21](https://arxiv.org/html/2604.18738#bib.bib25)\)now attain accuracy comparable to autoregressive models of similar size while decoding several positions at once\. Parallelism has a known cost\. Tokens filled in a single step are predicted independently of each other and can be mutually inconsistent\([8](https://arxiv.org/html/2604.18738#bib.bib26)\), and any resulting errors compound in subsequent steps, since the model conditions on its own previous commitments\([2](https://arxiv.org/html/2604.18738#bib.bib2)\)\.

To mitigate these errors, LLaDA2\.1\([2](https://arxiv.org/html/2604.18738#bib.bib2)\)introduces*Token\-to\-Token*\(T2T\) editing\. After every M2T \(Mask\-to\-Token\) step, each committed token is re\-examined and is overwritten whenever a different token’s predicted probability exceeds a thresholdτt2t\\tau\_\{\\text\{t2t\}\}\. The mechanism helps LLaDA2\.1 match the accuracy of autoregressive models at similar scale\([2](https://arxiv.org/html/2604.18738#bib.bib2)\)\. It also introduces three structural failure modes:

![Refer to caption](https://arxiv.org/html/2604.18738v1/x1.png)Figure 1:Three failure modes of T2T editing and how T2M recovers, illustrated with LLaDA2\.1\-mini\.\(a\) Correction inertia:a multimodal posterior prevents T2T from acting despite an obvious error\.\(b\) Premature replacement:T2T swaps the correct “8” for an incorrect “6” under incomplete context; T2M recovers “8” once context converges\.\(c\) Delayed commitment:T2M iteratively remasks uncertain tokens so that “Jon Kitna” settles jointly, while T2T greedily destroys the first name\. Numerical probabilities annotated in\-figure\.1. 1\.Detection–replacement coupling\.Detection and correction share one confidence test: the rule fires only when some alternative is both available and confident\. If the posterior is multimodal \(say “sad”:0\.120\.12, “happy”:0\.110\.11, …\), no candidate crossesτt2t\\tau\_\{\\text\{t2t\}\}, and an obviously incorrect token survives editing\. We refer to this as*correction inertia*\(Figure[1](https://arxiv.org/html/2604.18738#S1.F1)a\)\.
2. 2\.Context pollution\.The replacement is the argmax under a context that may itself be corrupted elsewhere\. A confident but wrong substitution then propagates—biasing predictions at other positions, and, over successive iterations, at the replaced position itself \(Figures[1](https://arxiv.org/html/2604.18738#S1.F1)b,[2](https://arxiv.org/html/2604.18738#S5.F2)\)\.
3. 3\.Train–inference noise mismatch\.During training, the T2T stream perturbs tokens uniformly at random\. The errors encountered at inference are nothing like that: they are semantically plausible and locally coherent with their neighbours \(Figure[2](https://arxiv.org/html/2604.18738#S5.F2)\)\. The test\-time noise distribution therefore sits outside the support of the training distribution\.

All three failure modes stem from a single design choice: that detection and correction share one confidence test\. We decouple them\. When a detector flags a suspect token, the token is reset to\[M\], and the next M2T step re\-predicts the position under the updated context\. We call this*Token\-to\-Mask*\(T2M\) remasking\. A mask is semantically inert; it introduces no directional bias at the remasked position and does not perturb the predictions that depend on it\. When several positions are remasked at once, the next M2T step re\-predicts them jointly under a conditioning context from which the incorrect tokens have been removed \(Figure[1](https://arxiv.org/html/2604.18738#S1.F1)c\)\. The procedure runs at inference time, requires no retraining, and is parameterised only by a choice of detection rule \(LowProb,T2T\-Remask, orLogitDiff\)\.

We complement the method with a theoretical analysis built around two ideas: a three\-level context signal hierarchy \(aligned\>\>null≫\\ggadversarial\), and a*stuck set*on which T2T provably cannot initiate correction whileLowProbalways can\. A manual audit of every baseline error on CMATH finds that 79\.9% are not reasoning mistakes but instances of*last\-mile corruption*, in which correct reasoning is followed by a garbled final answer\. T2M repairs 41\.3% of these, raising CMATH accuracy by\+5\.92\+5\.92points\. Across 8 benchmarks spanning knowledge, reasoning, math, and instruction following, T2M improves accuracy on tasks that require exact token\-level output\.

## 2Related Work

#### Discrete masked diffusion language models\.

D3PM\([1](https://arxiv.org/html/2604.18738#bib.bib11)\)introduced discrete diffusion with absorbing states; MDLM\([14](https://arxiv.org/html/2604.18738#bib.bib9)\)and SEDD\([10](https://arxiv.org/html/2604.18738#bib.bib10)\)gave efficient training objectives\. LLaDA\([12](https://arxiv.org/html/2604.18738#bib.bib1)\)scaled the approach to 8B parameters, matching autoregressive models of comparable size; Dream\([21](https://arxiv.org/html/2604.18738#bib.bib25)\)showed the recipe continues to work further out\. LLaDA2\.1\([2](https://arxiv.org/html/2604.18738#bib.bib2)\)adds semi\-autoregressive block generation and the T2T editing phase we target here, keeping both training recipe and architecture unchanged\.

#### Remasking strategies\.

Two prior methods also remask during masked diffusion\. ReMDM\([18](https://arxiv.org/html/2604.18738#bib.bib3)\)adds a uniform remasking probabilityσt\\sigma\_\{t\}to the reverse posterior, applied independently to every committed token; correct and incorrect tokens are therefore removed in equal proportion\. In the signal taxonomy of Section[5](https://arxiv.org/html/2604.18738#S5), this eliminates adversarial signals only at the cost of destroying an equal fraction of aligned ones\. Appendix[C](https://arxiv.org/html/2604.18738#A3)shows that any targeted remasker whose precision exceeds the base error rate dominates the random baseline \(Eqs\.[12](https://arxiv.org/html/2604.18738#A3.E12)–[14](https://arxiv.org/html/2604.18738#A3.E14)\)\. CORE\([23](https://arxiv.org/html/2604.18738#bib.bib4)\)takes a different route and flags “context\-brittle” tokens via their sensitivity under masked perturbations; the test costsO\(k\)O\(k\)additional forward passes per query, and its signal conflates intrinsic token unreliability with the transient mask pattern, since a token’s sensitivity changes depending on which of its neighbours happen to be filled \(Appendix[C](https://arxiv.org/html/2604.18738#A3)\)\. OurLogitDiffrule captures a related trajectory\-level signal at no extra forward\-pass cost\.

#### Training\-based self\-correction\.

A parallel line of work trains the model to self\-correct\. RemeDi\([6](https://arxiv.org/html/2604.18738#bib.bib5)\)learns per\-token confidence scores via supervised fine\-tuning followed by reinforcement learning; ProSeCo\([15](https://arxiv.org/html/2604.18738#bib.bib6)\)adds a corrector cross\-entropy loss so that the model recovers from synthetic mistakes; PRISM\([9](https://arxiv.org/html/2604.18738#bib.bib8)\)fine\-tunes a lightweight self\-correction adapter; MDPO\([4](https://arxiv.org/html/2604.18738#bib.bib7)\)formulates denoising as a sequential decision problem and applies policy optimisation, also proposing Running Confidence Remasking as a training\-free baseline\. These methods all improve the detector through additional training\. Our contribution is orthogonal: we modify only the correction action \(replacement versus remasking\), leaving the detector and the training recipe unchanged, and the two directions compose\. The context signal hierarchy \(Remark[1](https://arxiv.org/html/2604.18738#Thmremark1)\) provides a unified perspective for comparing random remasking, perturbation\-based sensitivity, and learned detectors\.

## 3Preliminaries

LLaDA2\.1\([2](https://arxiv.org/html/2604.18738#bib.bib2)\)generates text via semi\-autoregressive block diffusion\. The response is partitioned into blocks ofBBtokens produced left to right; within each block, all\[M\]positions are filled in parallel by iterative Mask\-to\-Token \(M2T\) denoising\. After each M2T step, a T2T editing phase re\-evaluates every committed token and replaces it whenever the argmax prediction exceeds a threshold:

xi←xi∗ifpθ\(xi∗∣𝐳\)\>τt2t∧xi∗≠xiold,x\_\{i\}\\leftarrow x\_\{i\}^\{\*\}\\quad\\text\{if\}\\quad p\_\{\\theta\}\(x\_\{i\}^\{\*\}\\mid\\mathbf\{z\}\)\>\\tau\_\{\\text\{t2t\}\}\\;\\land\\;x\_\{i\}^\{\*\}\\neq x\_\{i\}^\{\\mathrm\{old\}\},\(1\)wherexi∗=arg⁡maxv⁡pθ\(v∣𝐳\)x\_\{i\}^\{\*\}=\\arg\\max\_\{v\}p\_\{\\theta\}\(v\\mid\\mathbf\{z\}\)\. The inner loop iterates until no masks remain and no further edits are triggered, and generation then advances to the next block\. Appendix[A](https://arxiv.org/html/2604.18738#A1)gives a self\-contained description of masked diffusion, block generation, and T2T\.

## 4Method: Token\-to\-Mask Remasking

### 4\.1Overview

T2M modifies only the editing rule\. At every positioniithat T2T would re\-examine, the action is changed from replacement to reset:

xi←\[M\]ifShouldRemask\(i,𝐳,θ\),x\_\{i\}\\leftarrow\\texttt\{\[M\]\}\{\}\\quad\\text\{if\}\\quad\\textsc\{ShouldRemask\}\(i,\\mathbf\{z\},\\theta\),\(2\)after which the next M2T step re\-predictsiiunder the updated context\. Model weights, the M2T fill rule, block scheduling, and KV caching are all retained \(Algorithm[1](https://arxiv.org/html/2604.18738#alg1)\)\.

There are two reasons to expect this to help\. The M2T predictor has been trained on inputs containing\[M\], so a re\-prediction triggered by T2M is made under in\-distribution conditioning\. Replacement, by contrast, commits the model to an output that has to be interpretable by the T2T training stream, whose uniform random perturbations do not resemble the mistakes the model actually produces at inference \(Section[5](https://arxiv.org/html/2604.18738#S5)\)\. The second reason concerns coverage: once the detector is freed from the requirement of also supplying a replacement candidate, it can flag a position on the basis of the current token’s implausibility alone\. This is the regime in which T2T is silent\.

### 4\.2Error detection

We consider three instantiations ofShouldRemask, each built around a different signal\.

#### LowProb\.

The simplest signal is the model’s own probability for the currently committed token\. We re\-score every committed token under the current context and remask those that fall below threshold,

Remasklp\(i\)=𝟏\[pθ\(xiold∣𝐳−i\)<τlp\],\\textsc\{Remask\}\_\{\\textsc\{lp\}\}\(i\)=\\mathbf\{1\}\\\!\\left\[p\_\{\\theta\}\(x\_\{i\}^\{\\mathrm\{old\}\}\\mid\\mathbf\{z\}\_\{\-i\}\)<\\tau\_\{\\text\{lp\}\}\\right\],\(3\)with no dependence on any replacement candidate\.

#### T2T\-Remask\.

To isolate the effect of changing only the correction action, we reuse T2T’s trigger and swap its action for remasking:

Remaskt2t\(i\)=𝟏\[pθ\(xi∗∣𝐳\)\>τtr∧xi∗≠xiold\]\.\\textsc\{Remask\}\_\{\\textsc\{t2t\}\}\(i\)=\\mathbf\{1\}\\\!\\left\[p\_\{\\theta\}\(x\_\{i\}^\{\*\}\\mid\\mathbf\{z\}\)\>\\tau\_\{\\text\{tr\}\}\\;\\land\\;x\_\{i\}^\{\*\}\\neq x\_\{i\}^\{\\mathrm\{old\}\}\\right\]\.\(4\)Any difference between this rule and T2T is attributable to “remask vs\. replace” under a fixed detector\.

#### LogitDiff\.

Both of the above use the current state only\.LogitDiffinstead tracks how the model’s confidence inxioldx\_\{i\}^\{\\mathrm\{old\}\}evolves between consecutive iterations, and fires on a drop:

Remaskld\(i\)=𝟏\[pθ\(t−1\)\(xiold∣𝐳\(t−1\)\)−pθ\(t\)\(xiold∣𝐳\(t\)\)\>τld\]\.\\textsc\{Remask\}\_\{\\textsc\{ld\}\}\(i\)=\\mathbf\{1\}\\\!\\left\[p\_\{\\theta\}^\{\(t\-1\)\}\(x\_\{i\}^\{\\mathrm\{old\}\}\\mid\\mathbf\{z\}^\{\(t\-1\)\}\)\-p\_\{\\theta\}^\{\(t\)\}\(x\_\{i\}^\{\\mathrm\{old\}\}\\mid\\mathbf\{z\}^\{\(t\)\}\)\>\\tau\_\{\\text\{ld\}\}\\right\]\.\(5\)A falling confidence says that the converging neighbourhood has stopped supporting the token; a rising one says it has been corroborated\. The signal therefore depends on the denoising trajectory rather than on a single snapshot, which makes it complementary toLowProb\. There is no predecessor state at the first iteration of a block, soLogitDiffabstains there\.

### 4\.3Safety caps

Remasking can in principle oscillate: a position is remasked, refilled with a similar token, remasked again\. Two caps prevent this\. A per\-position budgetCmaxC\_\{\\max\}\(default 1\) limits how many times a single position may be remasked within one block, and a per\-step ratio capρmax\\rho\_\{\\max\}\(default0\.250\.25\) limits the fraction of editable positions that can be remasked in one M2T iteration\. When the ratio cap binds, the least\-confident positions are selected first\.

Algorithm 1Token\-to\-Mask \(T2M\) Remasking \(replacing T2T editing in LLaDA2\.1\)0:Current block tokens

𝐳\\mathbf\{z\}, model

pθp\_\{\\theta\}, mask id\[M\], strategy

𝒮\\mathcal\{S\}, threshold

τ\\tau, budget

CmaxC\_\{\\max\}, ratio cap

ρmax\\rho\_\{\\max\}
1:// Standard M2T step \(unchanged from LLaDA2\.1\)

2:foreach mask position

iiwhere

zi=\[M\]z\_\{i\}=\\texttt\{\[M\]\}\{\}do

3:if

maxv⁡pθ\(v∣𝐳\)\>τm2t\\max\_\{v\}p\_\{\\theta\}\(v\\mid\\mathbf\{z\}\)\>\\tau\_\{\\text\{m2t\}\}then

4:

zi←arg⁡maxv⁡pθ\(v∣𝐳\)z\_\{i\}\\leftarrow\\arg\\max\_\{v\}p\_\{\\theta\}\(v\\mid\\mathbf\{z\}\)
5:endif

6:endfor

7:

8:// T2M remasking step \(replaces T2T editing\)

9:

ℰ←\{i:zi≠\[M\]andi∉prompt\}\\mathcal\{E\}\\leftarrow\\\{i:z\_\{i\}\\neq\\texttt\{\[M\]\}\{\}\\text\{ and \}i\\notin\\text\{prompt\}\\\}⊳\\trianglerighteditable positions

10:

ℛ←\{i∈ℰ:ShouldRemask𝒮\(i,𝐳,τ\)andci<Cmax\}\\mathcal\{R\}\\leftarrow\\\{i\\in\\mathcal\{E\}:\\textsc\{ShouldRemask\}\_\{\\mathcal\{S\}\}\(i,\\mathbf\{z\},\\tau\)\\text\{ and \}c\_\{i\}<C\_\{\\max\}\\\}
11:if

\|ℛ\|\>ρmax⋅\|ℰ\|\|\\mathcal\{R\}\|\>\\rho\_\{\\max\}\\cdot\|\\mathcal\{E\}\|then

12:

ℛ←top\-kby lowest confidence score,k=⌊ρmax⋅\|ℰ\|⌋\\mathcal\{R\}\\leftarrow\\text\{top\-\}k\\text\{ by lowest confidence score, \}k=\\lfloor\\rho\_\{\\max\}\\cdot\|\\mathcal\{E\}\|\\rfloor
13:endif

14:foreach

i∈ℛi\\in\\mathcal\{R\}do

15:

zi←\[M\]z\_\{i\}\\leftarrow\\texttt\{\[M\]\}\{\};

ci←ci\+1c\_\{i\}\\leftarrow c\_\{i\}\+1
16:endfor

17:return

ℛ=∅\\mathcal\{R\}=\\emptyset⊳\\trianglerightconverged if no remasking occurred

## 5Theoretical Analysis

We analyse T2M from four angles\. Three of them concern the state of a single denoising step—decoupling detection from correction, the quality of the conditioning context, and the match between training and inference noise—and the fourth concerns the behaviour of the method across iterations\.

### 5\.1Decoupling detection from correction

A single confidence test answers two questions at once in T2T’s trigger: whetherxioldx\_\{i\}^\{\\mathrm\{old\}\}is wrong, and whether a confident replacement exists\. The two questions come apart whenever the posterior is multimodal: the answer to the first can be “yes” while the answer to the second is “no”\. We formalise this situation as the*stuck set*

𝒮stuck=\{i:pθ\(xiold∣𝐳−i\)<ϵandmaxv⁡pθ\(v∣𝐳−i\)<τt2t\}\\mathcal\{S\}\_\{\\mathrm\{stuck\}\}=\\\{\\,i:p\_\{\\theta\}\(x\_\{i\}^\{\\mathrm\{old\}\}\\mid\\mathbf\{z\}\_\{\-i\}\)<\\epsilon\\text\{ and \}\\max\_\{v\}p\_\{\\theta\}\(v\\mid\\mathbf\{z\}\_\{\-i\}\)<\\tau\_\{\\text\{t2t\}\}\\,\\\}\(Definition[1](https://arxiv.org/html/2604.18738#Thmdefinition1)and Proposition[1](https://arxiv.org/html/2604.18738#Thmproposition1)in Appendix[B](https://arxiv.org/html/2604.18738#A2)\)\. By construction, T2T cannot fire at anyi∈𝒮stucki\\in\\mathcal\{S\}\_\{\\mathrm\{stuck\}\}, whereasLowProbalways does\. Figure[1](https://arxiv.org/html/2604.18738#S1.F1)\(a\) instantiates this: the committed token “purple” has probability∼2×10−5\{\\sim\}2\\\!\\times\\\!10^\{\-5\}and the highest alternatives \(“sad” at0\.120\.12, “happy” at0\.110\.11\) fail to clearτt2t\\tau\_\{\\text\{t2t\}\}, so T2T takes no action\. Remasking does not guarantee a correct re\-prediction, but it is a necessary condition for any re\-prediction\.

### 5\.2Context signal hierarchy

The primary difference between T2T and T2M lies not at the edited position but in what the other positions receive as conditioning context\. Each position can contribute one of three qualitatively different signals:

Under this ordering, remasking converts an adversarial signal to a null signal, which is never worse\. Replacement converts an adversarial signal to either an aligned or another adversarial signal, depending on whether the argmax under the \(possibly polluted\) context is correct\. Appendix[C\.1](https://arxiv.org/html/2604.18738#A3.SS1)establishes the analogous dominance of targeted over random remasking\.

Figure[2](https://arxiv.org/html/2604.18738#S5.F2)provides an empirical check on the ordering\. In the template “I went to \[X\] and visited the\[M\]Tower”, we vary \[X\] and query LLaDA2\.1\-mini for the second mask\. With \[X\]=France, the model predicts “Eiffel” with probability0\.970\.97\(aligned\); with \[X\]=\[M\], it predicts “Eiffel” with probability0\.820\.82\(null\); with \[X\]=Japan, it predicts “Tokyo” with probability0\.910\.91\(adversarial, high\-confidence incorrect\); with \[X\]=banana, it predicts “Eiffel” with probability0\.330\.33\(out\-of\-context noise, low\-confidence correct\)\. The adversarial case is qualitatively worse than both the null and the noise cases, consistent with the hierarchy\.

![Refer to caption](https://arxiv.org/html/2604.18738v1/x2.png)Figure 2:Context signal hierarchy on LLaDA2\.1\-mini\. Aligned context \(France\) is best; null context \(\[M\]\) preserves correctness at lower confidence; adversarial context \(Japan\) misleads the model to a confident wrong answer; unrelated noise \(banana\) merely degrades confidence\.
### 5\.3Train–inference noise mismatch

LLaDA2\.1 sees two training noise distributions\([2](https://arxiv.org/html/2604.18738#bib.bib2)\):\[M\]in the M2T stream, and uniformly random tokens in the T2T stream\. Neither matches what the model actually encounters at inference\. The non\-mask tokens that appear in an inference\-time context are the model’s own earlier mistakes; such mistakes tend to be semantically plausible and locally coherent with their neighbours, properties that uniform random tokens lack\. The induced test\-time distribution therefore falls outside the support of either training stream\. The effect is analogous to the familiar shift, in continuous diffusion, between the Gaussian noise seen during training and the structured corruptions encountered at inference\([5](https://arxiv.org/html/2604.18738#bib.bib13),[16](https://arxiv.org/html/2604.18738#bib.bib14)\)\. Figure[2](https://arxiv.org/html/2604.18738#S5.F2)makes the discrete case concrete\. Replacing “France” with “banana” produces an incoherent context that only lowers the confidence of the target prediction \(p\(Eiffel\)=0\.33p\(\\text\{Eiffel\}\)=0\.33\); this is what the T2T training stream contains\. Replacing “France” with “Japan” yields a context that is coherent with the prompt and drives the model to a confident but incorrect prediction \(p\(Tokyo\)=0\.91p\(\\text\{Tokyo\}\)=0\.91\); this is the kind of error that actually arises at inference\. T2M avoids the mismatch rather than correcting it:

\{𝐳correct,𝐳error\}⏟out\-of\-distribution→T2M\{𝐳correct,\[M\]\}⏟in\-distribution\.\\underbrace\{\\\{\\mathbf\{z\}\_\{\\text\{correct\}\},\\mathbf\{z\}\_\{\\text\{error\}\}\\\}\}\_\{\\text\{out\-of\-distribution\}\}\\xrightarrow\{\\text\{T2M\}\}\\underbrace\{\\\{\\mathbf\{z\}\_\{\\text\{correct\}\},\\texttt\{\[M\]\}\{\}\\\}\}\_\{\\text\{in\-distribution\}\}\.\(6\)The detector only needs to flag the erroneous token; it does not need to identify the correct one\. The resulting context lies within the distribution the M2T predictor is trained on\.

### 5\.4Delayed commitment

T2T acts greedily: as soon as one alternative crosses the threshold, the replacement is committed, and the commitment enters the context of every other position in the block\. If the committed token was the wrong argmax under a multimodal posterior, the error becomes difficult to reverse\. T2M defers the same decision\. A suspect position held at\[M\]stays open until more of its neighbourhood has converged; when several positions are remasked at the same step, the subsequent M2T step re\-predicts them jointly under a purified context:

𝐱^ℛ=argmax𝐱ℛ∏i∈ℛpθ\(xi∣𝐳correct,\[M\]\)\|ℛ\|\.\\hat\{\\mathbf\{x\}\}\_\{\\mathcal\{R\}\}=\\arg\\max\_\{\\mathbf\{x\}\_\{\\mathcal\{R\}\}\}\\prod\_\{i\\in\\mathcal\{R\}\}p\_\{\\theta\}\(x\_\{i\}\\mid\\mathbf\{z\}\_\{\\text\{correct\}\},\\texttt\{\[M\]\}\{\}^\{\|\\mathcal\{R\}\|\}\)\.\(7\)The individual predictions remain conditionally independent, but their common conditioning context no longer contains mutually reinforcing errors, so the joint argmax is typically more globally consistent\. The mechanism has the same flavour as simulated annealing: temporarily raising local uncertainty to escape a configuration that greedy local updates would lock in\. Figures[1](https://arxiv.org/html/2604.18738#S1.F1)\(b,c\) give two DROP instances\.

## 6Experiments

### 6\.1Experimental Setup

#### Model and setup\.

All runs use LLaDA2\.1\-mini \(16B MoE\)\([2](https://arxiv.org/html/2604.18738#bib.bib2)\)in its Q Mode defaults:τm2t=0\.7\\tau\_\{\\text\{m2t\}\}\{=\}0\.7,τt2t=0\.5\\tau\_\{\\text\{t2t\}\}\{=\}0\.5, block lengthB=32B\{=\}32, greedy decoding \(temperature 0\)\. We compare T2T editing against T2M withLowProbatτlp=0\.3\\tau\_\{\\text\{lp\}\}\{=\}0\.3,Cmax=1C\_\{\\max\}\{=\}1,ρmax=0\.25\\rho\_\{\\max\}\{=\}0\.25, which is the configuration selected by the ablation in Figure[3](https://arxiv.org/html/2604.18738#S6.F3)\. Model weights, inference parameters, and evaluation code are identical across the two conditions\.

#### Benchmarks\.

Eight benchmarks grouped into four categories:Knowledge\(TriviaQA\([7](https://arxiv.org/html/2604.18738#bib.bib20)\), MMLU\-Pro\([19](https://arxiv.org/html/2604.18738#bib.bib28)\)\);Reasoning\(HellaSwag\([22](https://arxiv.org/html/2604.18738#bib.bib21)\), DROP\([3](https://arxiv.org/html/2604.18738#bib.bib23)\), BBH\([17](https://arxiv.org/html/2604.18738#bib.bib29)\)\);Math\(CMATH\([20](https://arxiv.org/html/2604.18738#bib.bib17)\), AIME 2025\([11](https://arxiv.org/html/2604.18738#bib.bib19)\)\);Instruction Following\(IFEval\([24](https://arxiv.org/html/2604.18738#bib.bib24)\)\)\. Sample counts in Appendix[E](https://arxiv.org/html/2604.18738#A5)\.

### 6\.2Main Results

Table 1:Main results across 8 benchmarks, evaluated on the full test set of each benchmark \(sample counts in Appendix[E](https://arxiv.org/html/2604.18738#A5)\)\. All methods use LLaDA2\.1\-mini with identical inference parameters per benchmark\. T2M usesLowProb\(τ=0\.3\\tau\{=\}0\.3,Cmax=1C\_\{\\max\}\{=\}1,ρmax=0\.25\\rho\_\{\\max\}\{=\}0\.25\)\. Metrics are accuracy or pass@1 \(%\) unless noted\. Best per\-benchmark inbold\.CategoryBenchmarkOriginal \(T2T\)T2M \(ours\)Δ\\DeltaKnowledgeTriviaQA \(EM\)43\.7144\.98\+1\.27\+1\.27MMLU\-Pro58\.7858\.84\+0\.06\+0\.06ReasoningHellaSwag78\.5778\.67\+0\.10\+0\.10DROP \(EM\)53\.9854\.41\+0\.43\+0\.43BBH75\.5075\.38−0\.12\-0\.12MathCMATH82\.3388\.25\+5\.92\+5\.92AIME 202530\.0030\.00±0\.00\\pm 0\.00InstructionIFEval \(Strict\)73\.0174\.12\+1\.11\+1\.11Grouping the benchmarks in Table[1](https://arxiv.org/html/2604.18738#S6.T1)by output format makes the pattern easy to read\. The four largest gains all occur on benchmarks with short, exact answers: CMATH \(\+5\.92\+5\.92\), TriviaQA \(\+1\.27\+1\.27\), IFEval \(\+1\.11\+1\.11\), and DROP \(\+0\.43\+0\.43\)\. Seven of the eight benchmarks see T2M match or exceed T2T, and the improvements concentrate on the tasks the method is designed to address\.

### 6\.3Ablation Studies

Running the full sweep of 109 hyperparameter configurations on the complete CMATH test set is computationally impractical\. We instead sweep on a fixed random subset of 100 CMATH problems \(seed 42\); the T2T baseline accuracy on this subset is 81%, as opposed to 82\.33% on the full set, so subset numbers in this subsection are not directly comparable to Table[1](https://arxiv.org/html/2604.18738#S6.T1)\. Figure[3](https://arxiv.org/html/2604.18738#S6.F3)plots every configuration in the \(cost, accuracy\) plane; Appendix[D](https://arxiv.org/html/2604.18738#A4)describes the protocol in full\. Every T2M configuration lies strictly above the 81% T2T baseline\. The best trade\-off in the sweep isLowProbwithτlp=0\.3\\tau\_\{\\text\{lp\}\}\{=\}0\.3,Cmax=1C\_\{\\max\}\{=\}1,ρmax=0\.25\\rho\_\{\\max\}\{=\}0\.25, which reaches 93% at 69\.2 remasks per sample; this is the configuration carried over to the main experiments\. At the other end of the cost axis,T2T\-Remaskwithτtr=0\.9\\tau\_\{\\text\{tr\}\}\{=\}0\.9reaches 87% with 1\.7 remasks per sample, showing that even a very small volume of targeted remasks is effective\. Aggressive settings \(largeτ\\tau, unboundedCmaxC\_\{\\max\},ρmax=1\\rho\_\{\\max\}\{=\}1\) push the count past10310^\{3\}per sample while driving accuracy back below 88%, which motivates the safety caps\.

![Refer to caption](https://arxiv.org/html/2604.18738v1/x3.png)Figure 3:Cost vs\. accuracy\.Each point is one \(strategy,τ\\tau,CmaxC\_\{\\max\},ρmax\\rho\_\{\\max\}\) on CMATH\. All T2M strategies surpass T2T \(⋆\{\\star\}, 81%\)\. Most efficient:LowProbτ=0\.3\\tau\{=\}0\.3\(93%, 69\.2 remasks\)\. Cheapest:T2T\-Remaskτ=0\.9\\tau\{=\}0\.9\(87%, 1\.7 remasks\)\.
### 6\.4Analysis

#### Last\-mile corruption is the dominant error mode\.

We inspect all 194 baseline errors on CMATH by hand\. The split between corruption and reasoning is stark: 155 of the 194 errors \(79\.9%\) are not reasoning mistakes at all\. In every one of these, the chain of thought reaches the correct numerical result, and only the final answer is garbled during denoising, for instance a digit dropped or appended, or the answer marker repeated\. The remaining 39 errors are genuine reasoning failures\. Under T2M, 64 of the 155 corruption errors \(41\.3%\) are repaired, while the reasoning errors stay essentially constant \(39→3839\\to 38\)\. Table[2](https://arxiv.org/html/2604.18738#S6.T2)reports the full breakdown; Appendix[G](https://arxiv.org/html/2604.18738#A7)gives a representative case\.

Table 2:Error breakdown on CMATH\. “Corruption” = correct reasoning but wrong final extraction\.

## 7Conclusion

We introduced Token\-to\-Mask \(T2M\) remasking, an inference\-time modification to the editing phase of masked diffusion language models\. Rather than overwriting a suspect token with a fresh guess, T2M resets it to\[M\]\. The conditioning context then returns to the distribution on which the M2T predictor was trained, and the commitment at that position is deferred until its neighbourhood has converged\. The method is training\-free, architecture\-preserving, and changes only a single rule in the generation loop\. The empirical pattern is consistent with the analysis\. On CMATH, where 79\.9% of baseline errors are last\-mile corruption, T2M repairs 41\.3% of those errors and lifts accuracy by\+5\.92\+5\.92points\. On the remaining benchmarks, tasks that require exact token\-level output improve, while tasks with longer free\-form outputs are largely unchanged\. The underlying signal hierarchy suggests a more general principle for inference\-time correction in parallel generative models: when the model distrusts a committed token, a mask is a strictly better conditioning signal than a replacement computed under a potentially polluted context\.

#### Limitations\.

T2M assumes an explicit editing phase in the generation loop; dLLMs without one would require a different integration point\. Its advantage over T2T diminishes in regimes where the base error rate is already low\. The safety capsCmaxC\_\{\\max\}andρmax\\rho\_\{\\max\}are fixed from a small pilot sweep and may require retuning per task\. Our experiments use a single model \(LLaDA2\.1\-mini\); extending the evaluation to other masked diffusion language models is left for future work\.

#### Outlook\.

The training\-inference noise mismatch identified in Section[5](https://arxiv.org/html/2604.18738#S5)suggests a complementary training\-time intervention, which we call*noise\-aware continued training*\. The idea is to close the gap directly: instead of sampling T2T perturbations uniformly from the vocabulary, one populates the training stream with the model’s own inference\-time mistakes, harvested by running the generation pipeline and recording the erroneous commitments it produces\. Exposure to such realistic noise should improve both M2T prediction under polluted context and T2T correction accuracy\. The intervention is orthogonal to T2M and composes with it: T2M changes the correction mechanism at inference time, while noise\-aware training would make the model more robust to the errors it actually makes\. Further directions include combining T2M with process\-level remasking\([18](https://arxiv.org/html/2604.18738#bib.bib3)\), learning the remask threshold adaptively\([9](https://arxiv.org/html/2604.18738#bib.bib8)\), and extending the analysis to other dLLM architectures\.

Acknowledgments\.Anonymous for review\.

## References

- \[1\]\(2021\)Structured denoising diffusion models in discrete state\-spaces\.InAdvances in Neural Information Processing Systems,Cited by:[§A\.1](https://arxiv.org/html/2604.18738#A1.SS1.p1.6),[§1](https://arxiv.org/html/2604.18738#S1.p1.1),[§2](https://arxiv.org/html/2604.18738#S2.SS0.SSS0.Px1.p1.1)\.
- \[2\]T\. Bie, M\. Cao, X\. Cao, B\. Chen,et al\.\(2026\)LLaDA2\.1: speeding up text diffusion via token editing\.arXiv preprint arXiv:2602\.08676\.Cited by:[§A\.2](https://arxiv.org/html/2604.18738#A1.SS2.p1.2),[Appendix D](https://arxiv.org/html/2604.18738#A4.p2.3),[§1](https://arxiv.org/html/2604.18738#S1.p1.1),[§1](https://arxiv.org/html/2604.18738#S1.p2.1),[§2](https://arxiv.org/html/2604.18738#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2604.18738#S3.p1.1),[§5\.3](https://arxiv.org/html/2604.18738#S5.SS3.p1.2),[§6\.1](https://arxiv.org/html/2604.18738#S6.SS1.SSS0.Px1.p1.6)\.
- \[3\]D\. Duaet al\.\(2019\)DROP: a reading comprehension benchmark requiring discrete reasoning over paragraphs\.arXiv preprint arXiv:1903\.00161\.Cited by:[§6\.1](https://arxiv.org/html/2604.18738#S6.SS1.SSS0.Px2.p1.1)\.
- \[4\]H\. He, K\. Renz, Y\. Cao, and A\. Geiger\(2025\)MDPO: overcoming the training\-inference divide of masked diffusion language models\.arXiv preprint arXiv:2508\.13148\.Cited by:[§2](https://arxiv.org/html/2604.18738#S2.SS0.SSS0.Px3.p1.1)\.
- \[5\]J\. Ho, A\. Jain, and P\. Abbeel\(2020\)Denoising diffusion probabilistic models\.InAdvances in Neural Information Processing Systems,Cited by:[§B\.3](https://arxiv.org/html/2604.18738#A2.SS3.p2.3),[§5\.3](https://arxiv.org/html/2604.18738#S5.SS3.p1.2)\.
- \[6\]Z\. Huang, Y\. Wang, Z\. Chen, and G\. Qi\(2025\)Don’t settle too early: self\-reflective remasking for diffusion language models\.arXiv preprint arXiv:2509\.23653\.Cited by:[§2](https://arxiv.org/html/2604.18738#S2.SS0.SSS0.Px3.p1.1)\.
- \[7\]M\. Joshi, E\. Choi, D\. S\. Weld, and L\. Zettlemoyer\(2017\)TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension\.arXiv preprint arXiv:1705\.03551\.Cited by:[§6\.1](https://arxiv.org/html/2604.18738#S6.SS1.SSS0.Px2.p1.1)\.
- \[8\]W\. Kang, K\. Galim, S\. Oh, M\. Lee, Y\. Zeng, S\. Zhang, C\. Hooper, Y\. Hu, H\. I\. Koo, N\. I\. Cho, and K\. Lee\(2025\)ParallelBench: understanding the trade\-offs of parallel decoding in diffusion LLMs\.arXiv preprint arXiv:2510\.04767\.Cited by:[§1](https://arxiv.org/html/2604.18738#S1.p1.1)\.
- \[9\]J\. Kim, S\. Kim, T\. Lee, D\. Z\. Pan, H\. Kim, S\. Kakade, and S\. Chen\(2025\)Fine\-tuning masked diffusion for provable self\-correction\.arXiv preprint arXiv:2510\.01384\.Cited by:[§2](https://arxiv.org/html/2604.18738#S2.SS0.SSS0.Px3.p1.1),[§7](https://arxiv.org/html/2604.18738#S7.SS0.SSS0.Px2.p1.1)\.
- \[10\]A\. Lou, C\. Meng, and S\. Ermon\(2024\)Discrete diffusion modeling by estimating the ratios of the data distribution\.InInternational Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2604.18738#S1.p1.1),[§2](https://arxiv.org/html/2604.18738#S2.SS0.SSS0.Px1.p1.1)\.
- \[11\]Mathematical Association of America\(2025\)American invitational mathematics examination 2025\.Cited by:[§6\.1](https://arxiv.org/html/2604.18738#S6.SS1.SSS0.Px2.p1.1)\.
- \[12\]S\. Nie, F\. Zhu, Z\. You, X\. Zhang, J\. Ou, J\. Zhou, J\. Wen, and C\. Li\(2025\)LLaDA: large language diffusion with masking\.InInternational Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2604.18738#S1.p1.1),[§2](https://arxiv.org/html/2604.18738#S2.SS0.SSS0.Px1.p1.1)\.
- \[13\]J\. Ou, S\. Nie, K\. Xue, F\. Zhu, J\. Wen, C\. Li, and Z\. Zhang\(2025\)Your absorbing discrete diffusion secretly models the conditional distributions of clean data\.InInternational Conference on Learning Representations,Cited by:[§A\.1](https://arxiv.org/html/2604.18738#A1.SS1.p1.6)\.
- \[14\]S\. S\. Sahoo, M\. Arriola, Y\. Schiff, A\. Gokaslan, E\. Marroquin, and V\. Kuleshov\(2024\)Simple and effective masked diffusion language models\.InAdvances in Neural Information Processing Systems,Cited by:[§A\.1](https://arxiv.org/html/2604.18738#A1.SS1.p1.6),[§A\.1](https://arxiv.org/html/2604.18738#A1.SS1.p3.1),[§1](https://arxiv.org/html/2604.18738#S1.p1.1),[§2](https://arxiv.org/html/2604.18738#S2.SS0.SSS0.Px1.p1.1)\.
- \[15\]Y\. Schiff, O\. Belhasin, R\. Uziel, G\. Wang, M\. Arriola, G\. Turok, M\. Elad, and V\. Kuleshov\(2026\)Learn from your mistakes: self\-correcting masked diffusion models\.arXiv preprint arXiv:2602\.11590\.Cited by:[§2](https://arxiv.org/html/2604.18738#S2.SS0.SSS0.Px3.p1.1)\.
- \[16\]Y\. Song, J\. Sohl\-Dickstein, D\. P\. Kingma, A\. Kumar, S\. Ermon, and B\. Poole\(2021\)Score\-based generative modeling through stochastic differential equations\.InInternational Conference on Learning Representations,Cited by:[§B\.3](https://arxiv.org/html/2604.18738#A2.SS3.p2.3),[§5\.3](https://arxiv.org/html/2604.18738#S5.SS3.p1.2)\.
- \[17\]M\. Suzgun, N\. Scales, N\. Schärli, S\. Gehrmann, Y\. Tay, H\. W\. Chung, A\. Chowdhery, Q\. V\. Le, E\. H\. Chi, D\. Zhou, and J\. Wei\(2023\)Challenging BIG\-Bench tasks and whether chain\-of\-thought can solve them\.Findings of ACL\.Cited by:[§6\.1](https://arxiv.org/html/2604.18738#S6.SS1.SSS0.Px2.p1.1)\.
- \[18\]G\. Wang, Y\. Schiff, S\. S\. Sahoo, and V\. Kuleshov\(2025\)Remasking discrete diffusion models with inference\-time scaling\.InAdvances in Neural Information Processing Systems,Cited by:[§C\.1](https://arxiv.org/html/2604.18738#A3.SS1.p1.8),[§2](https://arxiv.org/html/2604.18738#S2.SS0.SSS0.Px2.p1.2),[§7](https://arxiv.org/html/2604.18738#S7.SS0.SSS0.Px2.p1.1)\.
- \[19\]Y\. Wang, X\. Ma, G\. Zhang, Y\. Ni, A\. Chandra, S\. Guo, W\. Ren, A\. Arulraj, X\. He, Z\. Jiang,et al\.\(2024\)MMLU\-Pro: a more robust and challenging multi\-task language understanding benchmark\.arXiv preprint arXiv:2406\.01574\.Cited by:[§6\.1](https://arxiv.org/html/2604.18738#S6.SS1.SSS0.Px2.p1.1)\.
- \[20\]T\. Wei, J\. Luan, W\. Liu, S\. Dong, and B\. Wang\(2023\)CMATH: can your language model pass chinese elementary school math test?\.arXiv preprint arXiv:2306\.16636\.Cited by:[§6\.1](https://arxiv.org/html/2604.18738#S6.SS1.SSS0.Px2.p1.1)\.
- \[21\]J\. Ye, Z\. Xie, L\. Zheng, J\. Gao, Z\. Wu, X\. Jiang, Z\. Li, and L\. Kong\(2025\)Dream 7B: diffusion large language models\.arXiv preprint arXiv:2508\.15487\.Cited by:[§1](https://arxiv.org/html/2604.18738#S1.p1.1),[§2](https://arxiv.org/html/2604.18738#S2.SS0.SSS0.Px1.p1.1)\.
- \[22\]R\. Zellerset al\.\(2019\)HellaSwag: can a machine really finish your sentence?\.arXiv preprint arXiv:1905\.07830\.Cited by:[§6\.1](https://arxiv.org/html/2604.18738#S6.SS1.SSS0.Px2.p1.1)\.
- \[23\]K\. Zhai, S\. Mollah, Z\. Wang, and M\. Shah\(2026\)CORE: context\-robust remasking for diffusion language models\.arXiv preprint arXiv:2602\.04096\.Cited by:[§C\.2](https://arxiv.org/html/2604.18738#A3.SS2.p1.4),[§2](https://arxiv.org/html/2604.18738#S2.SS0.SSS0.Px2.p1.2)\.
- \[24\]J\. Zhouet al\.\(2023\)Instruction\-following evaluation for large language models\.arXiv preprint arXiv:2311\.07911\.Cited by:[§6\.1](https://arxiv.org/html/2604.18738#S6.SS1.SSS0.Px2.p1.1)\.

## Appendix ABackground

### A\.1Masked Diffusion Language Models

A masked diffusion language model defines a forward noising process that progressively masks tokens and a reverse denoising process that recovers them\([1](https://arxiv.org/html/2604.18738#bib.bib11),[13](https://arxiv.org/html/2604.18738#bib.bib12),[14](https://arxiv.org/html/2604.18738#bib.bib9)\)\. Given a clean sequence𝐱=\(x1,…,xL\)\\mathbf\{x\}=\(x\_\{1\},\\ldots,x\_\{L\}\), the forward process at timet∈\[0,1\]t\\in\[0,1\]independently replaces each token with the mask token\[M\]with probabilityαt\\alpha\_\{t\}\(whereα0=0\\alpha\_\{0\}=0andα1=1\\alpha\_\{1\}=1\), yielding the noisy sequence𝐳t\\mathbf\{z\}\_\{t\}\.

The modelpθ\(𝐱∣𝐳t\)p\_\{\\theta\}\(\\mathbf\{x\}\\mid\\mathbf\{z\}\_\{t\}\)is trained to predict all masked positions simultaneously via a cross\-entropy loss:

ℒ\(θ\)=−𝔼𝐱,t\[1∑i𝟏\[zti=\[M\]\]∑i=1L𝟏\[zti=\[M\]\]log⁡pθ\(xi∣𝐳t\)\]\.\\mathcal\{L\}\(\\theta\)=\-\\mathbb\{E\}\_\{\\mathbf\{x\},t\}\\left\[\\frac\{1\}\{\\sum\_\{i\}\\mathbf\{1\}\[z\_\{t\}^\{i\}=\\texttt\{\[M\]\}\{\}\]\}\\sum\_\{i=1\}^\{L\}\\mathbf\{1\}\[z\_\{t\}^\{i\}=\\texttt\{\[M\]\}\{\}\]\\log p\_\{\\theta\}\(x\_\{i\}\\mid\\mathbf\{z\}\_\{t\}\)\\right\]\.\(8\)
At inference time, generation starts from a fully masked sequence and proceeds iteratively: at each step, the model \(i\) predicts a distribution at every masked position and \(ii\) commits the predictions whose confidence exceeds a threshold, subject to a per\-step budget determined by the unmasking schedule\. When more predictions qualify than the budget permits, only the top\-kkmost confident are committed and the remaining qualifying positions stay masked\([14](https://arxiv.org/html/2604.18738#bib.bib9)\)\.

### A\.2Block Diffusion

LLaDA2\.1\([2](https://arxiv.org/html/2604.18738#bib.bib2)\)adopts a*semi\-autoregressive*block diffusion architecture\. The response is divided into blocks ofBBtokens \(typicallyB=32B=32\), generated sequentially from left to right; within each block, the tokens are produced in parallel via iterative denoising\.

#### Prompt and response handling\.

Given a prompt–response pair, the prompt tokens are placed into the sequence unmasked; only the response positions are initialised as\[M\]\. In blocks that partially overlap with the prompt, the prompt positions are marked as non\-editable and remain frozen throughout generation\.

#### Block\-causal attention\.

To maintain left\-to\-right coherence while preserving parallel generation within each block, LLaDA2\.1 employs a*block\-causal*attention mask\. Within a block, all positions attend to each other bidirectionally, so the model can capture intra\-block dependencies\. Across blocks, attention is strictly causal: blockjjattends to blocks0,1,…,j0,1,\\ldots,jbut not to any future blockj′\>jj^\{\\prime\}\>j\. The resulting mask is𝐌attn=tril\(𝟏Nb×Nb\)\\mathbf\{M\}\_\{\\text\{attn\}\}=\\text\{tril\}\(\\mathbf\{1\}\_\{N\_\{b\}\\times N\_\{b\}\}\)at the block level, expanded to token\-level resolution\. Previously generated blocks are frozen and their key–value states cached for reuse in subsequent blocks\.

#### Iterative M2T denoising\.

Within each block, the model iteratively fills\[M\]positions via Mask\-to\-Token \(M2T\) steps\. At each iteration, the model predicts token distributionspθ\(xi∣𝐳\)p\_\{\\theta\}\(x\_\{i\}\\mid\\mathbf\{z\}\)for all masked positions\. Positions whose confidence exceeds a thresholdτm2t\\tau\_\{\\text\{m2t\}\}are committed; when fewer positions qualify than the per\-step budgetntransfern\_\{\\text\{transfer\}\}, the top\-ntransfern\_\{\\text\{transfer\}\}most confident predictions are committed instead\. The inner loop repeats until all masks in the block are filled\.

### A\.3Token\-to\-Token \(T2T\) Editing

Beyond M2T, LLaDA2\.1 introduces a*Token\-to\-Token*\(T2T\) editing phase at each denoising step\. For positions that already hold committed \(non\-mask\) tokens, the model checks whether the argmax differs from the current token with confidence exceedingτt2t\\tau\_\{\\text\{t2t\}\}\. When this holds, the committed token is overwritten by the argmax \(Eq\.[1](https://arxiv.org/html/2604.18738#S3.E1)\)\. Prompt positions within a block are excluded from T2T editing\. The inner loop continues until all masks are filled and no further T2T edits are triggered, at which point generation advances to the next block with the current block’s tokens frozen as context\.

## Appendix BFormal Theoretical Results

### B\.1Stuck Set and Its Consequence for T2T

###### Definition 1\(Stuck set\)\.

For a fixed thresholdτt2t\\tau\_\{\\text\{t2t\}\}and a small confidence floorϵ<τt2t\\epsilon<\\tau\_\{\\text\{t2t\}\}, the*stuck set*𝒮stuck\\mathcal\{S\}\_\{\\mathrm\{stuck\}\}consists of positions where the currently committed tokenxioldx\_\{i\}^\{\\mathrm\{old\}\}has very low probability but no single alternative is confident enough to trigger replacement:

𝒮stuck=\{i:pθ\(xiold∣𝐳−i\)<ϵandmaxv⁡pθ\(v∣𝐳−i\)<τt2t\}\.\\mathcal\{S\}\_\{\\mathrm\{stuck\}\}=\\left\\\{i:p\_\{\\theta\}\(x\_\{i\}^\{\\mathrm\{old\}\}\\mid\\mathbf\{z\}\_\{\-i\}\)<\\epsilon\\;\\text\{and\}\\;\\max\_\{v\}p\_\{\\theta\}\(v\\mid\\mathbf\{z\}\_\{\-i\}\)<\\tau\_\{\\text\{t2t\}\}\\right\\\}\.\(9\)

###### Proposition 1\(T2T is powerless on the stuck set\)\.

For everyi∈𝒮stucki\\in\\mathcal\{S\}\_\{\\mathrm\{stuck\}\}, T2T editing leavesxioldx\_\{i\}^\{\\mathrm\{old\}\}unchanged\. In contrast, T2M with theLowProbstrategy remasks everyi∈𝒮stucki\\in\\mathcal\{S\}\_\{\\mathrm\{stuck\}\}whenever the thresholdτlp\\tau\_\{\\text\{lp\}\}satisfiesτlp\>ϵ\\tau\_\{\\text\{lp\}\}\>\\epsilon\(a condition trivially met by our defaultτlp=0\.3\\tau\_\{\\text\{lp\}\}=0\.3withϵ≪0\.3\\epsilon\\ll 0\.3\)\.

###### Proof\.

T2T’s trigger condition \(Eq\.[1](https://arxiv.org/html/2604.18738#S3.E1)\) requirespθ\(xi∗∣𝐳\)\>τt2tp\_\{\\theta\}\(x\_\{i\}^\{\*\}\\mid\\mathbf\{z\}\)\>\\tau\_\{\\text\{t2t\}\}for somexi∗≠xioldx\_\{i\}^\{\*\}\\neq x\_\{i\}^\{\\mathrm\{old\}\}, which impliesmaxv⁡pθ\(v∣𝐳\)\>τt2t\\max\_\{v\}p\_\{\\theta\}\(v\\mid\\mathbf\{z\}\)\>\\tau\_\{\\text\{t2t\}\}\. This contradicts the second condition in the definition of𝒮stuck\\mathcal\{S\}\_\{\\mathrm\{stuck\}\}, so T2T never fires ati∈𝒮stucki\\in\\mathcal\{S\}\_\{\\mathrm\{stuck\}\}\.

For the second claim,LowProb’s trigger \(Eq\.[3](https://arxiv.org/html/2604.18738#S4.E3)\) ispθ\(xiold∣𝐳−i\)<τlpp\_\{\\theta\}\(x\_\{i\}^\{\\mathrm\{old\}\}\\mid\\mathbf\{z\}\_\{\-i\}\)<\\tau\_\{\\text\{lp\}\}\. By Eq\.[9](https://arxiv.org/html/2604.18738#A2.E9),pθ\(xiold∣𝐳−i\)<ϵ<τlpp\_\{\\theta\}\(x\_\{i\}^\{\\mathrm\{old\}\}\\mid\\mathbf\{z\}\_\{\-i\}\)<\\epsilon<\\tau\_\{\\text\{lp\}\}, so the condition is satisfied andiiis remasked\. ∎

Proposition[1](https://arxiv.org/html/2604.18738#Thmproposition1)isolates one concrete mechanism through which T2M exercises a capability that T2T lacks: it enables the initiation of correction at any position whose committed token the model assigns low probability to, independently of whether a confident replacement exists\. The proposition does not guarantee that the subsequent M2T step will fill the remasked position correctly; it guarantees only that such correction becomes possible\. Figure[1](https://arxiv.org/html/2604.18738#S1.F1)\(a\) is a direct instance: “purple” satisfies the definition of𝒮stuck\\mathcal\{S\}\_\{\\mathrm\{stuck\}\}for anyτt2t\>0\.12\\tau\_\{\\text\{t2t\}\}\>0\.12and anyϵ\>2×10−5\\epsilon\>2\\times 10^\{\-5\}, andLowProbwithτlp=0\.3\\tau\_\{\\text\{lp\}\}=0\.3remasks it\.

### B\.2Context Purification: A More Detailed Account

This appendix expands the context signal hierarchy of Remark[1](https://arxiv.org/html/2604.18738#Thmremark1), and its consequence that T2M converts adversarial signals to null signals, into a more detailed argument\.

At every denoising step, the prediction at positioniiis conditioned on the values at all other positions, which can be decomposed as

pθ\(xi∣𝐳−i\)=pθ\(xi\|𝐳correct⏟correct tokens,𝐳error⏟erroneous tokens,\[M\],…,\[M\]⏟still unknown\)\.p\_\{\\theta\}\(x\_\{i\}\\mid\\mathbf\{z\}\_\{\-i\}\)=p\_\{\\theta\}\\\!\\left\(x\_\{i\}\\;\\middle\|\\;\\underbrace\{\\mathbf\{z\}\_\{\\text\{correct\}\}\}\_\{\\text\{correct tokens\}\},\\;\\underbrace\{\\mathbf\{z\}\_\{\\text\{error\}\}\}\_\{\\text\{erroneous tokens\}\},\\;\\underbrace\{\\texttt\{\[M\]\}\{\},\\ldots,\\texttt\{\[M\]\}\{\}\}\_\{\\text\{still unknown\}\}\\right\)\.\(10\)The erroneous positions in𝐳error\\mathbf\{z\}\_\{\\text\{error\}\}arise from two sources: an M2T step may commit an incorrect fill, or a T2T step may replace a token with a different but still incorrect one\. In either case, the result is indistinguishable from a correct token to the model and is processed as informative context\.

#### Replacement propagates errors\.

When T2T substitutes the suspected error with the current argmax, the replacement is computed under a context that may itself contain errors at other positions\. A high replacement confidence therefore does not imply correctness: the argmax under a polluted context can be confidently wrong\. Once committed, the incorrect replacement is indistinguishable from a valid commitment, and it biases both the predictions at other positions and the re\-evaluation of the same position in subsequent iterations\. In the worst case, several erroneous positions reinforce one another, each sustaining the others across iterations\.

#### Remasking returns context to a neutral state\.

T2M substitutes the suspected error with\[M\]rather than with a guess\. The mask itself carries no semantic bias \(Remark[1](https://arxiv.org/html/2604.18738#Thmremark1), type 2\); it does not inject misleading information at other positions, and it does not propagate into the re\-evaluation of the reset position\. Over successive denoising steps, type\-\(3\) adversarial signals in the context are converted to type\-\(2\) null signals, until either the detector stops firing or the safety capsCmax,ρmaxC\_\{\\max\},\\rho\_\{\\max\}are reached\.

#### Joint remasking breaks the error\-propagation cycle\.

The self\-reinforcing cycle is a joint effect of several adversarial signals acting simultaneously: an error at positionjjbiases the prediction atii, and that biased prediction atiiin turn distorts the re\-evaluation atjj\. Converting several such signals to null in a single T2M step severs the dependencies at once—once a position holds\[M\], it supplies no biasing signal to any other position’s re\-prediction\. Figure[1](https://arxiv.org/html/2604.18738#S1.F1)\(c\) is a concrete instance on DROP: six consecutive remasks allow “Jon” to stay in place, while “Kit” and “na” converge top≥0\.98p\\geq 0\.98under cleaner mutual context; T2T, by contrast, overwrites “Jon” at the first high\-confidence alternative\.

### B\.3Noise Mismatch: Analogy with Continuous Diffusion

Section[5](https://arxiv.org/html/2604.18738#S5)argues that inference\-time errors constitute a third type of noise, systematic and locally coherent, whose support overlaps with neither of the two LLaDA2\.1 training distributions\. The phenomenon has a structural counterpart in continuous diffusion, which we describe here\.

In continuous \(Gaussian\) diffusion models\([5](https://arxiv.org/html/2604.18738#bib.bib13),[16](https://arxiv.org/html/2604.18738#bib.bib14)\), the denoising network learns the score∇𝐱log⁡pt\(𝐱\)\\nabla\_\{\\mathbf\{x\}\}\\log p\_\{t\}\(\\mathbf\{x\}\)under a fixed Gaussian noise kernel𝒩\(0,σt2I\)\\mathcal\{N\}\(0,\\sigma\_\{t\}^\{2\}I\)\. When the test\-time noise deviates from this training distribution \(for example, structured perturbations or out\-of\-distribution corruptions\), the learned score no longer points along the true data gradient, and the sampling trajectory drifts away fromp0p\_\{0\}\.

The discrete case has the same structure with two training noise types instead of one\. LLaDA2\.1 is trained under \(i\)\[M\]noise in the M2T stream and \(ii\) uniform\-vocabulary perturbations in the T2T stream\. Inference\-time errors form a third noise type whose support is disjoint from both\. T2M maps a position under this third noise back to the first type \(\[M\]\), so that the M2T predictor sees inputs drawn from the distribution it was trained on\. The effect of the method is therefore not to change the predictor, but to reduce the distribution shift it is exposed to\.

### B\.4Delayed Commitment and Joint Re\-prediction

T2T editing is greedy in two respects, each of which contributes to sub\-optimal behaviour:

1. 1\.Premature locking\.Once committed, a replacement becomes context for every other position in the current block\. If the committed token is sub\-optimal \(for instance, the argmax under a multimodal posterior\), the error propagates irreversibly, since T2T does not revisit a position at which no alternative exceedsτt2t\\tau\_\{\\text\{t2t\}\}\.
2. 2\.Per\-position independence\.T2T treats each candidate independently viax^i=arg⁡maxv⁡pθ\(v∣𝐳−i\)\\hat\{x\}\_\{i\}=\\arg\\max\_\{v\}p\_\{\\theta\}\(v\\mid\\mathbf\{z\}\_\{\-i\}\), ignoring the joint structure across positions\. When a group of positions is jointly uncertain \(for instance, the tokens of a multi\-token entity name\), per\-position argmax produces locally confident but globally incoherent commitments\.

T2M addresses both effects by deferring commitment\. Resetting a suspect position to\[M\]keeps its final value open until the surrounding positions have converged; when multiple positions are remasked in the same step, the subsequent M2T step performs a joint re\-prediction,

𝐱^ℛ=argmax𝐱ℛ∏i∈ℛpθ\(xi\|𝐳correct,\[M\]\)\|ℛ\|,\\hat\{\\mathbf\{x\}\}\_\{\\mathcal\{R\}\}=\\arg\\max\_\{\\mathbf\{x\}\_\{\\mathcal\{R\}\}\}\\prod\_\{i\\in\\mathcal\{R\}\}p\_\{\\theta\}\\\!\\left\(x\_\{i\}\\;\\middle\|\\;\\mathbf\{z\}\_\{\\text\{correct\}\},\\,\\texttt\{\[M\]\}\{\}^\{\|\\mathcal\{R\}\|\}\\right\),\(11\)in which the errors inℛ\\mathcal\{R\}are removed from the conditioning context\. The individual predictions remain conditionally independent, but the context they share has been jointly purified, so the resulting commitments tend to be more globally consistent than the outcome of sequential greedy replacement\.

The mechanism is analogous to simulated annealing: a controlled, temporary increase in local uncertainty allows the joint re\-prediction to escape configurations in which sequential greedy updates would otherwise be trapped\.

### B\.5Illustrative Examples Annotated

The following two cases annotate Figures[1](https://arxiv.org/html/2604.18738#S1.F1)\(b,c\) with the relevant probabilities and the mechanism each instantiates\.

#### DROP 160 \(gold = 857\): premature locking\.

At an early denoising step, only “8” is committed \(p=0\.11p=0\.11\); “5” and “7” remain\[M\]\. Under this incomplete context, the top non\-“8” alternative is “6” atp=0\.64\>τt2t=0\.5p=0\.64\>\\tau\_\{\\text\{t2t\}\}=0\.5, so T2T commits “8”→\\to“6” and produces the incorrect answer “657”\. T2M \(LowProb,τlp=0\.3\\tau\_\{\\text\{lp\}\}=0\.3\) instead remasks “8” based onp=0\.11<0\.3p=0\.11<0\.3\. The next M2T step fills “5” and “7”; once both are present, the first digit is re\-predicted as “8” atp=0\.94p=0\.94, recovering the correct answer “857”\. The per\-step trajectory is shown in Figure[5](https://arxiv.org/html/2604.18738#A8.F5)\.

#### DROP \(gold = “Jon Kitna”\): per\-position independence\.

The target spans three tokens, with committed values “Jon” \(p=0\.32p=0\.32\), “Kit” \(p=0\.57p=0\.57\), and “na” \(p=0\.45p=0\.45\) before correction\. T2T’s per\-position argmax fires on “Jon”, whose top alternative “Kit” exceedsτt2t\\tau\_\{\\text\{t2t\}\}, and commits “Jon”→\\to“Kit”\. T2M withLowProbflags all three tokens as low\-probability and, subject to the ratio cap, remasks them across six denoising rounds\. As the surrounding context converges, each token is re\-predicted at high confidence: “Jon” atp=0\.96p=0\.96, “Kit” atp=0\.99p=0\.99, and “na” atp=0\.98p=0\.98\. Joint re\-prediction under the cleaner context recovers the globally coherent “Jon Kitna” that per\-position greedy editing cannot\.

## Appendix CExtended Comparison with Prior Remasking Methods

Section[2](https://arxiv.org/html/2604.18738#S2)summarises the qualitative contrast with prior remasking methods\. This appendix supplies the two formal arguments: a dominance result for targeted over random remasking \(ReMDM\), and an information\-theoretic critique of perturbation\-based sensitivity \(CORE\)\.

### C\.1Targeted vs\. Random Remasking: A Dominance Result

ReMDM\([18](https://arxiv.org/html/2604.18738#bib.bib3)\)remasks each committed token independently with probabilityσt\\sigma\_\{t\}at every reverse\-diffusion step\. Using the signal hierarchy of Remark[1](https://arxiv.org/html/2604.18738#Thmremark1), we now quantify the disadvantage of random remasking relative to targeted remasking\. Let there beNNcommitted \(non\-mask\) positions, of whichNcN\_\{c\}are correct \(aligned, contributions\+\>0s\_\{\+\}\>0\) andNeN\_\{e\}are erroneous \(adversarial, contributions−<0s\_\{\-\}<0\)\. Define the context quality asQ=Ncs\+\+Nes−Q=N\_\{c\}s\_\{\+\}\+N\_\{e\}s\_\{\-\}\. Under random remasking with per\-token probabilityσ\\sigma, each position is removed independently of correctness, so

Qrandom\(σ\)=\(1−σ\)\(Ncs\+\+Nes−\)\.Q\_\{\\text\{random\}\}\(\\sigma\)=\(1\-\\sigma\)\(N\_\{c\}s\_\{\+\}\+N\_\{e\}s\_\{\-\}\)\.\(12\)Increasingσ\\sigmaeliminates adversarial signals, but in equal proportion destroys aligned signals; there is noσ∗\\sigma^\{\*\}at which the trade\-off is uniformly favourable\. Under targeted remasking with perfect detection, only theNeN\_\{e\}erroneous positions are removed, yielding

Qtargeted=Ncs\+\.Q\_\{\\text\{targeted\}\}=N\_\{c\}s\_\{\+\}\.\(13\)The difference is strictly positive for allσ∈\(0,1\]\\sigma\\in\(0,1\]:

Qtargeted−Qrandom\(σ\)=σNcs\+⏟aligned signals preserved\+\(1−σ\)\(−Nes−\)⏟adversarial signals removed\>0\.Q\_\{\\text\{targeted\}\}\-Q\_\{\\text\{random\}\}\(\\sigma\)=\\underbrace\{\\sigma N\_\{c\}s\_\{\+\}\}\_\{\\text\{aligned signals preserved\}\}\+\\underbrace\{\(1\-\\sigma\)\(\-N\_\{e\}s\_\{\-\}\)\}\_\{\\text\{adversarial signals removed\}\}\>0\.\(14\)Both terms are strictly positive, so targeted remasking preserves strictly more correct context and removes strictly more errors than the random alternative\. The dominance extends to imperfect detection provided the detector’s precision exceeds the base error rateNe/NN\_\{e\}/N, a condition empirically satisfied by each of our three strategies\.

### C\.2Trajectory Sensitivity vs\. Perturbation Sensitivity \(CORE\)

CORE\([23](https://arxiv.org/html/2604.18738#bib.bib4)\)identifies unreliable tokens via the conditional mutual informationI\(Xi;ZS∣Z−i,−S\)I\(X\_\{i\};Z\_\{S\}\\mid Z\_\{\-i,\-S\}\)under masked perturbationsZSZ\_\{S\}, which measures how much the prediction at positioniichanges when a subsetSSof the context is hidden\. We raise two concerns with this signal\.

#### Confounding with the transient mask pattern\.

CORE’s sensitivity is measured under additional masking applied on top of the current denoising state, but that state is itself a partially\-filled context, and the sensitivity depends strongly on which neighbours happen to be filled at the time of probing\. A correct token may exhibit low sensitivity while its semantically informative neighbours are still\[M\]\(because the disambiguating context is already absent\), and much higher sensitivity once those neighbours have been filled, irrespective of whether the fills are themselves correct\. Conversely, an erroneous token may appear stable simply because its correlated neighbours remain unfilled\. The resulting signal conflates intrinsic token unreliability with the transient mask geometry\.

#### An information\-theoretic upper bound\.

The static posteriorpθ\(xi∣𝐳\)p\_\{\\theta\}\(x\_\{i\}\\mid\\mathbf\{z\}\)is computed under the full current context\. By the data processing inequality, any statistic derived from an artificially degraded version of that context \(as in CORE’s masked perturbations\) contains at most as much information about the correctness ofXiX\_\{i\}as the full\-context posterior itself\. CORE paysO\(k\)O\(k\)additional forward passes per probe to compute a signal that cannot exceed in information content a quantity already available from the standard forward pass\.

OurLogitDiffrule captures a related but more principled form of sensitivity: the change inpθ\(t\)\(xiold\)p\_\{\\theta\}^\{\(t\)\}\(x\_\{i\}^\{\\mathrm\{old\}\}\)between consecutive denoising iterations\. This quantity tracks the evolution of the model’s confidence along the actual denoising trajectory, under fully informed context, and incurs no additional forward\-pass cost\.

## Appendix DAblation Studies

We conduct a full cross\-ablation of all hyperparameters on CMATH\. We randomly sample 100 problems from the CMATH test set \(1,098 problems, fixed seed=42\) and run single\-sample greedy inference \(batch size 1, temperature 0\) on a single A100\-80GB GPU per configuration\. All shared inference parameters follow the Q Mode defaults from our main experiments: M2T confidence thresholdτm2t=0\.7\\tau\_\{\\text\{m2t\}\}\{=\}0\.7, T2T editing thresholdτt2t=0\.5\\tau\_\{\\text\{t2t\}\}\{=\}0\.5, block lengthB=32B\{=\}32, generation length 16,384\.

The M2T fill thresholdτm2t=0\.7\\tau\_\{\\text\{m2t\}\}\{=\}0\.7is shared across all configurations \(including the baseline\)\. We sweep three remask\-specific hyperparameters:

1. 1\.The*remask threshold*τ\\taugoverns how aggressively each strategy triggers remasking\. The sign of the effect depends on the strategy \(Table[3](https://arxiv.org/html/2604.18738#A4.T3)\)\. Table 3:Effect of increasingτ\\tauon remask volume per strategy\.
2. 2\.The*per\-position budget*Cmax∈\{1,3,5\}C\_\{\\max\}\\in\\\{1,3,5\\\}caps the number of times a single position may be remasked within a block\.
3. 3\.The*ratio cap*ρmax∈\{0\.25,0\.50,1\.0\}\\rho\_\{\\max\}\\in\\\{0\.25,0\.50,1\.0\\\}caps the fraction of editable positions remasked in a single step;ρmax=1\.0\\rho\_\{\\max\}=1\.0corresponds to no cap\.

The baseline uses unmodified T2T editing atτt2t=0\.5\\tau\_\{\\text\{t2t\}\}\{=\}0\.5, with all inference parameters taken from the LLaDA2\.1\-mini Q Mode defaults\([2](https://arxiv.org/html/2604.18738#bib.bib2)\)\. The sweep consists of1\+\(5\+3\+4\)×3×3=1091\+\(5\{\+\}3\{\+\}4\)\\times 3\\times 3=109configurations, each evaluated on the same 100 samples\.

The 109 configurations are plotted in Figure[3](https://arxiv.org/html/2604.18738#S6.F3)\(Section[6](https://arxiv.org/html/2604.18738#S6)\)\. Thexx\-axis reports the average number of token modifications per sample—T2T edits for the baseline point⋆\{\\star\}\(12\.4 edits/sample\), remask count for T2M strategies—and theyy\-axis reports accuracy\. Color shade encodesτ\\tau\(lighter = smaller\)\. The vertical dotted line marks the average output length \(∼334\{\\sim\}334tokens\)\. The most efficient configuration isLowProbatτ=0\.3\\tau\{=\}0\.3,C=1C\{=\}1,ρ=0\.25\\rho\{=\}0\.25\(93%, 69\.2 remasks\); the cheapest isT2T\-Remaskatτ=0\.9\\tau\{=\}0\.9,C=1C\{=\}1,ρ=0\.5\\rho\{=\}0\.5\(87%, 1\.7 remasks\)\.

## Appendix EBenchmark Details

Table[4](https://arxiv.org/html/2604.18738#A5.T4)lists all benchmarks used in our evaluation with dataset details\.

Table 4:Benchmark details\.
## Appendix FHyperparameter Details

All remask\-specific hyperparameters and their defaults are listed in the main text\. Shared inference parameters follow LLaDA2\.1 Q Mode:threshold=0\.7=0\.7,editing\_threshold=0\.5=0\.5,block\_length=32=32,temperature=0\.0=0\.0\. The denoising inner loop runs until convergence \(all masks filled and no edits/remasks triggered\), not for a fixed number of steps\.

## Appendix GRepresentative Case Studies on CMATH

Figure[4](https://arxiv.org/html/2604.18738#A7.F4)presents a representative CMATH case drawn from the corruption analysis of Section[6](https://arxiv.org/html/2604.18738#S6)\. The model’s chain\-of\-thought arithmetic is correct under both methods, but the T2T run produces a corrupted final answer; T2M, by resetting the low\-confidence tokens in the final\-answer line and re\-predicting them, recovers the correct answer\.

Leading digits dropped \+ answer marker repeatedGold:31 Q: Xiaofang is 8 years old and her mother is 39\. How many years younger is Xiaofang than her mother? T2T: Reasoning:39−8=3139\-8=31✓→\\to“The answer is the answer is1” ✗T2M: Reasoning:39−8=3139\-8=31✓→\\to“The answer is31” ✓

Figure 4:Representative CMATH case illustrating last\-mile token corruption\. Both methods produce correct chain\-of\-thought reasoning\. The T2T run produces a corrupted final answer \(a dropped leading digit and a repeated answer marker\), while the T2M run outputs the correct result\. The answer\-marker phrase is translated from the Chinese original for readability\.
## Appendix HToken Denoising Trajectory

Figure[5](https://arxiv.org/html/2604.18738#A8.F5)shows the full token\-level denoising trajectory for the DROP 160 example \(gold==857\) referred to in Section[6\.4](https://arxiv.org/html/2604.18738#S6.SS4)\. Each cell reports the token and its model probability at the corresponding denoising step\. The figure expands Figure[1](https://arxiv.org/html/2604.18738#S1.F1)\(b\) into per\-step detail: T2T commits the error “8”→\\to“6” att=1t\{=\}1, whereas T2M remasks the position att=1t\{=\}1and recovers “8” att=2t\{=\}2\.

![Refer to caption](https://arxiv.org/html/2604.18738v1/x4.png)Figure 5:Token denoising trajectoryfor DROP example 160 \(gold = 857\)\. Each cell shows the token and its probability\.\(a\)T2T: “8” \(p=0\.11p\{=\}0\.11\) is replaced by “6” \(p=0\.64p\{=\}0\.64\) att=1t\{=\}1, producing 657\.\(b\)T2M: “8” is remasked att=1t\{=\}1, re\-predicted as “8” \(p=0\.94p\{=\}0\.94\) att=2t\{=\}2under converged context \(“5”, “7” committed\), producing 857\.
Remask, Don't Replace: Token-to-Mask Refinement in Masked Diffusion Language Models

Similar Articles

MaskAlign: Token-Subset Representation Alignment for Efficient Diffusion Training

Multi-Turn Reflective Masking Elicits Reasoning in Mask Diffusion Models

Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis

Who Should Lead Decoding Now? Tracking Reliable Trajectories for Ensembling Masked Diffusion Language Models

Mixing Times of Glauber Dynamics on Masked Language Models

Submit Feedback

Similar Articles

MaskAlign: Token-Subset Representation Alignment for Efficient Diffusion Training
Multi-Turn Reflective Masking Elicits Reasoning in Mask Diffusion Models
Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis
Who Should Lead Decoding Now? Tracking Reliable Trajectories for Ensembling Masked Diffusion Language Models
Mixing Times of Glauber Dynamics on Masked Language Models