The Confidence Shortcut: A Reasoning Failure Mode of Masked Diffusion Models

arXiv cs.AI Papers

Summary

This paper identifies a failure mode in masked diffusion language models where confidence-based decoding leads to high-confidence errors on complex reasoning tasks, and shows that confidence-aligned training exacerbates this issue while random masking preserves reasoning performance.

arXiv:2605.29123v1 Announce Type: new Abstract: Masked diffusion language models (MDMs) uniquely support any-order generation, with confidence-based decoding currently serving as the de facto standard inference policy. To optimize for this, recent training schemes attempt to align training mask patterns directly with those observed during generation. However, we argue that confidence-based decoding is inherently misaligned with the logical-flow trajectories required for complex reasoning, and that confidence-aligned training actively entrenches this misalignment. We make this concrete using multi-digit addition, where the decoding strategy prematurely predicts locally easy digits before resolving their long-range dependencies, producing high-confidence errors on challenging inputs. While traditional random masking keeps the failure rate low on this challenging tail, confidence-aligned training amplifies the error rate by an order of magnitude. Across five distinct reasoning tasks, this same pattern emerges with task-dependent severity: confidence-based decoding induces failures on highly complex inputs, and confidence-aligned training exacerbates them. In contrast, random masking -- despite its perceived inefficiency -- robustly preserves the reasoning-trajectory conditionals essential for solving the challenging tail.
Original Article
View Cached Full Text

Cached at: 05/29/26, 09:13 AM

# The Confidence Shortcut: A Reasoning Failure Mode of Masked Diffusion Models
Source: [https://arxiv.org/html/2605.29123](https://arxiv.org/html/2605.29123)
###### Abstract

Masked diffusion language models \(MDMs\) uniquely support any\-order generation, with confidence\-based decoding currently serving as the de facto standard inference policy\. To optimize for this, recent training schemes attempt to align training mask patterns directly with those observed during generation\. However, we argue that confidence\-based decoding is inherently misaligned with the logical\-flow trajectories required for complex reasoning, and that confidence\-aligned training actively entrenches this misalignment\. We make this concrete using multi\-digit addition, where the decoding strategy prematurely predicts locally easy digits before resolving their long\-range dependencies, producing high\-confidence errors on challenging inputs\. While traditional random masking keeps the failure rate low on this challenging tail, confidence\-aligned training amplifies the error rate by an order of magnitude\. Across five distinct reasoning tasks, this same pattern emerges with task\-dependent severity: confidence\-based decoding induces failures on highly complex inputs, and confidence\-aligned training exacerbates them\. In contrast, random masking—despite its perceived inefficiency—robustly preserves the reasoning\-trajectory conditionals essential for solving the challenging tail\.

22footnotetext:Correspondence to: Albert No<albertno@yonsei\.ac\.kr\>\.## 1Introduction

Discrete diffusion models\(Austinet al\.,[2021](https://arxiv.org/html/2605.29123#bib.bib19); Louet al\.,[2023](https://arxiv.org/html/2605.29123#bib.bib10); Campbellet al\.,[2022](https://arxiv.org/html/2605.29123#bib.bib20)\)have emerged as an alternative to autoregressive language modeling\. Among discrete diffusion variants, masked diffusion models \(MDMs\)\(Sahooet al\.,[2024](https://arxiv.org/html/2605.29123#bib.bib12); Shiet al\.,[2024](https://arxiv.org/html/2605.29123#bib.bib11); Ouet al\.,[2024](https://arxiv.org/html/2605.29123#bib.bib13)\)are particularly prominent: tokens transition between an absorbing\[MASK\]state and the original text, and the model is trained to reconstruct masked positions from a partially observed context\. Recent work has shown that MDMs can scale to large\-scale language modeling\(Nieet al\.,[2025](https://arxiv.org/html/2605.29123#bib.bib5); Yeet al\.,[2025b](https://arxiv.org/html/2605.29123#bib.bib8); Gonget al\.,[2025](https://arxiv.org/html/2605.29123#bib.bib21)\)and achieve competitive performance on reasoning, planning, and code\-generation benchmarks\(Yeet al\.,[2025a](https://arxiv.org/html/2605.29123#bib.bib7); Zhaoet al\.,[2025](https://arxiv.org/html/2605.29123#bib.bib22)\)\.

A central appeal of MDMs is their decoding flexibility\. Unlike autoregressive models, which commit to a fixed left\-to\-right factorization, an MDM can support arbitrary generation orders: different positions can be unmasked at different times, and each decoding order induces a different factorization of the same sequence distribution\(Changet al\.,[2022](https://arxiv.org/html/2605.29123#bib.bib15); Kimet al\.,[2025](https://arxiv.org/html/2605.29123#bib.bib3); Yeet al\.,[2025b](https://arxiv.org/html/2605.29123#bib.bib8)\)\. The dominant inference heuristic is*confidence\-based decoding*, which reveals the position with the largest top\-1 probability, margin, or negative entropy score\. This heuristic is intuitively appealing for generation, as it first commits to tokens that appear easiest under the current context\. It has also motivated recent confidence\-aligned training schemes\. PAPL\(Penget al\.,[2026](https://arxiv.org/html/2605.29123#bib.bib17)\)reweights the per\-token loss toward positions the model already predicts confidently, while PUMA\(Kimet al\.,[2026b](https://arxiv.org/html/2605.29123#bib.bib4)\)modifies the masking process so that training states resemble confidence\-based inference trajectories\.

For reasoning tasks, however, generation order is not merely a matter of presentation\. A reasoning problem usually has a*logical\-flow order*: an order in which intermediate facts become justified and later facts become determinate\. In multi\-digit addition, the stable order is least\-significant\-digit first, because each carry must be propagated from lower digits before higher digits are fully determined\. More generally, the useful generation order is the order in which a competent solver would establish the solution\. An MDM decoded in a different order must predict later facts while their prerequisites are still masked, forcing the model to marginalize over unresolved reasoning states\. Confidence\-based decoding can therefore diverge from the reasoning order: it prefers locally easy tokens, not necessarily tokens whose dependencies have been resolved\.

This distinction matters most on the hard tail of reasoning distributions\. The inputs on which the dependency order is long or rigid are often rare, but they are not pathological outliers\. They are controlled versions of the cases for which reasoning models are most valuable: long carry chains, narrow maze corridors, deeply nested expressions, or ultimately hard mathematical and scientific problems whose dependency structure cannot be shortened by local heuristics\. A model that performs well on common instances by following confidence shortcuts may still fail on precisely the inputs that distinguish complex reasoning from mere interpolation\.

We use multi\-digit addition as the cleanest setting in which to expose this divergence\. Addition is simple enough that the correct dependency structure is known exactly: each digit’s value depends on the carry propagated from below, so the unique reasoning order is least\-significant\-digit first\. At the same time, it admits a strong distributional shortcut\. Under typical digit sampling, long carry chains are rare, and high\-order digits can usually be predicted from a short local window without traversing the full chain\. These two ingredients—a known reasoning order and an available shortcut effective on most inputs—allow us to directly measure the decoding behavior\. Specifically, we can determine whether confidence\-based decoding follows the logical reasoning order or the shortcut, and observe how a model behaves when these two pathways diverge\.

Our empirical study compares uniform random masking with two confidence\-aligned training schemes, PAPL and PUMA, on five reasoning tasks: addition, maze, ListOps, Countdown, and Sudoku\. We hold architecture and compute fixed within each domain; only the training intervention varies\. We evaluate confidence\-based decoding, random decoding, and task\-specific logical\-flow or solver\-derived orders where available, and stratify results by structural difficulty\. The results show that confidence alignment can amplify the mismatch between locally easy predictions and the true reasoning\-order dependencies\. Depending on the task, this alignment can even produce critical overall failure\.

Our contributions are:

- •We formulate a reasoning\-order view of MDM decoding and explain why confidence\-based decoding can be suboptimal when confidence differs from logical\-flow dependency order\.
- •We give a concrete analysis of the confidence shortcut on multi\-digit addition, characterizing how confidence\-aligned training schemes amplify the failure\.
- •We provide a controlled five\-task empirical study showing that confidence\-aligned training can amplify reasoning\-order coverage gaps in qualitatively different ways across tasks\.

## 2Preliminaries

#### Notation\.

Let𝐱=\(x1,…,xL\)∈𝒱L\\mathbf\{x\}=\(x\_\{1\},\\dots,x\_\{L\}\)\\in\\mathcal\{V\}^\{L\}denote a clean sequence over vocabulary𝒱\\mathcal\{V\}, augmented with a special mask token𝙼\\mathtt\{M\}\. For a subset𝐌⊆\[L\]=\{1,…,L\}\\mathbf\{M\}\\subseteq\[L\]=\\\{1,\\dots,L\\\}, we write𝐌¯=\[L\]∖𝐌\\overline\{\\mathbf\{M\}\}=\[L\]\\setminus\\mathbf\{M\}for the complement and𝐱𝐌¯\\mathbf\{x\}\_\{\\overline\{\\mathbf\{M\}\}\}for the sub\-sequence of𝐱\\mathbf\{x\}at the positions in𝐌¯\\overline\{\\mathbf\{M\}\}\. We refer to𝐌\\mathbf\{M\}as the masked indices and𝐌¯\\overline\{\\mathbf\{M\}\}as the visible indices\.

### 2\.1Masked Diffusion Models

A masked diffusion model \(MDM\) learns to reconstruct tokens from partially masked contexts\. Given𝐱∼pdata\\mathbf\{x\}\\sim p\_\{\\text\{data\}\}, a masking rateλ∼Unif​\(0,1\)\\lambda\\sim\\mathrm\{Unif\}\(0,1\)is sampled, and each position in\[L\]\[L\]is independently masked with probabilityλ\\lambda\. Let𝐌⊆\[L\]\\mathbf\{M\}\\subseteq\[L\]be the resulting masked set\. The model observes𝐱𝐌¯\\mathbf\{x\}\_\{\\overline\{\\mathbf\{M\}\}\}and assigns a categorical distribution to each masked position:

pθ​\(xi∣𝐱𝐌¯\)∈Δ​\(𝒱\),i∈𝐌\.p\_\{\\theta\}\(x\_\{i\}\\mid\\mathbf\{x\}\_\{\\overline\{\\mathbf\{M\}\}\}\)\\in\\Delta\(\\mathcal\{V\}\),\\quad i\\in\\mathbf\{M\}\.The standard MDM denoising objective is

ℒ​\(θ\)=−𝔼𝐱,λ,𝐌​\[1λ​∑i∈𝐌log⁡pθ​\(xi∣𝐱𝐌¯\)\],\\mathcal\{L\}\(\\theta\)=\-\\,\\mathbb\{E\}\_\{\\mathbf\{x\},\\,\\lambda,\\,\\mathbf\{M\}\}\\left\[\\frac\{1\}\{\\lambda\}\\sum\_\{i\\in\\mathbf\{M\}\}\\log p\_\{\\theta\}\(x\_\{i\}\\mid\\mathbf\{x\}\_\{\\overline\{\\mathbf\{M\}\}\}\)\\right\],where1/λ1/\\lambdanormalizes for the expected fraction of masked tokens\.

#### Order\-agnostic interpretation\.

The MDM objective is equivalent to a uniform expectation over generation orders\(Kimet al\.,[2025](https://arxiv.org/html/2605.29123#bib.bib3)\):

ℒ​\(θ\)∝−𝔼π∼Unif​\(𝕊L\)​\[∑j=1Llog⁡pθ​\(xπ​\(j\)\|𝐱π⁣\(:j\)\)\],\\mathcal\{L\}\(\\theta\)\\;\\propto\\;\-\\,\\mathbb\{E\}\_\{\\pi\\sim\\mathrm\{Unif\}\(\\mathbb\{S\}\_\{L\}\)\}\\left\[\\sum\_\{j=1\}^\{L\}\\log p\_\{\\theta\}\\bigl\(x\_\{\\pi\(j\)\}\\,\\big\|\\,\\mathbf\{x\}\_\{\\pi\(\\,:\\,j\)\}\\bigr\)\\right\],where𝕊L\\mathbb\{S\}\_\{L\}denotes the symmetric group of permutationsπ\\pion\[L\]\[L\]andπ\(:j\)=\{π\(1\),…,π\(j−1\)\}\\pi\(\\,:\\,j\)=\\\{\\pi\(1\),\\dots,\\pi\(j\-1\)\\\}is the prefix unmasked before stepjj\. Every generation order is trained uniformly; the order used at inference time is determined by the decoding policy\.

### 2\.2Decoding policies

MDM inference starts from the fully masked sequence and iteratively unmasks tokens\. At each step, let𝐌⊆\[L\]\\mathbf\{M\}\\subseteq\[L\]be the current masked set; the model produces a distributionpθ\(⋅∣𝐱𝐌¯\)p\_\{\\theta\}\(\\cdot\\mid\\mathbf\{x\}\_\{\\overline\{\\mathbf\{M\}\}\}\)for eachi∈𝐌i\\in\\mathbf\{M\}\. A decoding policy chooses a reveal setR⊆𝐌R\\subseteq\\mathbf\{M\}and fills those positions, typically by greedy prediction:

xi←arg⁡maxv∈𝒱⁡pθ​\(v∣𝐱𝐌¯\),i∈R\.x\_\{i\}\\;\\leftarrow\\;\\arg\\max\_\{v\\in\\mathcal\{V\}\}\\,p\_\{\\theta\}\(v\\mid\\mathbf\{x\}\_\{\\overline\{\\mathbf\{M\}\}\}\),\\quad i\\in R\.
#### Confidence\-based decoding\.

The most common decoding policy selects positions where the current model is most confident\. For a masked positioni∈𝐌i\\in\\mathbf\{M\}, define the top\-1 confidence

cθi=maxv∈𝒱⁡pθ​\(v∣𝐱𝐌¯\)\.c\_\{\\theta\}^\{i\}=\\max\_\{v\\in\\mathcal\{V\}\}\\,p\_\{\\theta\}\(v\\mid\\mathbf\{x\}\_\{\\overline\{\\mathbf\{M\}\}\}\)\.Confidence\-based decoding selects the highest\-scoring positions according tocθic\_\{\\theta\}^\{i\}and decodes them first\. Variants use related uncertainty measures, such as the margin between the top two probabilities or negative predictive entropy\(Changet al\.,[2022](https://arxiv.org/html/2605.29123#bib.bib15); Kimet al\.,[2025](https://arxiv.org/html/2605.29123#bib.bib3); Yeet al\.,[2025b](https://arxiv.org/html/2605.29123#bib.bib8)\)\.

#### Confidence\-aligned training\.

Recent work modifies MDM training so that the training distribution better matches confidence\-based inference\.PAPL\(Planner Aware Path Learning\)\(Penget al\.,[2026](https://arxiv.org/html/2605.29123#bib.bib17)\)keeps the i\.i\.d\. random masking process but reweights the per\-token loss across masked positions: positions that the model already predicts confidently receive larger loss weight\.PUMA\(Progressive UnMAsking\)\(Kimet al\.,[2026b](https://arxiv.org/html/2605.29123#bib.bib4)\)replaces the masking process itself: starting from a fully masked sequence, it iteratively unmasks ground\-truth tokens in the order selected by the model’s confidence scores and uses the resulting intermediate states as denoising contexts\. Both schemes shift the training distribution toward the inference\-time confidence trajectory, on the rationale that aligning the two should improve generation quality\. For the details of each method, please refer to Appendix[D](https://arxiv.org/html/2605.29123#A4)\.

## 3Addition as a Controlled Lens on Reasoning Order

Addition serves as an exceptionally clean diagnostic for reasoning in masked diffusion models\. The task is elementary, its exact dependency structure is known, and the gap between genuine reasoning and high\-accuracy shortcuts can be characterized in closed form\. This makes addition an ideal motivating case before examining tasks with more complex and ambiguous dependencies\.

### 3\.1Task setup

We consider3232\-digit addition\. Each example consists of two operands and a sum: “a\+b=ca\+b=c\.”

The prompt contains the two operands and the equality sign, and the answer region contains the3333output digits of the sum, including the possible carry\-out digit\. During generation, the answer region is initially masked and the model must fill in all output digits\.

We evaluate three training schemes—standard uniform random masking, PAPL, and PUMA—across two decoding policies\. The first is confidence\-based decoding, which unmasks the highest\-confidence digit at each step\. The second is an addition\-specific, least\-significant\-digit\-first policy that strictly enforces the arithmetic dependency order\. We stratify test instances by the length of their longest carry\-propagation chain\.

### 3\.2The optimal reasoning order is LSB\-first

Although the input and output strings are conventionally written in most\-significant\-first order, the arithmetic computation flows in the opposite direction\. Indexing from the least significant digit, leta0,b0a\_\{0\},b\_\{0\}, andc0c\_\{0\}be the least significant digits, withc32c\_\{32\}representing the final carry\-out\. Letrir\_\{i\}denote the carry into positionii\. The arithmetic resolves as follows:

r0=0,ci=\(ai\+bi\+ri\)mod10ri\+1=𝟏​\{ai\+bi\+ri≥10\}\(i=0,…,31\),c32=r32r\_\{0\}=0,\\qquad\\begin\{aligned\} c\_\{i\}&=\(a\_\{i\}\+b\_\{i\}\+r\_\{i\}\)\\bmod 10\\\\ r\_\{i\+1\}&=\\mathbf\{1\}\\\{a\_\{i\}\+b\_\{i\}\+r\_\{i\}\\geq 10\\\}\\end\{aligned\}\\quad\(i=0,\\ldots,31\),\\qquad c\_\{32\}=r\_\{32\}Here,cic\_\{i\}is fully determined oncerir\_\{i\}is known, andrir\_\{i\}is established only after resolving the lower\-order positions\. Therefore, unmasking the answer fromc0c\_\{0\}toc32c\_\{32\}never forces the model to predict a digit before its arithmetic prerequisites are available\. This makes least\-significant\-digit\-first \(LSB\-first\) decoding the optimal logical\-flow order for addition\.

### 3\.3The tempting shortcut

Addition also admits a powerful shortcut hidden within these same mechanics\. Define the position\-wise sumsi=ai\+bis\_\{i\}=a\_\{i\}\+b\_\{i\}\. Ifsi≤8s\_\{i\}\\leq 8, the carry out of positioniiis*killed*:ri\+1=0r\_\{i\+1\}=0regardless ofrir\_\{i\}\. Ifsi≥10s\_\{i\}\\geq 10, a carry is*generated*:ri\+1=1r\_\{i\+1\}=1regardless ofrir\_\{i\}\. Only whensi=9s\_\{i\}=9does the position*propagate*the incoming carry, yieldingri\+1=rir\_\{i\+1\}=r\_\{i\}\. We classify these askk,gg, andppcells, respectively\. A maximal sequence of consecutiveppcells forms a*carry chain*, which is strictly bounded by the nearest lower\-orderggorkkcell that terminates it\.

To determine the carry into a high\-order position, it is usually sufficient to look a short distance downward until a non\-propagating digit \(ggorkk\) is encountered\. True long\-range dependency arises only across an unbroken run ofsi=9s\_\{i\}=9\. Under uniform digit sampling,ℙ​\(si=9\)=0\.1\\mathbb\{P\}\(s\_\{i\}=9\)=0\.1\. Consequently, a bounded heuristic that looks only at thewwpreceding lower\-order positions fails for a specific digit only if allwwpositions propagate the carry:

ℙ​\(w​\-digit lookahead is insufficient for position​i\)=ℙ​\(si−1=⋯=si−w=9\)=10−w\.\\mathbb\{P\}\\bigl\(w\\text\{\-digit lookahead is insufficient for position \}i\\bigr\)=\\mathbb\{P\}\(s\_\{i\-1\}=\\cdots=s\_\{i\-w\}=9\)=10^\{\-w\}\.
Applying a union bound over all 33 answer digits, the probability that any digit requires a lookahead greater thanwwis bounded by33⋅10−w33\\cdot 10^\{\-w\}\. With just a window ofw=8w=8, this probability falls below10−610^\{\-6\}\. Thus, a model can achieve near\-perfect average accuracy by learning a finite\-window lookahead heuristic instead of the rigorous LSB\-first computation\. This shortcut is neither weak nor artificial; it almost perfectly aligns with the standard training distribution\.

This phenomenon demonstrates exactly why addition is such a useful diagnostic tool\. Average test accuracy can appear nearly perfect, even while the model completely fails to implement the true reasoning order required for rare but structurally demanding carry chains\.

### 3\.4Confidence decoding fails on the hard tail

Each model is trained for300​k300\\mathrm\{k\}iterations \(batch size256256\) on20,00020\{,\}000random instances of3232\-digit addition \(≈4,000\\approx 4\{,\}000epochs\)\. Because this is an elementary task, all three training schemes easily achieve≥99\.9%\\geq 99\.9\\%exact\-match accuracy on the standard test set \(n=10,000n=10\{,\}000\) under confidence\-based decoding\. To probe behavior on the rare long\-chain inputs, we construct test strata of500500instances each, guaranteeing a minimum carry\-chain length\.

Table 1:Exact\-match accuracy on3232\-digit addition by carry\-chain length \(3\-seed average\)\. “Random” denotes standard MDM training with uniform random masking\. PAPL is reported withα=1\\alpha=1; PAPL withα=5\\alpha=5collapses under both decode policies\.[Table˜1](https://arxiv.org/html/2605.29123#S3.T1)reports exact\-match accuracy by chain length\. Random masking remains nearly perfect under confidence\-based decoding—scoring0\.9920\.992on the hardest stratum \(chain≥28\\geq 28\)—and hits exactly1\.0001\.000under LSB\-first decoding across all lengths\. PUMA, however, drops to0\.9080\.908on the hardest stratum under confidence\-based decoding, though it completely recovers to1\.0001\.000when forced to use LSB\-first decoding\. PAPL suffers a far more severe collapse\. Its loss reweighting is controlled byα\\alpha\(where larger values shift more weight to high\-confidence positions\)\. Even at a modest setting ofα=1\\alpha=1, accuracy drops to0\.0260\.026under confidence\-based decoding for chain≥28\\geq 28, and LSB\-first decoding barely improves this to0\.0300\.030\. Under the defaultα=5\\alpha=5, accuracy collapses to zero across all chain lengths regardless of the decoding policy\.

![Refer to caption](https://arxiv.org/html/2605.29123v1/x1.png)Figure 1:Two confidence\-decode failures from random masking on long\-carry\-chain inputs\. The chain\-MSB cell is decoded before the chain interior is resolved\. The committed value differs from the answer by exactly±1\\pm 1, with the direction determined by the cell’s role: atkkcells the model assumes no carry\-in arrives \(−1\-1\); atggcells the model assumes one does \(\+1\+1\)\.#### Wrong commit profile\.

Of the406406confidence\-decoding failures \(3939from random masking,367367from PUMA\),99\.8%99\.8\\%occur at the chain\-MSB cell\. In these cases, the model misses the true answer by exactly±1\\pm 1, predictably reflecting the cell’s arithmetic role: overshooting by\+1\+1at nearly everyggcell and undershooting by−1\-1at everykkcell \([Figure˜1](https://arxiv.org/html/2605.29123#S3.F1)\)\. Remarkably, the model is highly confident in these errors, assigning the wrong token a top\-1 probability of0\.9970\.997on average with the correct token consistently ranked second\. Crucially, these wrong commits occur prematurely relative to the dependent chain\. At the moment of the erroneous commit, the median fraction of the carry chain’s interior that remains masked is0\.680\.68for random masking and1\.001\.00for PUMA\. On the hardest stratum \(chain≥28\\geq 28\), this commit happens as early as decoding step1212of3333for random masking, and at step33for PUMA\. In short, confidence\-based decoding selects the chain\-MSB cell long before the carry chain is resolved, a flaw that PUMA massively exacerbates by committing while the entire chain is completely masked\.

PAPL exhibits a different failure mode\. Its wrong commits are scattered across bothgg/kkcells \(exhibiting both over\- and under\-predictions without a consistent sign\) and chain\-interiorppcells \(missing by±9\\pm 9, indicative of a wrong\-neighbor carry flip\)\. Furthermore, these errors occur very late in the decoding trajectory—at a median stage of2828out of3333, when only28%28\\%of the chain remains unresolved—indicating that the failure arises after the chain has mostly been decoded\.

### 3\.5Discussion: What addition teaches us

The chain\-MSB cell is the worst possible target for confidence\-based decoding\. While local context \(ad\+bda\_\{d\}\+b\_\{d\}\) easily determines the carry\-*out*, predicting the cell’s exact digit still requires the unresolved carry\-*in*from the chain below\. Forced to predict prematurely, the model uses the cell’s own arithmetic role as a local proxy\. Since akkcell does not propagate a carry, the model assumes no carry arrives; conversely, for aggcell, it assumes one does\. This substitution explains the predictable±1\\pm 1errors\. It is a typical*confidence shortcut*: a locally valid heuristic that succeeds on the bulk of the distribution but fails entirely when a distant anchor dictates the carry\-in\. Both random masking and PUMA share this exact failure profile—location, direction, and probability collapse—differing only in frequency\.

![Refer to caption](https://arxiv.org/html/2605.29123v1/x2.png)Figure 2:Training loss \(solid, left axis\) and generation accuracy \(dashed, right axis\) on the natural test distribution by training scheme\.Training dynamics \([Figure˜2](https://arxiv.org/html/2605.29123#S3.F2)\) reveal the trade\-off behind this: by aligning training mask states to the confidence\-based inference trajectory, PUMA converges on the natural distribution an order of magnitude faster than random masking\. However, this alignment rigidly commits the model to the shortcut trajectory, severely amplifying the latent failure rate on long\-chain inputs\. Importantly, the underlying capacity to compute the correct carry\-in remains intact, as LSB\-first oracle decoding perfectly recovers100%100\\%accuracy\. PUMA does not degrade what the model can represent, but rather biases which representations its inference path queries\.

In contrast, PAPL exhibits a fundamentally different, unrecoverable failure mode\. Its errors scatter unpredictably across cells and cannot be rescued by oracle decoding\. This collapse highlights a fatal flaw in its loss reweighting: positions lacking early confidence \(i\.e\., those with deep sequential dependencies\) are continuously downweighted and effectively starved of training signal\. PAPL’s training loss flatlines for most of the process before rising late \([Figure˜2](https://arxiv.org/html/2605.29123#S3.F2)\), reflecting that these deeply dependent representations are never properly learned, leaving the long\-chain conditionals completely broken\.

## 4Experiments on Other Reasoning Tasks

### 4\.1Setup

We extend the analysis to four reasoning tasks: maze navigation, ListOps, Countdown, and Sudoku\. For each task, all three schemes \(random masking, PAPL, PUMA\) are trained under a single architecture and compute budget per domain and evaluated under confidence\-based decoding and a task\-specific dependency\-respecting decoding where available\. Difficulty stratification is task\-specific\. For detailed data format and hyperparameter setting, please refer to Appendix[C](https://arxiv.org/html/2605.29123#A3)and[D](https://arxiv.org/html/2605.29123#A4)\.

### 4\.2Maze

We evaluate a10×1010\\times 10maze rendered on a21×2121\\times 21wall/corridor grid, with marked start and goal\. The model labels every corridor cell as on\-path or off\-path\. The dependency\-respecting oracle is dead\-end\-filling, the standard polynomial\-time maze solver: it iteratively eliminates dead\-end branches until only the start\-to\-goal path remains\. We stratify the test mazes by the longest corridor in the backbone path; long corridors require information to propagate from both ends\. Each stratum contains300300mazes\.

[Table˜2](https://arxiv.org/html/2605.29123#S4.T2)reports puzzle\-level exact\-match accuracy by corridor length\. Model trained with random masking is at or above0\.990\.99under both decode policies across all strata\. PUMA’s confidence\-decode accuracy stays near0\.880\.88at every corridor length and is recovered to≥0\.91\\geq 0\.91by the dead\-end filling, with the gap widening at longer corridors \(0\.8930\.893confidence vs\.0\.9870\.987oracle at corridor≥30\\geq 30\)\. PAPL falls between the two:∼0\.90\\sim\\\!0\.90under confidence decoding and≥0\.97\\geq 0\.97under the oracle\. Uniform random decoding performs much worse than either confidence or the oracle for all three schemes, with PUMA the weakest by a wide margin\. PUMA’s training distribution covers only mask states reachable along its self\-induced confidence trajectory, so the random decoding trajectory exposes the model to mask configurations it has rarely seen during training\.

Unlike in the addition task, PAPL does not exhibit a complete collapse on mazes: oracle decoding recovers its performance to≥0\.97\\geq 0\.97, indicating that the underlying conditionals distinguishing the labels remain intact\. The maze grid imposes redundant constraints on each cell through its neighbors, so the same conditional is reachable through many partial\-mask configurations, and a confidence\-weighted loss does not strand any single dependency\. The addition chain admits no such redundancy\.

Table 2:Maze puzzle exact\-match accuracy by maximum corridor length on21×2121\\times 21grids\.![Refer to caption](https://arxiv.org/html/2605.29123v1/x3.png)Figure 3:Wrong\-commit region geometry on the maze grid\. Cells are labeled by the color: walls \(black\), off\-path corridors \(light gray\), the answer path \(green\) connecting two start/end cells \(blue circles\)\. Wrong commits are overlaid as a diagonal half: an off\-path cell labeled on\-path \(orange\) or an on\-path cell labeled off\-path \(red\)\.#### Wrong commit profile\.

We examined all394394confidence\-decoding failures \(1111from random masking,209209from PUMA,174174from PAPL\)\. Without exception, the wrong cells in each failing instance form a single contiguous,11\-cell\-wide path rather than a 2D cluster\. Topologically,96%96\\%of wrong cells have exactly two wrong neighbors, and none have more than two\. Local statistics confirm this spatial propagation: an orthogonal neighbor of a wrong cell is itself wrong with probability0\.450\.45on average, far exceeding thei\.i\.d\.baseline of0\.120\.12–0\.150\.15\. The nature of the error differs across training schemes\. PAPL and PUMA mostly make*path\-removal errors*: they take cells that truly lie on the start\-to\-goal path and label them as off\-path\. PUMA makes this error in202202of its209209failures, and PAPL makes it in all174174of its failures\. Geometrically \(red regions,[Figure˜3](https://arxiv.org/html/2605.29123#S4.F3)\), the model cuts out a contiguous segment, severing the start\-to\-goal connection\.

Random masking has far fewer failures\. Among its1111failures,88are instead*path\-addition errors*: the model labels an off\-path corridor as on\-path\. Geometrically, this creates a spurious side corridor or parallel false path rather than cutting the true path\.

This shared shape arises because a cell’s label is strongly biased toward extending its committed neighbors\. Once the model makes an incorrect commit, this flawed state acts as a confident local proxy for adjacent cells, cascading the error along the corridor until a dead end\. This deeply parallels the chain\-MSB substitution in addition: both exploit a locally consistent heuristic—carry symmetry there, corridor extension here—as a proxy for a globally determined value, propagating the error along the very structure that makes the shortcut tempting\.

### 4\.3ListOps, Countdown, Sudoku

Table 3:Accuracy on three reasoning tasks\. “Recovery” is task\-specific: layered post\-order decoding for ListOps,50%50\\%answer equation chain reveal for Countdown, and not available for Sudoku\. “mm” for Countdown is solution multiplicity \(number of distinct equations reaching the target\)\. ListOps and Countdown report exact\-match; Sudoku reports cell\-level accuracy on blank cells\.#### ListOps\.

ListOps consists of nested arithmetic expressions overmin,max,median, andsum\-mod\-10operators\(Nangia and Bowman,[2018](https://arxiv.org/html/2605.29123#bib.bib29)\)\. Its reasoning structure is hierarchical: inner sub\-expressions must be evaluated before the outer operators become resolvable\. We therefore stratify test instances by expression depth and use a layered post\-order decoding policy as a bottom\-up diagnostic order\.

As depth increases, confidence\-aligned training increasingly underperforms standard random masking\. At depth33, accuracy under confidence\-based decoding is0\.4260\.426for random masking,0\.3020\.302for PAPL, and0\.2060\.206for PUMA\. At depth55, the gap becomes even more pronounced: random masking reaches0\.0420\.042, PAPL0\.0240\.024, and PUMA only0\.0020\.002\([Table˜3](https://arxiv.org/html/2605.29123#S4.T3)\)\. Thus, the same pattern observed on long carry chains emerges in a hierarchical domain: as the dependency structure deepens, confidence\-aligned training loses coverage faster than standard random masking\.

Unlike in the addition and maze tasks, layered post\-order decoding does not substantially recover model performance\. At depth55, random masking improves only from0\.0420\.042to0\.0540\.054, PAPL from0\.0240\.024to0\.0280\.028, and PUMA from0\.0020\.002to0\.0040\.004\. Uniform random decoding yields similar numbers \(0\.0340\.034,0\.0320\.032, and0\.0020\.002\)\. These results suggest that the failure is not merely due to a bad confidence trajectory\. In this small\-model regime, the deep bottom\-up conditionals themselves appear underlearned, especially for PUMA\.

#### Countdown\.

Countdown asks the model to combine four numbers with\{\+,−,×,/\}\\\{\+,\-,\\times,/\\\}to reach a target\. We stratify examples by solution multiplicitymm, the number of distinct equation chains that reach the target; lowmmindicates a highly constrained instance, while highmmindicates that many valid solutions exist\(Katzet al\.,[2025](https://arxiv.org/html/2605.29123#bib.bib36)\)\.

The aggregate results reveal a surprising trend\. PUMA is much weaker overall under confidence\-based decoding: achieving0\.2940\.294exact\-match accuracy, compared with0\.5180\.518for random masking and0\.5250\.525for PAPL\. However, this deficit is not concentrated on the rare, highly constrained cases\. On the low\-multiplicity stratum \(m∈\[1,3\]m\\in\[1,3\]\), PUMA slightly exceeds random masking \(0\.2110\.211vs\.0\.1820\.182\), although this small gap should not be over\-interpreted\. The decisive failure occurs on the common, high\-multiplicity stratum: form≥11m\\geq 11, PUMA drops to0\.3310\.331, while random masking and PAPL reach0\.9730\.973and0\.9810\.981, respectively\.

Uniform random decoding does not repair this gap: onm≥11m\\geq 11, PUMA improves only to0\.4740\.474, remaining far below random masking and PAPL\. In contrast, unmasking50%50\\%of the answer equation chain recovers all three methods to at least0\.990\.99overall\. Thus, PUMA has not lost the arithmetic conditionals needed to complete a solution once a useful scaffold is visible\. Its failure is instead a mask\-state coverage failure: the self\-induced confidence training trajectory has narrowed away from the partial equation states needed to solve many common, easy Countdown instances from scratch\.

#### Sudoku\.

We evaluate standard9×99\\times 9Sudoku puzzles and report cell\-level accuracy on initially blank cells\. We stratify test instances by puzzle rating tier—where the top\-1%1\\%denotes the hardest one percent of puzzles based on a solver’s guess count—and by the fraction of TL4 cells, which require search beyond standard deductive techniques\(t\-dillon,[2019](https://arxiv.org/html/2605.29123#bib.bib33); Stuart,[2008](https://arxiv.org/html/2605.29123#bib.bib34)\)\.

Sudoku represents a success case for confidence\-aligned training\. Under confidence\-based decoding, PUMA performs best across every difficulty level: overall accuracy is0\.7530\.753for PUMA versus0\.6290\.629for random masking, and on the top\-1%1\\%tier, PUMA reaches0\.6090\.609versus0\.3570\.357for random masking \([Table˜3](https://arxiv.org/html/2605.29123#S4.T3)\)\. This contrasts with the addition and maze tasks\. In Sudoku, highly confident predictions often correspond to logically ready cells: because each blank cell is constrained by its row, column, and3×33\\times 3box, unmasking one correct cell immediately strengthens many other predictions\. Confidence\-based decoding therefore resembles true constraint propagation rather than a flawed shortcut that bypasses unresolved dependencies\.

Uniform random decoding confirms that PUMA’s gain is primarily trajectory\-driven\. On the top\-1%1\\%tier, PUMA’s lead over random masking shrinks from\+0\.252\+0\.252under confidence\-based decoding \(0\.6090\.609vs\.0\.3570\.357\) to\+0\.062\+0\.062under random decoding \(0\.4320\.432vs\.0\.3700\.370\)\. For instances with a TL4 fraction≥0\.95\\geq 0\.95, the lead similarly shrinks from\+0\.132\+0\.132to\+0\.032\+0\.032\. Thus, PUMA does not simply learn a uniformly stronger global representation; rather, it benefits because its confidence\-trained trajectory is inherently well aligned with Sudoku’s constraint\-propagation structure\.

## 5Discussion

#### Confidence is a readiness heuristic, not a guarantee\.

Confidence\-based decoding treats high token probability as evidence that a position is ready to be unmasked\. This heuristic is useful when local predictability matches logical readiness, but it fails when a token is predictable via a shortcut while its true dependencies remain masked\. The addition and maze tasks make this distinction clear: a digit or corridor cell can appear highly confident well before the underlying carry state or global path connectivity has been resolved\. Sudoku represents the opposite case, where high\-confidence predictions often perfectly align with the cells a constraint solver would logically fill next\.

#### Confidence alignment narrows the states the model learns\.

Standard random masking trains the model across a diverse range of partial\-observation states\. In contrast, PAPL and PUMA bias training toward states that are likely to occur under confidence\-based decoding\. This bias can improve efficiency when the confidence trajectory is task\-aligned, as seen in Sudoku\. However, when confidence follows a shortcut, this same narrowing drastically reduces coverage of critical reasoning states\. The result is either a*trajectory failure*, where a better decoding order can recover the model’s performance, or a*representation failure*, where the underlying conditionals were never learned sufficiently to be recovered\.

#### The main lesson\.

The core issue is not confidence\-based decoding itself, but whether its decoding order faithfully matches the task’s dependency structure\. It succeeds when confidence correctly tracks logical readiness, but fails when confidence latches onto a locally correct yet globally incomplete shortcut\. Therefore, evaluation should look beyond average accuracy under a single confidence decoder\. Models must be tested on structurally hard cases, using solver\-derived unmasking orders as diagnostics wherever possible\.

## 6Conclusion

Masked diffusion language models offer flexible, any\-order generation, but complex reasoning tasks fundamentally require the decoding order to follow the logical flow of resolving facts\. While confidence\-based decoding serves as a useful local proxy, it catastrophically fails when reflecting a superficial shortcut rather than logical readiness\. We observe this failure across multiple domains: on long carry chains in addition, on long corridors in maze, on deep expression trees in ListOps, and through narrowed mask\-state coverage in Countdown\. Confidence\-aligned training amplifies these failures when the confidence trajectory diverges from the task’s true dependency structure, yet provides a benefit when they coincide, as in Sudoku\. Ultimately, training should not merely align with the inference policy; rather, the inference policy itself must fundamentally align with the underlying reasoning order\.

## References

- S\. Aman \(2026\)LogicDiff: logic\-guided denoising improves reasoning in masked diffusion language models\.arXiv preprint arXiv:2603\.26771\.Cited by:[Appendix B](https://arxiv.org/html/2605.29123#A2.SS0.SSS0.Px2.p1.1)\.
- H\. Asano, T\. Kozuno, K\. Saito, and Y\. Baba \(2026\)Where\-to\-unmask: ground\-truth\-guided unmasking order learning for masked diffusion language models\.arXiv preprint arXiv:2602\.09501\.Cited by:[Appendix B](https://arxiv.org/html/2605.29123#A2.SS0.SSS0.Px2.p1.1)\.
- J\. Austin, D\. D\. Johnson, J\. Ho, D\. Tarlow, and R\. Van Den Berg \(2021\)Structured denoising diffusion models in discrete state\-spaces\.InNeurIPS,Cited by:[Appendix B](https://arxiv.org/html/2605.29123#A2.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.29123#S1.p1.1)\.
- Y\. Bengio, J\. Louradour, R\. Collobert, and J\. Weston \(2009\)Curriculum learning\.InICML,Cited by:[Appendix B](https://arxiv.org/html/2605.29123#A2.SS0.SSS0.Px3.p1.1)\.
- C\. Cai and G\. Li \(2026\)Confidence\-based decoding is provably efficient for diffusion language models\.arXiv preprint arXiv:2603\.22248\.Cited by:[Appendix B](https://arxiv.org/html/2605.29123#A2.SS0.SSS0.Px2.p1.1)\.
- A\. Campbell, J\. Benton, V\. De Bortoli, T\. Rainforth, G\. Deligiannidis, and A\. Doucet \(2022\)A continuous time framework for discrete denoising models\.InNeurIPS,Cited by:[Appendix B](https://arxiv.org/html/2605.29123#A2.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.29123#S1.p1.1)\.
- H\. Chang, H\. Zhang, L\. Jiang, C\. Liu, and W\. T\. Freeman \(2022\)Maskgit: masked generative image transformer\.InCVPR,Cited by:[Appendix B](https://arxiv.org/html/2605.29123#A2.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.29123#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.29123#S2.SS2.SSS0.Px1.p1.2)\.
- K\. Gandhi, D\. Lee, G\. Grand, M\. Liu, W\. Cheng, A\. Sharma, and N\. D\. Goodman \(2024\)Stream of search \(sos\): learning to search in language\.arXiv preprint arXiv:2404\.03683\.Cited by:[§C\.4](https://arxiv.org/html/2605.29123#A3.SS4.SSS0.Px1.p1.1)\.
- R\. Geirhos, J\. Jacobsen, C\. Michaelis, R\. Zemel, W\. Brendel, M\. Bethge, and F\. A\. Wichmann \(2020\)Shortcut learning in deep neural networks\.Nature Machine Intelligence\.Cited by:[Appendix B](https://arxiv.org/html/2605.29123#A2.SS0.SSS0.Px4.p1.1)\.
- S\. Gong, R\. Zhang, H\. Zheng, J\. Gu, N\. Jaitly, L\. Kong, and Y\. Zhang \(2025\)Diffucoder: understanding and improving masked diffusion models for code generation\.arXiv preprint arXiv:2506\.20639\.Cited by:[Appendix B](https://arxiv.org/html/2605.29123#A2.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.29123#S1.p1.1)\.
- Z\. Huang, Z\. Chen, Z\. Wang, T\. Li, and G\. Qi \(2025\)Reinforcing the diffusion chain of lateral thought with diffusion language models\.arXiv preprint arXiv:2505\.10446\.Cited by:[Appendix B](https://arxiv.org/html/2605.29123#A2.SS0.SSS0.Px3.p1.1)\.
- M\. I\. Ivanitskiy, R\. Shah, A\. F\. Spies, T\. Räuker, D\. Valentine, C\. Rager, L\. Quirke, C\. Mathwin, G\. Corlouer, C\. D\. Behn,et al\.\(2023\)A configurable library for generating and manipulating maze datasets\.arXiv preprint arXiv:2309\.10498\.Cited by:[§C\.2](https://arxiv.org/html/2605.29123#A3.SS2.SSS0.Px1.p1.1)\.
- M\. Katz, H\. Kokel, and S\. Sreedharan \(2025\)Seemingly simple planning problems are computationally challenging: the countdown game\.arXiv preprint arXiv:2508\.02900\.Cited by:[§4\.3](https://arxiv.org/html/2605.29123#S4.SS3.SSS0.Px2.p1.4)\.
- B\. Kim, D\. Jeon, D\. Kim, W\. Jeung, and A\. No \(2026a\)Rainbow padding: mitigating early termination in instruction\-tuned diffusion LLMs\.InICLR,Cited by:[§C\.6](https://arxiv.org/html/2605.29123#A3.SS6.p1.1)\.
- J\. Kim, J\. Geuter, D\. Alvarez\-Melis, S\. Kakade, and S\. Chen \(2026b\)Stop training for the worst: progressive unmasking accelerates masked diffusion training\.arXiv preprint arXiv:2602\.10314\.Cited by:[Appendix B](https://arxiv.org/html/2605.29123#A2.SS0.SSS0.Px3.p1.1),[§D\.4](https://arxiv.org/html/2605.29123#A4.SS4.p1.6),[§1](https://arxiv.org/html/2605.29123#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.29123#S2.SS2.SSS0.Px2.p1.1)\.
- J\. Kim, K\. Shah, V\. Kontonis, S\. M\. Kakade, and S\. Chen \(2025\)Train for the worst, plan for the best: understanding token ordering in masked diffusions\.InICML,Cited by:[Appendix B](https://arxiv.org/html/2605.29123#A2.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.29123#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.29123#S2.SS1.SSS0.Px1.p1.6),[§2\.2](https://arxiv.org/html/2605.29123#S2.SS2.SSS0.Px1.p1.2)\.
- N\. Lee, K\. Sreenivasan, J\. D\. Lee, K\. Lee, and D\. Papailiopoulos \(2024\)Teaching arithmetic to small transformers\.InICLR,Cited by:[Appendix B](https://arxiv.org/html/2605.29123#A2.SS0.SSS0.Px4.p1.1),[§C\.1](https://arxiv.org/html/2605.29123#A3.SS1.SSS0.Px1.p1.1)\.
- A\. Lou, C\. Meng, and S\. Ermon \(2023\)Discrete diffusion modeling by estimating the ratios of the data distribution\.arXiv preprint arXiv:2310\.16834\.Cited by:[Appendix B](https://arxiv.org/html/2605.29123#A2.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.29123#S1.p1.1)\.
- S\. McLeish, A\. Bansal, A\. Stein, N\. Jain, J\. Kirchenbauer, B\. R\. Bartoldson, B\. Kailkhura, A\. Bhatele, J\. Geiping, A\. Schwarzschild,et al\.\(2024\)Transformers can do arithmetic with the right embeddings\.InNeurIPS,Cited by:[Appendix B](https://arxiv.org/html/2605.29123#A2.SS0.SSS0.Px4.p1.1)\.
- N\. Nangia and S\. Bowman \(2018\)Listops: a diagnostic dataset for latent tree learning\.InNAACL,Cited by:[§C\.3](https://arxiv.org/html/2605.29123#A3.SS3.SSS0.Px1.p1.2),[§4\.3](https://arxiv.org/html/2605.29123#S4.SS3.SSS0.Px1.p1.1)\.
- S\. Nie, F\. Zhu, Z\. You, X\. Zhang, J\. Ou, J\. Hu, J\. Zhou, Y\. Lin, J\. Wen, and C\. Li \(2025\)Large language diffusion models\.arXiv preprint arXiv:2502\.09992\.Cited by:[Appendix B](https://arxiv.org/html/2605.29123#A2.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.29123#S1.p1.1)\.
- J\. Ou, S\. Nie, K\. Xue, F\. Zhu, J\. Sun, Z\. Li, and C\. Li \(2024\)Your absorbing discrete diffusion secretly models the conditional distributions of clean data\.arXiv preprint arXiv:2406\.03736\.Cited by:[Appendix B](https://arxiv.org/html/2605.29123#A2.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.29123#S1.p1.1)\.
- F\. Z\. Peng, Z\. Bezemek, J\. Rector\-Brooks, S\. Zhang, M\. M\. Bronstein, A\. Zhang, J\. Bose, and A\. Tong \(2026\)Planner aware path learning in diffusion language models training\.InICLR,Cited by:[Appendix B](https://arxiv.org/html/2605.29123#A2.SS0.SSS0.Px3.p1.1),[§D\.3](https://arxiv.org/html/2605.29123#A4.SS3.p1.10),[§D\.3](https://arxiv.org/html/2605.29123#A4.SS3.p1.9),[§1](https://arxiv.org/html/2605.29123#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.29123#S2.SS2.SSS0.Px2.p1.1)\.
- S\. S\. Sahoo, M\. Arriola, Y\. Schiff, A\. Gokaslan, E\. Marroquin, J\. T\. Chiu, A\. Rush, and V\. Kuleshov \(2024\)Simple and effective masked diffusion language models\.InNeurIPS,Cited by:[Appendix B](https://arxiv.org/html/2605.29123#A2.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.29123#S1.p1.1)\.
- K\. Shah, N\. Dikkala, X\. Wang, and R\. Panigrahy \(2024\)Causal language modeling can elicit search and reasoning capabilities on logic puzzles\.InNeurIPS,Cited by:[Appendix B](https://arxiv.org/html/2605.29123#A2.SS0.SSS0.Px4.p1.1),[§C\.5](https://arxiv.org/html/2605.29123#A3.SS5.SSS0.Px4.p3.1)\.
- J\. Shi, K\. Han, Z\. Wang, A\. Doucet, and M\. K\. Titsias \(2024\)Simplified and generalized masked diffusion for discrete data\.InNeurIPS,Cited by:[Appendix B](https://arxiv.org/html/2605.29123#A2.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.29123#S1.p1.1)\.
- E\. J\. St Sauver \(2026\)Think first, diffuse fast: improving diffusion language model reasoning via autoregressive plan conditioning\.arXiv preprint arXiv:2603\.13243\.Cited by:[Appendix B](https://arxiv.org/html/2605.29123#A2.SS0.SSS0.Px2.p1.1)\.
- A\. Stuart \(2008\)SudokuWiki: solving strategies\.Note:Online resourceExternal Links:[Link](https://www.sudokuwiki.org/Strategy_Families)Cited by:[§C\.5](https://arxiv.org/html/2605.29123#A3.SS5.SSS0.Px4.p3.1),[§4\.3](https://arxiv.org/html/2605.29123#S4.SS3.SSS0.Px3.p1.2)\.
- t\-dillon \(2019\)Tdoku: a fast sudoku solver and generator\.Note:GitHub repositoryExternal Links:[Link](https://github.com/t-dillon/tdoku)Cited by:[§C\.5](https://arxiv.org/html/2605.29123#A3.SS5.SSS0.Px1.p1.5),[§4\.3](https://arxiv.org/html/2605.29123#S4.SS3.SSS0.Px3.p1.2)\.
- G\. Wang, J\. Li, Y\. Sun, X\. Chen, C\. Liu, Y\. Wu, M\. Lu, S\. Song, and Y\. A\. Yadkori \(2025a\)Hierarchical reasoning model\.arXiv preprint arXiv:2506\.21734\.Cited by:[Appendix B](https://arxiv.org/html/2605.29123#A2.SS0.SSS0.Px4.p1.1),[§C\.5](https://arxiv.org/html/2605.29123#A3.SS5.SSS0.Px1.p1.5)\.
- G\. Wang, G\. Turok, Y\. Schiff, M\. Arriola, and V\. Kuleshov \(2025b\)D2: improved techniques for training reasoning diffusion language models\.arXiv preprint arXiv:2509\.21474\.Cited by:[Appendix B](https://arxiv.org/html/2605.29123#A2.SS0.SSS0.Px3.p1.1)\.
- J\. Ye, J\. Gao, S\. Gong, L\. Zheng, X\. Jiang, Z\. Li, and L\. Kong \(2025a\)Beyond autoregression: discrete diffusion for complex reasoning and planning\.InICLR,Cited by:[Appendix B](https://arxiv.org/html/2605.29123#A2.SS0.SSS0.Px1.p1.1),[§C\.4](https://arxiv.org/html/2605.29123#A3.SS4.SSS0.Px1.p1.1),[§C\.4](https://arxiv.org/html/2605.29123#A3.SS4.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.29123#S1.p1.1)\.
- J\. Ye, Z\. Xie, L\. Zheng, J\. Gao, Z\. Wu, X\. Jiang, Z\. Li, and L\. Kong \(2025b\)Dream 7b: diffusion large language models\.arXiv preprint arXiv:2508\.15487\.Cited by:[Appendix B](https://arxiv.org/html/2605.29123#A2.SS0.SSS0.Px1.p1.1),[Appendix B](https://arxiv.org/html/2605.29123#A2.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.29123#S1.p1.1),[§1](https://arxiv.org/html/2605.29123#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.29123#S2.SS2.SSS0.Px1.p1.2)\.
- S\. Zhao, D\. Gupta, Q\. Zheng, and A\. Grover \(2025\)D1: scaling reasoning in diffusion large language models via reinforcement learning\.arXiv preprint arXiv:2504\.12216\.Cited by:[Appendix B](https://arxiv.org/html/2605.29123#A2.SS0.SSS0.Px1.p1.1),[Appendix B](https://arxiv.org/html/2605.29123#A2.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.29123#S1.p1.1)\.

## Appendix ALimitations

Two methodological assumptions deserve mention\. First, we train small task\-specific transformers \(≈0\.4\\approx 0\.4M to≈21\\approx 21M parameters\) and use greedy \(deterministic\) decoding throughout\. The confidence\-shortcut mechanism plausibly persists at larger scales and under stochastic decoding—locally\-easy positions whose values depend on long unresolved chains pervade extended reasoning—but in those settings the dependency structure is harder to identify exactly and individual wrong commits become probabilistic rather than deterministic, so the failure mode is harder to characterize cleanly\.

Second, our quantitative comparisons assume access to a clean dependency\-respecting reveal order\. This is exact for addition and maze, but for Sudoku and Countdown we approximate it with solver\-derived orders involving search and backtracking, so the gap between confidence decoding and these orders conflates the confidence shortcut with the inherent suboptimality of backtracking\-based sequences\.

## Appendix BRelated Works

#### Discrete and masked diffusion models\.

Discrete diffusion models provide an alternative to autoregressive language modeling by iteratively denoising categorical sequences\(Austinet al\.,[2021](https://arxiv.org/html/2605.29123#bib.bib19); Campbellet al\.,[2022](https://arxiv.org/html/2605.29123#bib.bib20); Louet al\.,[2023](https://arxiv.org/html/2605.29123#bib.bib10)\)\. Masked diffusion models \(MDMs\) instantiate this with an absorbing\[MASK\]state and a denoiser trained to reconstruct masked tokens\(Sahooet al\.,[2024](https://arxiv.org/html/2605.29123#bib.bib12); Shiet al\.,[2024](https://arxiv.org/html/2605.29123#bib.bib11); Ouet al\.,[2024](https://arxiv.org/html/2605.29123#bib.bib13)\), and have recently scaled to language\-model settings competitive on reasoning, planning, and code\-generation benchmarks\(Nieet al\.,[2025](https://arxiv.org/html/2605.29123#bib.bib5); Yeet al\.,[2025b](https://arxiv.org/html/2605.29123#bib.bib8); Gonget al\.,[2025](https://arxiv.org/html/2605.29123#bib.bib21); Yeet al\.,[2025a](https://arxiv.org/html/2605.29123#bib.bib7); Zhaoet al\.,[2025](https://arxiv.org/html/2605.29123#bib.bib22)\)\.

#### Decoding policies and order\-aware inference\.

The de facto MDM inference default is confidence\-based parallel decoding—revealing positions by top probability\(Changet al\.,[2022](https://arxiv.org/html/2605.29123#bib.bib15)\), top\-two margin\(Kimet al\.,[2025](https://arxiv.org/html/2605.29123#bib.bib3)\), or negative entropy\(Yeet al\.,[2025b](https://arxiv.org/html/2605.29123#bib.bib8)\)\.Kimet al\.\([2025](https://arxiv.org/html/2605.29123#bib.bib3)\)relate this to the order\-agnostic training objective, andCai and Li \([2026](https://arxiv.org/html/2605.29123#bib.bib40)\)show an entropy\-sum variant is provably efficient under conditions unrelated to logical\-flow structure\. Structural alternatives to confidence decoding have also been proposed: logic\-role classifiers that unmask premises first\(Aman,[2026](https://arxiv.org/html/2605.29123#bib.bib23)\), planners that imitate ground\-truth oracles\(Asanoet al\.,[2026](https://arxiv.org/html/2605.29123#bib.bib39)\), and autoregressive plans prepended as frozen scaffolds\(St Sauver,[2026](https://arxiv.org/html/2605.29123#bib.bib24)\)\. We treat these decoding policies as fixed inference choices and ask how training\-time alignment with them shapes the trained representation\.

#### Planner\-aligned training\.

The closest methods bias the MDM loss or mask distribution toward an inference\-time planner\. PAPL\(Penget al\.,[2026](https://arxiv.org/html/2605.29123#bib.bib17)\)reweights the per\-token loss by the model’s own confidence; PUMA\(Kimet al\.,[2026b](https://arxiv.org/html/2605.29123#bib.bib4)\)replaces the i\.i\.d\. forward process with a teacher\-forced confidence\-based chain, asymptotically matching inference\-time mask\-state marginals\. Reinforcement\-learning approaches likewise concentrate compute along inference\-aligned trajectories\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.29123#bib.bib22); Wanget al\.,[2025b](https://arxiv.org/html/2605.29123#bib.bib26); Huanget al\.,[2025](https://arxiv.org/html/2605.29123#bib.bib25)\)\. Our results identify a complementary failure mode: when the inference policy diverges from the task’s logical\-flow order, planner alignment trades away the exploration random masking provides, and the conditionals lost are precisely those needed on adversarial reasoning inputs\. Curriculum\-style training distributions\(Bengioet al\.,[2009](https://arxiv.org/html/2605.29123#bib.bib1)\)face an analogous coverage\-vs\-concentration trade\-off in a different setting\.

#### Algorithmic reasoning and shortcut learning\.

The lookahead shortcut on addition instantiates the broader pattern of shortcut learning\(Geirhoset al\.,[2020](https://arxiv.org/html/2605.29123#bib.bib37)\), where models exploit distributionally\-correlated cues that fail out\-of\-distribution\. Multi\-digit addition is a long\-standing length\-generalization probe for transformers\(Leeet al\.,[2024](https://arxiv.org/html/2605.29123#bib.bib30); McLeishet al\.,[2024](https://arxiv.org/html/2605.29123#bib.bib38)\), and recent benchmarks expose long\-range dependency failures in masked generation\(Wanget al\.,[2025a](https://arxiv.org/html/2605.29123#bib.bib28); Shahet al\.,[2024](https://arxiv.org/html/2605.29123#bib.bib6)\)\. We identify training\-distribution narrowing under confidence\-aligned objectives as a specific mechanism producing such shortcut\-like failures, and provide diagnostic signatures distinguishing failure modes within this class\.

## Appendix CData formats and generation

This appendix specifies the token\-level format, generation procedure, and difficulty stratification for each domain\. For each domain we indicate which positions of the sequence constitute the*prompt region*\(visible to the model from the start of generation, never masked\) and which constitute the*answer region*\(masked at training time and predicted by the model\)\. Probe NLL and generation accuracy are computed over the answer region only\.

### C\.1Addition

#### Task and source\.

We follow the addition task formulation ofLeeet al\.\([2024](https://arxiv.org/html/2605.29123#bib.bib30)\): given two operands, the model produces their sum digit\-by\-digit\. We use3232\-digit operands\.

#### Vocabulary\.

The vocabulary consists of digits0–99, the symbols\+and=, plusMandPAD\.

#### Sequence format\.

A single training example has the form

a0​a1​⋯​a31​\+​b0​b1​⋯​b31​=⏟prompt​c0​c1​⋯​c32⏟answer,\\underbrace\{a\_\{0\}a\_\{1\}\\cdots a\_\{31\}\\ \\texttt\{\+\}\\ b\_\{0\}b\_\{1\}\\cdots b\_\{31\}\\ \\texttt\{=\}\}\_\{\\text\{prompt\}\}\\ \\underbrace\{c\_\{0\}c\_\{1\}\\cdots c\_\{32\}\}\_\{\\text\{answer\}\},where digits are written in standard most\-significant\-first order with zero\-padding to fixed width\. The prompt occupies6666characters; the answer occupies3333characters \(one extra digit accommodates the carry\-out of the most significant addition\)\. For example,a=47a=47,b=38b=38\(displayed at44\-digit width\) yields the sequence0047\+0038=00085\.

#### Generation\.

Operandsa,ba,bare sampled uniformly from\{0,…,1032−1\}\\\{0,\\ldots,10^\{32\}\-1\\\}\. The sumc=a\+bmod1032c=a\+b\\mod 10^\{32\}is computed deterministically\.

#### Difficulty stratification\.

For each digit positionii, letsi=ai\+bis\_\{i\}=a\_\{i\}\+b\_\{i\}\. By Equation[3\.2](https://arxiv.org/html/2605.29123#S3.Ex6), the carry dependency propagates through positioniiif and only ifsi=9s\_\{i\}=9; forsi≤8s\_\{i\}\\leq 8the carry is killed \(becomes0\) and forsi≥10s\_\{i\}\\geq 10the carry is generated \(becomes11\), regardless of the incoming carry\. The carry chain length of an instance is the longest run of consecutive positions withsi=9s\_\{i\}=9, bounded by a generate or kill event\. This equals the length of the dependency window required to resolve the carry at the end of the chain, matching the rare\-event probability in Equation[3\.3](https://arxiv.org/html/2605.29123#S3.Ex8)\. Under uniform operand sampling, chains of length≥28\\geq 28appear in well under1%1\\%of training samples\.

#### Set sizes\.

Train:20,00020\{,\}000instances\. Test:10,00010\{,\}000random instances plus500500instances per stratification level for the carry\-chain sweep\.

### C\.2Maze

#### Task and source\.

We adapt the maze\-completion task ofIvanitskiyet al\.\([2023](https://arxiv.org/html/2605.29123#bib.bib31)\): given the start and goal positions of a10×1010\\times 10logical maze, the model fills in the wall/corridor assignment of every cell so that a single connected path links start to goal\.

#### Vocabulary\.

Cell\-content tokens\#\(wall\),\.\(corridor cell, in puzzle\),S\(start\),E\(end\), and0/1\(solution:0=0=corridor not on path,1=1=corridor on path\), plus=,\[MASK\],\[PAD\]\.

#### Sequence format\.

A maze is encoded as a21×2121\\times 21wall\-and\-corridor grid \(every other row and column is a wall layer between corridor cells\), serialized in row\-major order without separators\. The prompt half shows the maze structure with all corridor cells unmarked \(\.\); the answer half shows the same grid with each corridor cell labeled0or11depending on whether it lies on the start\-to\-end path\. A short example illustrates the format on a3×33\\times 3logical maze \(rendered in7×77\\times 7grid, linebreaks added for clarity—actual sequences are flat strings\):

```
prompt:     answer:
#######     #######
#S.#..#     #S0#00#
#.#.#.#     #1#0#0#
#.#.#.#     #1#0#0#
#.....#     #11111#
#####.#     #####1#
#....E#     #0000E#
#######     #######
```

The full sequence ispuzzle=solutionwherepuzzleandsolutionare the row\-major flattened grids \(441441cells each for the10×1010\\times 10logical maze used in our experiments\)\.

#### Generation\.

Mazes are generated by randomized depth\-first search starting from a random corridor cell, producing a connected tree of corridors\. Start and goal are chosen as the most distant pair of corridor cells along the resulting tree, ensuring a unique simple path between them\.

#### Difficulty stratification\.

The backbone is the simple corridor path from start to goal\. Branching points are corridor cells on the backbone with three or more accessible neighbors\. Corridors are maximal runs of consecutive backbone cells between branching points\. We stratify test instances by maximum corridor length\.

#### Set sizes\.

Train:10,00010\{,\}000mazes\. Test:50005000mazes with300300mazes per corridor\-length stratum\.

### C\.3ListOps

#### Task and source\.

We use the ListOps task introduced byNangia and Bowman \([2018](https://arxiv.org/html/2605.29123#bib.bib29)\)with operatorsmin,max,medianandsum\-mod\-10, restricted to operands0–99\. We modify the original formulation to require the model to produce intermediate values for every sub\-expression, not only the root, so the answer region exposes the full computation graph\.

#### Vocabulary\.

Digits0–99, operator charactersX\(max\),N\(min\),D\(median\),S\(sum\-mod\-10\), brackets\[,\], equality marker=, plus\[MASK\]and\[PAD\]\. The trace rainbow\-pads to a fixed maximum length \(Appendix[C\.6](https://arxiv.org/html/2605.29123#A3.SS6)\)\.

#### Sequence format\.

The expression appears in prefix\-bracketed form as the prompt, followed by the post\-order evaluation trace \(one digit per sub\-expression value, inner\-to\-outer\)\. For the input\[X 3 \[N 2 5\] 7\]\(max of\{3,7,min⁡\(2,5\)\}=7\\\{3,7,\\min\(2,5\)\\\}=7\):

\[X 3 \[N 2 5\] 7\] =⏟prompt​27⏟trace: inner=2, outer=7\.\\underbrace\{\\texttt\{\[X 3 \[N 2 5\] 7\] =\}\}\_\{\\text\{prompt\}\}\\ \\underbrace\{\\texttt\{27\}\}\_\{\\text\{trace: inner=2, outer=7\}\}\.The trace gives the result of each sub\-expression in evaluation order: the inner\[N 2 5\]evaluates to22, then the outer\[X …\]to77\. The model predicts every digit of the trace\.

#### Generation\.

Trees are sampled recursively: at each non\-leaf, an operator is chosen uniformly; the number of children is sampled from a fixed distribution peaked at22–44\. Leaf operands are sampled uniformly from0–99\. We generate trees of target depth11–66to match the stratification axis\.

#### Difficulty stratification\.

Tree depth \(longest leaf\-to\-root path\)\.

#### Set sizes\.

Train:20,00020\{,\}000trees, depths11–55\. We apply depth\-based decay with a 0\.5 ratio to intentionally make difficult examples rare in the training distribution\. Test:1,0001\{,\}000trees per depth bin\.

### C\.4Countdown

#### Task and source\.

The countdown task asks the model to combine four input numbers using\{\+,−,×,/\}\\\{\+,\-,\\times,/\\\}to reach a target value\. We use the dataset ofYeet al\.\([2025a](https://arxiv.org/html/2605.29123#bib.bib7)\), with the random generation procedure and multiplicity annotations followingGandhiet al\.\([2024](https://arxiv.org/html/2605.29123#bib.bib32)\)\.

#### Vocabulary\.

Digits0–99, operators\+,−,∗,/\+,\-,\*,/, equality marker=, comma,, plus\[MASK\]and\[PAD\], and rainbow padding characters\.

#### Sequence format\.

Inputs and target form the prompt; the equation chain forms the answer\. The format followsYeet al\.\([2025a](https://arxiv.org/html/2605.29123#bib.bib7)\): four input numbers and one target, comma\-separated, followed by an equation chain of three binary operations:

86,28,13,31,96⏟prompt: 4 inputs, target​=​86\+28=114,31\-13=18,114\-18=96⏟answer: 3\-step equation chain\.\\underbrace\{\\texttt\{86,28,13,31,96\}\}\_\{\\text\{prompt: 4 inputs, target\}\}\\ \\texttt\{=\}\\ \\underbrace\{\\texttt\{86\+28=114,31\-13=18,114\-18=96\}\}\_\{\\text\{answer: 3\-step equation chain\}\}\.Each step in the chain consumes two operands \(drawn from the input pool or freshly produced intermediates\) and writes a new number\. The final step’s result equals the target\. Sequence length varies across instances and is rainbow\-padded\.

#### Difficulty stratification\.

Solution multiplicitymmis the number of distinct equation chains that reach the target, computed by the dataset’s annotator\. We use binsm∈\[1,3\]m\\in\[1,3\]\(rare\-hard\),m∈\[4,10\]m\\in\[4,10\]\(medium\), andm≥11m\\geq 11\(common easy\)\.

#### Sub\-sampling\.

Following the rare\-hard framing, we retain only10%10\\%of training puzzles withm∈\[1,3\]m\\in\[1,3\], while higher\-multiplicity puzzles are kept at natural frequency\. The test set retains the natural distribution\. This puts the rare\-hard stratum at∼5%\\sim 5\\%of training samples but∼48%\\sim 48\\%of test samples, isolating each method’s ability to generalize from under\-represented training patterns\.

#### Set sizes\.

Train: approximately400,000400\{,\}000puzzles after sub\-sampling\. Test:1,0001\{,\}000puzzles at natural distribution\.

### C\.5Sudoku

#### Task and source\.

Standard9×99\\times 9sudoku\. We use thesudoku\-extremedataset\(Wanget al\.,[2025a](https://arxiv.org/html/2605.29123#bib.bib28)\), originally compiled for the Hierarchical Reasoning Model\. The dataset contains approximately3\.83\.8million puzzles spanning a wide range of difficulties, each annotated with a rating computed by the tdoku solver\(t\-dillon,[2019](https://arxiv.org/html/2605.29123#bib.bib33)\), defined as the number of guesses \(binary decision points along the search tree\) required by an MRV\-heuristic backtracking solver\. Rating zero indicates the puzzle is solvable by constraint propagation alone; the maximum rating in the dataset is465465, with99%99\\%of puzzles below rating101101\.

#### Vocabulary\.

Digits11–99, blank token\., equality marker=, plus\[MASK\]and\[PAD\]\. Total1313tokens\.

#### Sequence format\.

The puzzle and its solution are flattened in row\-major order\. The prompt shows the puzzle with blank cells marked by a placeholder; the answer is the complete8181\-digit solution:

prompt:53\.\.7\.\.\.\.6\.\.195\.\.\.\.98\.\.\.\.6\.8\.\.\.6\.\.\.34\.\.8\.3\.\.17\.\.\.2\.\.\.6\.6\.\.\.\.28\.\.\.\.419\.\.5\.\.\.\.8\.\.79answer:534678912672195348198342567859761423426853791713924856961537284287419635345286179where the prompt’s8181cells include blanks marked\.and the answer is the corresponding8181\-cell solution\. The model sees the puzzle layout including which cells are blank, and predicts the complete solution\. Generation accuracy is measured over the8181\-cell solution; probe NLL and stratified accuracy are restricted to the originally blank cells \(averaging∼56\\sim 56per puzzle\)\.

#### Difficulty stratification\.

We use two complementary axes\.

*Rating tiers\.*Six tiers determined from the rating distribution ofsudoku\-extremevia quantile boundaries on non\-zero ratings\. The median rating is1717; the7575th,9090th,9595th, and9999th percentiles are3434,5151,6565, and101101respectively\. Tier boundaries are placed at these quantiles, yielding:

- •*easy*: rating0\(singles only\)
- •*medium*: rating11–1717\(≤p50\\leq p\_\{50\}of non\-zero\)
- •*hard*: rating1818–3434\(p50p\_\{50\}–p75p\_\{75\}\)
- •*very hard*: rating3535–5151\(p75p\_\{75\}–p90p\_\{90\}\)
- •*extreme*: rating5252–101101\(p90p\_\{90\}–p99p\_\{99\}\)
- •*top 1%*: rating≥102\\geq 102\(\>p99\>p\_\{99\}, up to dataset max465465\)\.

*Per\-cell technique levels\.*For decoding\-strategy analysis, we classify each blank cell by the deepest solving technique required to deduce its value, following the standard Sudoku\-solving technique hierarchy\(Stuart,[2008](https://arxiv.org/html/2605.29123#bib.bib34); Shahet al\.,[2024](https://arxiv.org/html/2605.29123#bib.bib6)\)\. We use a five\-level hierarchy:

- •TL0 \(singles\): Naked singles and hidden singles—cells whose value is forced by direct row/column/box elimination\.
- •TL1 \(subsets and intersections\): Naked/hidden pairs and triples, pointing pairs, box\-line reductions\.
- •TL2 \(fish and wings\): X\-Wing, Swordfish, XY\-Wing patterns\.
- •TL3 \(forcing chains\): Single\-step hypothesis\-contradiction chains\.
- •TL4 \(search\): Cells that cannot be deduced by any of the above and require bifurcation\. This is the true “no stepping stone” regime\.

The technique\-difficulty decoding strategy reveals cells in increasing TL order; the constraint\-propagation reveals them in the order an implementation of the above hierarchy uncovers them\.

#### Set sizes\.

Train: a balanced sample with exponential decay across rating tiers \(decay factor0\.010\.01, yielding approximately570,000570\{,\}000puzzles dominated by*easy*and*medium*but with at least one puzzle from each tier\)\. Test:1,0001\{,\}000each for*easy*and*medium*;500500for*hard*;200200for*very hard*;100100for*extreme*;500500for*top 1%*\.

### C\.6Rainbow padding for variable\-length answers

For domains with variable answer length \(ListOps, countdown\), the answer region is padded to a fixed length to support batched training\. Rather than a single repeated pad token, we use a deterministic cyclic pattern over a small set of distinct pad tokens, followingKimet al\.\([2026a](https://arxiv.org/html/2605.29123#bib.bib35)\)\. The cyclic structure distributes probability mass across the padding region, preventing pad positions from dominating early steps of confidence\-based decoding\.

## Appendix DMore Details for the Experiments

### D\.1Architecture

Table[4](https://arxiv.org/html/2605.29123#A4.T4)describes the model architectures used in the experiments\. All architectures are pre\-norm transformers with bidirectional self\-attention, GELU activations, learned absolute positional embeddings, weight\-tied input/output projections, and an MLP hidden dimension of3×embedding dim3\\times\\text\{embedding dim\}\.

Table 4:Model architecture per domain\.
### D\.2Training

Table[5](https://arxiv.org/html/2605.29123#A4.T5)describes the training configuration used in the experiments\.

Table 5:Training hyperparameters\.All models trained with AdamW \(β=\(0\.9,0\.95\)\\beta=\(0\.9,0\.95\)\), weight decay0\.010\.01, gradient clipping at1\.01\.0\. Early stopping is disabled across all runs to remove convergence speed as a confound between methods\. Training was performed on a single A100 GPU, and each experiment took approximately 4 hours for addition, maze, and ListOps, and 10 hours for Countdown and Sudoku\.

### D\.3PAPL

PAPL\(Penget al\.,[2026](https://arxiv.org/html/2605.29123#bib.bib17)\)is a one\-line modification to the standard MDM loss\. Given a batch of partially masked sequences and the model’s predicted log\-probabilities, PAPL replaces the uniform per\-token weighting with

wi=1\|𝐌\|​\(1\+α⋅w~i\),w~i=softmaxj∈𝐌​\[1τ​log⁡pθ​\(xj∣𝐱𝐌¯\)\]i,w\_\{i\}=\\frac\{1\}\{\|\\mathbf\{M\}\|\}\\Big\(1\+\\alpha\\cdot\\tilde\{w\}\_\{i\}\\Big\),\\quad\\tilde\{w\}\_\{i\}=\\mathrm\{softmax\}\_\{j\\in\\mathbf\{M\}\}\\\!\\left\[\\frac\{1\}\{\\tau\}\\,\\log p\_\{\\theta\}\\bigl\(x\_\{j\}\\mid\\mathbf\{x\}\_\{\\overline\{\\mathbf\{M\}\}\}\\bigr\)\\right\]\_\{i\},where the softmax is taken over masked positions only,α≥0\\alpha\\geq 0controls the strength of confidence\-aligned reweighting \(recovering standard MDM atα=0\\alpha=0\), andτ\\taucontrols the sharpness of the planner distribution\. The reweighting requires only the per\-position predicted log\-probability of the ground\-truth token—already available during the standard loss computation—so the entire change consists of constructingw~i\\tilde\{w\}\_\{i\}from these log\-probabilities and multiplying the per\-position cross\-entropy by1\+α​w~i1\+\\alpha\\tilde\{w\}\_\{i\}before averaging\. We useα=5\\alpha=5andτ=1\\tau=1, the values used in the main experiments ofPenget al\.\([2026](https://arxiv.org/html/2605.29123#bib.bib17)\), for maze, ListOps, countdown, and sudoku\. For addition,α=5\\alpha=5collapses representation to zero accuracy at every chain length, so we report PAPL withα=1\\alpha=1for addition\.

### D\.4PUMA

PUMA\(Kimet al\.,[2026b](https://arxiv.org/html/2605.29123#bib.bib4)\)replaces the i\.i\.d\. forward process with teacher\-forced trajectories generated under the model’s current confidence\-based policy\. Concretely, given an integer hyperparameterKK, the answer lengthLLis partitioned intoKKstages, each unmasking roughlyL/KL/Ktokens; smallerKKyields coarser stages \(more tokens unmasked per step\), and largerKKyields finer stages closer to the inference\-time confidence\-based reveal trajectory\.

#### Streaming buffer\.

A naive implementation of PUMA would require generating each teacher\-forced trajectory from scratch per training example, costingKKforward passes per gradient step\. Instead, following the original implementation, we maintain a streaming buffer ofBBteacher\-forced chains \(one per batch slot\)\. At each gradient step, the model is trained on the current state of each chain, after which one stage of unmasking is applied to advance each chain to its next state\. When a chain reaches full unmasking \(end of itsKKstages\), it is replaced with a fresh fully\-masked sequence\. This amortizes the chain\-generation cost acrossKKtraining steps, making PUMA’s per\-step compute essentially identical to standard MDM training—one forward pass per gradient step, with the chain\-advancement forward pass being the same one used to compute confidence scores\. The training distribution thus consists of intermediate states sampled uniformly across theKKstages of confidence\-based reveal trajectories\.

#### Stochastic stage size\.

The number of tokens unmasked at each stage is not fixed at⌊L/K⌋\\lfloor L/K\\rfloor\. Following the original implementation, we add a small amount of randomness to avoid imbalanced exposure to particular mask patterns: at each stage the actual number of tokens revealed is drawn from a small range aroundL/KL/K\.

#### KK\-schedule\.

Because the model’s confidence policy is unreliable early in training, the original PUMA paper uses aKK\-schedule that ramps from a smallerKstartK\_\{\\text\{start\}\}to a largerKendK\_\{\\text\{end\}\}over the first portion of training and is held constant thereafter\. The original paper usesK=8K=8fixed for sudoku \(L=81L=81, so∼10\\sim 10tokens revealed per step\) and a ramp fromK=12K=12toK=42K=42for TinyGSM \(max lengthL=512L=512, with the upper end giving∼12\\sim 12tokens per step\)\. The authors note these settings trade some training\-inference alignment for compute savings during pretraining\. Since our experiments are designed to study the consequences of such alignment rather than to optimize compute, we useKKvalues that more faithfully realize PUMA’s design intent \(a few tokens per step\), scaled to each domain’s answer length\. Due to computational constraints, the maze setting follows approximately the same schedule as TinyGSM in the original paper\.

Table 6:PUMAKK\-schedule per domain\. TheKK\-schedule ramps over the first third of training and is held constant thereafter, following the original PUMA recommendation\. TheL/KendL/K\_\{\\text\{end\}\}column gives the asymptotic tokens revealed per step at the schedule’s tightest setting\.

### D\.5Decoding strategies

Given the model’s predicted distributionpθ\(⋅∣𝐱𝐌¯\)p\_\{\\theta\}\(\\cdot\\mid\\mathbf\{x\}\_\{\\overline\{\\mathbf\{M\}\}\}\)at each masked positioni∈𝐌i\\in\\mathbf\{M\}, we evaluate the following strategies for selecting the next positioni′i^\{\\prime\}to reveal:

Confidence:i′=arg⁡maxi∈𝐌⁡cθi,\\displaystyle i^\{\\prime\}=\\arg\\max\_\{i\\in\\mathbf\{M\}\}\\,c\_\{\\theta\}^\{i\},Random:i′∼Unif​\(𝐌\),\\displaystyle i^\{\\prime\}\\sim\\mathrm\{Unif\}\(\\mathbf\{M\}\),Algorithmic Optimal:i′=πtask​\(𝐌,𝐱𝐌¯\),\\displaystyle i^\{\\prime\}=\\pi\_\{\\mathrm\{task\}\}\\bigl\(\\mathbf\{M\},\\,\\mathbf\{x\}\_\{\\overline\{\\mathbf\{M\}\}\}\\bigr\),wherecθi=maxv∈𝒱⁡pθ​\(v∣𝐱𝐌¯\)c\_\{\\theta\}^\{i\}=\\max\_\{v\\in\\mathcal\{V\}\}\\,p\_\{\\theta\}\(v\\mid\\mathbf\{x\}\_\{\\overline\{\\mathbf\{M\}\}\}\)is the top\-11confidence at positionii\([Section˜2\.2](https://arxiv.org/html/2605.29123#S2.SS2)\) andπtask\\pi\_\{\\mathrm\{task\}\}is a polynomial\-time deterministic function for addition \(LSB\-first\) and maze \(dead\-end\-filling\)\. For ListOps, countdown, and sudoku, where the underlying task is NP\-hard or admits multiple valid orderings, we evaluate solver\-derived strategies—layered post\-order for ListOps \(depth\-first, ties broken by confidence\), step\-sequential for countdown \(left\-to\-right within the equation chain\), and constraint\-propagation order or technique\-difficulty order for sudoku\.

Once the positioni′i^\{\\prime\}is chosen, the token to commit is the argmax of the predictive distribution at that position:

xi′←arg⁡maxv∈𝒱⁡pθ​\(v∣𝐱𝐌¯\)\.x\_\{i^\{\\prime\}\}\\;\\leftarrow\\;\\arg\\max\_\{v\\in\\mathcal\{V\}\}\\,p\_\{\\theta\}\\bigl\(v\\mid\\mathbf\{x\}\_\{\\overline\{\\mathbf\{M\}\}\}\\bigr\)\.We do not introduce stochasticity into token selection \(no Gumbel noise, top\-pp/top\-kksampling, or temperature\) in order to isolate the effect of the training scheme from the confounding effect of decoding randomness\. All decoding produces one sample per puzzle; generation accuracy is exact\-match against the answer, except for sudoku \(cell accuracy\)\.

## Appendix EAdditional results

The main body reports each task under its most informative decode pair: typically confidence\-based decoding versus an efficient algorithmic optimal where one exists, or versus uniform random reveal where it does not\. For three tasks we report additional decoding results here\.

For addition, we extend[Table˜1](https://arxiv.org/html/2605.29123#S3.T1)with uniform random reveal to align the presentation with the other domains; the same three schemes trained for the addition experiment are evaluated under random decode \([Table˜7](https://arxiv.org/html/2605.29123#A5.T7)\)\.

For countdown and sudoku, we report*solver\-order*decoding strategies—reveal orders derived from a backtracking solver’s traversal of the solution rather than from an efficient algorithmic optimal\. Because the underlying tasks are NP\-hard or admit multiple valid orderings, no polynomial\-time oracle exists; the solver orders we use are deterministic and reproducible, but their reveal sequence corresponds to a search\-with\-backtracking traversal, not an efficient one\.

Table 7:Accuracy on3232\-digit addition under uniform random decoding, by carry\-chain length\.Table 8:Solver\-order decoding results, compared with confidence and random\-reveal numbers in[table˜3](https://arxiv.org/html/2605.29123#S4.T3)\. \(a\) Countdown under step\-sequential decode, by solution multiplicity\. \(b\) Sudoku under constraint\-propagation and technique\-difficulty orders, by rating tier; cell\-level accuracy\.\(a\)Countdown step\-sequential decode\.
\(b\)Sudoku solver\-order reveals\.

### E\.1Addition: uniform random decoding

Random decoding collapses accuracy for all three schemes\. Even random masking—which trains on the full i\.i\.d\. mask distribution—drops from0\.9920\.992under confidence decode to0\.4240\.424at chain≥4\\geq 4and to near zero at longer chains\. The collapse is structural: addition’s carry chain has no redundancy, so any reveal order that commits a chain\-MSB cell before its carry\-in is resolved produces a wrong digit that propagates through the rest of the answer\.

### E\.2Countdown: step\-sequential reveal

The step\-sequential reveal resolves one equation step at a time, left\-to\-right within the equation chain\. This is a task\-natural generation order: high\-multiplicity instances admit multiple valid chains, so left\-to\-right is one consistent choice among several\. The order tracks how a forward solver would write out the chain once it has committed to a set of operands at each step, and is backtracking\-based in the sense that an arbitrary commit at one step may require a different chain to reach the target on harder instances\. Results across all multiplicity bins are within a few points of confidence decode\.

### E\.3Sudoku: solver\-order reveals

We evaluate two solver\-derived reveal orders, neither of which is an efficient algorithmic oracle\. The*constraint\-propagation order*reveals cells in the order a backtracking solver uncovers them\. The*technique\-difficulty order*reveals cells in increasing order of the deepest solving technique required \(TL0–TL44; Appendix[C](https://arxiv.org/html/2605.29123#A3)\), which is itself derived from a backtracking\-based solver\.

Both reveal orders produce strictly worse accuracy than confidence decode for every training scheme at every difficulty tier \([Table˜8\(b\)](https://arxiv.org/html/2605.29123#A5.T8.st2)vs\.[Table˜3](https://arxiv.org/html/2605.29123#S4.T3)\), with constraint\-propagation order ahead of technique\-difficulty order throughout\. The relative pattern across training schemes is preserved: PUMA leads random masking at every tier above easy\.

Similar Articles

Adaptive Order Policies for Masked Diffusion

arXiv cs.LG

Proposes learning the unmasking order in masked diffusion models using a lightweight policy network, with a weighted loss that outperforms heuristics on combinatorial tasks and protein design.