When Top-1 Fails: Calibrating LoRA Monitors for Masked Diffusion LMs
Summary
This paper investigates the effectiveness of top-1 collapse rate as a stability monitor for short-horizon LoRA fine-tuning of discrete diffusion language models, finding it has zero precision, and proposes max gradient norm as a more reliable alternative with higher precision and F1 score on LLaDA-family models.
View Cached Full Text
Cached at: 06/24/26, 07:50 AM
# Calibrating LoRA Monitors for Masked Diffusion LMs
Source: [https://arxiv.org/html/2606.24119](https://arxiv.org/html/2606.24119)
Lucky Verma Independent Researcher luckyv1@umbc\.edu&Pratik Yadav University of Maryland, Baltimore County pratiky1@umbc\.edu
###### Abstract
Discrete diffusion language model \(DLM\) fine\-tuning inherits inexpensive diagnostics from denoising\-time confidence monitors, but their PEFT\-training meaning is untested\. We test top\-11argmax concentration as a collapse warning\. Across816816LoRA/PEFT configurations from three DLM families, the warning fires for every configuration while logs record0/8160/816actual collapses at the200200\-step horizon, givingzero precision\. The cause is pre\-equilibrium saturation: top\-11concentration is already high before optimization and quickly becomes insensitive to final training stability\. We then evaluate max LoRA gradient norm, a parameter\-side signal that samples gradient routing rather than token concentration\. On a pooled held\-out LLaDA\-family split, a train\-optimized threshold identifies top\-decile final\-loss configurations with precision0\.680\.68andF1=0\.79F\_\{1\}\{=\}0\.79, above the all\-positive top\-11baseline even at the lower split\-bootstrap confidence bound\. Autoregressive controls and cross\-family threshold failures bound the result to short\-horizon DLM\-LoRA inspection rather than a universal collapse detector\. Workflow: drop top\-11as a PEFT alarm, log max\-gradient early in training, and calibrate thresholds per DLM family before routing runs for inspection\.
When Top\-1 Fails: Calibrating LoRA Monitors for Masked Diffusion LMs
Lucky VermaIndependent Researcherluckyv1@umbc\.eduPratik YadavUniversity of Maryland, Baltimore Countypratiky1@umbc\.edu
Figure 1:The transferred top\-11warning has zero precision, while max\-gradient gives a LLaDA\-family triage signal\.\(A\) Across the816816DLM PEFT configurations, the top\-11warning fires in every configuration and observed collapse is0/8160/816; AR controls have0/3600/360collapses and no top\-11warning by definition\. \(B\) Stable\-vs\-unstable max\-gradient effect sizes are large in the LLaDA\-family DLM cohorts \(3\.23×3\.23\\times,362×362\\timeson the source\-scale method\-comparison set, and1\.48×1\.48\\times; Mann–WhitneyUUwith Bonferronim=6m\{=\}6and bootstrap CIs\), while AR controls are smaller or non\-portable\. \(C\) On a fixed held\-out LLaDA\-family split \(n=671n\{=\}671\), max\-gradient precision is0\.680\.68with split\-bootstrap95%95\\%CI\[0\.500,0\.947\]\[0\.500,0\.947\], compared with the all\-positive top\-11baseline ceiling0\.1480\.148\(recall0\.940\.94,F1=0\.79F\_\{1\}\{=\}0\.79\)\.## 1Introduction
Discrete diffusion language models \(DLMs\)\(Nie et al\.,[2025](https://arxiv.org/html/2606.24119#bib.bib21); Sahoo et al\.,[2024](https://arxiv.org/html/2606.24119#bib.bib25); Ye et al\.,[2025](https://arxiv.org/html/2606.24119#bib.bib32)\)reconstruct fully masked sequences through iterative denoising, using bidirectional context rather than left\-to\-right prediction\. As DLM checkpoints and fine\-tuning recipes spread\(Zhang et al\.,[2024](https://arxiv.org/html/2606.24119#bib.bib34),[2026](https://arxiv.org/html/2606.24119#bib.bib36); Wu et al\.,[2026](https://arxiv.org/html/2606.24119#bib.bib29); Kuiper et al\.,[2025](https://arxiv.org/html/2606.24119#bib.bib14); Yang et al\.,[2026](https://arxiv.org/html/2606.24119#bib.bib31)\), practitioners need low\-cost monitors for short\-run LoRA training\. A tempting candidate is already exposed by the audited DLM runners: thetop\-1 collapse rate, which measures whether argmax predictions concentrate on a small token vocabulary\. This signal is logged for denoising/remasking diagnostics, but its meaning under short\-horizon PEFT is unclear\. We test the transfer directly: can top\-11collapse serve as a PEFT stability warning, and if not, what family\-local monitor is more useful for inspection?
The transfer fails\.Across three DLM model families spanning816816DLM PEFT configurations \(LLaDA\-family, four cohorts,n=671n\{=\}671\+ Dream\-7B boundary cohortn=100n\{=\}100\+ MDLM\-OWT 130M boundary cohortn=45n\{=\}45\), the warning fires in𝟖𝟏𝟔/𝟖𝟏𝟔\\mathbf\{816/816\}\(𝟏𝟎𝟎%\\mathbf\{100\\%\}\) configurations, while actual training collapse, logged by the same training loop’scollapsedflag, occurs in𝟎/𝟖𝟏𝟔\\mathbf\{0/816\}\(𝟎%\\mathbf\{0\\%\}\) at the200200\-step horizon\. The diagnostic haszero precision\. Matched AR controls on Pythia\{410M,1B,2\.8B,6\.9B\}\\\{410\\text\{M\},1\\text\{B\},2\.8\\text\{B\},6\.9\\text\{B\}\\\}and Qwen3\.5\-9B \(360360audited configurations; App\.[C](https://arxiv.org/html/2606.24119#A3)\) also show0/3600/360actual collapses, so the result does not indicate a generic masked\-CE collapse phenomenon\. The warning fails to transfer into the tested DLM\-LoRA PEFT setting\.
The failure has a measured explanation\.Across the same671671configurations, top\-11token frequency is0\.83±0\.13\\mathbf\{0\.83\\pm 0\.13\}at training step0; every configuration is already above0\.50\.5, the median configuration crosses0\.950\.95within𝟒\\mathbf\{4\}optimizer steps, and the legacy fire\-step is stability\-agnostic \(Mann–WhitneyUU:p=0\.20p\{=\}0\.20, n\.s\.; Fig\.[3](https://arxiv.org/html/2606.24119#A2.F3)\)\. A parameter\-side check at the worst rank\-amplification corner gives the complementary measurement: per\-token CE gradients are only modestly concentrated \(Gini0\.290\.29, largest evaluated token\-position CE\-gradient share1\.5%1\.5\\%\), while LoRA\-parameter gradients are concentrated \(Gini0\.460\.46, one matrix carries63\.0%63\.0\\%of gradient mass; App\.[D](https://arxiv.org/html/2606.24119#A4)\)\. Top\-11tracks token\-side pre\-equilibrium concentration; max gradient norm samples the parameter\-side routing that separates stable from unstable runs\.
We evaluatemax gradient normas a family\-local triage signal with Mann–WhitneyUUtests and Bonferroni correction across six analyzable families \(m=6m\{=\}6\)\. On LLaDA2\.0\-mini \(n=144n\{=\}144\), unstable configurations have3\.23×3\.23\\timeshigher median max\-gradient norm than stable configurations \(pBonf=2\.7×10−7p\_\{\\text\{Bonf\}\}\{=\}2\.7\\times 10^\{\-7\}, bootstrap95%95\\%CI\[2\.76,3\.97\]\[2\.76,3\.97\]\); on the method\-comparison set \(n=395n\{=\}395\), the ratio is362×362\\times\(stable median99\.399\.3vs\. unstable median35,960\.435\{,\}960\.4in the source scale;pBonf=5×10−21p\_\{\\text\{Bonf\}\}\{=\}5\\times 10^\{\-21\}, CI\[202,779\]\[202,779\]\)\. The key check is held\-out performance\. On a fixed80/2080/20split of the671671\-configuration LLaDA\-family corpus, a threshold selected on training configurations predicts top\-decile final\-loss configurations on held\-out configurations with precision0\.68\\mathbf\{0\.68\}, recall0\.94\\mathbf\{0\.94\}, andF1=0\.79F\_\{1\}\{=\}0\.79, versus0\.130\.13precision for the all\-positive top\-11baseline on this fixed split\. A separate split\-bootstrap gives95%95\\%CI\[0\.500,0\.947\]\[0\.500,0\.947\], disjoint from the split\-bootstrap baseline ceiling0\.1480\.148; each bootstrap replicate resamples configurations, redraws the train/test split, and reselects the threshold on train; even the lower CI bound exceeds3×3\\timesthe baseline, and the supported use is inspection and routing rather than a high\-precision gate \(Limitations\)\. Separately, aB=200B\{=\}200random\-split step\-kksweep shows max\-gradient precision stabilizing from step∼25\\sim 25onward, while loss\-at\-step\-kkis non\-monotonic: loss is stronger at step1111for extreme high\-loss configurations but trails max\-gradient at steps2525–100100\(App\.[B\.1](https://arxiv.org/html/2606.24119#A2.SS1)\)\. Cross\-family thresholds do not transfer; calibration is per family, not a global constant\.
DLM\-LoRA triage workflow\.The audited workflow is three steps:droptop\-11as a PEFT alarm at this horizon,logmax\-gradient by step∼25\\sim 25, andcalibratethresholds per DLM family before routing high\-gradient configurations to inspection or separately validated follow\-up sweeps\. Three findings follow from existing data: top\-11is not a PEFT warning at this horizon, max\-gradient is a family\-local inspection trigger inside LLaDA\-family runs, and mask ratio should be tuned per model rather than exported as a single operating window\. Mask ratio is the strongest tested low\-cost covariate in the mask\-ratio holdout probes; max\-gradient is the supported early inspection signal while preserving mask\-ratio design as a separate tuning axis\.
#### Contributions\.
1. 1\.An816816\-configuration refutation: top\-11fires in816/816816/816DLM PEFT configurations while0/8160/816actual collapses occur \(§[4\.1](https://arxiv.org/html/2606.24119#S4.SS1)\)\.
2. 2\.A two\-level saturation characterization showing why the warning fails, with token\-side pre\-equilibrium saturation and parameter\-side gradient routing evidence \(§[4\.5](https://arxiv.org/html/2606.24119#S4.SS5)\)\.
3. 3\.A family\-calibrated max\-gradient triage protocol with held\-out precision0\.680\.68\(CI\[0\.500,0\.947\]\[0\.500,0\.947\]\) on the pooled LLaDA\-family corpus \(§[4\.1](https://arxiv.org/html/2606.24119#S4.SS1)\)\.
4. 4\.Thirteen falsification probes and matched AR controls that bound the claim to short\-horizon DLM\-LoRA PEFT \(App\.[D](https://arxiv.org/html/2606.24119#A4)\)\.
Manuscript values are source\-mapped through local run manifests, claim\-bearing aggregates, and verification summaries; public paper source, reference scripts, and the sanitized aggregate result artifacts that back the tables and figures are released at[GitHub repository](https://github.com/lucky-verma/top1-fails-dlm-lora-monitors)\([result artifacts](https://github.com/lucky-verma/top1-fails-dlm-lora-monitors/tree/main/results)\)\.
## 2Background
### 2\.1Discrete Diffusion Language Models
Discrete diffusion language models \(DLMs\) train by adding discrete noise to token sequences \(masking tokens at rateρ\\rho\) and learning to reconstruct the original tokens from the noisy input\. At inference, DLMs iteratively denoise a fully masked sequence overTTsteps, using bidirectional attention at each step\(Nie et al\.,[2025](https://arxiv.org/html/2606.24119#bib.bib21); Sahoo et al\.,[2024](https://arxiv.org/html/2606.24119#bib.bib25)\)\. We write the masked\-diffusion training objective in the per\-masked\-token form used by our implementation:
ℒ\(θ\)=−𝔼t,𝐱t\[1\|ℳt\|∑i∈ℳtlogpθ\(x0i∣𝐱t\)\],\\mathcal\{L\}\(\\theta\)=\-\\mathbb\{E\}\_\{t,\\mathbf\{x\}\_\{t\}\}\\left\[\\frac\{1\}\{\|\\mathcal\{M\}\_\{t\}\|\}\\sum\_\{i\\in\\mathcal\{M\}\_\{t\}\}\\log p\_\{\\theta\}\(x\_\{0\}^\{i\}\\mid\\mathbf\{x\}\_\{t\}\)\\right\],\(1\)whereℳt=\{i:xti=\[MASK\]\}\\mathcal\{M\}\_\{t\}=\\\{i:x\_\{t\}^\{i\}=\\text\{\{\[MASK\]\}\{\}\}\\\}andx0ix\_\{0\}^\{i\}is the clean token at positionii\. This differs fundamentally from AR next\-token prediction\. The density of gradient signal scales withρ\\rhoand the prediction entropy grows with the number of tokens jointly predicted, which together drive the rank–mask interaction we characterize in Sec\.[4](https://arxiv.org/html/2606.24119#S4)\.111We encountered five silent\-failure modes in the standard HuggingFace \+ PEFT stack when running LoRA on LLaDA/Dream \(loss API returningNone, generation API kvcache assertion, target\-module auto\-detection, Dream model\-class loader, Dream attention\-mask dtype\)\. Drop\-in fixes appear in Appendix[A](https://arxiv.org/html/2606.24119#A1); public release artifacts are linked in Appendix[A](https://arxiv.org/html/2606.24119#A1)\.
## 3Methodology
### 3\.1Correct Training Objective
Standard HuggingFace PEFT training assumes a model\-internal supervised loss, but LLaDA\-style DLM forward passes return logits only because the caller defines the masking distribution\. FollowingSahoo et al\. \([2024](https://arxiv.org/html/2606.24119#bib.bib25)\), we mask tokens externally and use Eq\.[1](https://arxiv.org/html/2606.24119#S2.E1)with loss computed only over masked positions\. Appendix[A](https://arxiv.org/html/2606.24119#A1)gives the drop\-in API fixes needed to reproduce this objective\.
### 3\.2Experimental Setup
#### Models\.
We evaluate LoRA fine\-tuning in three roles\. LLaDA\-family DLMs provide the primary top\-1 refutation and max\-gradient separation; Pythia/Qwen causal models under matched masked\-CE serve as diagnostic controls; Dream, MDLM\-OWT, and LLaDA\-MoE runs act as boundary cohorts\. The primary DLM setup is:
- •LLaDA\-8B\-Instruct\(Nie et al\.,[2025](https://arxiv.org/html/2606.24119#bib.bib21)\): 8B parameter masked diffusion LM\. Mask token ID: 126336\. Architecture: LLaDAModel \(custom, non\-HF\-standard\)\.
- •LLaDA2\.0\-mini\(Bie et al\.,[2025](https://arxiv.org/html/2606.24119#bib.bib2)\): 15\.93B MoE masked diffusion LM\. Mask token ID: 156895\. This model provides the 60\-configuration rank×\\timesmask surface and the 2×\\times2 task\-performance factorial\.
- •Dream\-7B\(Ye et al\.,[2025](https://arxiv.org/html/2606.24119#bib.bib32)\): 7B parameter masked diffusion LM\. Loaded viaAutoModel\. Requires boolean attention mask\.
#### LoRA configuration\.
All primary DLM runs adapt attention projections only \(q\_proj, k\_proj, v\_proj, o\_proj\)\. The LLaDA2\.0\-mini surface uses ranks\{4,8,16,32,64\}\\\{4,8,16,32,64\\\}and 12 mask ratios spanningρ∈\[0\.05,0\.95\]\\rho\\in\[0\.05,0\.95\]; the task\-performance factorial uses ranks\{4,64\}\\\{4,64\\\}and masks\{0\.40,0\.90\}\\\{0\.40,0\.90\\\}for 3 seeds per configuration, and the operating\-cell method comparison \(App\.[E](https://arxiv.org/html/2606.24119#A5)\) usesn=10n\{=\}10seeds per method at the learning rate selected by theα\\alpha\-sweep\. The older LLaDA\-Instruct pilot uses the coarser5×45\\times 4gridρ∈\{0\.3,0\.5,0\.7,0\.9\}\\rho\\in\\\{0\.3,0\.5,0\.7,0\.9\\\}, and the Dream\-7B pilot uses a learning\-rate\-resolved rank×\\timesmask grid described in Appendix[C](https://arxiv.org/html/2606.24119#A3)\.
#### Training\.
Short pilot runs use 30–40 steps for API validation; the LLaDA2\.0\-mini surface uses 200 steps at lr=10−4=10^\{\-4\}\(reported as an observed\-prefix diagnostic because the legacy top\-1 detector early\-stops all 60 traces at step 11\)\. The2×22\\times 2factorial uses 1000 steps on a 152\-example hand\-written arithmetic corpus with 20 held\-out prompts and is reported as masked\-CE convergence evidence; generation\-quality evaluation is separated from this diagnostic claim\. Batch size 4, AdamW, gradient norms recorded pre\-clipping \(0\.50\.5threshold\)\. LLaDA2\.0\-mini runs on an H100 NVL \(96GB\) workstation; pilots on CHIP HPC \(UMBC\) NVIDIA L40S \(48GB\)\. Implementation: HuggingFacetransformers\(Wolf et al\.,[2020](https://arxiv.org/html/2606.24119#bib.bib28)\)\+ PEFT\(Mangrulkar et al\.,[2022](https://arxiv.org/html/2606.24119#bib.bib18)\)\. Gradient norms are globalℓ2\\ell\_\{2\}norms over trainable LoRA parameters\. Family\-canonical hyperparameters are used across model families \(LLaDA lr=10−4=10^\{\-4\}eff\. batch6464; Dream\-default lr=2×10−6=2\{\\times\}10^\{\-6\}; MDLM\-OWT lr=10−4=10^\{\-4\}batch11\); the816/816816/816fire\-rate identity is an empirical aggregate under these family\-specific settings, not a hyperparameter\-invariance proof \(App\.[C](https://arxiv.org/html/2606.24119#A3)\)\.
#### Scope note on AR baselines\.
Two AR controls play distinct roles\.Training\-stack sanity:a Mistral\-7B LoRA baseline under standard next\-token cross\-entropy \(Appendix[A](https://arxiv.org/html/2606.24119#A1)\) verifies that the implementation itself is not the instability source\.Masked\-CE control:Pythia\-1B\(Biderman et al\.,[2023](https://arxiv.org/html/2606.24119#bib.bib1)\)on the same5×125\\times 12grid \(180 runs,n=3n\{=\}3seeds, §[4\.3](https://arxiv.org/html/2606.24119#S4.SS3)\) tests the loss\-vs\-architecture confound by holding the loss fixed while varying architecture and pretraining; Qwen3\.5\-9B\(Qwen Team, Alibaba,[2026](https://arxiv.org/html/2606.24119#bib.bib24)\)adds a larger matched control in §[4\.3](https://arxiv.org/html/2606.24119#S4.SS3)\.
## 4Experiments and Results
The experiments answer a diagnostic question, not a method\-comparison question: can a low\-cost monitor identify DLM\-LoRA configurations that should be inspected before the late training loss is known? We first test the transferred top\-11warning, then evaluate max gradient norm under the same held\-out label, and finally use the LLaDA2\.0\-mini rank×\\timesmask surface, AR controls, token/gradient measurements, and task probe to mark the boundary of the claim\. Preliminary LLaDA\-8B\-Instruct experiments across a5×45\\times 4rank×\\timesmask grid motivated the denser LLaDA2\.0\-mini study but are omitted from the main body\.
#### Evaluation object and baselines\.
The object under evaluation is the*training monitor*\. The main baselines are therefore diagnostic: the transferred top\-11warning, max\-gradient\-up\-to\-step\-kk, loss\-at\-step\-kk, mask\-ratio covariates, and matched AR masked\-CE controls\. PEFT variants enter as boundary and method\-comparison cohorts, but the claim\-bearing question stays fixed: whether an early DLM\-LoRA monitor can route top\-decile final\-loss configurations to inspection better than the transferred top\-11warning under family\-specific thresholds\. Table[1](https://arxiv.org/html/2606.24119#S4.T1)summarizes the action\-facing verdict\.
### 4\.1Top\-1 Has Zero Precision; Max Gradient Norm Provides Calibrated Triage
The audited LLaDA\-family runner exposes a top\-1\-frequency collapse heuristic: it emitstop1\_warning\_detectedwhen more than50%50\\%of predicted argmax tokens concentrate on a single token within a short observation window\. This makes it a plausible but unvalidated short\-run PEFT stability monitor; the Dream and MDLM boundary cohorts use harmonized logging fields for the same test\.
#### Denominator and result\.
We aggregate across three DLM model families totalling𝟖𝟏𝟔\\mathbf\{816\}configurations: LLaDA family \(n=671n\{=\}671; four cohorts from 2–3 model checkpoints\), a Dream\-77B dense boundary cohort \(n=100n\{=\}100; App\.[C](https://arxiv.org/html/2606.24119#A3)\), and an MDLM\-OWT130130M dense boundary cohort \(n=45n\{=\}45; App\.[C](https://arxiv.org/html/2606.24119#A3)\)\. The cohorts use harmonized logging fields fortop1\_warning\_detectedand post\-hoccollapsed; MDLM measures top\-11from the training\-time masked\-input forward pass, while LLaDA\-family runs use the corresponding runner proxy \(App\.[C](https://arxiv.org/html/2606.24119#A3)\)\. The top\-1 collapse warning fires in𝟖𝟏𝟔/𝟖𝟏𝟔\\mathbf\{816/816\}\(𝟏𝟎𝟎%\\mathbf\{100\\%\}\) configurations; actual collapse occurs in𝟎/𝟖𝟏𝟔\\mathbf\{0/816\}\(𝟎%\\mathbf\{0\\%\}\)\. The diagnostic has zero precision at this horizon\. In PEFT fine\-tuning at≤\\leq200 steps, the warning fires on a pre\-equilibrium artifact of LoRA updates rather than on divergence dynamics\. AR controls \(0/3600/360collapses across the audited main and extended\-mask grids; §[4\.3](https://arxiv.org/html/2606.24119#S4.SS3)\) show that the tested masked\-CE controls do not produce an analogous collapse pattern\. Source\-level provenance is recorded in Appendix[A](https://arxiv.org/html/2606.24119#A1)\.
#### Max gradient norm separates stable from unstable configurations\.
We report the maximum LoRA gradientℓ2\\ell\_\{2\}norm over the training trajectory \(pre\-clipping at the standard0\.50\.5threshold\) as the triage signal\. Within DLM family, the median max\-gradient ratio between unstable \(top\-decile final\-loss\) and stable \(sub\-median final\-loss\) configurations is3\.23×\\mathbf\{3\.23\\times\}on LLaDA2\.0\-mini full surface \(n=144n\{=\}144, Mann–WhitneyUU,pBonf=2\.7×10−7p\_\{\\text\{Bonf\}\}\{=\}2\.7\\times 10^\{\-7\},m=6m\{=\}6, bootstrap95%95\\%CI\[2\.76,3\.97\]\[2\.76,3\.97\]\);𝟑𝟔𝟐×\\mathbf\{362\\times\}on the LLaDA method\-comparison set \(n=395n\{=\}395; stable median99\.399\.3, unstable median35,960\.435\{,\}960\.4in the source scale;pBonf=5×10−21p\_\{\\text\{Bonf\}\}\{=\}5\\times 10^\{\-21\}, CI\[202,779\]\[202,779\]\); and1\.48×1\.48\\times\(pBonf=0\.036p\_\{\\text\{Bonf\}\}\{=\}0\.036\) on the 10\-seed critical expansion \(n=120n\{=\}120, compressed dynamic range\)\. AR controls show smaller, inconsistent separation \(Table[2](https://arxiv.org/html/2606.24119#A1.T2); verification summaries in App\.[A](https://arxiv.org/html/2606.24119#A1)\), supporting family calibration rather than a global threshold\.
#### Held\-out precision check\.
A fixed80/2080/20split over the full671671\-configuration LLaDA\-family corpus \(ntrain=536n\_\{\\text\{train\}\}\{=\}536,ntest=135n\_\{\\text\{test\}\}\{=\}135\), including the method\-comparison cohort \(n=395n\{=\}395\), with the max\-gradient thresholdF1F\_\{1\}\-optimized on train predicts top\-decile final\-loss on test with precision0\.68\\mathbf\{0\.68\}, recall0\.94\\mathbf\{0\.94\}, andF1=0\.79F\_\{1\}\{=\}\\mathbf\{0\.79\}\.222The top\-decile final\-loss label is defined on the full corpus; by chance the test\-split unstable fraction is13\.3%13\.3\\%\(18/13518/135\) vs\.9\.3%9\.3\\%in train\.A separateB=1000B\{=\}1000split\-bootstrap, which resamples configurations, redraws the train/test split, and reselects the threshold on train in each replicate, gives95%95\\%precision CI\[0\.500,0\.947\]\[\\mathbf\{0\.500\},\\mathbf\{0\.947\}\], disjoint from the always\-positive baseline ceiling0\.1480\.148\. This is a roughly5×5\\timesprecision lift over the fixed\-split baseline; even the lower confidence bound is more than3×3\\timesthe baseline ceiling, but the absolute precision remains moderate\. The supported use is therefore inspection and routing rather than an automatic decision rule; the pooled evaluation is appropriate because the primary grid alone is underpowered for threshold calibration \(Limitations\)\. A late\-vs\-early gradient\-ratio rule and its conjunction with the threshold are less precise \(App\.[A](https://arxiv.org/html/2606.24119#A1)\); the next paragraph reports a separateB=200B\{=\}200random\-split timing sweep\.
#### Early\-warning timing: stable inspection before late loss settles\.
The practical case for max gradient norm is timing, not absolute precision or compute saving\. We sweep three predictors — max\-gradient\-up\-to\-step\-kk, loss\-at\-step\-kk, and max top\-11token\-frequency\-up\-to\-step\-kk— acrossk∈\{5,10,11,25,50,100,200\}k\\in\\\{5,10,11,25,50,100,200\\\}on the same671671\-configuration corpus withB=200B\{=\}200random80/2080/20splits andF1F\_\{1\}\-optimized thresholds on train \(Fig\.[2](https://arxiv.org/html/2606.24119#S4.F2); Tab\.[3](https://arxiv.org/html/2606.24119#A2.T3)\)\. Max\-gradient precision stabilizes at0\.73\\mathbf\{0\.73\}–0\.75\\mathbf\{0\.75\}from step𝟐𝟓\\mathbf\{25\}onward\. Loss\-at\-step\-kkis non\-monotonic: it spikes to0\.790\.79at step1111, dips to0\.500\.50–0\.650\.65at steps5050–100100, and becomes tautological at step200200because loss is then the label\. A practitioner reading only loss at step1111would observe higher single\-point precision \(0\.790\.79\) but loses reliable signal for any inspection triggered between steps2525and100100; max gradient norm accumulates in the same training logs without requiring an additional forward pass and remains predictive across the full step\-2525–100100window\. Max top\-11\-up\-to\-step\-kknever exceeds0\.270\.27, consistent with the pre\-equilibrium\-artifact framing\.
Figure 2:Step\-kkprecision separates stable inspection from final\-loss hindsight\.On the671671\-configuration LLaDA\-family corpus, max\-gradient precision stabilizes through the step\-2525–100100window where loss\-at\-step\-kkis least reliable; the step\-200200loss point is the final\-loss label\. Colored ribbons are bootstrap95%95\\%CIs; the gray band marks the inspection window; the dotted line is the all\-positive precision ceiling\. The top\-11line \(blue\) exceeds this ceiling atk≥10k\{\\geq\}10because random splits can correlate top\-11with loss by chance; the zero\-precision result is on the fixed full\-corpus split \(0/8160/816collapses\)\.
#### DLM\-LoRA triage workflow\.
The operational recipe is deliberately narrow\.Dropthe top\-11collapse warning as a PEFT early\-warning signal at≤1\\leq 1K step horizons\.Logmax LoRA gradient norm at every training step and inspect it by step∼25\\sim 25\.Calibratehigh\-gradient thresholds per family, e\.g\., absolute max\-grad values above5050–100100in the LLaDA2\.0\-mini logging scale or above a locally calibrated high quantile\. These thresholds are inspection triggers, not prospectively validated cross\-family cutoffs or compute\-saving policies\. Appendix[A](https://arxiv.org/html/2606.24119#A1)gives the logging fields needed to implement this protocol\.
Table 1:Actionable monitor verdict\.The deliverable is not a new PEFT method; it is a claim\-matched triage protocol for DLM\-LoRA training monitors at the tested horizons\.DecisionEvidence\-backed useDrop warningTop\-11fires in816/816816/816DLM configurations with0/8160/816actual collapse; not a PEFT collapse detector at this horizon\.Use triageFixed\-split max\-gradient precision0\.680\.68, recall0\.940\.94;B=200B\{=\}200random\-split sweep stable by step2525; inspection trigger only\.Keep baselineLoss\-at\-kkprecision reaches0\.790\.79at step1111but falls to0\.500\.50–0\.650\.65at steps5050–100100; step200200is the label by construction\.DLM\-LoRA onlyAR masked\-CE controls have0/3600/360collapse and smaller or inconsistent separation, so the warning failure is scoped to the tested DLM\-LoRA monitor transfer\.No global cutoffCross\-family thresholds do not transfer; high\-gradient values are inspection triggers only after per\-family calibration\.
### 4\.2A U\-Shaped Gradient Instability Profile Across Mask Ratio
We extend the analysis toLLaDA2\.0\-mini\(Bie et al\.,[2025](https://arxiv.org/html/2606.24119#bib.bib2)\)\(inclusionAI/LLaDA2\.0\-mini, 15\.93B MoE, mask token ID 156895\), a more recent and larger masked diffusion model\. We configure 60 unique rank×\\timesmask combinations \(n=144n\{=\}144total runs including multi\-seed replications\): ranks\{4,8,16,32,64\}\\\{4,\\,8,\\,16,\\,32,\\,64\\\}×\\times12 mask ratios spanningρ∈\[0\.05,0\.95\]\\rho\\in\[0\.05,0\.95\]\(raw grid in Appendix Table[4](https://arxiv.org/html/2606.24119#A3.T4)\), with a 200\-step budget at lr=10−4=10^\{\-4\}\. The legacy top\-1 collapse detector early\-stops all 60 raw traces at step 11, so this surface should be read as an observed\-prefix short\-run diagnostic rather than a completed 200\-step trajectory\.
#### U\-shaped instability profile\.
Unlike the non\-monotone rank\-optimum flip observed in LLaDA\-Instruct \(§[4](https://arxiv.org/html/2606.24119#S4)\), LLaDA2\.0\-mini reveals aU\-shaped gradient instability profileacross mask ratio \(companion to Figure[1](https://arxiv.org/html/2606.24119#S0.F1)panel B\)\. Because top\-11fires uniformly, this surface explains where max\-gradient triage from §[4\.1](https://arxiv.org/html/2606.24119#S4.SS1)becomes useful\. The high\-mask arm corresponds tohigh\-mask gradient amplification: in this observed\-prefix grid, fine\-tuning LoRA on a DLM atρ\>0\.70\\rho\>0\.70produces gradient magnitudes up to6\.0×6\.0\{\\times\}larger than the operating\-window maximum \(34\.8 vs\. 5\.8\), in proportion to LoRA rank\. The left arm is sparse\-signal variance\. We use these two names throughout:
Two instability mechanisms:
- •Left arm\(mask<0\.15<0\.15\): Sparse supervision, only 5–15% of tokens are masked per sequence\. The per\-batch gradient estimate has high variance \(few prediction targets, noisy signal\)\. Gradient norm spikes reach 7\.7–23\.6 across ranks\.
- •Right arm\(mask\>0\.70\>0\.70\):High\-mask gradient amplification, predicting 70–95% of tokens simultaneously produces a high\-entropy prediction task with large loss and gradient magnitudes\. Rank amplifies this arm directionally: in the 1\-seed surface \(Table[4](https://arxiv.org/html/2606.24119#A3.T4)\)r=4r\{=\}4atρ=0\.95\\rho\{=\}0\.95reaches 2\.7 andr=64r\{=\}64reaches 34\.8 \(12\.9×12\.9\\times\); in the 3\-seed replication \(Table[4](https://arxiv.org/html/2606.24119#A3.T4)\) the same configurations give34\.5±9\.734\.5\{\\pm\}9\.7and41\.2±17\.541\.2\{\\pm\}17\.5respectively \(1\.19×1\.19\\times,n=3n\{=\}3seeds\), so we keep the high\-mask asymmetry as a directional finding and the 3\-seed values as the canonical magnitude\.
- •Low\-mid operating region\(mask∈\[0\.30,0\.40\]\\in\[0\.30,0\.40\]\): in the direct one\-seed observed\-prefix grid, these configurations have low gradient norms across all five ranks, withρ=0\.45\\rho\{=\}0\.45supported only by a narrowerr=64r\{=\}64boundary run\. In the 3\-seed completed grid, the lowest mean gradient norms shift toward low\-mid masks and ther=64r\{=\}64values atρ∈\{0\.30,0\.40\}\\rho\\in\\\{0\.30,0\.40\\\}are noisy; we therefore base the practical recommendation on convergence and held\-out CE evidence rather than on a replicated global gradient minimum\.Practical recommendation: avoidρ\>0\.70\\rho\>0\.70for LLaDA2\.0\-mini LoRA at lr=10−410^\{\-4\}in the tested setup; treatρ=0\.30\\rho=0\.30–0\.400\.40as a conservative low\-mid default, not a global optimum\.
LLaDA2\(Bie et al\.,[2025](https://arxiv.org/html/2606.24119#bib.bib2)\)independently reports high gradient variance at extreme masking during pre\-training and clips their noise\-schedule coefficient within\[αmin,αmax\]\[\\alpha\_\{\\min\},\\alpha\_\{\\max\}\], a bandwidth that maps to our operating window\. Our LoRA characterization adds the rank dimension and shows amplification concentrates on the high\-mask arm\.
#### Replication correction\.
The 1\-seed observed\-prefix12\.9×12\.9\\timeshigh\-mask amplification contracts to1\.19×1\.19\\timesunder the 3\-seed full\-grid replication \(Table[4](https://arxiv.org/html/2606.24119#A3.T4)\), with high\-mask std as large as6060\. We therefore use the replicated surface as the canonical magnitude estimate and keep symbolic\-regression descriptors as appendix\-only exploratory summaries, not decision rules\.
#### Standard AR PEFT does not traverse this surface\.
Standard AR LoRA fine\-tuning uses dense next\-token supervision over all non\-first positions, has no mask\-ratio dimension, and therefore cannot exhibit a rank–mask interaction in its standard recipe\. The natural follow\-up question is whether the U\-shape we observe on LLaDA2\.0\-mini is a property of DLM bidirectional architecture or of the masked\-CE objective itself\. We answer that with a paired AR control in the next subsection\.
### 4\.3AR Baseline Control \(summary; full grids in App\.[C](https://arxiv.org/html/2606.24119#A3)\)
To isolate the masked\-CE objective from DLM bidirectional attention, we ran matched random\-mask cross\-entropy controls on Pythia\-1B, Pythia\{410M,2\.8B,6\.9B\}\\\{410\\text\{M\},2\.8\\text\{B\},6\.9\\text\{B\}\\\}, and Qwen3\.5\-9B, with360360audited configurations across main grids and extended\-mask supplements\. At matched configuration\(r=64,ρ=0\.40\)\(r\{=\}64,\\rho\{=\}0\.40\), the max\-grad\-norm magnitude is2\.0×\\mathbf\{2\.0\\times\}–2\.5×\\mathbf\{2\.5\\times\}smaller on AR than on LLaDA2\.0\-mini \(16\.6116\.61on Pythia\-1B vs33\.433\.4on LLaDA2\.0\-mini\); the high\-vs\-mid ratio atr=64r\{=\}64is1\.20×1\.20\\timeson Pythia\-1B AR control vs2\.54×2\.54\\timeson LLaDA2\.0\-mini, and Qwen3\.5\-9B shows a cross\-family mid\-mask peak rather than the U\-shape\. AR controls report0/3600/360actual collapses, supporting a DLM\-family\-scoped interpretation rather than a masked\-CE\-generic one\. The grid, denominator, and cross\-architecture detail live in App\.[C](https://arxiv.org/html/2606.24119#A3); we keep here only the body\-essential conclusion: masked\-CE alone is not sufficient to reproduce the DLM\-family rank\-amp magnitude, so the max\-gradient triage signal in §[4\.1](https://arxiv.org/html/2606.24119#S4.SS1)is calibrated against the DLM family it serves\.
### 4\.4DLM Scale\-Architecture Boundary \(summary; full grids in App\.[C](https://arxiv.org/html/2606.24119#A3)\)
The LLaDA2\.0\-mini operating window survives unevenly across DLM scales and architectures\. The loss\-side high\-mask disadvantage replicates on Dream\-7B \(7B dense, lr\-calibrated; App\.[C](https://arxiv.org/html/2606.24119#A3)\) and LLaDA2\.1\-mini \(4\-configuration transfer; App\.[C](https://arxiv.org/html/2606.24119#A3)\), but the rank\-amplification direction is mixed on MDLM\-OWT\-130M \(n=3n\{=\}3replication; the earlier single\-seed lr\-modulation pattern does not survive replication\) and softens on LLaDA\-MoE\-A1B \(1\.4B small\-MoE: gradient\-side amplification1\.891\.89–2\.47×2\.47\\times, loss\-side flat; App\.[C](https://arxiv.org/html/2606.24119#A3)\)\. We claim DLM\-family scope rather than an architecture\-general window, with per\-model lr calibration required; the full scale\-boundary table and per\-model lr discussion are reported in App\.[C](https://arxiv.org/html/2606.24119#A3)\.
### 4\.5Why Top\-1 Fires in Every DLM Configuration: A Two\-Level Characterization
The816/816816/816fire vs0/8160/816collapse asymmetry reflects a structural mismatch between what the metric measures and what training stability requires\. We characterize it with two corpus\-wide measurements that decouple token\-space concentration from parameter\-space gradient routing\.
#### Level 1 \(token\-side\): top\-1 is saturated before training\.
Across all671671LLaDA\-family configurations, the top\-11token frequency at training step0has mean0\.830\.83and standard deviation0\.130\.13;100%100\\%of configurations are already above0\.50\.5at step0, and65%65\\%are already above0\.80\.8\. The median configuration crosses0\.950\.95within𝟒\\mathbf\{4\}optimizer steps\. The legacy detector samples at step1111and fires in every configuration because the threshold \(0\.50\.5\) is below the corpus\-wide initialization distribution\. Stable configurations \(sub\-median final loss\) and unstable configurations \(top\-decile final loss\) have indistinguishable median fire\-step \(1111vs1111; Mann–WhitneyUUtwo\-sidedp=0\.20p\{=\}0\.20,nstable=336n\_\{\\text\{stable\}\}\{=\}336,nunstable=68n\_\{\\text\{unstable\}\}\{=\}68; remaining267267mid\-band configurations are excluded from this stability contrast\)\. The complementary saturation\-step diagnostic \(first step where top\-11crosses0\.950\.95\) is significant in the*opposite*direction: unstable configurations saturate*faster*\(median1\.01\.0\) than stable configurations \(median4\.04\.0\),p=4\.7×10−5p\{=\}4\.7\\times 10^\{\-5\}\(nstable=178n\_\{\\text\{stable\}\}\{=\}178,nunstable=68n\_\{\\text\{unstable\}\}\{=\}68; conditioned on configurations that crossed0\.950\.95by step200200\)\. A signal that saturates before training, fires faster on unstable runs, and is uniform across the corpus cannot discriminate stability; it measures a pre\-equilibrium argmax\-concentration artifact of LoRA’s small\-magnitude initialization plus a few masked\-CE updates against an already\-confident pre\-trained DLM\. Figure[3](https://arxiv.org/html/2606.24119#A2.F3)reports the aggregate timing evidence\.
#### Level 2 \(parameter\-side\): rank\-amp is optimization routing, not token routing\.
The token\-side concentration above coexists with a near\-uniform per\-position information density, so the warning signal must measure something other than token\-distribution concentration\. At the worst rank\-amplification corner \(r=64r\{=\}64,ρ=0\.95\\rho\{=\}0\.95, LLaDA2\.0\-mini,n=3n\{=\}3seeds, last\-1010steps; App\.[D](https://arxiv.org/html/2606.24119#A4)\), the per\-token cross\-entropy\-gradient distribution has Gini0\.287±0\.0560\.287\\pm 0\.056and the largest evaluated token position contributes only1\.54%±0\.17%1\.54\\%\\pm 0\.17\\%of total CE\-gradient mass \(uniform baseline0\.8%0\.8\\%\)\. In the same runs, the LoRA\-parameter gradient distribution has Gini0\.463±0\.0310\.463\\pm 0\.031and a*single*LoRA matrix carries63\.0%±3\.6%63\.0\\%\\pm 3\.6\\%of total parameter\-side gradient mass\. At this high\-mask corner, rank\-amplification is therefore an*optimization\-routing*phenomenon: the masked\-CE signal arrives spread across token positions but is funnelled through a small subset of high\-rank LoRA adapters in the late trajectory\. Max gradient norm samples that late\-trajectory routing, supporting why it carries discriminative information that top\-11does not\.
#### What this characterization predicts\.
The useful monitor should depend on late\-trajectory parameter dynamics, not early token confidence: max\-gradient fits this pattern inside the calibrated LLaDA\-family split, while the always\-positive top\-11warning does not\. The pathology is scoped to LoRA\-on\-pretrained\-DLM regimes; AR controls \(App\.[C](https://arxiv.org/html/2606.24119#A3)\) and the DLM scale\-boundary check \(App\.[C](https://arxiv.org/html/2606.24119#A3)\) support this boundary\.Huang and Mirzasoleiman \([2026](https://arxiv.org/html/2606.24119#bib.bib11)\)studies masked\-diffusion signal/noise decomposition in a different generalization regime\.
The full pre\-equilibrium trajectory and timing breakdown are shown in Fig\.[3](https://arxiv.org/html/2606.24119#A2.F3)\.
#### Why no single\-axis intervention prevents saturation\.
The empirical816/816816/816identity is consistent with a masked\-CE convergence argument: if fitting increases expected top\-11mass before optimization settles, then convergence\-preserving single\-axis interventions should preserve the legacy fire event\. Two probes show the boundary\. A loss\-level entropy bonus on MDLM\-OWT withλ∈\{0\.5,1\.0,2\.0,5\.0,10\.0\}\\lambda\\in\\\{0\.5,1\.0,2\.0,5\.0,10\.0\\\}does not reduce top\-11mass at this horizon \(−0\.008\-0\.008atλ=0\.5\\lambda\{=\}0\.5,\+0\.051\+0\.051atλ=10\\lambda\{=\}10; App\.[D](https://arxiv.org/html/2606.24119#A4)\); canonical PiSSA improves MDLM\-OWT final loss \(−0\.43\-0\.43at200200steps;1\.82→1\.241\.82\\to 1\.24, paired delta−0\.57\-0\.57at step10001000\) without changing the fire identity \(App\.[D](https://arxiv.org/html/2606.24119#A4)\)\. App\.[D](https://arxiv.org/html/2606.24119#A4)reports all thirteen probes; the bound is explanatory scaffolding, not a load\-bearing theorem\.
#### Scope refinements\.
The low\-mid operating region does not define an architecture\-general optimum: some final\-loss probes prefer lower masks, while the convergence and held\-out CE probes mainly support avoiding high\-mask regimes in LLaDA\-family settings \(App\.[A](https://arxiv.org/html/2606.24119#A1)\)\. The worst rank\-amplification corner shifts fromρ=0\.95\\rho\{=\}0\.95\(12\.9×12\.9\\times, one seed\) toρ=0\.90\\rho\{=\}0\.90\(84\.7±60\.484\.7\{\\pm\}60\.4, three seeds\); high\-mask capacity effects and LLaDA2\.1\-mini transfer remain underpowered\. We therefore state a scoped diagnostic, not an architecture\-general recipe\.
#### Task\-performance sanity check\.
A small in\-domain masked\-CE convergence probe checks whether the gradient surface predicts downstream loss reduction, not generation accuracy\. LLaDA2\.0\-mini is trained for10001000steps on152152hand\-written arithmetic examples, crossing rank\{4,64\}\\\{4,64\\\}with mask ratio\{0\.40,0\.90\}\\\{0\.40,0\.90\\\}and evaluating masked\-CE on2020disjoint prompts \(App\.[C](https://arxiv.org/html/2606.24119#A3)\)\.
Finding\.Table[5](https://arxiv.org/html/2606.24119#A3.T5)matches the surface ordering: operating\-window configurations \(ρ=0\.40\\rho\{=\}0\.40\) reach lower final and holdout losses than high\-mask configurations \(ρ=0\.90\\rho\{=\}0\.90\)\. The within\-window rank gap is not significant \(pairedttp=0\.40p\{=\}0\.40\), and Table[6](https://arxiv.org/html/2606.24119#A3.T6)shows no Bonferroni\-corrected rank\-64 advantage\.
## 5Related Work
LoRA/PEFT work introduces low\-rank and quantized adapters\(Hu et al\.,[2022](https://arxiv.org/html/2606.24119#bib.bib10); Dettmers et al\.,[2023](https://arxiv.org/html/2606.24119#bib.bib7); Liu et al\.,[2024](https://arxiv.org/html/2606.24119#bib.bib17)\)plus rank\-allocation and optimizer\-side variants\(Zhao et al\.,[2024](https://arxiv.org/html/2606.24119#bib.bib37); Zhang et al\.,[2025](https://arxiv.org/html/2606.24119#bib.bib35); Chang et al\.,[2025](https://arxiv.org/html/2606.24119#bib.bib3); Park et al\.,[2025](https://arxiv.org/html/2606.24119#bib.bib22)\), but these works study generic or AR adaptation regimes rather than DLM mask\-ratio monitor transfer\. DLM work studies objectives and decoding\(Sahoo et al\.,[2024](https://arxiv.org/html/2606.24119#bib.bib25); Nie et al\.,[2025](https://arxiv.org/html/2606.24119#bib.bib21); Ye et al\.,[2025](https://arxiv.org/html/2606.24119#bib.bib32)\), scaling and surveys\(Bie et al\.,[2025](https://arxiv.org/html/2606.24119#bib.bib2); Li et al\.,[2025a](https://arxiv.org/html/2606.24119#bib.bib15)\), mask\-agnostic fine\-tuning\(Piskorz et al\.,[2025](https://arxiv.org/html/2606.24119#bib.bib23)\), and recent systems or adapters including noise\-aware LoRA\(Kuiper et al\.,[2025](https://arxiv.org/html/2606.24119#bib.bib14); Yang et al\.,[2026](https://arxiv.org/html/2606.24119#bib.bib31); Xu et al\.,[2025](https://arxiv.org/html/2606.24119#bib.bib30); Wang et al\.,[2026](https://arxiv.org/html/2606.24119#bib.bib27)\); these improve dLLM adaptation or inference but do not test whether top\-11collapse warnings transfer into supervised LoRA fine\-tuning\. We use the term collapse for training\-time top\-11argmax saturation, distinct from the representational layer collapse reported in fully\-trained DLMs\(Conzelmann et al\.,[2026](https://arxiv.org/html/2606.24119#bib.bib6)\); auditing warning\-signal precision under matched false\-positive control has precedent outside language modeling\(Mullett,[2026](https://arxiv.org/html/2606.24119#bib.bib20)\)\. LikeSchaeffer et al\. \([2023](https://arxiv.org/html/2606.24119#bib.bib26)\), we show that a familiar metric changes meaning outside its calibration regime; App\.[E](https://arxiv.org/html/2606.24119#A5)gives the fuller taxonomy\.
## 6Conclusion
Top\-11fires in816/816816/816configurations while observed collapse is0/8160/816across three DLM families because the token\-side signal saturates before training stability is observable\. Max gradient norm instead gives a family\-local inspection signal: precision0\.680\.68on the pooled LLaDA\-family split and stable step\-2525–100100behavior\. The scoped recommendation is to drop top\-11as a PEFT collapse warning, log max\-gradient for inspection, and recalibrate mask ratio per model before reusing inference\-time confidence monitors as training alarms\.
## Limitations
Budget and seeds\.The primary6060\-configuration rank×\\timesmask grid usesn=3n\{=\}3seeds at200200training steps, expanded ton=10n\{=\}10at twelve critical configurations\. Power analysis \(App\.[C](https://arxiv.org/html/2606.24119#A3)\) places adequate detection of2×2\\timesratios atn≥30n\\geq 30for high\-mask configurations, so rank\-amplification magnitudes are directional estimates; the top\-11refutation and max\-gradient triage claims rest on the larger audited denominators\.
Architecture and adapter scope\.The max\-gradient precision claim is calibrated on the pooled LLaDA\-family corpus \(n=671n\{=\}671\); the primary rank×\\timesmask grid alone \(n=264n\{=\}264\) contains too few unstable configurations to calibrate a held\-out threshold reliably, so the pooled evaluation is the appropriate unit\. The zero\-precision top\-11denominator additionally includes Dream\-77B and MDLM\-OWT\-130130M boundary cohorts across four LLaDA\-family cohorts \(2–3 model checkpoints\)\. Adapters are placed on attention projections \(q,k,v,o\); MLP, embedding, and LM\-head LoRA placement, quantization\-mask interaction\(Zhang et al\.,[2026](https://arxiv.org/html/2606.24119#bib.bib36); Wu et al\.,[2026](https://arxiv.org/html/2606.24119#bib.bib29)\), and fully matched dense LLaDA\-88B replication are follow\-up axes\.
Task and use scope\.The task probe is an in\-domain masked\-CE convergence check rather than an accuracy\-grade generation benchmark\. Max\-gradient is therefore presented as the tested LLaDA\-family alternative to top\-11for early inspection, while coupled\(ρ,r,family\)\(\\rho,r,\\text\{family\}\)intervention design and generation\-quality gains remain separate claims for future work\. Low\-mid mask ratios are a conservative LLaDA\-family default in the tested setup, not a global optimum; per\-architecture validation is required\.
Diagnostic horizon\.The816816/816816zero\-precision result is bounded to short\-run PEFT diagnostics at the tested horizon\. We test the inherited legacy warning threshold \(\>50%\>50\\%argmax concentration at step1111\); recalibrated thresholds or alternative top\-11\-derived statistics could behave differently and remain unvalidated\. Separate20002000\-step sidecars on Dream\-77B \(27/2727/27fire,0/270/27collapse\) and LLaDA2\.0\-mini MoE \(9/99/9fire,0/90/9collapse\) are consistent with this warning\-failure pattern, but remain outside the816816\-configuration headline denominator\. The result should not be read as a claim about full fine\-tuning, DLM pretraining from scratch, or budgets beyond these bounded sidecars\.
## Ethical considerations
All training and evaluation data are publicly released English\-language benchmarks under permissive licenses \(GSM88K\(Cobbe et al\.,[2021](https://arxiv.org/html/2606.24119#bib.bib5)\), HumanEval\(Chen et al\.,[2021](https://arxiv.org/html/2606.24119#bib.bib4)\), MMLU\(Hendrycks et al\.,[2021](https://arxiv.org/html/2606.24119#bib.bib9)\): MIT; MetaMathQA\-55K: CC\-BY\-NC\-SA\-4\.04\.0\); the152152\-example instruction corpus is hand\-written, no PII, no scraped third\-party content\. Backbone weights are publicly released \(LLaDA\-family per model cards; LLaDA\-MoE\-77B\-A11B perZhu et al\.,[2025](https://arxiv.org/html/2606.24119#bib.bib38); PythiaBiderman et al\.,[2023](https://arxiv.org/html/2606.24119#bib.bib1)\+ Qwen3\.53\.5\-99BQwen Team, Alibaba,[2026](https://arxiv.org/html/2606.24119#bib.bib24)under Apache2\.02\.0\)\. Aggregate compute is∼119\\sim 119kg CO2eq total, estimated from reported GPU\-hours and US grid\-intensity context\(Electricity Maps,[2024](https://arxiv.org/html/2606.24119#bib.bib8)\)\. The max\-gradient triage protocol operates only on training diagnostics and produces no model outputs; we do not anticipate disproportionate or novel harms beyond those already present in supervised LoRA fine\-tuning\. AI assistants were used for coding support, layout repair, audit checklists, and prose editing; all claims, numbers, and experimental results were author\-verified against local run artifacts\. Public artifacts include the arXiv source, reference logging scripts, and sanitized aggregate result JSON/CSV files backing the tables and figures \([GitHub repository](https://github.com/lucky-verma/top1-fails-dlm-lora-monitors);[result artifacts](https://github.com/lucky-verma/top1-fails-dlm-lora-monitors/tree/main/results)\)\. The public artifacts intentionally exclude raw per\-run prompts/completions, W&B metadata, local paths, checkpoints, and adapter weights\.
## References
- Biderman et al\. \(2023\)Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal\. 2023\.[Pythia: A suite for analyzing large language models across training and scaling](https://arxiv.org/abs/2304.01373)\.In*ICML*\.
- Bie et al\. \(2025\)Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, Chengxi Li, Chongxuan Li, Jianguo Li, Zehuan Li, Huabin Liu, Lin Liu, Guoshan Lu, Xiaocheng Lu, Yuxin Ma, and 12 others\. 2025\.[LLaDA2\.0: Scaling up diffusion language models to 100b](https://arxiv.org/abs/2512.15745)\.*arXiv preprint arXiv:2512\.15745*\.
- Chang et al\. \(2025\)Yupeng Chang, Chenlu Guo, Yi Chang, and Yuan Wu\. 2025\.[LoRA\-MGPO: Mitigating double descent in low\-rank adaptation via momentum\-guided perturbation optimization](https://doi.org/10.18653/v1/2025.findings-emnlp.34)\.In*Findings of the Association for Computational Linguistics: EMNLP 2025*, pages 648–659\. Association for Computational Linguistics\.
- Chen et al\. \(2021\)Mark Chen and 1 others\. 2021\.[Evaluating large language models trained on code](https://arxiv.org/abs/2107.03374)\.*arXiv preprint arXiv:2107\.03374*\.
- Cobbe et al\. \(2021\)Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman\. 2021\.[Training verifiers to solve math word problems](https://arxiv.org/abs/2110.14168)\.*arXiv preprint arXiv:2110\.14168*\.
- Conzelmann et al\. \(2026\)Alexander Conzelmann, Albert Catalan\-Tatjer, and Shiwei Liu\. 2026\.[Layer collapse in diffusion language models](https://arxiv.org/abs/2605.06366)\.*arXiv preprint arXiv:2605\.06366*\.
- Dettmers et al\. \(2023\)Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer\. 2023\.[Qlora: Efficient finetuning of quantized llms](https://arxiv.org/abs/2305.14314)\.*arXiv preprint arXiv:2305\.14314*\.
- Electricity Maps \(2024\)Electricity Maps\. 2024\.Electricity map: Live CO2emissions of electricity consumption\.[https://app\.electricitymaps\.com](https://app.electricitymaps.com/)\.
- Hendrycks et al\. \(2021\)Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt\. 2021\.[Measuring massive multitask language understanding](https://arxiv.org/abs/2009.03300)\.In*International Conference on Learning Representations \(ICLR\)*\.
- Hu et al\. \(2022\)Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen\-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen\. 2022\.[LoRA: Low\-rank adaptation of large language models](https://arxiv.org/abs/2106.09685)\.In*International Conference on Learning Representations \(ICLR\)*\.
- Huang and Mirzasoleiman \(2026\)Jianhao Huang and Baharan Mirzasoleiman\. 2026\.[Tuning the implicit regularizer of masked diffusion language models: Enhancing generalization via insights fromkk\-Parity](https://arxiv.org/abs/2601.22450)\.*arXiv preprint arXiv:2601\.22450*\.
- Jung et al\. \(2025\)Yeonjoon Jung, Daehyun Ahn, Hyungjun Kim, Taesu Kim, and Eunhyeok Park\. 2025\.[GraLoRA: Granular low\-rank adaptation for parameter\-efficient fine\-tuning](https://openreview.net/forum?id=8wvOMQ2Olw)\.In*Advances in Neural Information Processing Systems*\.
- Kalajdzievski \(2023\)Damjan Kalajdzievski\. 2023\.[A rank stabilization scaling factor for fine\-tuning with LoRA](https://arxiv.org/abs/2312.03732)\.*arXiv preprint arXiv:2312\.03732*\.
- Kuiper et al\. \(2025\)Ruurd Jan Anthonius Kuiper, Lars de Groot, Bram van Es, Maarten van Smeden, and Ayoub Bagheri\. 2025\.[LAD: LoRA\-adapted diffusion](https://doi.org/10.18653/v1/2025.emnlp-demos.8)\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations*\.
- Li et al\. \(2025a\)Tianyi Li, Mingda Chen, Bowei Guo, and Zhiqiang Shen\. 2025a\.[A survey on diffusion language models](https://arxiv.org/abs/2508.10875)\.*arXiv preprint arXiv:2508\.10875*\.
- Li et al\. \(2025b\)Zhizhong Li, Sina Sajadmanesh, Jingtao Li, and Lingjuan Lyu\. 2025b\.[StelLA: Subspace learning in low\-rank adaptation using stiefel manifold](https://openreview.net/forum?id=55Lv1unlUL)\.In*Advances in Neural Information Processing Systems*\.
- Liu et al\. \(2024\)Shih\-Yang Liu, Chien\-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu\-Chiang Frank Wang, Kwang\-Ting Cheng, and Min\-Hung Chen\. 2024\.[Dora: Weight\-decomposed low\-rank adaptation](https://arxiv.org/abs/2402.09353)\.*arXiv preprint arXiv:2402\.09353*\.
- Mangrulkar et al\. \(2022\)Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan\. 2022\.PEFT: State\-of\-the\-art parameter\-efficient fine\-tuning methods\.[https://github\.com/huggingface/peft](https://github.com/huggingface/peft)\.Software library\.
- Meng et al\. \(2024\)Fanxu Meng, Zhaohui Wang, and Muhan Zhang\. 2024\.[PiSSA: Principal singular values and singular vectors adaptation of large language models](https://arxiv.org/abs/2404.02948)\.In*Advances in Neural Information Processing Systems*\.
- Mullett \(2026\)David Mullett\. 2026\.[Benchmarking recursive\-collapse warning claims under matched false\-positive control](https://arxiv.org/abs/2606.00329)\.*arXiv preprint arXiv:2606\.00329*\.
- Nie et al\. \(2025\)Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji\-Rong Wen, and Chongxuan Li\. 2025\.[Large language diffusion models](https://arxiv.org/abs/2502.09992)\.*arXiv preprint arXiv:2502\.09992*\.
- Park et al\. \(2025\)JuneYoung Park, Minjae Kang, Seongbae Lee, Haegang Lee, Seongwan Kim, and Jaeho Lee\. 2025\.[Riemannian optimization for LoRA on the stiefel manifold](https://doi.org/10.18653/v1/2025.findings-emnlp.1143)\.In*Findings of the Association for Computational Linguistics: EMNLP 2025*, pages 20971–20985\. Association for Computational Linguistics\.
- Piskorz et al\. \(2025\)Julianna Piskorz, Cristina Pinneri, Alvaro Correia, Motasem Alfarra, Risheek Garrepalli, and Christos Louizos\. 2025\.[Masks can be distracting: On context comprehension in diffusion language models](https://arxiv.org/abs/2511.21338)\.*arXiv preprint arXiv:2511\.21338*\.
- Qwen Team, Alibaba \(2026\)Qwen Team, Alibaba\. 2026\.[Qwen3\.5\-9B model card](https://huggingface.co/Qwen/Qwen3.5-9B)\.Hugging Face model card\.
- Sahoo et al\. \(2024\)Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov\. 2024\.[Simple and effective masked diffusion language models](https://arxiv.org/abs/2406.07524)\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*\.
- Schaeffer et al\. \(2023\)Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo\. 2023\.[Are emergent abilities of large language models a mirage?](https://arxiv.org/abs/2304.15004)In*NeurIPS*\.
- Wang et al\. \(2026\)Shuaidi Wang, Zhan Zhuang, Ruping HUANG, and Yu Zhang\. 2026\.[NaRA: Noise\-aware LoRA for parameter\-efficient fine\-tuning of diffusion LLMs](https://arxiv.org/abs/2605.29716)\.*arXiv preprint arXiv:2605\.29716*\.
- Wolf et al\. \(2020\)Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, and 3 others\. 2020\.[Transformers: State\-of\-the\-art natural language processing](https://aclanthology.org/2020.emnlp-demos.6)\.In*Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations*, pages 38–45\. Association for Computational Linguistics\.
- Wu et al\. \(2026\)Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie\. 2026\.[Fast\-dLLM: Training\-free acceleration of diffusion LLM by enabling KV cache and parallel decoding](https://openreview.net/forum?id=3Z3Is6hnOT)\.In*International Conference on Learning Representations \(ICLR\)*\.
- Xu et al\. \(2025\)Guowei Xu, Wenxin Xu, Jiawang Zhao, and Kaisheng Ma\. 2025\.[GIFT: Guided importance\-aware fine\-tuning for diffusion language models](https://arxiv.org/abs/2509.20863)\.*arXiv preprint arXiv:2509\.20863*\.
- Yang et al\. \(2026\)Jingyi Yang, Yuxian Jiang, Xuhao Hu, Shuang Cheng, Biqing Qi, and Jing Shao\. 2026\.[Dare: Diffusion large language models alignment and reinforcement executor](https://arxiv.org/abs/2604.04215)\.*arXiv preprint arXiv:2604\.04215*\.
- Ye et al\. \(2025\)Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong\. 2025\.[Dream 7b: Diffusion large language models](https://arxiv.org/abs/2508.15487)\.*arXiv preprint arXiv:2508\.15487*\.
- Yu et al\. \(2024\)Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li\. 2024\.[Language models are super mario: Absorbing abilities from homologous models as a free lunch](https://proceedings.mlr.press/v235/yu24p.html)\.In*Proceedings of the 41st International Conference on Machine Learning*, pages 57755–57775\.
- Zhang et al\. \(2024\)Biao Zhang, Zhongtao Liu, Colin Cherry, and Orhan Firat\. 2024\.[When scaling meets LLM finetuning: The effect of data, model and finetuning method](https://arxiv.org/abs/2402.17193)\.In*ICLR*\.
- Zhang et al\. \(2025\)Hao Zhang, Bo Huang, Zhenjia Li, Xi Xiao, Hui Yi Leong, Zumeng Zhang, Xinwei Long, Tianyang Wang, and Hao Xu\. 2025\.[Sensitivity\-LoRA: Low\-load sensitivity\-based fine\-tuning for large language models](https://doi.org/10.18653/v1/2025.findings-emnlp.709)\.In*Findings of the Association for Computational Linguistics: EMNLP 2025*, pages 13185–13199\. Association for Computational Linguistics\.
- Zhang et al\. \(2026\)Tianao Zhang, Zhiteng Li, Xianglong Yan, Haotong Qin, Yong Guo, and Yulun Zhang\. 2026\.[Quant\-dLLM: Post\-training extreme low\-bit quantization for diffusion large language models](https://openreview.net/forum?id=HD7tuVakmR)\.In*International Conference on Learning Representations \(ICLR\)*\.
- Zhao et al\. \(2024\)Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian\. 2024\.[GaLore: Memory\-efficient LLM training by gradient low\-rank projection](https://arxiv.org/abs/2403.03507)\.*arXiv preprint arXiv:2403\.03507*\.
- Zhu et al\. \(2025\)Fengqi Zhu, Zebin You, Yipeng Xing, Zenan Huang, Lin Liu, Yihong Zhuang, Guoshan Lu, Kangyu Wang, Xudong Wang, Lanning Wei, Hongrui Guo, Jiaqi Hu, Wentao Ye, Tieyuan Chen, Chenchen Li, Chengfu Tang, Haibo Feng, Jun Hu, Jun Zhou, and 7 others\. 2025\.[LLaDA\-MoE: A sparse MoE diffusion language model](https://arxiv.org/abs/2509.24389)\.*arXiv preprint arXiv:2509\.24389*\.
## Appendix AReproducibility and Source Trace
#### Reproducibility scope\.
The arXiv source package contains the manuscript source, bibliography, and rendered figures\. The public artifact release contains paper source, reference scripts, and the sanitized aggregate result JSON/CSV files that back the tables and figures at[GitHub repository](https://github.com/lucky-verma/top1-fails-dlm-lora-monitors)\([result artifacts](https://github.com/lucky-verma/top1-fails-dlm-lora-monitors/tree/main/results)\)\. The manuscript values are source\-mapped through local run manifests and claim\-bearing aggregates rather than copied from tracker prose\. The release excludes raw prompts/completions, W&B metadata, local paths, checkpoints, and adapter weights\.
#### Compute and setup\.
The primary LLaDA\-family experiments use H100 NVL\-class GPUs; Dream, preliminary LLaDA, and some AR controls use L40S\-class GPUs\. The paper accounts for approximately396396GPU\-hours across reported experiment groups and estimates≈119\\approx 119kg CO2eq\. Models are used through HuggingFacetransformersand PEFT with explicit masked cross\-entropy on masked positions\. API pitfalls needed for reproduction are: DLM forward passes may not return a supervised loss,generate\(\)is not the training\-time denoising loop, target modules must be explicit, Dream\-77B loads throughAutoModel, and Dream attention masks must be boolean\.
Table 2:Top\-11warning and max\-gradient summary\.The top\-11warning fires in every audited DLM\-family configuration while actual collapse is zero at the tested horizon\. Max\-gradient separation is family\-local, not a global threshold\.CohortTop\-11/ coll\.Max\-grad evidenceDLM LLaDA2\.0\-mini\-full \(n=144n\{=\}144\)144/144; 0/1443\.23×3\.23\\times\[2\.76,3\.97\]\[2\.76,3\.97\];2\.7×10−72\.7\{\\times\}10^\{\-7\}DLM LLaDA2\.0\-mini\-crit12 \(n=120n\{=\}120\)120/120; 0/1201\.48×1\.48\\times\[1\.12,1\.76\]\[1\.12,1\.76\];0\.0360\.036DLM LLaDA\-method\-comp \(n=395n\{=\}395\)395/395; 0/395362×362\\times\[202,779\]\[202,779\];5×10−215\{\\times\}10^\{\-21\}\(source scale\)DLM LLaDA2\.1\-mini \(n=12n\{=\}12\)12/12; 0/12smallnnDLM Dream\-7B boundary \(n=100n\{=\}100\)100/100; 0/100boundary cohortDLM MDLM\-OWT\-130M boundary \(n=45n\{=\}45\)45/45; 0/45boundary cohortAR Pythia/Qwen masked\-CE controls \(n=360n\{=\}360\)–; 0/360smaller or inconsistent
## Appendix BTop\-1 Saturation and Step\-kkPrecision Sweep
Figure 3:Top\-11collapse is a pre\-equilibrium artifact\.\(A\) Across671671LLaDA\-family configurations, top\-11mass starts high and crosses the legacy threshold before the detector samples\. \(B\) The legacy fire\-step has no stable/unstable split, while the stricter0\.950\.95saturation step points in the opposite direction: unstable configurations saturate faster\.#### Timing\.
Top\-11token frequency is0\.83±0\.130\.83\\pm 0\.13at step0on the LLaDA\-family corpus\. All configurations are already above0\.50\.5at step0, and the median crosses0\.950\.95within four optimizer steps\. Stable and unstable configurations have the same median legacy fire\-step \(1111vs\.1111, Mann–Whitneyp=0\.20p\{=\}0\.20\)\. The stricter0\.950\.95crossing points in the wrong direction for a collapse detector: unstable configurations saturate faster\.
### B\.1Step\-k Precision Sweep
Table 3:Step\-kkheld\-out precision sweep\.Median precision overB=200B\{=\}200random80/2080/20splits on671671LLaDA\-family configurations\. Max\-gradient is stable from step2525onward; step\-200200loss is the label by construction\.kkmax\-gradloss\-at\-kkmax top\-11550\.220\.22\[0\.09,0\.36\]\[0\.09,0\.36\]0\.130\.13\[0\.08,0\.18\]\[0\.08,0\.18\]0\.140\.14\[0\.06,0\.24\]\[0\.06,0\.24\]10100\.690\.69\[0\.45,0\.88\]\[0\.45,0\.88\]0\.690\.69\[0\.50,0\.91\]\[0\.50,0\.91\]0\.190\.19\[0\.11,0\.31\]\[0\.11,0\.31\]11110\.710\.71\[0\.50,0\.92\]\[0\.50,0\.92\]0\.790\.79\[0\.57,0\.94\]\[0\.57,0\.94\]0\.230\.23\[0\.13,0\.35\]\[0\.13,0\.35\]25250\.730\.73\[0\.53,0\.92\]\[0\.53,0\.92\]0\.560\.56\[0\.36,0\.88\]\[0\.36,0\.88\]0\.260\.26\[0\.17,0\.38\]\[0\.17,0\.38\]50500\.750\.75\[0\.53,0\.93\]\[0\.53,0\.93\]0\.640\.64\[0\.40,0\.87\]\[0\.40,0\.87\]0\.260\.26\[0\.17,0\.38\]\[0\.17,0\.38\]1001000\.740\.74\[0\.53,0\.92\]\[0\.53,0\.92\]0\.500\.50\[0\.26,0\.82\]\[0\.26,0\.82\]0\.250\.25\[0\.17,0\.38\]\[0\.17,0\.38\]2002000\.740\.74\[0\.53,0\.92\]\[0\.53,0\.92\]1\.001\.00\[0\.86,1\.00\]\[0\.86,1\.00\]0\.260\.26\[0\.17,0\.38\]\[0\.17,0\.38\]
## Appendix CSurfaces, Controls, and Task Probe
Table 4:LLaDA\-family surface summaries\.The table preserves the source\-traced values used in the body; sanitized per\-configuration grids are released in the public artifact\.Evidence sliceSource\-traced contrastReading1\-seedr=64r\{=\}64surfacemax\-gradρ=0\.30/0\.40\\rho\{=\}0\.30/0\.40:4\.6/5\.84\.6/5\.8;ρ=0\.90/0\.95\\rho\{=\}0\.90/0\.95:15\.0/34\.815\.0/34\.8tailr4 replicated rowsmean max\-grad \(n10/n3\):ρ=0\.40\\rho\{=\}0\.40:16\.4±1\.316\.4\{\\pm\}1\.3;ρ=0\.90/0\.95\\rho\{=\}0\.90/0\.95:63\.3±60\.0/34\.5±9\.763\.3\{\\pm\}60\.0/34\.5\{\\pm\}9\.7noisyr64 replicated rowsmean max\-grad \(n10/n3\):ρ=0\.40\\rho\{=\}0\.40:33\.4±11\.033\.4\{\\pm\}11\.0;ρ=0\.90/0\.95\\rho\{=\}0\.90/0\.95:84\.7±60\.4/41\.2±17\.584\.7\{\\pm\}60\.4/41\.2\{\\pm\}17\.5elevated3\-seed rank ratior4→r64r4\{\\to\}r64ratioρ=0\.40\\rho\{=\}0\.40:2\.04×2\.04\\times;ρ=0\.90/0\.95\\rho\{=\}0\.90/0\.95:1\.34/1\.19×1\.34/1\.19\\timescorrected10\-seed critical configsρ=0\.90\\rho\{=\}0\.90vs0\.400\.40: max\-grad ratios3\.87/1\.69/2\.53×3\.87/1\.69/2\.53\\timesand final\-loss deltas\+0\.17/\+0\.17/\+0\.18\+0\.17/\+0\.17/\+0\.18for ranks4/16/644/16/64desc\.LLaDA2\.1 transferρ=0\.90\\rho\{=\}0\.90vs0\.400\.40: max\-grad ratios1\.76/3\.52×1\.76/3\.52\\timesand final\-loss deltas\+1\.11/\+0\.99\+1\.11/\+0\.99for ranks4/644/64scoped#### Scale boundary\.
The LLaDA2\.0\-mini low\-mid mask recommendation is not architecture\-general\. Loss\-side high\-mask disadvantage replicates on Dream\-7B after learning\-rate calibration and in a small LLaDA2\.1\-mini transfer check, but rank\-amplification direction is mixed on MDLM\-OWT\-130M and softens on LLaDA\-MoE\-A1B\. This is why the body states a DLM\-family diagnostic and requires per\-model calibration\.
Table 5:In\-domain convergence probe\.Values are mean±\\pmstd over33seeds; lower held\-out CE is better\.RegimeRank𝝆\\rhoMax‖∇‖\\\|\\nabla\\\|Holdout CEstable40\.4018\.00\.43±0\.080\.43\\pm 0\.08stable640\.4021\.00\.42±0\.100\.42\\pm 0\.10high mask40\.9073\.90\.57±0\.060\.57\\pm 0\.06high mask640\.9096\.40\.48±0\.040\.48\\pm 0\.04Table 6:Operating\-window multi\-benchmark masked\-CE check\.Theρ=0\.40\\rho\{=\}0\.40rank contrast is not significant after correction\. These null results bound the low\-mid\-mask recommendation to the DLM\-LoRA training diagnostic and do not support a downstream generation\-quality claim\.Benchmarknr4 CEr64 CE𝚫\\Delta𝒑𝐁𝐨𝐧𝐟p\_\{\\mathrm\{Bonf\}\}GSM8K\-test13192\.66±2\.302\.66\{\\pm\}2\.301\.92±0\.321\.92\{\\pm\}0\.32\+0\.74\+0\.741\.00HumanEval1641\.10±0\.031\.10\{\\pm\}0\.031\.64±0\.311\.64\{\\pm\}0\.31−0\.54\-0\.541\.00MMLU\-subset2501\.10±0\.041\.10\{\\pm\}0\.041\.27±0\.111\.27\{\\pm\}0\.11−0\.17\-0\.171\.00
#### AR control\.
Pythia and Qwen masked\-CE controls show0/3600/360actual collapses and smaller or inconsistent max\-gradient separation\. The denominator is Pythia\-1B main5×12×35\{\\times\}12\{\\times\}3\(180180configurations\), Pythia\-410410M and Pythia\-6\.96\.9B matched grids \(4545each\), plus five1818\-configuration Pythia/Qwen sweep or extended\-mask blocks\. At matched\(r=64,ρ=0\.40\)\(r\{=\}64,\\rho\{=\}0\.40\), Pythia\-1B max\-gradient is16\.6116\.61versus33\.433\.4on LLaDA2\.0\-mini, and Qwen3\.5\-9B shows a mid\-mask peak rather than the LLaDA\-family U\-shape\.
## Appendix DMechanism and Boundary Audit
#### Gradient concentration\.
At the worst rank\-amplification corner \(r=64,ρ=0\.95r\{=\}64,\\rho\{=\}0\.95, LLaDA2\.0\-mini,n=3n\{=\}3seeds, last 10 steps\), per\-token CE gradients are only modestly concentrated \(Gini0\.287±0\.0560\.287\\pm 0\.056; the largest evaluated token position contributes1\.54%±0\.17%1\.54\\%\\pm 0\.17\\%of CE\-gradient mass\)\. LoRA\-parameter gradients are much more concentrated \(Gini0\.463±0\.0310\.463\\pm 0\.031; one LoRA matrix carries63\.0%±3\.6%63\.0\\%\\pm 3\.6\\%of gradient mass\)\. This supports the body interpretation that max\-gradient samples parameter\-side routing while top\-11samples token\-side pre\-equilibrium concentration\.
Table 7:Single\-axis boundary audit\.No tested single\-axis intervention prevents the fire\-rate identity\. The paper therefore remains diagnostic rather than a prospective controller paper\.Axis and probeObserved outcomeFire?Activation timing: gating window and learning\-rate triggerno timing shiftnoMagnitude: learning\-rate warm\-upN∈\{10,20,50\}N\\in\\\{10,20,50\\\}stepsno timing shiftnoInit amplitude: LoRA\-BBperturbationno timing shiftnoInit direction: spectral\-init only \(no weight subtraction\)shifts11→3311\{\\to\}33, but with first\-update overshootnoSpectral\-init with weight subtraction\(Meng et al\.,[2024](https://arxiv.org/html/2606.24119#bib.bib19)\)improves loss while preserving step\-0identitynoAdapter/optimizer geometryLow\-rank group bottleneck \(G=4G\{=\}4\)\(Jung et al\.,[2025](https://arxiv.org/html/2606.24119#bib.bib12)\)and Stiefel projection do not remove firenoLoss\-level entropy bonusλ∈\{0\.5,1,2,5,10\}\\lambda\\in\\\{0\.5,1,2,5,10\\\}does not reduce firenoPortability checksnormalized thresholds remain family\-specific–
#### Definitions and non\-portability\.
The logged top\-11warning is an argmax mode\-frequency statistic, not mean maximum probability: for runner inputztz\_\{t\}, evaluated positionsℐt\\mathcal\{I\}\_\{t\}, andat\(i\)=argmaxu∈Vpθt\(u∣zt,i\)a\_\{t\}\(i\)=\\arg\\max\_\{u\\in V\}p\_\{\\theta\_\{t\}\}\(u\\mid z\_\{t\},i\), it usesS^t=maxv\|\{i∈ℐt:at\(i\)=v\}\|/\|ℐt\|\\widehat\{S\}\_\{t\}=\\max\_\{v\}\|\\\{i\\in\\mathcal\{I\}\_\{t\}:a\_\{t\}\(i\)=v\\\}\|/\|\\mathcal\{I\}\_\{t\}\|andSt=𝔼\[S^t\]S\_\{t\}=\\mathbb\{E\}\[\\widehat\{S\}\_\{t\}\]\. For LLaDA\-family runs,ztz\_\{t\}is the clean\-batch proxy andℐt\\mathcal\{I\}\_\{t\}all positions; for MDLM\-OWT,ztz\_\{t\}is the masked training input andℐt\\mathcal\{I\}\_\{t\}masked positions\. Crossing a fixed threshold can therefore indicate pre\-equilibrium argmax concentration rather than divergence\. The corresponding max\-gradient sketch is only a family\-local scale heuristic:
GT\\displaystyle G\_\{T\}:=max0≤t≤T‖∇θLoRAℒ\(θt\)‖2,\\displaystyle=\\max\_\{0\\leq t\\leq T\}\\\|\\nabla\_\{\\theta\_\{\\text\{LoRA\}\}\}\\mathcal\{L\}\(\\theta\_\{t\}\)\\\|\_\{2\},\(2\)GT\\displaystyle G\_\{T\}≤CfamαLrTlogTσfam\(ρ,r,V\),\\displaystyle\\leq C\_\{\\text\{fam\}\}\\frac\{\\alpha\_\{L\}\}\{r\}\\sqrt\{T\\log T\}\\,\\sigma\_\{\\text\{fam\}\}\(\\rho,r,V\),whereCfamC\_\{\\text\{fam\}\}absorbs model/data constants,αL\\alpha\_\{L\}is the LoRA scaling factor,rris LoRA rank,VVis the output vocabulary, andσfam\\sigma\_\{\\text\{fam\}\}denotes the empirical gradient\-scale term induced by mask ratio, rank, and family\. We do not assert a universal closed\-form bound forσfam\\sigma\_\{\\text\{fam\}\}\. The sketch is scaffolding, not the basis for the claim: empirically, cross\-family normalization reduces raw scale variance but loses portable precision because correlations sign\-flip by family, especially on the small MDLM\-OWT cohort\.
## Appendix EMethod Comparison and Related Work
#### Method references\.
Named probes follow PiSSA, GraLoRA, StelLA, rsLoRA, Yu\-DARE, and NaRA\(Meng et al\.,[2024](https://arxiv.org/html/2606.24119#bib.bib19); Jung et al\.,[2025](https://arxiv.org/html/2606.24119#bib.bib12); Li et al\.,[2025b](https://arxiv.org/html/2606.24119#bib.bib16); Kalajdzievski,[2023](https://arxiv.org/html/2606.24119#bib.bib13); Yu et al\.,[2024](https://arxiv.org/html/2606.24119#bib.bib33); Wang et al\.,[2026](https://arxiv.org/html/2606.24119#bib.bib27)\)\.
Table 8:Operating\-cell method comparison\.Source\-mapped masked\-CE summary\.ProtocolClaim\-facing conclusionDefault learning ratersLoRA is higher CE on all three benches; Yu\-DARE trends similarly with high seed variance, so we treat this as learning\-rate mismatch\.Best learning rate \(n=10n\{=\}10\)rsLoRA remains higher CE \(\+3\.7\+3\.7–4\.2%4\.2\\%\); NaRA is lower \(−1\.0\-1\.0–4\.7%4\.7\\%\), with only MMLU Bonferroni\-significant\. This is learning\-rate\-dependent, not a method\-quality claim\.GSM8K gen\. checkExact match is0/200/20, so generation quality is excluded from paper claims\.
#### Related work taxonomy\.
LoRA\-family and AR\-side PEFT stability work assume dense next\-token supervision and do not expose a mask\-ratio axis\. DLM work covers objectives, scaling, decoding, masking schedules, train\-inference mismatch, and systems that use fixed LoRA\-like adapters, but we are not aware of prior work that tests top\-11warning precision as a DLM\-LoRA PEFT monitor with matched AR masked\-CE controls\. The closest genre is metric refutation: a familiar diagnostic changes meaning outside its calibration regime\.
#### Scope\.
Claim\-bearing denominators are816816DLM PEFT configurations,671671LLaDA\-family configurations, and360360AR masked\-CE controls; longer horizons, generation quality, full fine\-tuning, and coupled controllers remain future work\.Similar Articles
Sampling More, Getting Less: Calibration is the Diversity Bottleneck in LLMs
This paper introduces a validity-diversity framework attributing diversity collapse in LLMs to order and shape miscalibration during decoding, validated across 14 language models.
FAIR-Calib: Frontier-Aware Instability-Reweighted Calibration for Post-Training Quantization of Diffusion Large Language Models
This paper proposes FAIR-Calib, a two-stage post-training quantization framework for diffusion large language models that addresses the instability of token commitments during iterative refinement. It achieves state-of-the-art results on LLaDA and Dream models under low-bit quantization.
Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation
This paper proposes a Mixture of LoRA and Full (MoLF) fine-tuning framework that uses gradient-guided optimizer routing to adaptively switch between LoRA and full fine-tuning. It aims to overcome the structural limitations of relying solely on static adaptation methods by combining the plasticity of full tuning with the regularization of LoRA.
Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training
Hybrid-LoRA proposes a framework that selectively applies full fine-tuning to a small subset of modules while using LoRA for the rest, achieving performance near full fine-tuning with significantly lower computational cost. Experiments show improvements of up to 5.65% over existing parameter-efficient baselines.
Beyond Mode-Seeking RL: Trajectory-Balance Post-Training for Diffusion Language Models
This paper identifies a failure mode called 'trajectory locking' in reward-maximizing post-training for diffusion language models, and proposes TraFL, a trajectory-balance objective that improves diversity and performance across math and code benchmarks.