Learnability-Informed Fine-Tuning of Diffusion Language Models

arXiv cs.CL 05/25/26, 04:00 AM Papers
diffusion-language-models fine-tuning reasoning learnability sft nlp icml
Summary
We propose LIFT, a learnability-informed fine-tuning algorithm for diffusion language models that aligns training with token difficulty and time step, achieving substantial gains on reasoning benchmarks.
arXiv:2605.22939v1 Announce Type: new Abstract: We aim to improve the reasoning capabilities of diffusion language models (DLMs). While SFT is a popular post-training recipe for autoregressive models, its use in DLMs faces challenges and can even hurt performance, though the underlying causes remain understudied. Our analysis reveals that vanilla SFT overlooks learnability, namely what and when tokens are learned. Specifically, rare tokens are difficult to learn when most of the input is masked, whereas it is straightforward and thus of little value to learn common tokens when most of the input is unmasked. Motivated by our analysis, we propose LIFT, an efficient SFT-based post-training algorithm for DLMs. LIFT learns easy tokens when most of the input is masked and hard tokens when more context is available, thus aligning the training with the information available at different diffusion time steps. Our results show that LIFT outperforms existing SFT baselines across six reasoning benchmarks, achieving up to a 3x relative gain on AIME'24 and AIME'25. Our code is publicly available at https://github.com/divelab/LIFT.
Original Article
View Cached Full Text
Cached at: 05/25/26, 08:55 AM
# Learnability-Informed Fine-Tuning of Diffusion Language Models
Source: [https://arxiv.org/html/2605.22939](https://arxiv.org/html/2605.22939)
Atharv ChagiJacob HelwigLakshmi JotsnaSushil VemuriJames CaverleeDileep KalathilShuiwang Ji

###### Abstract

We aim to improve the reasoning capabilities of diffusion language models \(DLMs\)\. While SFT is a popular post\-training recipe for autoregressive models, its use in DLMs faces challenges and can even hurt performance, though the underlying causes remain understudied\. Our analysis reveals that vanilla SFT overlooks*learnability*, namely*what*and*when*tokens are learned\. Specifically, rare tokens are difficult to learn when most of the input is masked, whereas it is straightforward and thus of little value to learn common tokens when most of the input is unmasked\. Motivated by our analysis, we proposeLIFT, an efficient SFT\-based post\-training algorithm for DLMs\. LIFT learns easy tokens when most of the input is masked and hard tokens when more context is available, thus aligning the training with the information available at different diffusion time steps\. Our results show that LIFT outperforms existing SFT baselines across six reasoning benchmarks, achieving up to a 3×\\timesrelative gain on AIME’24 and AIME’25\. Our code is publicly available at[https://github\.com/divelab/LIFT](https://github.com/divelab/LIFT)\.

Machine Learning, ICML

\#1\#

## 1Introduction

Diffusion models have shown impressive performance in image\(Song and Ermon,[2019](https://arxiv.org/html/2605.22939#bib.bib11); Nichol and Dhariwal,[2021](https://arxiv.org/html/2605.22939#bib.bib10)\)video\(Hoet al\.,[2022](https://arxiv.org/html/2605.22939#bib.bib26)\)generation applications\. Recently, diffusion models have been successfully applied to textual data, leading to the recent surge of interest in Diffusion Language Models \(DLMs\)\(Austinet al\.,[2021a](https://arxiv.org/html/2605.22939#bib.bib12); Sahooet al\.,[2024](https://arxiv.org/html/2605.22939#bib.bib14)\)\. A central promise of DLMs over autoregressive language models \(ARLMs\) is their ability to generate multiple tokens in parallel per model call, yielding substantial gains in inference throughput\(Khannaet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib2); Wuet al\.,[2026](https://arxiv.org/html/2605.22939#bib.bib5)\)\. Several open\-weight DLMs, such as LLaDA\(Nieet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib15)\)and Dream\(Yeet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib16)\), are now available, and they largely match the performance of similarly\-sized ARLM counterparts\.

![Refer to caption](https://arxiv.org/html/2605.22939v1/x1.png)Figure 1:Performance on AIME benchmarks\. Pass@16 accuracy comparison on AIME’24 and AIME’25 for LLaDA\-8B\-Instruct, vanilla SFT, and LIFT\. LIFT achieves substantial relative improvements over vanilla SFT on both challenging mathematical reasoning datasets, demonstrating the effectiveness of learnability\-informed training\.![Refer to caption](https://arxiv.org/html/2605.22939v1/x2.png)

![Refer to caption](https://arxiv.org/html/2605.22939v1/x3.png)

\(a\)Frequency vs\. confidence\.
![Refer to caption](https://arxiv.org/html/2605.22939v1/x4.png)\(b\)Token\-level confidence across timesteps\.

Figure 2:Token Analysis with LLaDA\.Using data collated from 4 post\-training corpora\(Muennighoffet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib23); Bercovich and others,[2025](https://arxiv.org/html/2605.22939#bib.bib30); Open\-R1,[2025](https://arxiv.org/html/2605.22939#bib.bib31); Team OLMo and others,[2025](https://arxiv.org/html/2605.22939#bib.bib32)\), we analyze 0\.5B masked tokens and aggregate token\-level confidence and frequencies\.\(a\)We bin tokens by log\-scaled frequency and plot the mean model confidence against the average frequency\. The marginalized plot \(top\) reveals that rare tokens have lower confidence on average, demonstrating that certain tokens are more difficult to predict \(*what*dimension\)\. We perform a more nuanced analysis by breaking down the marginalized plot by diffusion timesteptt\(bottom\), revealing an interaction between the*what*and*when*dimensions\. Specifically, we observe att\-induced bias, when at largettmany of the model inputs are masked, low frequency tokens become disproportionately difficult to predict, suggesting that the information content of heavily masked inputs arising later in the forward diffusion process as diffusion timet→1\+t\\to 1^\{\+\}is insufficient to learn certain tokens reliably\. Conversely, ast→0−t\\to 0^\{\-\}, less frequent tokens become more learnable, whereas predicting frequent tokens become trivial\.\(b\)We sample representativehighandlow\-frequency tokens, visualizing their \(average\) confidence across diffusion time\. Rare tokens increasingly suffer ast→1\+t\\to 1^\{\+\}, and experience more extreme drops in confidence than high frequency tokens\.Following the success of post\-training of ARLMs to improve reasoning, recent works have explored post\-training of DLMs using supervised or instruction finetuning \(SFT\)\(Yeet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib16); Nieet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib15)\)and reinforcement learning \(RL\)\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib21)\)\. However, in contrast to ARLMs, RL in DLMs is substantially more challenging both technically and algorithmically due to intractable sequence\-level likelihoods, and most works on RL for DLMs propose approximations to overcome this challenge\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib21); Kundeet al\.,[2026](https://arxiv.org/html/2605.22939#bib.bib3); Wanget al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib4)\)\. SFT has been studied less thoroughly, and to date no work has systematically examined the challenges involved in applying SFT to DLMs\. Recent results suggest that SFT can in fact degrade model performance relative to pretraining\(Yeet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib16)\)\. This motivates the central question of our work, which we decompose into two sub\-questions: \(i\)what are the major factors that influence SFT post\-training of DLMs, and \(ii\)how can we design an SFT algorithm that accounts for them to effectively post\-train DLMs?

As our first contribution, we address \(i\) by analyzing SFT in DLMs and characterizing its failure cases\. Specifically, we conduct an extensive analysis in Fig\.[2\(a\)](https://arxiv.org/html/2605.22939#S1.F2.sf1)spanning 0\.5B tokens collated from four popular post\-training reasoning datasets\(Muennighoffet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib23); Bercovich and others,[2025](https://arxiv.org/html/2605.22939#bib.bib30); Team OLMo and others,[2025](https://arxiv.org/html/2605.22939#bib.bib32); Open\-R1,[2025](https://arxiv.org/html/2605.22939#bib.bib31)\)\. Across several pre\-trained DLMs\(Yeet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib16); Nieet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib15)\), our findings reveal two crucial considerations whose interplay govern SFT dynamics; those are,whattokens are learned, andwhentokens are learned in the diffusion process\. Our findings show that rare tokens in the corpus are more difficult to predict than frequent tokens \(what\)\. Additionally, rare tokens become more learnable when more context is available, corresponding to early forward diffusion times\. However, at later forward diffusion times, the reduced information in the input disproportionately lowers the model’s confidence on rare tokens, in some cases making them effectively unlearnable \(when\)\. These findings suggest that as forward diffusion timet→1\+t\\to 1^\{\+\}, rare tokens often become unlearnable, making it more effective to focus compute on frequent tokens\. In contrast, as forward diffusion timet→0−t\\to 0^\{\-\}, frequent tokens are easy to predict, while rare tokens become more learnable\. While prior works have proposed heuristics partially adhering to these guidelines by considering either thewhatorwhendimensions in isolation\(Yeet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib16); Xuet al\.,[2026](https://arxiv.org/html/2605.22939#bib.bib25)\), our study is the first to systematically analyze their combined effect during supervised fine\-tuning\. We show that modeling the interaction between token difficulty and diffusion time is critical for improving training\.

As our second contribution, motivated by these insights, we propose and developLIFT, the first post\-training approach to target the interaction between*what*and*when*during DLM training\. LIFT trains the model on masked tokens that are most appropriate to learn at each diffusion time given the available context\. We obtain state\-of\-the\-art results among various SFT training frameworks across two DLM base models on four reasoning benchmarks\. We also evaluate LIFT on the challenging AIME\-24\(AIME,[2024](https://arxiv.org/html/2605.22939#bib.bib33)\)and AIME\-25\(Math\-AI Team and Zhang,[2025](https://arxiv.org/html/2605.22939#bib.bib34)\), where it achieves up to a3×3\\timesimprovement over SFT baselines\. Remarkably, LIFT attains performance close to the RLVR baseline d1\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib21)\)while using roughly500×500\\timesfewer GPU hours, establishing a new Pareto frontier for DLM post\-training\.

![Refer to caption](https://arxiv.org/html/2605.22939v1/figures/training_framework_diagram/LIFT_framework.png)Figure 3:Learnability\-Informed Fine\-Tuning \(LIFT\)\.LIFT increases learnability by using model confidence and diffusion time to construct a learnability\-informed mask so as to train on the highest utility tokens at each point in the diffusion process\. Utility is estimated as a function of model confidence and diffusion time\. In the first stage, a mask is sampled with ratet\+ρt\+\\rhoand used to estimate model confidencespθ\(x0∣xt\+ρ\)p\_\{\\theta\}\(x\_\{0\}\\mid x\_\{t\+\\rho\}\)over all masked positions\. LIFT then selects a subset of masked tokens fromxt\+ρx\_\{t\+\\rho\}to supervise based on model confidences and diffusion time\. Depending on the diffusion time, subset selection is either top\-KKmost confident tokens, bottom\-KKleast confident tokens, or vanilla \(random\)\. The mapping from diffusion time to subset selection method is done so as to increase learnability and utility of each training step according to the insights from our analysis in Sec\.[4](https://arxiv.org/html/2605.22939#S4)\.
## 2Related Work

#### Diffusion Language Models

extend the success of diffusion models in continuous domains like image generation\(Hoet al\.,[2020](https://arxiv.org/html/2605.22939#bib.bib1); Nichol and Dhariwal,[2021](https://arxiv.org/html/2605.22939#bib.bib10); Song and Ermon,[2019](https://arxiv.org/html/2605.22939#bib.bib11)\)to language\. However, applying continuous diffusion to discrete text is inherently difficult\(Austinet al\.,[2021a](https://arxiv.org/html/2605.22939#bib.bib12)\)\. To tackle this, Masked Diffusion Language Models\(Sahooet al\.,[2024](https://arxiv.org/html/2605.22939#bib.bib14)\)offer a discrete alternative by leveraging masked language modeling\(Devlinet al\.,[2019](https://arxiv.org/html/2605.22939#bib.bib13)\), wherein tokens are randomly masked and the model learns to unmask them\. Recent models\(Nieet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib15); Yeet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib16)\)have shown competitive performance to autoregressive LLMs \(ARMs\) in mathematical reasoning, code generation\(Zhuet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib17)\)and multi\-modal tasks\(Liet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib18)\), indicating that DLMs can perform complex reasoning\. This makes DLM post\-training a natural next step, with the goal of similar reasoning gains as in ARMs\.

#### Post\-Training

of DLMs mirrors that of autoregressive models, following one of two approaches, namely reinforcement learning with verifiable rewards \(RLVR\)\(Guoet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib19); Parasharet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib20); Zhaoet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib21)\), or supervised fine\-tuning \(SFT\)\. SFT with high\-quality chain\-of\-thought data can achieve performance comparable to RL\-based methods\(Zelikmanet al\.,[2022](https://arxiv.org/html/2605.22939#bib.bib22); Muennighoffet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib23)\)\. For DLMs, recent work with SFT has explored difficulty\-informed training by considering*what*is being predicted\(Liet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib18); Bieet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib24); Xuet al\.,[2026](https://arxiv.org/html/2605.22939#bib.bib25)\), since some tokens are inherently harder to predict, and*when*it is predicted\(Yeet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib16)\), as inputs with heavier masking makes prediction more challenging\. In this work, we investigate how jointly accounting for the interaction between*what*and*when*can improve the effectiveness of DLM post\-training in enhancing reasoning performance\.

## 3Preliminaries

MDLMs\(Sahooet al\.,[2024](https://arxiv.org/html/2605.22939#bib.bib14); Nieet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib15); Yeet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib16)\)define a forward diffusion process on an input sequencex0x\_\{0\}frompdatap\_\{\\text\{data\}\}, producing continuously indexed corrupted sequences\{xt\}t∈\[0,1\]\\\{x\_\{t\}\\\}\_\{t\\in\[0,1\]\}by progressively replacing tokens with*\[MASK\]*\. The amount of information present inxtx\_\{t\}decreases monotonically withttsuch thatx1x\_\{1\}has all tokens masked\. To generate a new sequence, MDLMs parameterize a bi\-directional predictorpθp\_\{\\theta\}to reverse the diffusion process starting fromx1x\_\{1\}\.pθp\_\{\\theta\}is trained by sampling a diffusion timet∼π\(⋅\)t\\sim\\pi\(\\cdot\)witht∈\[0,1\]t\\in\[0,1\]\(commonlyt∼Uniform\(0,1\)t\\sim\\mathrm\{Uniform\}\(0,1\)\)\. To samplextx\_\{t\}, each token inx0x\_\{0\}is masked with probability1−αt1\-\\alpha\_\{t\}\. Here, we follow the same setup as LLaDA\(Nieet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib15)\), whereinαt=1−t\\alpha\_\{t\}=1\-t\. Given the corrupted inputxtx\_\{t\},pθp\_\{\\theta\}learns to recover the original tokens fromx0x\_\{0\}at the masked positions\. The MDLM training objective is the negative evidence lower bound \(NELBO\) objective, which upper bounds the negative log\-likelihood of the data\. For a masked sequencextx\_\{t\}, the NELBO is given as

−𝔼t∼𝒰\[0,1\],x0∼pdata\[1t∑k=1\|x0\|𝟏\{xtk=*\[MASK\]*\}log⁡pθ\(x0k∣xt\)\]\-\\mathbb\{E\}\_\{t\\sim\\mathcal\{U\}\[0,1\],\\,x\_\{0\}\\sim p\\textsubscript\{data\}\}\\left\[\\frac\{1\}\{t\}\\sum\\limits\_\{k=1\}^\{\|x\_\{0\}\|\}\\mathbf\{1\}\\\!\\left\\\{x\_\{t\}^\{k\}=\\text\{\\emph\{\[MASK\]\}\}\\right\\\}\\log p\_\{\\theta\}\\\!\\left\(x\_\{0\}^\{k\}\\mid x\_\{t\}\\right\)\\right\]

\(1\)where\|x0\|\|x\_\{0\}\|denotes the sequence length ofx0x\_\{0\},xtkx\_\{t\}^\{k\}is the token at positionkkin the corrupted input, and𝟏\{xtk=*\[MASK\]*\}\\mathbf\{1\}\\\{x\_\{t\}^\{k\}=\\text\{\\emph\{\[MASK\]\}\}\\\}restricts the loss to masked positions \(predicting the correspondingx0kx\_\{0\}^\{k\}givenxtx\_\{t\}\)\. In vanilla SFT, the same loss is optimized directly on a supervised training set, with prompt tokens left unmasked\.

## 4Analysis

In this section we analyze token difficulty around the central question \(Fig\.[2\(a\)](https://arxiv.org/html/2605.22939#S1.F2.sf1)\),*what*tokens should be learned and*when*in the diffusion process?

#### Which tokens are difficult?

We investigate this question by analyzing denoising confidence, defined as the probabilitypθ\(x0k∣xt\)p\_\{\\theta\}\(x\_\{0\}^\{k\}\\mid x\_\{t\}\)assigned to the ground truth tokenx0kx\_\{0\}^\{k\}at a masked positionkk, for a given noisy sequencextx\_\{t\}\. Prior work in ARMs has shown that rare tokens, due to limited exposure during training, are harder to learn and consequently predict\(Kandpalet al\.,[2023](https://arxiv.org/html/2605.22939#bib.bib36); Parasharet al\.,[2024](https://arxiv.org/html/2605.22939#bib.bib37); Udandaraoet al\.,[2024](https://arxiv.org/html/2605.22939#bib.bib38)\)\. We test this in DLMs by masking inputs at random time steps \(excluding prompt tokens\) and measuring prediction confidence for the masked tokens\. We then group tokens by their corpus frequency to analyze how difficulty varies with rarity\.

#### When do tokens become difficult to predict?

In DLMs, prediction difficulty depends not only on token identity but also on when the token is recovered during the denoising process\. As the forward diffusion progresses, more of the input is masked, reducing the available context and making prediction harder\. To analyze how difficulty evolves over time, we quantize the diffusion timettinto logarithmic bins ranging from2−22^\{\-2\}to2−1/42^\{\-1/4\}and measure average prediction confidence within each bin\.

#### Models and Datasets\.

We conduct our analysis using two diffusion language models, LLaDA\(Nieet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib15)\)and Dream\(Yeet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib16)\), chosen for their differences in architecture and pre\-training data\. For the analysis, we use arithmetic reasoning post\-training datasets that contain both questions and detailed reasoning traces: s1K\(Muennighoffet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib23)\), the Nemotron Post\-Training Dataset\(Bercovich and others,[2025](https://arxiv.org/html/2605.22939#bib.bib30)\), Mixture of Thoughts\(Open\-R1,[2025](https://arxiv.org/html/2605.22939#bib.bib31)\), and DociThink\-RL\(Team OLMo and others,[2025](https://arxiv.org/html/2605.22939#bib.bib32)\)\. Following the filtering procedure from Dream\(Yeet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib16)\), we select examples where the combined length of the question and answer is less than 4096 tokens\. This results in a dataset of approximately one million examples, totaling around 500 million tokens analyzed\.

#### Analysis Insights\.

We first confirm that on average, rarer tokens are harder to predict than more frequent tokens \(Fig\.[2](https://arxiv.org/html/2605.22939#S1.F2)a\)\. Conditioning on diffusion time reveals a more nuanced pattern\. Ast→0−t\\to 0^\{\-\}, when substantial context remains unmasked, even rare tokens are comparatively easy to recover\. Asttincreases, the available information reduces\. Beyond approximatelyt≥2−1t\\geq 2^\{\-1\}, prediction difficulty rises for all tokens, with rare tokens becoming the most challenging \(Fig\.[2](https://arxiv.org/html/2605.22939#S1.F2)\)\. Overall, these results show that difficulty is jointly determined by*what*is being predicted \(token frequency\) and*when*it is predicted \(diffusion time\)\. This suggests that with the limited context accompanyingt→1\+t\\to 1^\{\+\}, model capacity and training iterations may not be optimally utilized by attempting to denoise rare tokens, and that efforts should instead be directed towards predicting tokens that are more feasible to learn\. As information increases with decreasingtt, rare tokens become more learnable, whereas the prediction of more frequent tokens is trivial and of limited benefit to train\. We therefore propose to incorporate both dimensions so that training emphasizes targets that maximize learnability under the available context\.

## 5Methods

Algorithm 1LIFT: Learnability\-Informed Fine\-Tuning0:Dataset

pdatap\_\{\\text\{data\}\}, parameter

H≥2H\\geq 2, chosen variant, \(LIFT orLIFT\-A\), learning rate

η\\eta
1:repeat

2:

x0∼pdata,t∼𝒰\[0,1\],ρ∼𝒰\[0,1−t\]x\_\{0\}\\sim p\_\{\\text\{data\}\},\\quad t\\sim\\mathcal\{U\}\[0,1\],\\quad\\rho\\sim\\mathcal\{U\}\[0,1\-t\]⊳\\trianglerightSample input, timestep, and secondary ratio\.

3:

xt\+ρ∼q\(xt\+ρ∣x0\),ck←pθ\(x0k∣xt\+ρ\)∀k∈ℳt\+ρx\_\{t\+\\rho\}\\sim q\(x\_\{t\+\\rho\}\\mid x\_\{0\}\),\\quad c\_\{k\}\\leftarrow p\_\{\\theta\}\(x\_\{0\}^\{k\}\\mid x\_\{t\+\\rho\}\)\\quad\\forall k\\in\\mathcal\{M\}\_\{t\+\\rho\}⊳\\trianglerightMask input and compute confidences\.

4:

𝒮t←Eq\. \([2](https://arxiv.org/html/2605.22939#S5.E2)\) withK=⌊t⋅\|x0\|⌋\\mathcal\{S\}\_\{t\}\\leftarrow\\text\{Eq\.~\(\\ref\{eqn:selection\}\) with \}K=\\lfloor t\\cdot\|x\_\{0\}\|\\rfloor⊳\\trianglerightSelect tokens to supervise\.

5:ifLIFTthen

6:

xt←xt\+ρx\_\{t\}\\leftarrow x\_\{t\+\\rho\}⊳\\trianglerightCreatextx\_\{t\}based on learnability\.

7:for

k∈ℳt\+ρ∖𝒮tk\\in\\mathcal\{M\}\_\{t\+\\rho\}\\setminus\\mathcal\{S\}\_\{t\}do

8:

xtk←x0kx\_\{t\}^\{k\}\\leftarrow x\_\{0\}^\{k\}⊳\\trianglerightUnmask unsupervised masked tokens\.

9:endfor

10:elseifLIFT\-Athen

11:

xt←xt\+ρx\_\{t\}\\leftarrow x\_\{t\+\\rho\}
12:

t←t\+ρt\\leftarrow t\+\\rho
13:endif

14:

θ←θ−η∇θ\[−1t∑k∈𝒮tlog⁡pθ\(x0k∣xt\)\]\\theta\\leftarrow\\theta\-\\eta\\nabla\_\{\\theta\}\\left\[\-\\frac\{1\}\{t\}\\sum\\limits\_\{k\\in\\mathcal\{S\}\_\{t\}\}\\log p\_\{\\theta\}\(x\_\{0\}^\{k\}\\mid x\_\{t\}\)\\right\]⊳\\trianglerightTake gradient descent step\.

15:untilconverged

In this section, we present LIFT, a supervised fine\-tuning method for efficient post\-training of diffusion language models\. LIFT is motivated by our analysis, where the difficulty of predicting tokens depends on the interaction between*what*and*when*, i\.e\., token frequency and the amount of unmasked tokens available in the input\. Following this principle, LIFT adaptively selects which tokens to learn at each timestep, focusing on easy and frequent tokens when the input is heavily masked, and on rare and difficult tokens when more context is available\. This enhances the information gained in each training step by simultaneously ensuring that target tokens are learnable and are non\-trivial to predict\.

#### Which tokens to select for training?

Instead of training directly on the inputxtx\_\{t\}randomly masked at timesteptt, LIFT applieslearnability\-informed maskingto maximize the learning signal of training targets\. This is done by first sampling a secondary masking ratioρ∼𝒰\(0,1−t\)\\rho\\sim\\mathcal\{U\}\(0,1\-t\)to construct a more corrupted inputxt\+ρx\_\{t\+\\rho\}from which learnability can be estimated\. For example, ift=0\.4t=0\.4andρ=0\.3\\rho=0\.3, we creatext\+ρx\_\{t\+\\rho\}where 70% of the tokens are masked\. Having createdxt\+ρx\_\{t\+\\rho\}, we obtain confidence scores for the ground truth tokenc=pθ\(x0k\|xt\+ρ\)c=p\_\{\\theta\}\(x^\{k\}\_\{0\}\|x\_\{t\+\\rho\}\)at each masked position\. We define token difficulty simply as the corresponding loss,ℓk=−log⁡ck\\ell\_\{k\}=\-\\log c\_\{k\}, where lower confidence naturally indicates a harder token\.

LIFT then constructs a learnability\-informed mask by selecting a subset of the masked tokens inxt\+ρx\_\{t\+\\rho\}to supervise dependent on diffusion time, e\.g\., the top\-KKtokens \(highest confidence, easy tokens\) or the bottom\-KKtokens \(lowest confidence, hard tokens\), whereK=t⋅\|x0\|\.K=t\\cdot\|x\_\{0\}\|\.The remaining masked positions, which are not selected for training, are filled in using the original tokens from the clean inputx0x\_\{0\}, giving usxtx\_\{t\}\. LIFT then usesxtx\_\{t\}as input topθp\_\{\\theta\}for computing the NELBO in Eq\. \([1](https://arxiv.org/html/2605.22939#S3.E1)\)\.

#### When should tokens be learned?

As demonstrated in Sec\.[4](https://arxiv.org/html/2605.22939#S4), learnability of tokens is dependent on token difficulty and the amount of context available, i\.e\., the proportion of unmasked tokens, which is a function of diffusion time\. Whent→1\+t\\to 1^\{\+\}, the hardest tokens to predict can be unlearnable due to insufficient information, whereas \(t→0−t\\to 0^\{\-\}\), the easiest tokens are trivial to predict\. Both cases are undesirable, as neither provide high\-utility learning signal\.

LIFT addresses this by selecting the subset of masked tokens fromxt\+ρx\_\{t\+\\rho\}according to the diffusion time\. Letℳt\\mathcal\{M\}\_\{t\}andℳt\+ρ\\mathcal\{M\}\_\{t\+\\rho\}denote the sets of masked token indices at timesttandt\+ρt\+\\rho, respectively\. We define operatorsTopK⁡\(𝒮,c\)\\operatorname\{Top\}\_\{K\}\(\\mathcal\{S\},c\)andBottomK⁡\(𝒮,c\)\\operatorname\{Bottom\}\_\{K\}\(\\mathcal\{S\},c\)that return the subset ofKKindices from a set𝒮\\mathcal\{S\}corresponding to the highest and lowest confidence scorescc, respectively\. To control the scheduling behavior, we introduce a new parameterH≥2H\\geq 2that partitionsttinto three regimes to define the selected subset for supervision, denoted𝒮t⊆ℳt\+ρ\\mathcal\{S\}\_\{t\}\\subseteq\\mathcal\{M\}\_\{t\+\\rho\}:

𝒮t=\{Bottom\-K\(ℳt\+ρ,c\)ift∈\(0,1H\)ℳtift∈\[1H,1−1H\)Top\-K\(ℳt\+ρ,c\)ift∈\[1−1H,1\]\\mathcal\{S\}\_\{t\}=\\begin\{cases\}\\text\{Bottom\-\}K\(\\mathcal\{M\}\_\{t\+\\rho\},c\)&\\text\{if \}t\\in\\left\(0,\\frac\{1\}\{H\}\\right\)\\\\\[10\.0pt\] \\mathcal\{M\}\_\{t\}&\\text\{if \}t\\in\\left\[\\frac\{1\}\{H\},1\-\\frac\{1\}\{H\}\\right\)\\\\\[10\.0pt\] \\text\{Top\-\}K\(\\mathcal\{M\}\_\{t\+\\rho\},c\)&\\text\{if \}t\\in\\left\[1\-\\frac\{1\}\{H\},1\\right\]\\end\{cases\}\(2\)This selection reflects the insights drawn from our analysis that when the input has many masked positions \(t→1\+t\\to 1^\{\+\}\), we train on easy tokens using Top\-KK, where learnability is highest despite limited context\. When corruption is moderate, we revert to standard vanilla SFT\. When the input has low corruption \(t→0−t\\to 0^\{\-\}\), we learn the hardest tokens using Bottom\-KK\. This ensures that tokens are learned when they are most appropriate to learn, based on the level of context available at each timestep\. By replacing the standard masking indicator with𝟏\{k∈𝒮t\}\\mathbf\{1\}\\\{k\\in\\mathcal\{S\}\_\{t\}\\\}, the modified NELBO restricts the loss exclusively to this subset:

ℒLIFT=−𝔼t∼𝒰\[0,1\]x0∼pdata\[1t∑k=1\|x0\|𝟏\{k∈𝒮t\}log⁡pθ\(x0k∣xt\)\]\\mathcal\{L\}\_\{\\text\{LIFT\}\}=\-\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}t\\sim\\mathcal\{U\}\[0,1\]\\\\ x\_\{0\}\\sim p\_\{\\text\{data\}\}\\end\{subarray\}\}\\left\[\\frac\{1\}\{t\}\\sum\_\{k=1\}^\{\|x\_\{0\}\|\}\\mathbf\{1\}\\\!\\left\\\{k\\in\\mathcal\{S\}\_\{t\}\\right\\\}\\log p\_\{\\theta\}\\\!\\left\(x\_\{0\}^\{k\}\\mid x\_\{t\}\\right\)\\right\]\(3\)In our experiments, we find that integer valuesH=2H=2orH=3H=3work well in practice\. AsHHincreases beyond 3, LIFT behaves increasingly like vanilla SFT, since the middle region\[1H,1−1H\]\\left\[\\frac\{1\}\{H\},1\-\\frac\{1\}\{H\}\\right\]dominates the training\.

#### Approximate Variant of LIFT\.

Since LIFT selects tokens for training based on confidence under the modelpθp\_\{\\theta\}, it requires two forward passes, one to obtain token confidencespθ\(x0k\|xt\+ρ\)p\_\{\\theta\}\(x^\{k\}\_\{0\}\|x\_\{t\+\\rho\}\), and anotherpθ\(x0k\|xt\)p\_\{\\theta\}\(x^\{k\}\_\{0\}\|x\_\{t\}\)to compute the final loss\. To reduce this computational overhead, our lightweight variant, LIFT\-A, performs only a single forward pass att\+ρt\+\\rhoand applies a gating mask that zeroes out the loss for tokens not selected for supervision \(i\.e\., those outside𝒮t\\mathcal\{S\}\_\{t\}\)\. Because the loss is evaluated att\+ρt\+\\rhorather than the true diffusion timesteptt, this objective represents a biased NELBO\. This approximation trades off loss accuracy for efficiency by calculating the loss att\+ρt\+\\rhoand avoiding a second forward pass attt\.

ℒLIFT\-A=−𝔼t∼𝒰\[0,1\]ρ∼𝒰\[0,1−t\]x0∼pdata\[1t\+ρ∑k=1\|x0\|𝟏\{k∈𝒮t\}log⁡pθ\(x0k∣xt\+ρ\)\]\\displaystyle\\mathcal\{L\}\_\{\\text\{LIFT\-A\}\}=\-\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}t\\sim\\mathcal\{U\}\[0,1\]\\\\ \\rho\\sim\\mathcal\{U\}\[0,1\-t\]\\\\ x\_\{0\}\\sim p\_\{\\text\{data\}\}\\end\{subarray\}\}\\left\[\\frac\{1\}\{t\+\\rho\}\\sum\_\{k=1\}^\{\|x\_\{0\}\|\}\\mathbf\{1\}\\\!\\left\\\{k\\in\\mathcal\{S\}\_\{t\}\\right\\\}\\log p\_\{\\theta\}\\\!\\left\(x\_\{0\}^\{k\}\\mid x\_\{t\+\\rho\}\\right\)\\right\]

\(4\)

#### Connection to Curriculum Learning\.

While our method is motivated by the notion of token difficulty, it does not follow the conventional curriculum learning\(Bengioet al\.,[2009](https://arxiv.org/html/2605.22939#bib.bib39)\)where data is presented in an increasing order of difficulty\. Instead,LIFT adaptively performs token selection based on learnability, accounting for both the available unmasked context at each timestep and the model’s improving capacity throughout training\. However, recent work has shown that curriculum learning and adaptive sampling can offer complementary benefits\(Parasharet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib20); Yuet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib40); Chenet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib41)\), and future work could explore integrating the two\.

## 6Experiments

In this section, we evaluate LIFT on a suite of mathematical reasoning tasks spanning a range of difficulty levels\. We demonstrate that LIFT consistently outperforms all baseline methods\. The results indicate that difficulty\-informed training of LIFT is a simple yet effective approach for SFT\-based post\-training of diffusion language models\. We begin by describing the datasets, baseline methods\. We then explain the evaluation metrics and main experimental results followed by detailed ablations\. We include the training implementation details in the Appendix\.

### 6\.1Setup

#### Training Datasets\.

We use s1K\(Muennighoffet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib23)\), which comprises 1,000 high\-quality chain\-of\-thought \(CoT\) traces generated by Gemini\. Prior work has shown the effectiveness of supervised fine\-tuning on s1K\(Xuet al\.,[2026](https://arxiv.org/html/2605.22939#bib.bib25); Zhaoet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib21)\), making it a strong baseline for comparison\. To explore the effect of varying fine\-tuning data on training, we also construct a larger dataset of approximately 12,000 problems by randomly sampling the collated datasets used in our analysis, namely, Nemotron Post\-training Dataset\(Bercovich and others,[2025](https://arxiv.org/html/2605.22939#bib.bib30)\), Mixture of Thoughts\(Open\-R1,[2025](https://arxiv.org/html/2605.22939#bib.bib31)\), and DociThink R1\(Team OLMo and others,[2025](https://arxiv.org/html/2605.22939#bib.bib32)\)\. We refer to this dataset as LIFT\-SFT\-12K\. While s1K\(Muennighoffet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib23)\)is a highly curated dataset with clean, expert\-crafted CoT traces, LIFT\-SFT\-12K is less specialized and has a more heterogeneous training distribution\.

Table 1:LIFT outperforms baselines on LLaDA\-8B\-Instruct and LLaDA\-1\.5\.Across 4 math and reasoning benchmarks\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.22939#bib.bib42); Hendryckset al\.,[2021](https://arxiv.org/html/2605.22939#bib.bib44); Gandhiet al\.,[2024](https://arxiv.org/html/2605.22939#bib.bib43);[Cordero,](https://arxiv.org/html/2605.22939#bib.bib45)\), LIFT withH∈\{2,3\}H\\in\\\{2,3\\\}outperforms post\-training baselines Vanilla SFT, GIFT\(Xuet al\.,[2026](https://arxiv.org/html/2605.22939#bib.bib25)\), and CART\(Yeet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib16)\)\. Additionally, LIFT demonstrates3×3\\timesrelative gain in pass@16 accuracy with LLaDA on AIME’24\(AIME,[2024](https://arxiv.org/html/2605.22939#bib.bib33)\)and AIME’25\(Math\-AI Team and Zhang,[2025](https://arxiv.org/html/2605.22939#bib.bib34)\)\. Percent deltas denote relative change versus the corresponding pre\-trained model\.NameGSM8KMATHCountdownSudokuAIME ’24AIME ’25LLaDALLaDA78\.178\.136\.136\.119\.619\.611\.211\.23\.33\.33\.33\.3Vanilla78\.778\.734\.134\.120\.720\.716\.816\.86\.76\.73\.33\.3GIFT79\.279\.234\.234\.221\.721\.717\.317\.316\.716\.70\.00\.0CART78\.878\.835\.535\.523\.023\.014\.614\.610\.010\.03\.33\.3𝐋𝐈𝐅𝐓2\\mathbf\{LIFT\{\}\}\_\{2\}79\.879\.837\.937\.927\.927\.916\.516\.510\.010\.03\.33\.3↑2\.1%\\mathbf\{\\uparrow 2\.1\\%\}↑4\.9%\\mathbf\{\\uparrow 4\.9\\%\}↑42\.3%\\mathbf\{\\uparrow 42\.3\\%\}↑47\.3%\\mathbf\{\\uparrow 47\.3\\%\}↑203\.0%\\mathbf\{\\uparrow 203\.0\\%\}↑0\.0%\\mathbf\{\\uparrow 0\.0\\%\}𝐋𝐈𝐅𝐓3\\mathbf\{LIFT\{\}\}\{\}\_\{3\}79\.479\.438\.438\.426\.426\.417\.417\.416\.716\.76\.76\.7↑1\.6%\\mathbf\{\\uparrow 1\.6\\%\}↑6\.3%\\mathbf\{\\uparrow 6\.3\\%\}↑34\.6%\\mathbf\{\\uparrow 34\.6\\%\}↑55\.3%\\mathbf\{\\uparrow 55\.3\\%\}↑406\.0%\\mathbf\{\\uparrow 406\.0\\%\}↑103\.0%\\mathbf\{\\uparrow 103\.0\\%\}LLaDA\-1\.5LLaDA 1\.580\.980\.937\.837\.822\.622\.612\.112\.113\.313\.33\.33\.3Vanilla79\.279\.232\.632\.622\.022\.014\.414\.46\.76\.73\.33\.3GIFT79\.579\.536\.036\.020\.720\.717\.617\.66\.73\.3CART80\.480\.435\.835\.821\.521\.517\.017\.06\.70\.0𝐋𝐈𝐅𝐓2\\mathbf\{LIFT\{\}\}\_\{2\}79\.579\.539\.839\.831\.331\.315\.615\.613\.313\.36\.76\.7↓1\.7%\\mathbf\{\\downarrow 1\.7\\%\}↑5\.2%\\mathbf\{\\uparrow 5\.2\\%\}↑38\.4%\\mathbf\{\\uparrow 38\.4\\%\}↑28\.9%\\mathbf\{\\uparrow 28\.9\\%\}↑0\.0%\\mathbf\{\\uparrow 0\.0\\%\}↑103\.0%\\mathbf\{\\uparrow 103\.0\\%\}𝐋𝐈𝐅𝐓3\\mathbf\{LIFT\{\}\}\_\{3\}82\.282\.238\.838\.831\.231\.218\.218\.213\.313\.36\.76\.7↑1\.6%\\mathbf\{\\uparrow 1\.6\\%\}↑2\.6%\\mathbf\{\\uparrow 2\.6\\%\}↑38\.0%\\mathbf\{\\uparrow 38\.0\\%\}↑50\.4%\\mathbf\{\\uparrow 50\.4\\%\}↑0\.0%\\mathbf\{\\uparrow 0\.0\\%\}↑103\.0%\\mathbf\{\\uparrow 103\.0\\%\}Table 2:LIFT is robust to training datasets\.Benchmark performance when training on LIFT\-SFT\-12K, a math\-focused dataset assembled by randomly sampling from multiple post\-training sources\. LIFT consistently improves performance, demonstrating strong generalization across training datasets\.NameGSM8KMATHCountdownSudokuAIME’24AIME’25Instruct78\.278\.236\.836\.820\.020\.011\.811\.83\.33\.33\.33\.3Vanilla82\.982\.934\.634\.619\.919\.99\.39\.33\.33\.33\.33\.3GIFT82\.482\.434\.434\.425\.025\.06\.66\.66\.76\.70\.00\.0CART80\.080\.034\.634\.624\.624\.611\.611\.66\.76\.70\.00\.0𝐋𝐈𝐅𝐓2\\mathbf\{LIFT\{\}\}\_\{2\}81\.881\.838\.038\.025\.825\.812\.512\.56\.73\.3↑4\.6%\\mathbf\{\\uparrow 4\.6\\%\}↑3\.2%\\mathbf\{\\uparrow 3\.2\\%\}↑𝟐𝟗%\\mathbf\{\\uparrow 29\\%\}↑5\.9%\\mathbf\{\\uparrow 5\.9\\%\}↑103\.0%\\mathbf\{\\uparrow 103\.0\\%\}↑0\.0%\\mathbf\{\\uparrow 0\.0\\%\}𝐋𝐈𝐅𝐓3\\mathbf\{LIFT\{\}\}\_\{3\}81\.481\.438\.638\.620\.720\.710\.310\.310\.010\.03\.33\.3↑4\.1%\\mathbf\{\\uparrow 4\.1\\%\}↑4\.9%\\mathbf\{\\uparrow 4\.9\\%\}↑3\.5%\\mathbf\{\\uparrow 3\.5\\%\}↓12\.7%\\mathbf\{\\downarrow 12\.7\\%\}↑203\.0%\\mathbf\{\\uparrow 203\.0\\%\}↑0\.0%\\mathbf\{\\uparrow 0\.0\\%\}

Table 3:Compute–performance trade\-off\.We compare methods using H100 GPU hours alongside benchmark performance, including an RLVR oracle \(d1\), and the single\-forward\-pass approximation LIFT\-A\. LIFT \(and LIFT\-A\) delivers substantial gains at much lower compute\.NameH100HoursGSM8KMATHCountdownSudokuAIME’24AIME’25Vanilla1\.01\.078\.778\.734\.134\.120\.720\.716\.816\.86\.76\.73\.33\.3CART1\.01\.078\.878\.835\.535\.523\.023\.014\.614\.610\.010\.03\.33\.3𝐋𝐈𝐅𝐓2\\mathbf\{LIFT\{\}\}\_\{2\}\-A1\.01\.078\.778\.736\.836\.833\.233\.211\.111\.16\.76\.73\.33\.3𝐋𝐈𝐅𝐓3\\mathbf\{LIFT\{\}\}\_\{3\}\-A1\.01\.079\.079\.034\.034\.023\.123\.116\.216\.213\.413\.43\.33\.3GIFT1\.81\.879\.279\.234\.234\.221\.721\.717\.317\.316\.716\.70\.00\.0𝐋𝐈𝐅𝐓2\\mathbf\{LIFT\{\}\}\_\{2\}1\.81\.879\.879\.837\.937\.927\.927\.916\.516\.510\.010\.03\.33\.3𝐋𝐈𝐅𝐓3\\mathbf\{LIFT\{\}\}\_\{3\}1\.81\.879\.479\.438\.438\.426\.426\.417\.417\.416\.716\.76\.76\.7d1 \(oracle\)2303230381\.981\.939\.239\.237\.137\.118\.418\.4——

![Refer to caption](https://arxiv.org/html/2605.22939v1/x5.png)\(a\)GSM8K
![Refer to caption](https://arxiv.org/html/2605.22939v1/x6.png)\(b\)MATH500

Figure 4:LIFT lies on the compute\-efficient Pareto frontier, measured in H100 GPU hours\. When applied to LLaDA, LIFT requires only 2 hours of training and already outperforms baselines on GSM8K and MATH\. We also evaluate LIFT\-A, an approximate variant of our method, which performs comparably at half the compute budget of LIFT\. Finally, when LIFT is applied to LLaDA 1\.5, which requires approximately 405 H100 hours of pretraining, LIFT\(1\.5\) adds just 2 hours, performing similar on MATH and outperforming d1\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib21)\)on GSM8K, while using nearly 50% less total compute\.
#### Evaluation\.

We follow the evaluation setup of d1\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib21)\)and assess LIFT on GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.22939#bib.bib42)\), MATH\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.22939#bib.bib44)\), Countdown\(Gandhiet al\.,[2024](https://arxiv.org/html/2605.22939#bib.bib43)\), and Sudoku\([Cordero,](https://arxiv.org/html/2605.22939#bib.bib45)\)\. We use the same evaluation code, prompts, and inference settings as d1, and report accuracy \(pass@1\)\. In addition, we evaluate on AIME’24\(AIME,[2024](https://arxiv.org/html/2605.22939#bib.bib33)\)and AIME’25\(Math\-AI Team and Zhang,[2025](https://arxiv.org/html/2605.22939#bib.bib34)\)datasets to measure LIFT on advanced mathematical reasoning; given their difficulty, we report pass@16 for AIME\. We include pass@8 and avg@8, and avg@16 results in the Appendix \(see Table[12](https://arxiv.org/html/2605.22939#A5.T12)\)\.

#### Baselines\.

Since we fine\-tune LLaDA Instruct\(Nieet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib15)\)and LLaDA 1\.5\(Zhuet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib17)\), they are our first set of baselines\. We use the vanilla masked\-DLM objective\(Sahooet al\.,[2024](https://arxiv.org/html/2605.22939#bib.bib14)\)\(Vanilla\)\. We additionally consider the methods ofXuet al\.\([2026](https://arxiv.org/html/2605.22939#bib.bib25)\)and\(Yeet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib16)\)\. Context\-Adaptive noise Rescheduling at Token\-level \(CART\)\(Yeet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib16)\)re\-weights each masked token in the NELBO objective such that targets with fewer unmasked tokens in their immediate neighborhood have less weight, as these tokens are harder to denoise\. This accounts for variable amount of context across diffusion time \(*when*\), however, it is applied independetly of token identity, and thus, does not consider*what*\. Guided Importance\-Aware Fine\-Tuning \(GIFT\)\(Xuet al\.,[2026](https://arxiv.org/html/2605.22939#bib.bib25)\)instead accounts for*what*without*when*\. Similar to LIFT, GIFT estimates token\-level uncertainty using an initial forward pass of the model with all non\-prompt tokens masked aspθ\(⋅\|x1\)p\_\{\\theta\}\(\\cdot\|x\_\{1\}\)\. Each response token is then masked with probability proportional to the square root of the token\-level entropy such that tokens with high uncertainty are more likely to be masked\. While this shares connections with the bottom\-KKloss, it is independent of time, since uncertainty is always estimated conditioned onx1x\_\{1\}\. The inclusion of both GIFT and CART serves to compare the effectiveness of jointly accounting for the interaction between*what*and*when*dimensions as done by LIFT compared to modeling only one dimension in isolation\.

Table 4:LIFT is robust across generation lengths\.We follow evaluation setup of d1\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib21)\)and compare performance across generation lengths of 128, 256, and 512 tokens on different datasets\. LIFT is robust across lengths and generally benefits from longer generations, except on Sudoku\. Best results are inbold\.NameGSM8KMATHCountdownSudoku128256512128256512128256512128256512Instruct68\.568\.576\.176\.178\.178\.126\.426\.432\.432\.436\.136\.119\.619\.619\.619\.617\.117\.111\.211\.26\.56\.55\.55\.5Vanilla67\.167\.178\.578\.778\.727\.027\.032\.832\.834\.134\.120\.120\.116\.216\.220\.720\.716\.816\.87\.37\.34\.74\.7GIFT66\.466\.478\.078\.079\.279\.227\.227\.232\.732\.734\.234\.221\.721\.716\.416\.417\.317\.316\.016\.08\.75\.25\.2CART67\.267\.276\.676\.678\.978\.924\.924\.930\.530\.535\.535\.523\.019\.019\.018\.718\.714\.614\.68\.58\.54\.74\.7𝐋𝐈𝐅𝐓2\\mathbf\{LIFT\{\}\}\_\{2\}70\.978\.278\.279\.829\.035\.535\.537\.937\.922\.822\.817\.917\.928\.016\.516\.57\.27\.26\.96\.9𝐋𝐈𝐅𝐓3\\mathbf\{LIFT\{\}\}\_\{3\}69\.569\.578\.478\.479\.479\.428\.228\.237\.638\.420\.120\.121\.326\.426\.417\.47\.97\.98\.5

### 6\.2Results

We now present the results of LIFT, reporting the mean performance across three training runs with different random seeds\. The same procedure is applied to all baselines in Table 1\. Additionally, we highlight the relative gains over the base model in green\. Confidence intervals are included in the Appendix[B\.2](https://arxiv.org/html/2605.22939#A2.SS2)\.

#### LIFT consistently outperforms baselines across both LLaDA\-8B\-Instruct and LLaDA 1\.5\.

Table[1](https://arxiv.org/html/2605.22939#S6.T1)presents the performance of LIFT with two values ofHH, namely 2 and 3, denoted asLIFT2\\text\{LIFT\{\}\}\_\{2\}andLIFT3\\text\{LIFT\{\}\}\_\{3\}, respectively\. On LLaDA\-8B\-Instruct, our method shows notable improvements over the baseline, especially on harder benchmarks AIME 2024 and 2025, where LIFT improves base model performance by more than2×2\\times\. Finally, we find thatLIFT3\\text\{LIFT\{\}\}\_\{3\}offers more consistent improvements across benchmarks and base models compared toLIFT2\\text\{LIFT\{\}\}\_\{2\}\.

#### Training Distribution Robustness of LIFT\.

To assess the generality of LIFT to different fine\-tuning datasets, we conduct experiments on the LIFT\-SFT\-12K dataset described in Sec\.[6\.1](https://arxiv.org/html/2605.22939#S6.SS1)\. As shown in Table[2](https://arxiv.org/html/2605.22939#S6.T2), LIFT demonstrates consistent gains across evaluation tasks, indicating that its effectiveness is not limited to s1K\(Muennighoffet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib23)\)\. These results suggest that LIFT generalizes well and could serve as a scalable objective beyond supervised post\-training and could potentially be useful for broader pre\-training or instruction tuning settings\.

Table 5:Ablation of interaction between*what*and*when*\.To ablate the importance of*what*, we introduce Top\-KKand Bottom\-KKas baselines, which train on the most and least confident masked tokens, respectively\. Furthermore we ablate the time\-independent variant of LIFT by randomly selecting one of Top\-KK, Bottom\-KK\(Random2\\text\{Random\}\_\{2\}\) and additionally Vanilla \(Random3\\text\{Random\}\_\{3\}\)\. As seen below, improvements from these baselines are not consistent across tasks\. By accounting for both*what*and*when*, LIFT achieves robust performance across all tasks, empirically validating the consideration of both*what*and*when*during SFT training of DLMs\.NameGSM8KMATHCountdownSudokuAIME’24AIME’25Vanilla78\.778\.734\.134\.120\.720\.716\.816\.86\.76\.73\.33\.3Top\-KK77\.277\.234\.634\.630\.330\.318\.018\.03\.33\.30\.00\.0Bottom\-KK77\.577\.534\.834\.826\.026\.018\.018\.010\.010\.03\.33\.3Random2\\text\{Random\}\_\{2\}80\.180\.137\.837\.823\.023\.016\.816\.80\.00\.00\.00\.0Random3\\text\{Random\}\_\{3\}80\.080\.035\.835\.823\.423\.418\.018\.00\.00\.00\.00\.0𝐋𝐈𝐅𝐓2\\mathbf\{LIFT\{\}\}\_\{2\}79\.879\.837\.937\.928\.028\.016\.516\.510\.010\.03\.33\.3𝐋𝐈𝐅𝐓3\\mathbf\{LIFT\{\}\}\_\{3\}79\.479\.438\.438\.426\.426\.417\.417\.416\.716\.76\.76\.7

#### Compute–Performance Trade\-offs and the Pareto Frontier\.

Table[3](https://arxiv.org/html/2605.22939#S6.T3)presents results for LIFT\-A on LLaDA 8B\-Instruct, a compute\-efficient variant that requires only a single forward pass\. Despite its lower computational cost, LIFT\-A consistently outperforms baselines with comparable budgets, such as vanilla and CART, highlighting a favorable trade\-off between efficiency and performance\.

Remarkably, when compared to d1\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib21)\), a reinforcement learning\-based post\-training method that fine\-tunes separately for each task and requires over 2,000 H100 GPU hours, LIFT achieves similar performance while using only 1\.8 H100 hours \(Table[3](https://arxiv.org/html/2605.22939#S6.T3)\)\. This 1000×\\timesreduction in compute demonstrates the strength of our learnability\-informed training approach in losslessly enhancing training efficiency\. As illustrated in Figure[4](https://arxiv.org/html/2605.22939#S6.F4), LIFT establishes a new compute\-efficient Pareto frontier\. These findings suggest that while RL has been effective for ARMs\(Guoet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib19)\), efficient RL\-based post\-training for DLMs remains an open challenge\.

Table 6:Ablations of H for LIFT\.We ablate the value of H, which controls whether Top\-KK, Bottom\-KK, or Vanilla SFT is applied during training\. Mathematically, asH→∞H\\to\\infty, LIFT converges to vanilla SFT\. Empirically,H=3H=3achieves the best average performance across benchmarks\.NameGSM8KMATHCountdownSudokuAIME’24AIME’25Vanilla78\.778\.734\.134\.120\.720\.716\.816\.86\.76\.73\.33\.3𝐋𝐈𝐅𝐓2\\mathbf\{LIFT\{\}\}\_\{2\}79\.879\.837\.937\.928\.028\.016\.516\.510\.010\.03\.33\.3𝐋𝐈𝐅𝐓3\\mathbf\{LIFT\{\}\}\_\{3\}79\.479\.438\.438\.426\.426\.417\.417\.416\.716\.76\.76\.7𝐋𝐈𝐅𝐓4\\mathbf\{LIFT\{\}\}\_\{4\}78\.078\.037\.237\.230\.430\.414\.914\.96\.76\.73\.33\.3𝐋𝐈𝐅𝐓5\\mathbf\{LIFT\{\}\}\_\{5\}78\.278\.235\.435\.422\.722\.715\.715\.76\.76\.73\.33\.3𝐋𝐈𝐅𝐓10\\mathbf\{LIFT\{\}\}\_\{10\}78\.278\.234\.234\.221\.721\.716\.116\.16\.76\.73\.33\.3𝐋𝐈𝐅𝐓15\\mathbf\{LIFT\{\}\}\_\{15\}77\.877\.834\.134\.122\.622\.615\.815\.83\.33\.33\.33\.3𝐋𝐈𝐅𝐓20\\mathbf\{LIFT\{\}\}\_\{20\}78\.078\.033\.433\.421\.021\.015\.515\.56\.76\.73\.33\.3

### 6\.3Ablation Studies

To better understand the design and performance of LIFT, we conduct a series of ablations focusing on its key components\. All ablations were carried out by training LLaDA\-8B\-Instruct on s1K\.

#### Ablation on Generation Length\.

Following the setup used in d1 and LLaDA\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib21); Nieet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib15)\), we ablate the response length across 128, 256, and 512 tokens, with diffusion steps set to half the generation length\. Results are shown in Table[4](https://arxiv.org/html/2605.22939#S6.T4), and we report the best\-performing value in all main tables, consistent with prior work\(Zhuet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib17); Xuet al\.,[2026](https://arxiv.org/html/2605.22939#bib.bib25)\)\. LIFT performs robustly across lengths, with performance generally improving as generation length increases, except on Sudoku, which exhibits the opposite trend\.

Table 7:Extension of LIFT to Dream\-7B\(Yeet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib16)\)\. LIFT demonstrates robust performance gains across mathematical and reasoning benchmarks\.MethodGSM8KMATHCountdownSudokuInstruct76\.739\.821\.18\.2Vanilla76\.130\.625\.014\.8GIFT78\.540\.023\.416\.0CART77\.838\.922\.317\.2𝐋𝐈𝐅𝐓2\\mathbf\{LIFT\{\}\}\_\{2\}77\.940\.833\.622\.5𝐋𝐈𝐅𝐓3\\mathbf\{LIFT\{\}\}\_\{3\}79\.140\.625\.617\.5

#### Ablating the Interaction between What and When\.

LIFT builds directly on our analysis, where we demonstrated the substantial effect of the interaction between*what*and*when*on the loss landscape of DLMs\. We next study this key design choice by considering ablated versions of LIFT that only account for*what*tokens are learned without any constraint on*when*they are learned in the diffusion process\. To construct frameworks that only consider*what*is learned and are independent of diffusion time, we introduce bottom\-KKand top\-KKtraining as standalone baselines\. Bottom\-KKtrains only on hard tokens, while top\-KKfocuses only on easy ones\. Alternatively, to analyze whether the mixture of top and bottomKKlosses is sufficient without consideration of*when*these losses are applied in the diffusion process, we design a time\-independent variant of LIFT by randomly selecting one of bottom\-KK, vanilla, or top\-KKlosses at each training step\. We refer to these asRandom2\\text\{Random\}\_\{2\}andRandom3\\text\{Random\}\_\{3\}, whereRandom2\\text\{Random\}\_\{2\}samples between bottom\-KKand top\-KK, andRandom3\\text\{Random\}\_\{3\}samples from all three\.

Results for this ablation are shown in Tab\.[5](https://arxiv.org/html/2605.22939#S6.T5)\. With the except of Countdown and Sudoku, LIFT offers substantial gains over both the Top and Bottom\-KKloss variants\. While the Random variants are competitive with LIFT across benchmarks, they achieve Pass@16 of 0 on the challenging AIME benchmarks, suggesting that accounting for the interaction of*what*tokens are learned*when*is crucial for success in real\-world tasks requiring multi\-step reasoning and use of tokens that are in the tails of the base model distribution\.

#### Ablation ofHH\.

We ablate the hyperparameterHHin LIFT, which determines the rate at which top\-KK, bottom\-KK, or vanilla is used during training \(See Sec[1](https://arxiv.org/html/2605.22939#alg1)\)\. Mathematically, asH→∞H\\to\\infty, LIFT approaches vanilla SFT\. As shown in Table[6](https://arxiv.org/html/2605.22939#S6.T6), LIFT is robust to this parameter, withH=3H=3yielding the best average performance across benchmarks\.

#### Extension to Dream\-7B

To further evaluate the generalizability of LIFT, we extended LIFT to Dream\(Yeet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib16)\)\. As demonstrated in Table[7](https://arxiv.org/html/2605.22939#S6.T7), LIFT yields performance gains consistent with other models\.

#### Alternate sampling strategies forρ\\rho

To evaluate the impact of alternative sampling strategies forρ\\rho, we a fixed schedule \(ρ=min⁡\(k,1−t\)\\rho=\\min\(k,1\-t\)\) and variance\-reduced uniform distribution \(ρ∼𝒰\(k,1−t\)\\rho\\sim\\mathcal\{U\}\(k,1\-t\)\)\. As shown in Table[8](https://arxiv.org/html/2605.22939#S6.T8), our default approach, i\.e\., \(ρ∼𝒰\(0,1−t\)\\rho\\sim\\mathcal\{U\}\(0,1\-t\)\), performs best\. By avoiding the deterministic constraints of the fixed schedule and the truncated intervals of the variance\-reduced distributions, uniform sampling ofρ\\rhomaximizes the diversity of masking patterns the model encounters during training\. During fine\-tuning, this diversity acts as implicit data augmentation, mirroring an effect previously observed in image diffusion\(Kingmaet al\.,[2021](https://arxiv.org/html/2605.22939#bib.bib9)\)\.

Table 8:Ablation of alternative sampling strategies forρ\\rho\.We compare the uniform sampling ofρ∼𝒰\(0,1−t\)\\rho\\sim\\mathcal\{U\}\(0,1\-t\)inLIFTagainst fixed schedules \(ρ=min⁡\(k,1−t\)\\rho=\\min\(k,1\-t\)\) and variance\-reduced distributions \(ρ∼𝒰\(k,1−t\)\\rho\\sim\\mathcal\{U\}\(k,1\-t\)\)\. We experimented with boundsk∈\{0\.1,0\.3\}k\\in\\\{0\.1,0\.3\\\}for a maximum generation length of 256 tokens\.StrategyGSM8KMATHCountdownSudokuρ=min⁡\(0\.1,1−t\)\\rho=\\min\(0\.1,1\-t\)78\.236\.421\.17\.6ρ=min⁡\(0\.3,1−t\)\\rho=\\min\(0\.3,1\-t\)77\.934\.621\.18\.5ρ∼𝒰\(1−t,0\.1\)\\rho\\sim\\mathcal\{U\}\(1\-t,0\.1\)78\.934\.420\.04\.5ρ∼𝒰\(1−t,0\.3\)\\rho\\sim\\mathcal\{U\}\(1\-t,0\.3\)78\.436\.222\.86\.5𝐋𝐈𝐅𝐓3\\mathbf\{LIFT\{\}\}\_\{3\}:ρ∼𝒰\(0,1−t\)\\rho\\sim\\mathcal\{U\}\(0,1\-t\)79\.437\.626\.47\.9

## 7Conclusion

We propose LIFT, a learnability\-informed fine\-tuning method for post\-training DLMs\. LIFT builds on the insight that certain tokens are inherently harder to learn \(*what*\), and that their learnability depends on*when*they are predicted during the diffusion process\. To elucidate this relationship, we analyze over 0\.5B tokens across common post\-training datasets, revealing consistent patterns in token frequency and the dependence on diffusion timestep\. These findings inform the design of LIFT, which achieves state\-of\-the\-art performance across arithmetic reasoning tasks, with particularly strong gains on challenging benchmarks such as AIME\. Notably, LIFT establishes a new compute\-efficient Pareto frontier, matching the performance of RL\-based methods while requiring orders of magnitude less compute\.

## Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning\. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here\.

## Acknowledgments

This work was supported in part by ARPA\-H under grant 1AY1AX000053, NIH under grant U01AG070112, and NSF under grant CNS\-2328395\.

## References

- 2\. AIME \(2024\)Aime\_2024\.Note:Hugging Face DatasetsAccessed 2026\-01\-21External Links:[Link](https://huggingface.co/datasets/HuggingFaceH4/aime_2024)Cited by:[§1](https://arxiv.org/html/2605.22939#S1.p4.2),[§6\.1](https://arxiv.org/html/2605.22939#S6.SS1.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2605.22939#S6.T1),[Table 1](https://arxiv.org/html/2605.22939#S6.T1.4.2.2)\.
- J\. Austin, D\. D\. Johnson, J\. Ho, D\. Tarlow, and R\. Van Den Berg \(2021a\)Structured denoising diffusion models in discrete state\-spaces\.Advances in neural information processing systems34,pp\. 17981–17993\.Cited by:[§1](https://arxiv.org/html/2605.22939#S1.p1.1),[§2](https://arxiv.org/html/2605.22939#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Austin, A\. Odena, M\. Nye, M\. Bosma, H\. Michalewski, D\. Dohan, E\. Jiang, C\. Cai, M\. Terry, Q\. Le, and C\. Sutton \(2021b\)Program synthesis with large language models\.arXiv preprint arXiv:2108\.07732\.Cited by:[Appendix F](https://arxiv.org/html/2605.22939#A6.p1.1)\.
- Y\. Bengio, J\. Louradour, R\. Collobert, and J\. Weston \(2009\)Curriculum learning\.InProceedings of the 26th annual international conference on machine learning,pp\. 41–48\.Cited by:[§5](https://arxiv.org/html/2605.22939#S5.SS0.SSS0.Px4.p1.1)\.
- A\. Bercovichet al\.\(2025\)Llama\-nemotron: efficient reasoning models\.arXiv preprint arXiv:2505\.00949\.External Links:[Link](https://arxiv.org/abs/2505.00949)Cited by:[Appendix C](https://arxiv.org/html/2605.22939#A3.p1.1),[Figure 2](https://arxiv.org/html/2605.22939#S1.F2),[Figure 2](https://arxiv.org/html/2605.22939#S1.F2.12.6.7),[§1](https://arxiv.org/html/2605.22939#S1.p3.2),[§4](https://arxiv.org/html/2605.22939#S4.SS0.SSS0.Px3.p1.1),[§6\.1](https://arxiv.org/html/2605.22939#S6.SS1.SSS0.Px1.p1.1)\.
- T\. Bie, M\. Cao, K\. Chen, L\. Du, M\. Gong, Z\. Gong, Y\. Gu, J\. Hu, Z\. Huang, Z\. Lan,et al\.\(2025\)Llada2\. 0: scaling up diffusion language models to 100b\.arXiv preprint arXiv:2512\.15745\.Cited by:[§2](https://arxiv.org/html/2605.22939#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. d\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[Appendix F](https://arxiv.org/html/2605.22939#A6.p1.1)\.
- X\. Chen, J\. Lu, M\. Kim, D\. Zhang, J\. Tang, A\. Piché, N\. Gontier, Y\. Bengio, and E\. Kamalloo \(2025\)Self\-evolving curriculum for llm reasoning\.arXiv preprint arXiv:2505\.14970\.Cited by:[§5](https://arxiv.org/html/2605.22939#S5.SS0.SSS0.Px4.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§6\.1](https://arxiv.org/html/2605.22939#S6.SS1.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2605.22939#S6.T1),[Table 1](https://arxiv.org/html/2605.22939#S6.T1.4.2.2)\.
- \[10\]A\. CorderoArel’s sudoku generator\.Note:[https://www\.ocf\.berkeley\.edu/~arel/sudoku/main\.html](https://www.ocf.berkeley.edu/~arel/sudoku/main.html)Cited by:[§6\.1](https://arxiv.org/html/2605.22939#S6.SS1.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2605.22939#S6.T1),[Table 1](https://arxiv.org/html/2605.22939#S6.T1.4.2.2)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)Bert: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 \(long and short papers\),pp\. 4171–4186\.Cited by:[§2](https://arxiv.org/html/2605.22939#S2.SS0.SSS0.Px1.p1.1)\.
- K\. Gandhi, D\. H\. J\. Lee, G\. Grand, M\. Liu, W\. Cheng, A\. Sharma, and N\. Goodman \(2024\)Stream of search \(sos\): learning to search in language\.InFirst Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=2cop2jmQVL)Cited by:[§6\.1](https://arxiv.org/html/2605.22939#S6.SS1.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2605.22939#S6.T1),[Table 1](https://arxiv.org/html/2605.22939#S6.T1.4.2.2)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)DeepSeek\-r1 incentivizes reasoning in llms through reinforcement learning\.Nature645\(8081\),pp\. 633–638\.Cited by:[§2](https://arxiv.org/html/2605.22939#S2.SS0.SSS0.Px2.p1.1),[§6\.2](https://arxiv.org/html/2605.22939#S6.SS2.SSS0.Px3.p2.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the math dataset\.NeurIPS\.Cited by:[§6\.1](https://arxiv.org/html/2605.22939#S6.SS1.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2605.22939#S6.T1),[Table 1](https://arxiv.org/html/2605.22939#S6.T1.4.2.2)\.
- J\. Ho, A\. Jain, and P\. Abbeel \(2020\)Denoising diffusion probabilistic models\.Advances in neural information processing systems33,pp\. 6840–6851\.Cited by:[§2](https://arxiv.org/html/2605.22939#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Ho, T\. Salimans, A\. Gritsenko, W\. Chan, M\. Norouzi, and D\. J\. Fleet \(2022\)Video diffusion models\.Advances in neural information processing systems35,pp\. 8633–8646\.Cited by:[§1](https://arxiv.org/html/2605.22939#S1.p1.1)\.
- N\. Kandpal, H\. Deng, A\. Roberts, E\. Wallace, and C\. Raffel \(2023\)Large language models struggle to learn long\-tail knowledge\.InInternational conference on machine learning,pp\. 15696–15707\.Cited by:[§4](https://arxiv.org/html/2605.22939#S4.SS0.SSS0.Px1.p1.4)\.
- S\. Khanna, S\. Kharbanda, S\. Li, H\. Varma, E\. Wang, S\. Birnbaum, Z\. Luo, Y\. Miraoui, A\. Palrecha, S\. Ermon,et al\.\(2025\)Mercury: ultra\-fast language models based on diffusion\.arXiv e\-prints,pp\. arXiv–2506\.Cited by:[§1](https://arxiv.org/html/2605.22939#S1.p1.1)\.
- D\. Kingma, T\. Salimans, B\. Poole, and J\. Ho \(2021\)Variational diffusion models\.Advances in neural information processing systems34,pp\. 21696–21707\.Cited by:[§6\.3](https://arxiv.org/html/2605.22939#S6.SS3.SSS0.Px5.p1.5)\.
- V\. T\. Kunde, F\. Doudi, M\. Farahbakhsh, D\. Kalathil, K\. Narayanan, and J\. Chamberland \(2026\)Reinforcement learning for diffusion llms with entropy\-guided step selection and stepwise advantages\.arXiv preprint arXiv:2603\.12554\.Cited by:[§1](https://arxiv.org/html/2605.22939#S1.p2.1)\.
- S\. Li, K\. Kallidromitis, H\. Bansal, A\. Gokul, Y\. Kato, K\. Kozuka, J\. Kuen, Z\. Lin, K\. Chang, and A\. Grover \(2025\)LaViDa: a large diffusion model for vision\-language understanding\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=6WnBITpnzD)Cited by:[§2](https://arxiv.org/html/2605.22939#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.22939#S2.SS0.SSS0.Px2.p1.1)\.
- Math\-AI Team and Y\. Zhang \(2025\)Aime25\.Note:Hugging Face DatasetsAccessed 2026\-01\-21External Links:[Link](https://huggingface.co/datasets/math-ai/aime25)Cited by:[§1](https://arxiv.org/html/2605.22939#S1.p4.2),[§6\.1](https://arxiv.org/html/2605.22939#S6.SS1.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2605.22939#S6.T1),[Table 1](https://arxiv.org/html/2605.22939#S6.T1.4.2.2)\.
- N\. Muennighoff, Z\. Yang, W\. Shi, X\. L\. Li, L\. Fei\-Fei, H\. Hajishirzi, L\. Zettlemoyer, P\. Liang, E\. Candès, and T\. B\. Hashimoto \(2025\)S1: simple test\-time scaling\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 20286–20332\.Cited by:[Figure 2](https://arxiv.org/html/2605.22939#S1.F2),[Figure 2](https://arxiv.org/html/2605.22939#S1.F2.12.6.7),[§1](https://arxiv.org/html/2605.22939#S1.p3.2),[§2](https://arxiv.org/html/2605.22939#S2.SS0.SSS0.Px2.p1.1),[§4](https://arxiv.org/html/2605.22939#S4.SS0.SSS0.Px3.p1.1),[§6\.1](https://arxiv.org/html/2605.22939#S6.SS1.SSS0.Px1.p1.1),[§6\.2](https://arxiv.org/html/2605.22939#S6.SS2.SSS0.Px2.p1.1)\.
- A\. Q\. Nichol and P\. Dhariwal \(2021\)Improved denoising diffusion probabilistic models\.InInternational conference on machine learning,pp\. 8162–8171\.Cited by:[§1](https://arxiv.org/html/2605.22939#S1.p1.1),[§2](https://arxiv.org/html/2605.22939#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Nie, F\. Zhu, Z\. You, X\. Zhang, J\. Ou, J\. Hu, J\. ZHOU, Y\. Lin, J\. Wen, and C\. Li \(2025\)Large language diffusion models\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=KnqiC0znVF)Cited by:[§1](https://arxiv.org/html/2605.22939#S1.p1.1),[§1](https://arxiv.org/html/2605.22939#S1.p2.1),[§1](https://arxiv.org/html/2605.22939#S1.p3.2),[§2](https://arxiv.org/html/2605.22939#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2605.22939#S3.p1.20),[§4](https://arxiv.org/html/2605.22939#S4.SS0.SSS0.Px3.p1.1),[§6\.1](https://arxiv.org/html/2605.22939#S6.SS1.SSS0.Px3.p1.3),[§6\.3](https://arxiv.org/html/2605.22939#S6.SS3.SSS0.Px1.p1.1)\.
- Open\-R1 \(2025\)Mixture\-of\-thoughts\.Note:Hugging Face DatasetsAccessed 2026\-01\-21External Links:[Link](https://huggingface.co/datasets/open-r1/Mixture-of-Thoughts)Cited by:[Appendix C](https://arxiv.org/html/2605.22939#A3.p1.1),[Figure 2](https://arxiv.org/html/2605.22939#S1.F2),[Figure 2](https://arxiv.org/html/2605.22939#S1.F2.12.6.7),[§1](https://arxiv.org/html/2605.22939#S1.p3.2),[§4](https://arxiv.org/html/2605.22939#S4.SS0.SSS0.Px3.p1.1),[§6\.1](https://arxiv.org/html/2605.22939#S6.SS1.SSS0.Px1.p1.1)\.
- S\. Parashar, S\. Gui, X\. Li, H\. Ling, S\. Vemuri, B\. Olson, E\. Li, Y\. Zhang, J\. Caverlee, D\. Kalathil,et al\.\(2025\)Curriculum reinforcement learning from easy to hard tasks improves llm reasoning\.arXiv preprint arXiv:2506\.06632\.Cited by:[§B\.3](https://arxiv.org/html/2605.22939#A2.SS3.p1.1),[§2](https://arxiv.org/html/2605.22939#S2.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.22939#S5.SS0.SSS0.Px4.p1.1)\.
- S\. Parashar, Z\. Lin, T\. Liu, X\. Dong, Y\. Li, D\. Ramanan, J\. Caverlee, and S\. Kong \(2024\)The neglected tails in vision\-language models\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 12988–12997\.Cited by:[§4](https://arxiv.org/html/2605.22939#S4.SS0.SSS0.Px1.p1.4)\.
- S\. Sahoo, M\. Arriola, Y\. Schiff, A\. Gokaslan, E\. Marroquin, J\. Chiu, A\. Rush, and V\. Kuleshov \(2024\)Simple and effective masked diffusion language models\.Advances in Neural Information Processing Systems37,pp\. 130136–130184\.Cited by:[§1](https://arxiv.org/html/2605.22939#S1.p1.1),[§2](https://arxiv.org/html/2605.22939#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2605.22939#S3.p1.20),[§6\.1](https://arxiv.org/html/2605.22939#S6.SS1.SSS0.Px3.p1.3)\.
- Y\. Song and S\. Ermon \(2019\)Generative modeling by estimating gradients of the data distribution\.Advances in neural information processing systems32\.Cited by:[§1](https://arxiv.org/html/2605.22939#S1.p1.1),[§2](https://arxiv.org/html/2605.22939#S2.SS0.SSS0.Px1.p1.1)\.
- Team OLMoet al\.\(2025\)Olmo 3\.arXiv preprint arXiv:2512\.13961\.External Links:[Link](https://arxiv.org/abs/2512.13961)Cited by:[Appendix C](https://arxiv.org/html/2605.22939#A3.p1.1),[Figure 2](https://arxiv.org/html/2605.22939#S1.F2),[Figure 2](https://arxiv.org/html/2605.22939#S1.F2.12.6.7),[§1](https://arxiv.org/html/2605.22939#S1.p3.2),[§4](https://arxiv.org/html/2605.22939#S4.SS0.SSS0.Px3.p1.1),[§6\.1](https://arxiv.org/html/2605.22939#S6.SS1.SSS0.Px1.p1.1)\.
- V\. Udandarao, A\. Prabhu, A\. Ghosh, Y\. Sharma, P\. Torr, A\. Bibi, S\. Albanie, and M\. Bethge \(2024\)No” zero\-shot” without exponential data: pretraining concept frequency determines multimodal model performance\.Advances in Neural Information Processing Systems37,pp\. 61735–61792\.Cited by:[§4](https://arxiv.org/html/2605.22939#S4.SS0.SSS0.Px1.p1.4)\.
- G\. Wang, G\. Turok, Y\. Schiff, M\. Arriola, and V\. Kuleshov \(2025\)D2: improved techniques for training reasoning diffusion language models\.arXiv preprint arXiv:2509\.21474\.Cited by:[§1](https://arxiv.org/html/2605.22939#S1.p2.1)\.
- C\. Wu, H\. Zhang, S\. Xue, Z\. Liu, S\. Diao, L\. Zhu, P\. Luo, S\. Han, and E\. Xie \(2025\)Fast\-dllm: training\-free acceleration of diffusion llm by enabling kv cache and parallel decoding\.External Links:2505\.22618,[Link](https://arxiv.org/abs/2505.22618)Cited by:[§D\.2](https://arxiv.org/html/2605.22939#A4.SS2.SSS0.Px1.p2.2)\.
- C\. Wu, H\. Zhang, S\. Xue, Z\. Liu, S\. Diao, L\. Zhu, P\. Luo, S\. Han, and E\. Xie \(2026\)Fast\-dLLM: training\-free acceleration of diffusion LLM by enabling KV cache and parallel decoding\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=3Z3Is6hnOT)Cited by:[§1](https://arxiv.org/html/2605.22939#S1.p1.1)\.
- G\. Xu, W\. Xu, J\. Zhao, and K\. Ma \(2026\)GIFT: guided importance\-aware fine\-tuning for diffusion language models\.External Links:2509\.20863,[Link](https://arxiv.org/abs/2509.20863)Cited by:[§1](https://arxiv.org/html/2605.22939#S1.p3.2),[§2](https://arxiv.org/html/2605.22939#S2.SS0.SSS0.Px2.p1.1),[§6\.1](https://arxiv.org/html/2605.22939#S6.SS1.SSS0.Px1.p1.1),[§6\.1](https://arxiv.org/html/2605.22939#S6.SS1.SSS0.Px3.p1.3),[§6\.3](https://arxiv.org/html/2605.22939#S6.SS3.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2605.22939#S6.T1),[Table 1](https://arxiv.org/html/2605.22939#S6.T1.4.2.2)\.
- Z\. Xu, Y\. Liu, Y\. Yin, M\. Zhou, and R\. Poovendran \(2025\)KodCode: a diverse, challenging, and verifiable synthetic dataset for coding\.InFindings of the Association for Computational Linguistics: ACL 2025,Vienna, Austria,pp\. 6980–7008\.External Links:[Link](https://aclanthology.org/2025.findings-acl.365/)Cited by:[Appendix F](https://arxiv.org/html/2605.22939#A6.p1.1)\.
- J\. Ye, Z\. Xie, L\. Zheng, J\. Gao, Z\. Wu, X\. Jiang, Z\. Li, and L\. Kong \(2025\)Dream 7b: diffusion large language models\.arXiv preprint arXiv:2508\.15487\.Cited by:[§B\.1](https://arxiv.org/html/2605.22939#A2.SS1.p1.1),[§1](https://arxiv.org/html/2605.22939#S1.p1.1),[§1](https://arxiv.org/html/2605.22939#S1.p2.1),[§1](https://arxiv.org/html/2605.22939#S1.p3.2),[§2](https://arxiv.org/html/2605.22939#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.22939#S2.SS0.SSS0.Px2.p1.1),[§3](https://arxiv.org/html/2605.22939#S3.p1.20),[§4](https://arxiv.org/html/2605.22939#S4.SS0.SSS0.Px3.p1.1),[§6\.1](https://arxiv.org/html/2605.22939#S6.SS1.SSS0.Px3.p1.3),[§6\.3](https://arxiv.org/html/2605.22939#S6.SS3.SSS0.Px4.p1.1),[Table 1](https://arxiv.org/html/2605.22939#S6.T1),[Table 1](https://arxiv.org/html/2605.22939#S6.T1.4.2.2),[Table 7](https://arxiv.org/html/2605.22939#S6.T7.4.1),[Table 7](https://arxiv.org/html/2605.22939#S6.T7.6.2)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, W\. Dai, T\. Fan, G\. Liu, L\. Liu,et al\.\(2025\)Dapo: an open\-source llm reinforcement learning system at scale\.arXiv preprint arXiv:2503\.14476\.Cited by:[§5](https://arxiv.org/html/2605.22939#S5.SS0.SSS0.Px4.p1.1)\.
- E\. Zelikman, Y\. Wu, J\. Mu, and N\. Goodman \(2022\)Star: bootstrapping reasoning with reasoning\.Advances in Neural Information Processing Systems35,pp\. 15476–15488\.Cited by:[§2](https://arxiv.org/html/2605.22939#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Zhao, D\. Gupta, Q\. Zheng, and A\. Grover \(2025\)D1: scaling reasoning in diffusion large language models via reinforcement learning\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=7ZVRlBFuEv)Cited by:[§D\.2](https://arxiv.org/html/2605.22939#A4.SS2.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.22939#S1.p2.1),[§1](https://arxiv.org/html/2605.22939#S1.p4.2),[§2](https://arxiv.org/html/2605.22939#S2.SS0.SSS0.Px2.p1.1),[Figure 4](https://arxiv.org/html/2605.22939#S6.F4),[Figure 4](https://arxiv.org/html/2605.22939#S6.F4.4.2.1),[§6\.1](https://arxiv.org/html/2605.22939#S6.SS1.SSS0.Px1.p1.1),[§6\.1](https://arxiv.org/html/2605.22939#S6.SS1.SSS0.Px2.p1.1),[§6\.2](https://arxiv.org/html/2605.22939#S6.SS2.SSS0.Px3.p2.1),[§6\.3](https://arxiv.org/html/2605.22939#S6.SS3.SSS0.Px1.p1.1),[Table 4](https://arxiv.org/html/2605.22939#S6.T4),[Table 4](https://arxiv.org/html/2605.22939#S6.T4.78.2.1)\.
- F\. Zhu, R\. Wang, S\. Nie, X\. Zhang, C\. Wu, J\. Hu, J\. Zhou, J\. Chen, Y\. Lin, J\. Wen,et al\.\(2025\)LLaDA 1\.5: variance\-reduced preference optimization for large language diffusion models\.arXiv preprint arXiv:2505\.19223\.Cited by:[§2](https://arxiv.org/html/2605.22939#S2.SS0.SSS0.Px1.p1.1),[§6\.1](https://arxiv.org/html/2605.22939#S6.SS1.SSS0.Px3.p1.3),[§6\.3](https://arxiv.org/html/2605.22939#S6.SS3.SSS0.Px1.p1.1)\.

Learnability\-Informed Fine\-Tuning of Diffusion Language Models Appendix

## Appendix APareto Frontier for Countdown and Sudoku

![Refer to caption](https://arxiv.org/html/2605.22939v1/x7.png)\(a\)Countdown
![Refer to caption](https://arxiv.org/html/2605.22939v1/x8.png)\(b\)Sudoku

Figure 5:Accuracy vs\. H100 hours \(log scale\)across Countdown, and Sudoku\.We show the pareto frontier for Countdown and Sudoku in Fig[5](https://arxiv.org/html/2605.22939#A1.F5)\.

## Appendix BAdditional Analysis and Ablations

![Refer to caption](https://arxiv.org/html/2605.22939v1/x9.png)\(a\)Dream confidence vs\. token frequency \(global\)\.
![Refer to caption](https://arxiv.org/html/2605.22939v1/x10.png)\(b\)Dream confidence vs\. token frequency \(timestep separated\)\.

Figure 6:Dream Token Analysis\.For each token, we compute Dream’s mean confidence when the token is the masked target and plot it against the token’s frequency in our collated post\-training corpus\. To reduce noise, tokens are grouped into shared log\-spaced frequency bins \(with a final tail bin for the most frequent tokens\), and we plot the bin\-wise average confidence versus the bin’s mean frequency\. We show the marginalized global trend \(left\) and the same relationship stratified by diffusion timestep \(right\)\. This was done on a scale of 1\.22e8 tokens\.### B\.1Dream Token Analysis

The analysis visualized in Figure[2](https://arxiv.org/html/2605.22939#S1.F2)is extended to other DLMs\. We analyze masked token frequencies and confidences for Dream\(Yeet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib16)\)in Figure[6](https://arxiv.org/html/2605.22939#A2.F6)\. On average, the same trend is realized; at higher timesteps, the confidence distribution favors frequent tokens\.

Additionally, we sample tokens from each frequency bin in s1K and visualize them alongside the corresponding average confidence that LLaDA exhibits for tokens in that bin in Table[9](https://arxiv.org/html/2605.22939#A2.T9)\. Again, we observe a clear frequency–confidence trend: high\-frequency tokens are associated with higher average confidence, while rare tokens tend to receive lower confidence, consistent with the patterns in our aggregate plots\.

Table 9:Word clouds of sampled tokens from s1K within each frequency bin, alongside the average LLaDA confidence computed over*all*tokens in that bin\.Frequency BinTokensConfidence10110^\{1\}–10210^\{2\}![[Uncaptioned image]](https://arxiv.org/html/2605.22939v1/x11.png)0\.675410210^\{2\}–10310^\{3\}![[Uncaptioned image]](https://arxiv.org/html/2605.22939v1/x12.png)0\.727010310^\{3\}–10410^\{4\}![[Uncaptioned image]](https://arxiv.org/html/2605.22939v1/x13.png)0\.778810410^\{4\}–10510^\{5\}![[Uncaptioned image]](https://arxiv.org/html/2605.22939v1/x14.png)0\.8431105\+10^\{5\}\+![[Uncaptioned image]](https://arxiv.org/html/2605.22939v1/x15.png)0\.8829
### B\.2Confidence Intervals

We report the confidence intervals for our experiments on different datasets\.

![Refer to caption](https://arxiv.org/html/2605.22939v1/x16.png)\(a\)GSM8K
![Refer to caption](https://arxiv.org/html/2605.22939v1/x17.png)\(b\)Math500
![Refer to caption](https://arxiv.org/html/2605.22939v1/x18.png)\(c\)Countdown
![Refer to caption](https://arxiv.org/html/2605.22939v1/x19.png)\(d\)Sudoku

Figure 7:Confidence Intervals for our experiments obtained via three runs on different separate seeds\.The box plots illustrate the distribution of accuracy scores over multiple seeds for five experimental methods\. The central horizontal lines represent the median, while the box and whiskers quantify the confidence intervals and performance range for \(a\) GSM8K, \(b\) Math500, \(c\) Countdown, and \(d\) Sudoku\.
### B\.3Compute\-matched comparison with baselines

To evaluate training efficiency, we conducted a compute\-matched experiment where we halved the epochs for GIFT and LIFT to directly compare them against Vanilla and CART at equivalent compute scales \(e\.g\., 1 Vanilla epoch corresponds to 0\.5 LIFT epochs\)\. As shown in Table[10](https://arxiv.org/html/2605.22939#A2.T10), while LIFT demonstrates performance comparable to baselines at lower epochs, it scales significantly better as training progresses\. This sustained improvement stems from LIFT dynamically adjusting its training target difficulty based on the noise schedule\. Ultimately, this confidence\-based token selection acts as an effective adaptive curriculum\(Parasharet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib20)\), systematically optimizing the learning process as the model improves\.

Table 10:Compute\-matched comparison\.We compare GIFT and LIFT against Vanilla and CART at equivalent compute scales\. LIFT scales favorably at higher compute budgets\.Epochs\(Vanilla/CART\)Epochs\(GIFT/LIFT\)GSM8KMATHVanillaCARTGIFTLIFT3\\text\{LIFT\{\}\}\_\{3\}VanillaCARTGIFTLIFT3\\text\{LIFT\{\}\}\_\{3\}10\.576\.678\.876\.376\.833\.635\.235\.034\.02179\.678\.976\.975\.833\.837\.034\.034\.64277\.178\.478\.578\.933\.834\.834\.635\.98477\.476\.476\.677\.732\.632\.033\.835\.716877\.376\.877\.880\.232\.231\.431\.436\.1201077\.376\.577\.780\.232\.629\.231\.836\.6

## Appendix CDataset Construction Details for LIFT\-SFT\-12K

We constructed the dataset by mining and consolidating math\-focused samples from three publicly available post\-training corpora: NVIDIA/Llama\-Nemotron\-Post\-Training\-Dataset\(Bercovich and others,[2025](https://arxiv.org/html/2605.22939#bib.bib30)\), open\-r1/Mixture\-of\-Thoughts\(Open\-R1,[2025](https://arxiv.org/html/2605.22939#bib.bib31)\), and AllenAI/Dolci\-Think\-RL\-32B\(Team OLMo and others,[2025](https://arxiv.org/html/2605.22939#bib.bib32)\)\. From each source, we filtered instances specifically related to mathematical problem solving and reasoning tasks\. The filtered subsets were then merged and randomly sampled to obtain a balanced collection of 12,000 examples\. This curated dataset was used to fine\-tune LLaDA\-8B\-Instruct\.

Table 11:Hyperparameters used for training the model\.HyperparameterValueLearning rate scheduler typeLinearAdamβ\\betaparametersβ1=0\.9,β2=0\.999\\beta\_\{1\}=0\.9,\\ \\beta\_\{2\}=0\.999Gradient accumulations steps4Per device train batch size2Epochs20Maximum sequence length4096Precisionbf16Lorarr128Loraα\\alpha256Weight decay0\.1Maximum gradient norm1\.0
## Appendix DImplementation Details

### D\.1Training

All methods are trained use the common hyperparameters listed in Table[11](https://arxiv.org/html/2605.22939#A3.T11), with method\-specific learning rates\. Forvanilla, we find that a learning rate of1e−51\\text\{e\}\{\-5\}yields the best performance\.CARTuses the same setting for consistency\. ForGIFT, we use the recommended learning rate of2e−52\\text\{e\}\{\-5\}on s1K, while a lower rate of1e−61\\text\{e\}\{\-6\}performs better on LIFT\-SFT\-12K\. Across all settings, LIFT uses a learning rate of5e−65\\text\{e\}\{\-6\}\.

### D\.2Inference

#### Evaluation Hyperparameters\.

We follow the evaluation setup of the d1\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib21)\)for all experiments\. The model generates 2 tokens per diffusion step and is evaluated with generation lengths of 128, 256, and 512 tokens\. Decoding is performed with temperatureτ=0\\tau=0\.

For AIME, we use a temperatureτ=0\.1\\tau=0\.1for AIME’24 andτ=0\.2\\tau=0\.2and AIME’25\. The generation length is fixed to 512 and number of evaluation steps were 256\. Additionally to speeden the evaluation, we implement prefix\-caching\(Wuet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib46)\)\.

## Appendix EAdditional Results on AIME’24 and AIME’25

Table 12:Performance comparison on AIME’24 and AIME’25 under different avg@KKand pass@KKvaluesAIME’24AIME’25MethodAvg8Pass8Avg16Pass16Avg8Pass8Avg16Pass16Instruct0\.43\.30\.43\.30\.43\.30\.43\.3Vanilla0\.43\.30\.86\.60\.00\.00\.233\.3GIFT1\.36\.72\.116\.70\.00\.00\.00\.0CART0\.43\.31\.510\.00\.43\.30\.23\.3𝐋𝐈𝐅𝐓2\\mathbf\{LIFT\{\}\}\_\{2\}0\.80\.86\.76\.71\.11\.110\.010\.00\.00\.00\.00\.00\.40\.46\.76\.7𝐋𝐈𝐅𝐓3\\mathbf\{LIFT\{\}\}\_\{3\}1\.010\.01\.716\.70\.86\.70\.86\.7![Refer to caption](https://arxiv.org/html/2605.22939v1/x20.png)\(a\)AIME 2024
![Refer to caption](https://arxiv.org/html/2605.22939v1/x21.png)\(b\)AIME 2025

Figure 8:Confidence Intervals for AIME 2024 and 2025\.In Table[12](https://arxiv.org/html/2605.22939#A5.T12), we provide expanded results on AIME’24 and AIME ’25 on pass and average at k=8,16, with confidence intervals for AIME in Fig\.[12](https://arxiv.org/html/2605.22939#A5.T12)\.

## Appendix FAdditional Results on HumanEval and MBPP

We extend our evaluation to the domain of code generation, assessing model performance on MBPP\(Austinet al\.,[2021b](https://arxiv.org/html/2605.22939#bib.bib8)\)and HumanEval\(Chenet al\.,[2021](https://arxiv.org/html/2605.22939#bib.bib7)\)\. For this testing, models were first fine\-tuned on the KodCode\(Xuet al\.,[2025](https://arxiv.org/html/2605.22939#bib.bib6)\)dataset for 5 epochs and then evaluated with a maximum generation length of 256 tokens\. As presented in Table[13](https://arxiv.org/html/2605.22939#A6.T13), LIFT demonstrates strong performance on this task, achieving the best overall results compared to the baselines\.

Table 13:Evaluation on Code Generation\.Models were fine\-tuned on the KodCode dataset for 5 epochs\. We report performance on MBPP and HumanEval with a maximum generation length of 256 tokens\. LIFT variants achieve the strongest overall results compared to existing baselines\.MethodMBPP \(256\)HumanEval \(256\)Instruct \(Base\)41\.134\.8Vanilla43\.231\.1CART41\.932\.9GIFT44\.435\.2𝐋𝐈𝐅𝐓2\\mathbf\{LIFT\{\}\}\_\{2\}43\.637\.4𝐋𝐈𝐅𝐓3\\mathbf\{LIFT\{\}\}\_\{3\}44\.036\.3
Learnability-Informed Fine-Tuning of Diffusion Language Models

Similar Articles

The Path Matters: Learning a Token-Commitment Policy for Diffusion Language Models

Drifting Objectives for Refining Discrete Diffusion Language Models

Extracting Training Data from Diffusion Language Models via Infilling

FlowLM: Few-Step Language Modeling via Diffusion-to-Flow Adaptation

Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models

Submit Feedback

Similar Articles

The Path Matters: Learning a Token-Commitment Policy for Diffusion Language Models
Drifting Objectives for Refining Discrete Diffusion Language Models
Extracting Training Data from Diffusion Language Models via Infilling
FlowLM: Few-Step Language Modeling via Diffusion-to-Flow Adaptation
Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models