Masked Language Flow Models

arXiv cs.CL 06/29/26, 04:00 AM Papers
Summary
This paper introduces Masked Language Flow Models (MLFMs), which incorporate masking into flow-based language models to enable continuous flow for conditional generation and allow pretrained Masked Diffusion Models to be converted. The authors propose a novel sampler that alternates continuous denoising with discrete unmasking, demonstrating for the first time that flow-based language models can scale to downstream reasoning and instruction-following tasks.
arXiv:2606.27617v1 Announce Type: new Abstract: Masked Diffusion Models (MDMs) promise fast, parallel language generation, but their reverse transition factorises across token positions -- an approximation that breaks down in the few-step sampling regime where parallel generation ought to provide the greatest efficiency gains. Flow Language Models (FLMs) sidestep this limitation by learning a continuous flow that transports noise toward clean sequences represented in Euclidean space, inducing a flow map that can be distilled for single-step generation. However, this makes complex tasks requiring multi-step reasoning problematic for FLMs, as FLMs are forced to decode every token during generation. To address this, we introduce Masked Language Flow Models (MLFMs), which incorporate masking into FLMs using a continuous stochastic interpolant to bridge partially masked and clean sequences. This design enables conditional generation via continuous flows and allows pretrained MDMs to be converted into MLFMs through a simple, lightweight adaptation. Leveraging this flexibility, we propose a novel sampler that alternates continuous denoising with the discrete unmasking of confident tokens to better support multi-step reasoning. We evaluate our approach on GSM8K and MT-Bench and find, for the first time, that flow-based language models can be scaled to solve downstream reasoning and instruction-following tasks.
Original Article
View Cached Full Text
Cached at: 06/29/26, 05:23 AM
# Masked Language Flow Models
Source: [https://arxiv.org/html/2606.27617](https://arxiv.org/html/2606.27617)
###### Abstract

*Masked Diffusion Models*\(MDMs\) promise fast, parallel language generation, but their reverse transition factorises across token positions—an approximation that breaks down in the few\-step sampling regime where parallel generation ought to provide the greatest efficiency gains\.*Flow Language Models*\(FLMs\) sidestep this limitation by learning a continuous flow that transports noise toward clean sequences represented in Euclidean space, inducing a flow map that can be distilled for single\-step generation\. However, this makes complex tasks requiring multi\-step reasoning problematic for FLMs, as FLMs are forced to decode every token during generation\. To address this, we introduceMasked Language Flow Models \(MLFMs\), which incorporate masking into FLMs using a continuous stochastic interpolant to bridge partially masked and clean sequences\. This design enables conditional generation via continuous flows and allows pretrained MDMs to be converted into MLFMs through a simple, lightweight adaptation\. Leveraging this flexibility, we propose a novel sampler that alternates continuous denoising with the discrete unmasking of confident tokens to better support multi\-step reasoning\. We evaluate our approach on GSM8K and MT\-Bench and find, for the first time, that flow\-based language models can be scaled to solve downstream reasoning and instruction\-following tasks\.

Code is available at:[github\.com/imbirik/mlfm](https://github.com/imbirik/mlfm)\.

## 1Introduction

*Autoregressive Models*\(ARMs\) have driven much of the recent progress in language modelling by framing language generation as next\-token prediction\(Brown et al\.,[2020](https://arxiv.org/html/2606.27617#bib.bib6)\)\. However, their left\-to\-right factorisation of the joint distribution makes this process inherently sequential, as each token must be conditioned on all preceding ones\. Consequently, inference costs scale linearly with output length, creating a bottleneck for long sequences\. This limitation has motivated growing interest in*Masked Diffusion Models*\(MDMs\)\(Austin et al\.,[2021](https://arxiv.org/html/2606.27617#bib.bib4); Campbell et al\.,[2022](https://arxiv.org/html/2606.27617#bib.bib7); Lou et al\.,[2024](https://arxiv.org/html/2606.27617#bib.bib29); Shi et al\.,[2024](https://arxiv.org/html/2606.27617#bib.bib39); Sahoo et al\.,[2024](https://arxiv.org/html/2606.27617#bib.bib38)\)which replace sequential generation with the parallel decoding of masked tokens\.

Although MDMs have demonstrated strong performance across language modelling and downstream tasks\(Nie et al\.,[2025a](https://arxiv.org/html/2606.27617#bib.bib30),[b](https://arxiv.org/html/2606.27617#bib.bib31)\), they typically rely on a factorised reverse transition across masked positions, making inference tractable in the combinatorially large discrete state space\. This approximation is accurate in the infinitesimal\-step limit but becomes highly restrictive in the few\-step regime\(Deschenaux and Gulcehre,[2024](https://arxiv.org/html/2606.27617#bib.bib16)\), where each step must independently decode many masked tokens at once—ignoring the dependencies amongst them\. Consequently, the very regimes where parallel decoding promises the greatest speedups\(Dieleman,[2023](https://arxiv.org/html/2606.27617#bib.bib17); Zheng et al\.,[2024](https://arxiv.org/html/2606.27617#bib.bib47)\)are precisely those where this independence assumption most severely compromises generation quality\.

*Flow Language Models*\(FLMs\)\(Roos et al\.,[2026](https://arxiv.org/html/2606.27617#bib.bib37); Lee et al\.,[2026](https://arxiv.org/html/2606.27617#bib.bib25); Potaptchik et al\.,[2026](https://arxiv.org/html/2606.27617#bib.bib35); Davis et al\.,[2026](https://arxiv.org/html/2606.27617#bib.bib13); Chen et al\.,[2026](https://arxiv.org/html/2606.27617#bib.bib10); Hu et al\.,[2026](https://arxiv.org/html/2606.27617#bib.bib23)\)address this issue by moving from a discrete state space into a continuous one\. Specifically, FLMs learn a flow\(Song et al\.,[2021](https://arxiv.org/html/2606.27617#bib.bib41); Lipman et al\.,[2023](https://arxiv.org/html/2606.27617#bib.bib27); Albergo et al\.,[2025](https://arxiv.org/html/2606.27617#bib.bib3)\)transporting noise to continuous embeddings of token sequences, allowing latent states to evolve jointly across token positions\. A significant advantage of FLMs is that they also induce a flow map which can be distilled to support few\-step and even one\-step generation\. However, collapsing generation to a single flow map may be too restrictive for language tasks that require multi\-step reasoning\. Indeed, many language tasks benefit from iterative generation where partially completed intermediate states provide context for subsequent refinements\(Ghazvininejad et al\.,[2019](https://arxiv.org/html/2606.27617#bib.bib19); Nye et al\.,[2021](https://arxiv.org/html/2606.27617#bib.bib32); Wei et al\.,[2022](https://arxiv.org/html/2606.27617#bib.bib44)\)\.

To address this limitation, we proposeMasked Language Flow Models \(MLFMs\)\. MLFMs incorporate masking from MDMs into FLMs by using a Brownian bridge as a stochastic interpolant connecting embedded partially masked sequences with embedded clean sequences\. As a result, MLFMs retain the exact any\-position conditioning structure of MDMs while learning a coupled continuous flow for generating clean tokens at masked positions\.

Moreover, this special structure of MLFMs gives us a natural strategy for training and inference\. For training, we propose adapting pretrained MDMs into MLFMs: this is possible because, at the boundary of the Brownian bridge, the stochastic interpolant reduces to an embedded partially masked sequence, matching the setup of MDMs\. This allows us to warm\-start MLFM training from large pretrained MDMs, greatly reducing the compute required to train our models\.

For inference, we leverage the flexibility MLFMs provide through any\-position conditional generation\. Specifically, we introduce a novel sampling scheme that combines a new approach to classifier\-free guidance with the online promotion of confident tokens\. Under this guidance method, clean tokens are noised in the reference velocity to isolate their contribution\. Concurrently, if the posterior mode at a specific token position reaches a probability of at least1−ϵ1\-\\epsilon, we immediately commit to that token and substitute in its clean embedding\. By promoting these resolved tokens early, they instantly become useful context for subsequent generation steps, rather than remaining noisy until the final step\.

Empirically, we adapt a pretrained MDM with 1028M parameters fromNie et al\. \([2025a](https://arxiv.org/html/2606.27617#bib.bib30)\)into an MLFM and evaluate the resulting model on GSM8K\(Cobbe et al\.,[2021](https://arxiv.org/html/2606.27617#bib.bib12)\)and MT\-Bench\(Zheng et al\.,[2023](https://arxiv.org/html/2606.27617#bib.bib48)\)\. MLFM improves over SMDM and similarly sized AR baselines on MT\-Bench and obtains encouraging performance on GSM8K, demonstrating its ability to handle both instruction\-following and reasoning tasks\. Across these benchmarks, we also find that our novel sampling scheme provides significant improvements\. To the best of our knowledge, this is the first time that flow\-based language models have been scaled to downstream reasoning and instruction\-following tasks\.

## 2Background

In this section, we provide an overview of the background required for our approach\. Section[2\.1](https://arxiv.org/html/2606.27617#S2.SS1)discusses Masked Diffusion Models and Section[2\.2](https://arxiv.org/html/2606.27617#S2.SS2)discusses Flow Language Models, in particular LangFlow\(Chen et al\.,[2026](https://arxiv.org/html/2606.27617#bib.bib10)\)which we take as the basis for our approach\.

##### Notation\.

Throughout, we let𝒱\\mathcal\{V\}denote a vocabulary of tokens,\|𝒱\|\|\\mathcal\{V\}\|the vocabulary size, andLLthe sequence length\. We will identify elements in𝒱\\mathcal\{V\}with the natural numbers:\{1,…,\|𝒱\|\}\\\{1,\\ldots,\|\\mathcal\{V\}\|\\\}\. A \(clean\) sequenceX∼pX\\sim p, whereppdenotes our data distribution, is written asX=\(X1,…,XL\)∈𝒱LX=\(X^\{1\},\\ldots,X^\{L\}\)\\in\\mathcal\{V\}^\{L\}\. We uses∈\[0,1\]s\\in\[0,1\]for the masking probability andt∈\[0,1\]t\\in\[0,1\]for the continuous noising time\.

### Masked Diffusion Models

*Masked Diffusion Models*\(MDMs\) construct a non\-autoregressive generative model forppby introducing a special*mask*token\[MASK\]to the vocabulary𝒱\\mathcal\{V\}and defining a continuous\-time Markov chain \(CTMC\)\(Del Moral and Penev,[2017](https://arxiv.org/html/2606.27617#bib.bib14)\)transportingppto the Dirac measure on fully masked sequences\. The forward processXs∼ps\|0\(Xs\|X\)X\_\{s\}\\sim p\_\{s\|0\}\(X\_\{s\}\|X\)of the CTMC at the masking probabilitys∈\[0,1\]s\\in\[0,1\]andX∼pX\\sim pis given by the following sampling procedure:

Xsℓ=\{Xℓ,ifBℓ=0,\[MASK\],ifBℓ=1,whereBℓ∼i\.i\.d\.Bernoulli⁡\(s\)\.X\_\{s\}^\{\\ell\}=\\begin\{cases\}X^\{\\ell\},&\\text\{ if \}B^\{\\ell\}=0,\\\\ \\texttt\{\[MASK\]\},&\\text\{ if \}B^\{\\ell\}=1,\\end\{cases\}\\qquad\\text\{ where \}B^\{\\ell\}\\stackrel\{\{\\scriptstyle\\mathrm\{i\.i\.d\.\}\}\}\{\{\\sim\}\}\\operatorname\{Bernoulli\}\(s\)\.\(1\)We letℳs=\{ℓ:Xsℓ=\[MASK\]\}\\mathcal\{M\}\_\{s\}=\\\{\\ell:X\_\{s\}^\{\\ell\}=\\texttt\{\[MASK\]\}\\\}denote the positions of mask tokens inXsX\_\{s\}and𝒰s=\{ℓ:Xsℓ≠\[MASK\]\}\\mathcal\{U\}\_\{s\}=\\\{\\ell:X\_\{s\}^\{\\ell\}\\neq\\texttt\{\[MASK\]\}\\\}the positions of clean tokens inXsX\_\{s\}\.

We can generate approximate samples fromppby simulating the backward process of the CTMC\. This involves gradually unmasking tokens to reveal clean tokens which are then fixed for later steps\. This relies on access to the ground\-truth factorised posteriorpdataℓ\(Xℓ\|Xs𝒰s\)p\_\{\\text\{data\}\}^\{\\ell\}\(X^\{\\ell\}\|X\_\{s\}^\{\\mathcal\{U\}\_\{s\}\}\)of clean tokens given unmasked tokens\(Zheng et al\.,[2024](https://arxiv.org/html/2606.27617#bib.bib47); Ou et al\.,[2025](https://arxiv.org/html/2606.27617#bib.bib33)\)whereXs𝒰sX\_\{s\}^\{\\mathcal\{U\}\_\{s\}\}denotes the subcollection ofXsX\_\{s\}consisting of clean tokens\. This can be learned from data via the following likelihood\-based objective:

ℒMDM\(θ\)=𝔼X∼p,Xs∼ps\|0\(⋅\|X\),s∼U\(0,1\)\[1s∑ℓ∈ℳs−log⁡pθℓ\(Xℓ∣Xs𝒰s\)\],\\mathcal\{L\}\_\{\\mathrm\{MDM\}\}\(\\theta\)=\\mathbb\{E\}\_\{X\\sim p,X\_\{s\}\\sim p\_\{s\|0\}\(\\cdot\|X\),s\\sim U\(0,1\)\}\\left\[\\frac\{1\}\{s\}\\sum\_\{\\ell\\in\\mathcal\{M\}\_\{s\}\}\-\\log p\_\{\\theta\}^\{\\ell\}\(X^\{\\ell\}\\mid X\_\{s\}^\{\\mathcal\{U\}\_\{s\}\}\)\\right\],\(2\)where we parametrisepθℓp\_\{\\theta\}^\{\\ell\}with a neural network to approximatepdataℓp\_\{\\text\{data\}\}^\{\\ell\}andU\(0,1\)U\(0,1\)denotes the uniform distribution over\[0,1\]\[0,1\]\.

During inference, MDMs can unmask multiple tokens in parallel in a single step\. This is most effective when the tokens to be unmasked are close to conditionally independent given the current masked state, due to the factorised structure ofpθℓp\_\{\\theta\}^\{\\ell\}\. However, this is only typically true for a small subset of tokens as increasing the number of tokens to be unmasked increases the likelihood of strong dependencies between tokens\. As a result, MDMs still require many refinement steps for high quality samples despite their parallel decoding structure\.

### Flow Language Models

We consider the*LangFlow*\(Chen et al\.,[2026](https://arxiv.org/html/2606.27617#bib.bib10)\)approach to*Flow Language Models*\(FLMs\) which operates within a learnt embedding space instead of one–hot embeddings\.

LetHHbe the embedding dimension\. For each tokena∈𝒱a\\in\\mathcal\{V\}in the vocabulary, we represent the token with its corresponding embedding vectorEa∈ℝHE\_\{a\}\\in\\mathbb\{R\}^\{H\}and we letE∈ℝ\|𝒱\|×HE\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\\times H\}denote the overall embedding matrix\. Moreover, for a sequenceX∼pX\\sim p, we representXXin Euclidean space by the embedding:

z1=\(EX1,…,EXL\)∈ℝL×H\.z\_\{1\}=\(E\_\{X^\{1\}\},\\ldots,E\_\{X^\{L\}\}\)\\in\\mathbb\{R\}^\{L\\times H\}\.With this continuous representationz1z\_\{1\}of discrete sequences, we can construct a generative model forppby considering the Gaussian–based stochastic interpolant:

zt=tz1\+\(1−t\)ϵ,z\_\{t\}=tz\_\{1\}\+\(1\-t\)\\epsilon,\(3\)whereϵ∼𝒩\(0,IL×H\)\\epsilon\\sim\\mathcal\{N\}\(0,I\_\{L\\times H\}\)andt∈\[0,1\]t\\in\[0,1\]\. We note thatz0z\_\{0\}is Gaussian noise andz1z\_\{1\}is an embedded sequence frompp\. To sample from the flow induced by this stochastic interpolant, we parametrise the denoiserpϕp\_\{\\phi\}which approximates the factorised posterior over clean tokens:

pϕ\(⋅∣zt,t\)∈Δ\(𝒱\)L⊂ℝL×\|𝒱\|\.p\_\{\\phi\}\(\\cdot\\mid z\_\{t\},t\)\\in\\Delta\(\\mathcal\{V\}\)^\{L\}\\subset\\mathbb\{R\}^\{L\\times\|\\mathcal\{V\}\|\}\.This is trained with the following cross\-entropy objective:

ℒFLM\(ϕ\)=𝔼t∼πt,X,ϵ\[1L∑ℓ=1L−log⁡pϕℓ\(Xℓ∣zt,t\)\],\\mathcal\{L\}\_\{\\mathrm\{FLM\}\}\(\\phi\)=\\mathbb\{E\}\_\{t\\sim\\pi\_\{t\},X,\\epsilon\}\\left\[\\frac\{1\}\{L\}\\sum\_\{\\ell=1\}^\{L\}\-\\log p\_\{\\phi\}^\{\\ell\}\(X^\{\\ell\}\\mid z\_\{t\},t\)\\right\],\(4\)whereπt\\pi\_\{t\}denotes some distribution over\[0,1\]\[0,1\]\. Although the supervision targetXℓX^\{\\ell\}is discrete, the learned denoiser induces a continuous endpoint estimate:

z^1,ϕ\(zt,t\):=Pϕ\(zt,t\)E,wherePϕ\(zt,t\)=pϕ\(⋅∣zt,t\)∈ℝL×\|𝒱\|,\\widehat\{z\}\_\{1,\\phi\}\(z\_\{t\},t\):=P\_\{\\phi\}\(z\_\{t\},t\)E,\\qquad\\text\{ where \}\\qquad P\_\{\\phi\}\(z\_\{t\},t\)=p\_\{\\phi\}\(\\cdot\\mid z\_\{t\},t\)\\in\\mathbb\{R\}^\{L\\times\|\\mathcal\{V\}\|\},\(5\)andz^1,ϕ\(zt,t\)\\widehat\{z\}\_\{1,\\phi\}\(z\_\{t\},t\)estimates the posterior mean embedding𝔼\[z1∣zt,t\]\\mathbb\{E\}\[z\_\{1\}\\mid z\_\{t\},t\]\. Additionally,z^1,ϕ\(zt,t\)\\widehat\{z\}\_\{1,\\phi\}\(z\_\{t\},t\)allows us to estimate the velocity field for the flow:

vϕ\(zt,t\):=z^1,ϕ\(zt,t\)−zt1−t,t<1\.v\_\{\\phi\}\(z\_\{t\},t\):=\\frac\{\\widehat\{z\}\_\{1,\\phi\}\(z\_\{t\},t\)\-z\_\{t\}\}\{1\-t\},\\qquad t<1\.\(6\)Therefore, we can generate approximate samples fromppby integrating the ODE:

ddtzt=vϕ\(zt,t\)\.\\frac\{d\}\{dt\}z\_\{t\}=v\_\{\\phi\}\(z\_\{t\},t\)\.In practice, the ODE is discretized over a finite time grid, and the final token sequence is obtained by decoding the denoiser probabilities frompϕp\_\{\\phi\}in vocabulary space\.

Continuous flows are attractive because they define a joint trajectory from noise to clean sequences\. This trajectory can be distilled into a direct flow map or a consistency\-style solver\(Song et al\.,[2023](https://arxiv.org/html/2606.27617#bib.bib42); Boffi et al\.,[2026](https://arxiv.org/html/2606.27617#bib.bib5)\), reducing the number of generation steps\. However, an exact one\-shot map is an overly demanding object for language models—i\.e\. some problems \(such as maths or coding\) may naturally require iterative reasoning in which intermediate commitments are used as context for later decisions\.

## 3Masked Language Flow Models

We introduce the core framework of MLFMs in Section[3\.1](https://arxiv.org/html/2606.27617#S3.SS1)\. In Section[3\.2](https://arxiv.org/html/2606.27617#S3.SS2), we show how pretrained MDMs can naturally be adapted to MLFMs, including the architectural changes required for this adaptation\. Finally, Section[3\.3](https://arxiv.org/html/2606.27617#S3.SS3)describes how we can apply supervised fine\-tuning on instruction\-response data to MLFMs, an aspect that, to the best of our knowledge, has not been explored in prior work on FLMs\.

### Framework

We retain the use of mask tokens and token embeddingsEEfrom MDMs and FLMs\. We begin by defining a forward processzs,t∈ℝL×Hz\_\{s,t\}\\in\\mathbb\{R\}^\{L\\times H\}in Euclidean space\. This process is parametrised by\(s,t\)∈\[0,1\]2\(s,t\)\\in\[0,1\]^\{2\}wheressdenotes the masking probability andttdenotes the noising time\.

Letm=E\[𝙼𝙰𝚂𝙺\]m=E\_\{\\mathtt\{\[MASK\]\}\}denote the embedding of the\[MASK\]token\. ForX∼pX\\sim p,s∼πss\\sim\\pi\_\{s\},t∼πtt\\sim\\pi\_\{t\}, whereπs\\pi\_\{s\},πt\\pi\_\{t\}are distributions on\[0,1\]\[0,1\], we sample a partially masked sequenceXs∼ps\|0\(⋅\|X\)X\_\{s\}\\sim p\_\{s\|0\}\(\\cdot\|X\)as in the MDM forward process \([1](https://arxiv.org/html/2606.27617#S2.E1)\) and we constructzs,t∈ℝL×Hz\_\{s,t\}\\in\\mathbb\{R\}^\{L\\times H\}as follows\. For all positionsℓ∈𝒰s\\ell\\in\\mathcal\{U\}\_\{s\}corresponding to clean tokens inXsX\_\{s\}, we fixzs,tℓz^\{\\ell\}\_\{s,t\}to be the clean embedding:

zs,tℓ=EXℓ,∀ℓ∈𝒰s,t∈\[0,1\]\.z\_\{s,t\}^\{\\ell\}=E\_\{X^\{\\ell\}\},\\qquad\\forall\\ell\\in\\mathcal\{U\}\_\{s\},\\ t\\in\[0,1\]\.Forℓ∈ℳs\\ell\\in\\mathcal\{M\}\_\{s\}, we constructzs,tℓz\_\{s,t\}^\{\\ell\}via the stochastic interpolant formed by the Brownian bridge connectingmmandEXℓE\_\{X^\{\\ell\}\}:

zs,tℓ∣Xℓ∼𝒩\(\(1−t\)m\+tEXℓ,σ2t\(1−t\)IH\),∀ℓ∈ℳs,t∈\[0,1\],\\displaystyle z\_\{s,t\}^\{\\ell\}\\mid X^\{\\ell\}\\sim\\mathcal\{N\}\\Big\(\(1\-t\)m\+tE\_\{X^\{\\ell\}\},\\quad\\\!\\\!\\\!\\\!\\sigma^\{2\}t\(1\-t\)I\_\{H\}\\Big\),\\qquad\\qquad\\forall\\ell\\in\\mathcal\{M\}\_\{s\},\\ t\\in\[0,1\],\(7\)whereσ\>0\\sigma\>0is some choice of noise scale on masked positions\. Since the variance vanishes at both endpoints, we see that the stochastic interpolant satisfieszs,0ℓ=mz\_\{s,0\}^\{\\ell\}=mandzs,1ℓ=EXℓz\_\{s,1\}^\{\\ell\}=E\_\{X^\{\\ell\}\}almost surely\. Thuszs,0z\_\{s,0\}recovers the embedding of the partially masked sequenceXsX\_\{s\}andzs,1z\_\{s,1\}recovers the embedding ofXX\. Additionally, fort∈\(0,1\)t\\in\(0,1\), the masked positions are modelled as noisy continuous states that provide partial information about their underlying clean token embeddings\.

This construction is related to recent hybrid discrete\-continuous diffusion models for language\(Pynadath et al\.,[2025](https://arxiv.org/html/2606.27617#bib.bib36); Zheng et al\.,[2025](https://arxiv.org/html/2606.27617#bib.bib46); Zhou et al\.,[2026](https://arxiv.org/html/2606.27617#bib.bib49)\)which also combine discrete masking and continuous noising\. The aforementioned works construct a forward process that masks and noises tokens simultaneously with the motivation that continuous noising provides additional information to help scaffold the generation of discrete tokens\. However, in MLFMs, we*decouple*these components—i\.e\. for any choice of sequence masking, we learn a continuous flow to generate clean tokens to fill in the masked tokens\. This separation is distillation\-friendly and enables the flexible inference procedures described in Section[4](https://arxiv.org/html/2606.27617#S4), where tokens can be revealed adaptively rather than at a fixed rate, as well as our adaptation of pretrained MDMs to accelerate training\.

Following the same reasoning as inChen et al\. \([2026](https://arxiv.org/html/2606.27617#bib.bib10)\), we sample from the flow induced by this stochastic interpolant by parametrising the MLFM denoiserpθp\_\{\\theta\}to output token probabilities from the factorised posterior over clean tokens:

pθ\(⋅∣zs,t,t\)∈Δ\(𝒱\)L⊂ℝL×\|𝒱\|\.p\_\{\\theta\}\(\\cdot\\mid z\_\{s,t\},t\)\\in\\Delta\(\\mathcal\{V\}\)^\{L\}\\subset\\mathbb\{R\}^\{L\\times\|\\mathcal\{V\}\|\}\.This is trained via cross\-entropy on only*masked*subsets of tokens:

ℒMLFM\(θ\)=𝔼s,t,X,ϵ\[1\|ℳs\|∑ℓ∈ℳs−log⁡pθℓ\(Xℓ∣zs,t,t\)\]\.\\mathcal\{L\}\_\{\\mathrm\{MLFM\}\}\(\\theta\)=\\mathbb\{E\}\_\{s,t,X,\\epsilon\}\\left\[\\frac\{1\}\{\|\\mathcal\{M\}\_\{s\}\|\}\\sum\_\{\\ell\\in\\mathcal\{M\}\_\{s\}\}\-\\log p\_\{\\theta\}^\{\\ell\}\(X^\{\\ell\}\\mid z\_\{s,t\},t\)\\right\]\.\(8\)As inChen et al\. \([2026](https://arxiv.org/html/2606.27617#bib.bib10)\), we find that the choice ofπt\\pi\_\{t\}is critical for performance\. We therefore follow their entropy\-based schedule, with a small adaptation to our setting\. Further details are provided in Appendix[A](https://arxiv.org/html/2606.27617#A1), in the context of the experimental setup in Section[5](https://arxiv.org/html/2606.27617#S5)\.

For sampling, we have thatpθp\_\{\\theta\}induces an estimate of the posterior mean embedding

z^1,θ\(zs,t,t\):=Pθ\(zs,t,t\)E,Pθ\(zs,t,t\)=pθℓ\(⋅∣zs,t,t\),\\widehat\{z\}\_\{1,\\theta\}\(z\_\{s,t\},t\):=P\_\{\\theta\}\(z\_\{s,t\},t\)E,\\qquad P\_\{\\theta\}\(z\_\{s,t\},t\)=p^\{\\ell\}\_\{\\theta\}\(\\cdot\\mid z\_\{s,t\},t\),\(9\)which can then be used to estimate the velocity field of the flow:

vθℓ\(zs,t,t\):=\{z^1,θℓ\(zs,t,t\)−zs,tℓ1−t,ℓ∈ℳs,0,ℓ∈𝒰s,t<1\.v\_\{\\theta\}^\{\\ell\}\(z\_\{s,t\},t\):=\\begin\{cases\}\\displaystyle\\frac\{\\widehat\{z\}\_\{1,\\theta\}^\{\\ell\}\(z\_\{s,t\},t\)\-z\_\{s,t\}^\{\\ell\}\}\{1\-t\},&\\ell\\in\\mathcal\{M\}\_\{s\},\\\\\[6\.45831pt\] 0,&\\ell\\in\\mathcal\{U\}\_\{s\},\\end\{cases\}\\qquad t<1\.\(10\)With this, we can generate samples by integrating the ODE defined byvθv\_\{\\theta\}\. This evolves only the masked positions, while the positions in𝒰s\\mathcal\{U\}\_\{s\}remain clamped to their clean embeddings\. Note that the incorporation of masks naturally enables*conditional*generation in MLFMs, which has received limited attention in prior work on FLMs\. Moreover, this additional flexibility allows us to design more advanced sampling schemes for MLFMs, as discussed in Section[4](https://arxiv.org/html/2606.27617#S4)\.

### MDMs to MLFMs

It is easy to see that we can view the MLFM loss as a continuous extension of the MDM loss: at the endpointt=0t=0, every unresolved positionℓ∈Ms\\ell\\in M\_\{s\}is represented exactly by the mask embeddingmm, while every observed positionℓ∈Us\\ell\\in U\_\{s\}is kept at its clean embedding\. Thus,zs,0z\_\{s,0\}contains the same information as the partially masked sequenceXsX\_\{s\}\. Consequently, the cross\-entropy objectiveℒMLFM\\mathcal\{L\}\_\{\\mathrm\{MLFM\}\}att=0t=0has the same minimiser as the MDM objectiveℒMDM\\mathcal\{L\}\_\{\\mathrm\{MDM\}\}: in both cases, the optimal predictor is the factorised posterior distribution of each masked token given the clean tokens\. We formalise this endpoint equivalence in the following proposition\.

###### Proposition 3\.1\(Mask\-endpoint equivalence\)\.

Assume that the unmasked token identities are either provided directly or are recoverable from their embeddings underEE\. At the mask endpointt=0t=0, the MLFM prediction problem coincides with the MDM prediction problem: for anyℓ∈ℳs\\ell\\in\\mathcal\{M\}\_\{s\}anda∈𝒱a\\in\\mathcal\{V\},

ℙ\(Xℓ=a∣zs,0\)=ℙ\(Xℓ=a∣Xs𝒰s,ℳs\)=ℙ\(Xℓ=a∣Xs\)\.\\mathbb\{P\}\(X^\{\\ell\}=a\\mid z\_\{s,0\}\)=\\mathbb\{P\}\(X^\{\\ell\}=a\\mid X\_\{s\}^\{\\mathcal\{U\}\_\{s\}\},\\mathcal\{M\}\_\{s\}\)=\\mathbb\{P\}\(X^\{\\ell\}=a\\mid X\_\{s\}\)\.\(11\)
###### Proof\.

See Appendix[E\.1](https://arxiv.org/html/2606.27617#A5.SS1)\. ∎

This observation motivates adapting pretrained MDMs to MLFMs: because a pretrained MDM has already learned the denoising problem at the endpointzs,0z\_\{s,0\}, it provides a strong initialisation for MLFM training and avoids the cost of training a capable MLFM from scratch\. Concretely, we adapt a pretrained bidirectional\-transformer from an MDM for MLFM training by adapting three main components in the architecture:

1. i\)The MDM token embedding layerE∈ℝ\|𝒱\|×HE\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\\times H\}, which maps tokens into the embedding space, is used as the MLFM embedding matrix and is kept fixed during adaptation, similar toHu et al\. \([2026](https://arxiv.org/html/2606.27617#bib.bib23)\)\.
2. ii\)The MDM transformer blocks initialise the MLFM denoising transformer and are frozen\. We augment the normalisations in these blocks with AdaLN\(Nie et al\.,[2025a](https://arxiv.org/html/2606.27617#bib.bib30)\)to condition on the continuous corruption timett, and attach LoRA adapters\(Hu et al\.,[2022](https://arxiv.org/html/2606.27617#bib.bib22)\)to the linear layers to adapt the pretrained transformer to work with continuous corrupted embeddings\.
3. iii\)The MDM head which maps the final activations to vocabulary logits is adapted with its own LoRA adapter\.

With these architectural changes, we can continue training the pretrained MDM under the MLFM objectiveℒMLFM\(θ\)\\mathcal\{L\}\_\{\\mathrm\{MLFM\}\}\(\\theta\), thereby adapting it into an MLFM\. We refer to this MDM\-to\-MLFM training procedure as*adaptation*and we provide a summary in Algorithm[3](https://arxiv.org/html/2606.27617#alg3)\.

### Supervised Fine\-Tuning

In this section, we describe how to perform supervised fine\-tuning \(SFT\) with MLFMs\. Let𝒟FT=\{\(pi,ai\)\}i=1n\\mathcal\{D\}\_\{\\mathrm\{FT\}\}=\\\{\(p\_\{i\},a\_\{i\}\)\\\}\_\{i=1\}^\{n\}denote a SFT dataset, wherepip\_\{i\}is a prompt andaia\_\{i\}is the corresponding target answer\. After tokenisation, letX=X\(pi,ai\)X=X\(p\_\{i\},a\_\{i\}\)denote the sequence obtained from the concatenation of the prompt and answer in a suitable format and let𝒫\\mathcal\{P\}denote the positions of the prompt and𝒜\\mathcal\{A\}the positions of the answer inXX\. We fine\-tune our MLFMpθp\_\{\\theta\}with the below cross\-entropy objective:

ℒFT\(θ\)=𝔼\(p,a\)∼𝒟FT,s∼πs,t∼πt,zs,t\[1\|ℳs\|∑ℓ∈ℳs−log⁡pθℓ\(Xℓ∣zs,t,t\)\]\.\\mathcal\{L\}\_\{\\mathrm\{FT\}\}\(\\theta\)=\\mathbb\{E\}\_\{\(p,a\)\\sim\\mathcal\{D\}\_\{\\mathrm\{FT\}\},s\\sim\\pi\_\{s\},t\\sim\\pi\_\{t\},z\_\{s,t\}\}\\left\[\\frac\{1\}\{\|\\mathcal\{M\}\_\{s\}\|\}\\sum\_\{\\ell\\in\\mathcal\{M\}\_\{s\}\}\-\\log p\_\{\\theta\}^\{\\ell\}\(X^\{\\ell\}\\mid z\_\{s,t\},t\)\\right\]\.\(12\)For a sequenceXXconstructed from\(p,a\)\(p,a\), we formzs,tz\_\{s,t\}as follows\. With probabilityα\\alpha, we mask all answer positions in𝒜\\mathcal\{A\}, and with probability1−α1\-\\alphawe instead sample a subset of answer positions by applyingps\|0p\_\{s\|0\}only to𝒜\\mathcal\{A\}\. The prompt positions𝒫\\mathcal\{P\}are always kept clean to encourage conditional generation from the prompt\. After the masked positionsℳs\\mathcal\{M\}\_\{s\}are sampled, we apply the same noising process \([7](https://arxiv.org/html/2606.27617#S3.E7)\) at timettto these positions\. We provide a summary of our SFT procedure in Algorithm[4](https://arxiv.org/html/2606.27617#alg4)\.

## 4Sampling from MLFMs

In this section, we design a sampling scheme that exploits the additional flexibility of MLFMs for conditioning and guidance\. The main idea behind our sampler is that high\-confidence token predictions are more useful as clean context than as noisy latent states\. In a flow\-only sampler, each generated position remains latent until the final decoding step, even when its posterior distribution has already concentrated on a single token\. MLFMs can use such positions more effectively: once a token becomes highly probable, we*promote*it to the observed\-token state, so that later denoising steps condition on its clean embedding rather than on a corrupted representation of the same position\.

We can naturally combine this strategy with a variant of classifier\-free guidance\(Ho and Salimans,[2022](https://arxiv.org/html/2606.27617#bib.bib20)\)\. Specifically, we compare two predicted vector fields: a guided field, which conditions on the promoted tokens as clean observed context, and a reference field, which uses the same tokens and positions but corrupts their embeddings to the current time\. The difference between these fields isolates the effect of clean context on the model’s predicted direction\. We then sample from the guided flow while promoting high\-confidence tokens online, yielding a single sampling procedure that combines FLM sampling with MDM\-style unmasking\.

We begin by recalling the DDPM sampler in Section[4\.1](https://arxiv.org/html/2606.27617#S4.SS1), then introduce our variant of classifier\-free guidance \(Section[4\.2](https://arxiv.org/html/2606.27617#S4.SS2)\) and online promotion strategy \(Section[4\.3](https://arxiv.org/html/2606.27617#S4.SS3)\)\.

### DDPM Sampler

We first describe the standard DDPM sampler\(Ho et al\.,[2020](https://arxiv.org/html/2606.27617#bib.bib21)\)for conditional generation\. Let𝒰0⊆\[L\]\\mathcal\{U\}\_\{0\}\\subseteq\[L\]be the unmasked context positions and letℳ0=\[L\]∖𝒰0\\mathcal\{M\}\_\{0\}=\[L\]\\setminus\\mathcal\{U\}\_\{0\}be the masked positions we want to generate\. The sampler is initialised at the masked endpointz0ℓz\_\{0\}^\{\\ell\}:

z0ℓ=\{EXℓ,ℓ∈𝒰0,m,ℓ∈ℳ0\.z\_\{0\}^\{\\ell\}=\\begin\{cases\}E\_\{X^\{\\ell\}\},&\\ell\\in\\mathcal\{U\}\_\{0\},\\\\ m,&\\ell\\in\\mathcal\{M\}\_\{0\}\.\\end\{cases\}During sampling, the context positions remain fixed, so thatztℓ=EXℓz\_\{t\}^\{\\ell\}=E\_\{X^\{\\ell\}\}forℓ∈𝒰0\\ell\\in\\mathcal\{U\}\_\{0\}, while the masked positions inℳ0\\mathcal\{M\}\_\{0\}are jointly denoised by simulating the SDE

dztℓ=vθℓ\(zs,t,t\)dt\+σdWtℓ,ℓ∈ℳ0\.dz^\{\\ell\}\_\{t\}=v^\{\\ell\}\_\{\\theta\}\(z\_\{s,t\},t\)\\,dt\+\\sigma\\,dW\_\{t\}^\{\\ell\},\\quad\\ell\\in\\mathcal\{M\}\_\{0\}\.We discretize this process over a mesh0=t0<t1<⋯<tN=10=t\_\{0\}<t\_\{1\}<\\cdots<t\_\{N\}=1\. Betweentit\_\{i\}andti\+1t\_\{i\+1\}, the process is approximated using the DDPM transition

zti\+1ℓ∣zti∼𝒩\(1−ti\+11−tiztiℓ\+ti\+1−ti1−tiz^1ℓ\(zti,ti\),σ2\(ti\+1−ti\)\(1−ti\+1\)1−tiIH\),z\_\{t\_\{i\+1\}\}^\{\\ell\}\\mid z\_\{t\_\{i\}\}\\sim\\mathcal\{N\}\\left\(\\frac\{1\-t\_\{i\+1\}\}\{1\-t\_\{i\}\}z\_\{t\_\{i\}\}^\{\\ell\}\+\\frac\{t\_\{i\+1\}\-t\_\{i\}\}\{1\-t\_\{i\}\}\\widehat\{z\}\_\{1\}^\{\\ell\}\(z\_\{t\_\{i\}\},t\_\{i\}\),\\,\\sigma^\{2\}\\frac\{\(t\_\{i\+1\}\-t\_\{i\}\)\(1\-t\_\{i\+1\}\)\}\{1\-t\_\{i\}\}I\_\{H\}\\right\),\(13\)which samples from the Brownian bridge connectingztiz\_\{t\_\{i\}\}to the current endpoint estimatez^1\(zti,ti\)\\widehat\{z\}\_\{1\}\(z\_\{t\_\{i\}\},t\_\{i\}\), given by

z^1\(zti,ti\)=zti\+\(1−ti\)vθ\(zti,ti\)\.\\widehat\{z\}\_\{1\}\(z\_\{t\_\{i\}\},t\_\{i\}\)=z\_\{t\_\{i\}\}\+\(1\-t\_\{i\}\)v\_\{\\theta\}\(z\_\{t\_\{i\}\},t\_\{i\}\)\.Finally, attN=1t\_\{N\}=1, each generated positionℓ\\ellis decoded by samplingXℓ∼pθℓ\(⋅∣ztN,tN\)X^\{\\ell\}\\sim p\_\{\\theta\}^\{\\ell\}\(\\cdot\\mid z\_\{t\_\{N\}\},t\_\{N\}\), yielding the sampleXX\.

### Classifier\-Free Guidance With Corrupted Context

We first describe our variant of classifier\-free guidance mentioned earlier\. Letztz\_\{t\}be the state at timett\. Forℓ∈𝒰0\\ell\\in\\mathcal\{U\}\_\{0\}, the observed token is represented, as before, by its clean embedding,

ztℓ=EXℓ,ℓ∈𝒰0\.z\_\{t\}^\{\\ell\}=E\_\{X^\{\\ell\}\},\\qquad\\ell\\in\\mathcal\{U\}\_\{0\}\.We construct a corrupted\-context reference statezt,corrz\_\{t,\\mathrm\{corr\}\}by applying the forward corruption marginal in \([7](https://arxiv.org/html/2606.27617#S3.E7)\) only to context positions, while leaving all other positions unchanged:

zt,corrℓ=\{z~tℓ,ℓ∈𝒰0,ztℓ,ℓ∉𝒰0,z\_\{t,\\mathrm\{corr\}\}^\{\\ell\}=\\begin\{cases\}\\widetilde\{z\}\_\{t\}^\{\\ell\},&\\ell\\in\\mathcal\{U\}\_\{0\},\\\\ z\_\{t\}^\{\\ell\},&\\ell\\notin\\mathcal\{U\}\_\{0\},\\end\{cases\}where, forℓ∈𝒰0\\ell\\in\\mathcal\{U\}\_\{0\},

z~tℓ∣Xℓ∼𝒩\(\(1−t\)m\+tEXℓ,σ2t\(1−t\)IH\)\.\\widetilde\{z\}\_\{t\}^\{\\ell\}\\mid X^\{\\ell\}\\sim\\mathcal\{N\}\\Big\(\(1\-t\)m\+tE\_\{X^\{\\ell\}\},\\quad\\sigma^\{2\}t\(1\-t\)I\_\{H\}\\Big\)\.
Letvθv\_\{\\theta\}denote the MLFM velocity defined in \([10](https://arxiv.org/html/2606.27617#S3.E10)\)\. We define context\-corrupted guidance with scalewwby

vw\(zt,t\)=vθ\(zt,t\)\+w\(vθ\(zt,t\)−vθ\(zt,corr,t\)\)\.v\_\{w\}\(z\_\{t\},t\)=v\_\{\\theta\}\(z\_\{t\},t\)\+w\\bigl\(v\_\{\\theta\}\(z\_\{t\},t\)\-v\_\{\\theta\}\(z\_\{t,\\mathrm\{corr\}\},t\)\\bigr\)\.The difference term captures the effect of clean context on the predicted velocity: both model calls use the same tokens at the same positions, but onlyvθ\(zt,t\)v\_\{\\theta\}\(z\_\{t\},t\)observes the context through exact clean embeddings\. In effect, this encourages the sampler to follow directions that are specific to the clean\-context prediction, rather than directions that persist when the context is corrupted\. Note that bothztz\_\{t\}andzt,corrz\_\{t,\\mathrm\{corr\}\}remain on the support of the forward process\. We refer to this sampling strategy as*context\-corrupted classifier\-free guidance*\(CCFG\) and summarise it in Algorithm[1](https://arxiv.org/html/2606.27617#alg1)\.

Algorithm 1Context Corrupted Classifier\-Free Guidance \(CCFG\)0:State

ztz\_\{t\}, time

tt, clean token positions

𝒰0\\mathcal\{U\}\_\{0\}, clean tokens

X𝒰0X^\{\\mathcal\{U\}\_\{0\}\}, guidance scale

ww\.

1:Initialize

zt,corr←ztz\_\{t,\\mathrm\{corr\}\}\\leftarrow z\_\{t\}\.

2:for

ℓ∈𝒰0\\ell\\in\\mathcal\{U\}\_\{0\}do

3:Draw

z~tℓ∣Xℓ\\widetilde\{z\}\_\{t\}^\{\\ell\}\\mid X^\{\\ell\}from the corruption marginal in \([7](https://arxiv.org/html/2606.27617#S3.E7)\)\.

4:Set

zt,corrℓ←z~tℓz\_\{t,\\mathrm\{corr\}\}^\{\\ell\}\\leftarrow\\widetilde\{z\}\_\{t\}^\{\\ell\}\.

5:endfor

6:return

vw=vθ\(zt,t\)\+w\(vθ\(zt,t\)−vθ\(zt,corr,t\)\)v\_\{w\}=v\_\{\\theta\}\(z\_\{t\},t\)\+w\\bigl\(v\_\{\\theta\}\(z\_\{t\},t\)\-v\_\{\\theta\}\(z\_\{t,\\mathrm\{corr\}\},t\)\\bigr\)\.

### Online Token Promotion

Algorithm 2CCFG with Online Token Promotion \(CCFG w/ OTP\)0:Prompt tokens

X𝒰0X^\{\\mathcal\{U\}\_\{0\}\}, unresolved positions

ℳ0\\mathcal\{M\}\_\{0\}, mesh

0=t0<⋯<tN=10=t\_\{0\}<\\cdots<t\_\{N\}=1, guidance scale

ww, tolerance

ε\\varepsilon\.

1:Initialize

zt0z\_\{t\_\{0\}\}with clean embeddings on

𝒰0\\mathcal\{U\}\_\{0\}and mask embeddings on

ℳ0\\mathcal\{M\}\_\{0\}\.

2:for

i=0,…,N−1i=0,\\ldots,N\-1do

3:Set

X^iℓ=argmaxa∈𝒱⁡pθℓ\(a∣zti,ti\)\\widehat\{X\}\_\{i\}^\{\\ell\}=\\operatorname\*\{arg\\,max\}\_\{a\\in\\mathcal\{V\}\}p\_\{\\theta\}^\{\\ell\}\(a\\mid z\_\{t\_\{i\}\},t\_\{i\}\)for

ℓ∈ℳi\\ell\\in\\mathcal\{M\}\_\{i\}\.

4:Promote

𝒫i=\{ℓ∈ℳi:pθℓ\(X^iℓ∣zti,ti\)≥1−ϵ\}\\mathcal\{P\}\_\{i\}=\\\{\\ell\\in\\mathcal\{M\}\_\{i\}:p\_\{\\theta\}^\{\\ell\}\(\\widehat\{X\}\_\{i\}^\{\\ell\}\\mid z\_\{t\_\{i\}\},t\_\{i\}\)\\geq 1\-\\epsilon\\\}\.

5:For

ℓ∈𝒫i\\ell\\in\\mathcal\{P\}\_\{i\}, set

Xℓ=X^iℓX^\{\\ell\}=\\widehat\{X\}\_\{i\}^\{\\ell\}and

ztiℓ=EXℓz\_\{t\_\{i\}\}^\{\\ell\}=E\_\{X^\{\\ell\}\}\.

6:Set

𝒰i\+1=𝒰i∪𝒫i\\mathcal\{U\}\_\{i\+1\}=\\mathcal\{U\}\_\{i\}\\cup\\mathcal\{P\}\_\{i\}and

ℳi\+1=ℳi∖Pi\\mathcal\{M\}\_\{i\+1\}=\\mathcal\{M\}\_\{i\}\\setminus P\_\{i\}\.

7:Compute guided velocity

vwv\_\{w\}using[Algorithm1](https://arxiv.org/html/2606.27617#alg1)\.

8:Make one DDPM step \([13](https://arxiv.org/html/2606.27617#S4.E13)\) on

ℳi\+1\\mathcal\{M\}\_\{i\+1\}using

vwv\_\{w\}to obtain

zti\+1z\_\{t\_\{i\+1\}\}\.

9:endfor

10:Decode any remaining unresolved positions andreturn

XX\.

The standard DDPM sampler, as well as CCFG, treat the prompt as clean context but keep every generated position latent until the final decoding step\. This can be inefficient as different positions resolve at different times: indeed, positions adjacent to the prompt, deterministic formatting tokens, and padding tokens in short answers often have sharply peaked posteriors well beforet=1t=1\. If such positions remain latent, however, later model calls still observe them through corrupted representations rather than through the clean embeddings of the predicted tokens\.

We therefore*promote*high\-confidence positions online, fixing a token to its clean embedding as soon as the model is sufficiently confident in it\. Formally, let𝒰i\\mathcal\{U\}\_\{i\}andℳi=\[L\]∖𝒰i\\mathcal\{M\}\_\{i\}=\[L\]\\setminus\\mathcal\{U\}\_\{i\}denote the set of unmasked, context and masked positions at stepii, respectively\. For each masked position, the posteriorpθ\(⋅∣zti,ti\)p\_\{\\theta\}\(\\cdot\\mid z\_\{t\_\{i\}\},t\_\{i\}\)gives the predicted token

X^iℓ=argmaxa∈𝒱⁡pθℓ\(a∣zti,ti\),\\widehat\{X\}\_\{i\}^\{\\ell\}=\\operatorname\*\{arg\\,max\}\_\{a\\in\\mathcal\{V\}\}p\_\{\\theta\}^\{\\ell\}\(a\\mid z\_\{t\_\{i\}\},t\_\{i\}\),and, given a toleranceε\>0\\varepsilon\>0, we promote the positions

𝒫i=\{ℓ∈ℳi:pθℓ\(X^iℓ∣zti,ti\)≥1−ε\}\.\\mathcal\{P\}\_\{i\}=\\left\\\{\\ell\\in\\mathcal\{M\}\_\{i\}:p\_\{\\theta\}^\{\\ell\}\(\\widehat\{X\}\_\{i\}^\{\\ell\}\\mid z\_\{t\_\{i\}\},t\_\{i\}\)\\geq 1\-\\varepsilon\\right\\\}\.
For eachℓ∈𝒫i\\ell\\in\\mathcal\{P\}\_\{i\}we setXℓ=X^iℓX^\{\\ell\}=\\widehat\{X\}\_\{i\}^\{\\ell\}and fixztiℓ=EXℓz\_\{t\_\{i\}\}^\{\\ell\}=E\_\{X^\{\\ell\}\}\. We then apply the DDPM transition \([13](https://arxiv.org/html/2606.27617#S4.E13)\) only to the still\-masked positions, holding all positions in𝒰i∪𝒫i\\mathcal\{U\}\_\{i\}\\cup\\mathcal\{P\}\_\{i\}fixed, and update the sets𝒰i\\mathcal\{U\}\_\{i\}andℳi\\mathcal\{M\}\_\{i\}:

𝒰i\+1=𝒰i∪𝒫i,ℳi\+1=ℳi∖𝒫i\.\\mathcal\{U\}\_\{i\+1\}=\\mathcal\{U\}\_\{i\}\\cup\\mathcal\{P\}\_\{i\},\\qquad\\mathcal\{M\}\_\{i\+1\}=\\mathcal\{M\}\_\{i\}\\setminus\\mathcal\{P\}\_\{i\}\.The sampler terminates oncetN=1t\_\{N\}=1is reached or all positions have been promoted\. We call this sampling strategy*online token promotion*\(OTP\) and in practice combine it with CCFG\. Algorithm[2](https://arxiv.org/html/2606.27617#alg2)summarises our full sampling procedure\. Additionally, we note that this sampling procedure can be further accelerated with the use of flow map distillation as promotion can be done at any timett\.

#### Error from Online Token Promotion

We note that OTP can introduce errors into the sampling process by promoting tokens too early\. Indeed, even when the posterior mode has high probability, it may still be incorrect, causing the sampler to fix the position to the clean embedding of an incorrect token before terminal time\. The following result upper bounds the error caused by such early promotions\.

###### Proposition 4\.1\(Promotion Error\)\.

Letppbe the target distribution on𝒱L\\mathcal\{V\}^\{L\}, and assume that each denoising samples exactly fromp\(zti\+1∣zti\)p\(z\_\{t\_\{i\+1\}\}\\mid z\_\{t\_\{i\}\}\)\. Letp~\\widetilde\{p\}be the output law of the corresponding sampler that uses the promotion rule above with the true posteriors underpp, and then continues with the same exact denoising dynamics conditioned on promoted values\. Then

TV⁡\(p,p~\)≤εL\.\\operatorname\{TV\}\(p,\\widetilde\{p\}\)\\leq\\varepsilon L\.
###### Proof\.

See Appendix[E\.2](https://arxiv.org/html/2606.27617#A5.SS2)\. ∎

The above results show that the overall accumulated probability of at least one incorrect promotion is bounded byεL\\varepsilon L\. Note that this error does not depend on the number of discretization steps, and can be made arbitrarily small by takingϵ\\epsilonsmall enough\.

As a result, we can view Online Token Promotion as a signal amplifier, similar to self\-conditioning\(Chen et al\.,[2023](https://arxiv.org/html/2606.27617#bib.bib9)\)\. We note that this is not available to previous work in hybrid discrete\-continuous approaches to language modelling\(Pynadath et al\.,[2025](https://arxiv.org/html/2606.27617#bib.bib36); Zheng et al\.,[2025](https://arxiv.org/html/2606.27617#bib.bib46); Zhou et al\.,[2026](https://arxiv.org/html/2606.27617#bib.bib49)\)\.

## 5Experiments

Most prior work on FLMs has evaluated unconditional generation, typically using metrics such as generative perplexity and entropy\. While these metrics aim to measure distributional modelling quality, they are unreliable indicators of actual performance\(Franca and Tong,[2026](https://arxiv.org/html/2606.27617#bib.bib18)\)and they do not establish whether flow\-based language models can serve as useful conditional generators in practical settings\. We therefore evaluate MLFM on more demanding downstream tasks that require mathematical reasoning and instruction following\. Concretely, we adapt the pretrained SMDM model ofNie et al\. \([2025a](https://arxiv.org/html/2606.27617#bib.bib30)\)into an MLFM, following Section[3\.1](https://arxiv.org/html/2606.27617#S3.SS1), and evaluate it on GSM8K\(Cobbe et al\.,[2021](https://arxiv.org/html/2606.27617#bib.bib12)\)and MT\-Bench\(Zheng et al\.,[2023](https://arxiv.org/html/2606.27617#bib.bib48)\)\.

Below, we first describe our experimental setup in Section[5\.1](https://arxiv.org/html/2606.27617#S5.SS1)then present our main results and ablations in Sections[5\.2](https://arxiv.org/html/2606.27617#S5.SS2)and[5\.3](https://arxiv.org/html/2606.27617#S5.SS3)respectively\.

### Experimental Setup

#### Adaptation Setup

For all of our experiments, we initialise our model from a pretrained SMDM with 1028M parameters111More specifically, we use the official SMDM 1028M checkpoint,mdm\-1028M\-3300e18\-rsl\-0\.01\-bs\-1024\.safetensors, from the[https://huggingface\.co/nieshen/SMDM](https://huggingface.co/nieshen/SMDM)Hugging Face repository\.\. The pretrained backbone and input embedding matrix are kept frozen throughout all training stages\. We train only a small set of adapters: LoRA adapters on attention and MLP modules, an output\-head LoRA adapter, and AdaLN adapters for time conditioning\. The backbone LoRA rank is 256 withα=512\\alpha=512and dropout 0\.05; the output\-head LoRA rank is 256 withα=256\\alpha=256and no dropout\. The MLP that produces the AdaLN time\-conditioning parameters has hidden dimension 512\. The total number of trainable parameters in our model is 319M\. For further details, see Appendix[B](https://arxiv.org/html/2606.27617#A2)\.

We note that aside fromDavis et al\. \([2026](https://arxiv.org/html/2606.27617#bib.bib13)\), who train a 1\.7B\-parameter categorical flow model at trillion\-token scale, MLFM is, to the best of our knowledge, the first flow\-language model at billion\-parameter scale\.

#### Adaptation Training

We train our model on SlimPajama\(Soboleva et al\.,[2023](https://arxiv.org/html/2606.27617#bib.bib40)\), tokenized with the LLaMA\-2 tokenizer\(Touvron et al\.,[2023](https://arxiv.org/html/2606.27617#bib.bib43)\)using sequences of length 1024\. We use a training budget of≈\\approx100B processed sequence positions—this corresponds to 200k optimiser updates with an effective batch size of 512 sequences, obtained by accumulating two global batches of 256 sequences\. We use the AdamW optimiser\(Loshchilov and Hutter,[2017](https://arxiv.org/html/2606.27617#bib.bib28)\)with 3k warmup steps followed by cosine decay to 10% of the peak learning rate\. The learning rate is10−410^\{\-4\}for the backbone LoRA, output\-head LoRA, and AdaLN parameters\. We apply weight decay of 0\.01 and clip the global gradient norm at 1\.0\. We also maintain an exponential moving average \(EMA\) of the adapter weights with decay 0\.999, and use these EMA weights for validation\.

Moreover, followingChen et al\. \([2026](https://arxiv.org/html/2606.27617#bib.bib10)\), we sample the timettin terms of the log noise\-to\-signal ratio \(NSR\)γ\\gammarather than samplingttdirectly\. We sampleγ\\gammafrom a three\-component mixture consisting of uniform, normal, and fitted generalised\-logistic components, with mixture weights0\.1/0\.2/0\.70\.1/0\.2/0\.7respectively \(see Appendix[A](https://arxiv.org/html/2606.27617#A1)for further details\)\. For our distributionπs\\pi\_\{s\}over masking probabilitiesss, we use the MaskGIT cosine schedule\(Chang et al\.,[2022](https://arxiv.org/html/2606.27617#bib.bib8)\)clipped to\[0\.05,1\.0\]\[0\.05,1\.0\]\. For our bridge \([7](https://arxiv.org/html/2606.27617#S3.E7)\), we setσ=0\.2\\sigma=0\.2\. In addition, we use an auxiliary embedding loss with weight1010: from the predicted token distribution, we form the corresponding posterior\-mean embedding and penalise its distance to the clean token embedding\. Our adaptation stage took approximately three days on 16 NVIDIA GH200 GPUs\.

#### Supervised Fine\-Tuning

After the adaptation stage, we further train the adapted MLFM model on supervised fine\-tuning data consisting of prompt\-response pairs\. The data mixture includes general instruction data from first\-turn ShareGPT222The dataset can be accessed from[https://sharegpt\.com/](https://sharegpt.com/)\.examples, mathematical reasoning data from NuminaMath\-CoT\(LI et al\.,[2024](https://arxiv.org/html/2606.27617#bib.bib26)\), GSM8K\-Aug\-NL\(Cobbe et al\.,[2021](https://arxiv.org/html/2606.27617#bib.bib12); Deng et al\.,[2023](https://arxiv.org/html/2606.27617#bib.bib15)\), and MetaMathQA\(Yu et al\.,[2024](https://arxiv.org/html/2606.27617#bib.bib45)\), and code data from OpenCodeInstruct\(Ahmad et al\.,[2025](https://arxiv.org/html/2606.27617#bib.bib2)\)\. We sample these three groups with weights 0\.2, 0\.6, and 0\.2, respectively\.

As discussed in Section[3\.3](https://arxiv.org/html/2606.27617#S3.SS3), prompt tokens are always kept fixed during supervised fine\-tuning and the MLFM objective is applied only to response tokens\. With probability 0\.5, all response tokens are masked; otherwise, the response mask ratio is sampled using the same MaskGIT cosine schedule as in Section[5\.1\.2](https://arxiv.org/html/2606.27617#S5.SS1.SSS2)\. We use an SFT budget of≈\\approx15B processed sequence positions\. We use AdamW for 50k optimizer updates, using an effective batch size of 512 sequences obtained by accumulating two global batches of size 256\. We use 3k warmup steps followed by cosine learning\-rate decay, with peak learning rate5×10−55\\times 10^\{\-5\}for the LoRA, output\-head LoRA, and AdaLN parameters\. Our SFT took≈\\approx18 hours on 16 NVIDIA GH200 GPUs\.

#### Datasets

We evaluate our SFT model on two conditional generation benchmarks: GSM8K\(Cobbe et al\.,[2021](https://arxiv.org/html/2606.27617#bib.bib12)\)and MT\-Bench\(Zheng et al\.,[2023](https://arxiv.org/html/2606.27617#bib.bib48)\)\. GSM8K is a benchmark of grade\-school mathematical word problems, where each example consists of a natural\-language question with a numerical answer\. We use GSM8K to evaluate mathematical reasoning, and measure performance using exact\-match accuracy after extracting the final numerical answer from the model output\.

MT\-Bench is an open\-ended instruction\-following benchmark consisting of multi\-turn user queries spanning diverse categories\(Zheng et al\.,[2023](https://arxiv.org/html/2606.27617#bib.bib48)\)\. Following standard practice, model responses are instead evaluated by a strong language\-model judge, which assigns a scalar score reflecting the quality of the answer\. In our experiments, we report the first\-turn MT\-Bench score with GPT\-4o\(Achiam et al\.,[2023](https://arxiv.org/html/2606.27617#bib.bib1)\)as our judge\.

For both datasets, we prompt the model using the Vicuna prompt template\(Chiang et al\.,[2023](https://arxiv.org/html/2606.27617#bib.bib11)\)\.

#### Baselines

We compare our approach with the 1\.1B MDM model ofNie et al\. \([2025a](https://arxiv.org/html/2606.27617#bib.bib30)\), which we refer to as SMDM, as well as the AR models considered in their work: LLaMA\-2\(Touvron et al\.,[2023](https://arxiv.org/html/2606.27617#bib.bib43)\)for GSM8K, and their similarly sized AR model for MT\-Bench\.

For inference, we follow Algorithm[2](https://arxiv.org/html/2606.27617#alg2)withϵ=0\.05\\epsilon=0\.05\. We use a maximum sequence length of 512 tokens and 256 sampling steps for GSM8K, and 1024 tokens and 128 sampling steps for MT\-Bench\. Similar toNie et al\. \([2025a](https://arxiv.org/html/2606.27617#bib.bib30)\), we report the results with the best guidance scalew∈\{0,2,4,6\}w\\in\\\{0,2,4,6\\\}and ablate with differentwwin Section[5\.3](https://arxiv.org/html/2606.27617#S5.SS3)\.

### Main Results

Table[1](https://arxiv.org/html/2606.27617#S5.T1)shows that MLFM improves substantially over both baselines on MT\-Bench, achieving a first\-turn score of2\.272\.27compared with1\.601\.60for SMDM and1\.571\.57for the similarly sized AR baseline\. Notably, this gain is obtained with128128sampling steps, half of the256256steps used by SMDM\. On GSM8K, however, MLFM remains significantly behind both LLaMA\-2 and SMDM, obtaining31\.24%31\.24\\%accuracy compared with58\.6%58\.6\\%and58\.5%58\.5\\%, respectively\. One plausible reason is the difference in fine\-tuning protocol: the SMDM GSM8K result is obtained after task\-specific fine\-tuning on augmented GSM8K data for4040epochs, whereas our MLFM is fine\-tuned on a broader instruction mixture\. Nevertheless, these results are encouraging: to the best of our knowledge, they provide the first evidence that flow\-based language models can be scaled beyond unconditional generation to downstream reasoning and instruction\-following tasks\. The qualitative examples in Figures[2](https://arxiv.org/html/2606.27617#A4.F2),[3](https://arxiv.org/html/2606.27617#A4.F3), provide further evidence that MLFM can produce coherent, multi\-step responses\.

Table 1:Main results on GSM8K and MT\-Bench comparing MLFM with SMDM and the AR and LLaMA\-2 baselines ofNie et al\. \([2025a](https://arxiv.org/html/2606.27617#bib.bib30)\)\. Note that SMDM uses 256 sampling steps for both datasets\.ApproachGSM8K\(accuracy%↑\\uparrow\)MT\-Bench\(first\-turn score↑\\uparrow\)LLaMA\-2\(Touvron et al\.,[2023](https://arxiv.org/html/2606.27617#bib.bib43)\)58\.6–AR baseline\(Nie et al\.,[2025a](https://arxiv.org/html/2606.27617#bib.bib30)\)–1\.57SMDM\(Nie et al\.,[2025a](https://arxiv.org/html/2606.27617#bib.bib30)\)58\.51\.60MLFM31\.242\.27
### Ablations

MT–Bench

![Refer to caption](https://arxiv.org/html/2606.27617v1/x1.png)
![Refer to caption](https://arxiv.org/html/2606.27617v1/x2.png)
GSM8K

![Refer to caption](https://arxiv.org/html/2606.27617v1/x3.png)
![Refer to caption](https://arxiv.org/html/2606.27617v1/x4.png)

Figure 1:MT\-Bench and GSM8K results across different guidance scales \(left\) and sampler steps \(right\)\. The plots on the right use the optimal guidance scales given by the corresponding plots on the left\.Here, we study the effect of different guidance scalesww, different numbers of sampling steps, and different sampling strategies\.

##### Different sampling strategies\.

Table[2](https://arxiv.org/html/2606.27617#S5.T2)shows the GSM8K and MT\-Bench results across three sampling settings: no guidance \(the standard DDPM sampler from Section[4\.1](https://arxiv.org/html/2606.27617#S4.SS1)\), CCFG \(Algorithm[1](https://arxiv.org/html/2606.27617#alg1)\), and CCFG with Online Token Promotion \(Algorithm[2](https://arxiv.org/html/2606.27617#alg2)\)\. On both datasets, we observe that both CCFG and online token promotion significantly boost performance\. This is not surprising: OTP commits high\-confidence posterior modes as clean observed tokens, giving later denoising steps more reliable context than corrupted continuous states\.

Table 2:Results for GSM8K and MT\-Bench for different sampling strategies\.SamplingGSM8K\(accuracy%↑\\uparrow\)MT\-Bench\(first\-turn score↑\\uparrow\)No guidance13\.191\.22CCFG \(Algorithm[1](https://arxiv.org/html/2606.27617#alg1)\)21\.381\.85CCFG w/ OTP \(Algorithm[2](https://arxiv.org/html/2606.27617#alg2)\)31\.242\.27
##### Different guidance scales\.

Figure[1](https://arxiv.org/html/2606.27617#S5.F1)shows the GSM8K and MT\-Bench results for different guidance scales\. In general, we see that larger guidance scales provide the largest gains in performance\. This too is not surprising, as stronger guidance makes the sampler rely more heavily on the clean observed context when resolving the remaining tokens\.

##### Different numbers of sampling steps\.

Similarly, Figure[1](https://arxiv.org/html/2606.27617#S5.F1)shows the GSM8K and MT\-Bench results for different numbers of sampling steps\. We see that, in general, larger numbers of sampling steps provide the largest gains in performance\. Moreover, we note that MLFM still outperforms the SMDM and AR baselines on MT\-Bench even at 16 sampling steps\.

## 6Conclusion

In this work, we introduced Masked Language Flow Models which integrate masking from Masked Diffusion Models into Flow Language Models via a Brownian bridge connecting partially masked sequences with clean sequences\. This enables exact, any\-position conditional generation, allowing MLFMs to anchor continuous generation on partially masked sequences\. Paired with our novel sampler, this facilitates complex, multi\-step reasoning\. Additionally, MLFMs support efficient training with a lightweight adaptation of pretrained MDMs\. For future work, it is interesting to continue scaling MLFMs as well as distilling such models\.

## Acknowledgments

IA, KA and LZ would like to thank Jinwoo Kim and Pete Patterson for helpful conversations\.

IA is supported by the Engineering and Physical Sciences Research Council \[grant number EP/T517811/1\]\. LZ and KA are supported by the EPSRC CDT in Modern Statistics and Statistical Machine Learning \(EP/S023151/1\)\. This work was supported by the UKRI AI Research Resource \(AIRR\) through Isambard\-AI \(project AIRR\-GW \- Diffusion Models for Language Modelling\) and by an Amazon Research Award awarded to Patrick Rebeschini \(Fall 2024\)\. SV and PR are funded by UK Research and Innovation \(UKRI\) under the UK government’s Horizon Europe funding guarantee \[grant number EP/Y028333/1\]\.

## References

- Achiam et al\. \(2023\)Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al\.Gpt\-4 technical report\.*arXiv preprint arXiv:2303\.08774*, 2023\.
- Ahmad et al\. \(2025\)Wasi Uddin Ahmad, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Vahid Noroozi, Somshubra Majumdar, and Boris Ginsburg\.Opencodeinstruct: A large\-scale instruction tuning dataset for code llms\.*arXiv preprint arXiv:2504\.04030*, 2025\.
- Albergo et al\. \(2025\)Michael Albergo, Nicholas M Boffi, and Eric Vanden\-Eijnden\.Stochastic interpolants: A unifying framework for flows and diffusions\.*Journal of Machine Learning Research*, 26\(209\):1–80, 2025\.
- Austin et al\. \(2021\)Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg\.Structured denoising diffusion models in discrete state\-spaces\.*Advances in neural information processing systems*, 34:17981–17993, 2021\.
- Boffi et al\. \(2026\)Nicholas Boffi, Michael Albergo, and Eric Vanden\-Eijnden\.How to build a consistency model: Learning flow maps via self\-distillation\.*Advances in Neural Information Processing Systems*, 38:33346–33382, 2026\.
- Brown et al\. \(2020\)Tom B\. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert\-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M\. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei\.Language models are few\-shot learners, 2020\.URL[https://arxiv\.org/abs/2005\.14165](https://arxiv.org/abs/2005.14165)\.
- Campbell et al\. \(2022\)Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Arnaud Doucet\.A continuous time framework for discrete denoising models\.*Advances in Neural Information Processing Systems*, 35:28266–28279, 2022\.
- Chang et al\. \(2022\)Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman\.Maskgit: Masked generative image transformer\.In*Proceedings of the IEEE/CVF conference on computer vision and pattern recognition*, pages 11315–11325, 2022\.
- Chen et al\. \(2023\)Ting Chen, Ruixiang Zhang, and Geoffrey Hinton\.Analog bits: Generating discrete data using diffusion models with self\-conditioning, 2023\.URL[https://arxiv\.org/abs/2208\.04202](https://arxiv.org/abs/2208.04202)\.
- Chen et al\. \(2026\)Yuxin Chen, Chumeng Liang, Hangke Sui, Ruihan Guo, Chaoran Cheng, Jiaxuan You, and Ge Liu\.Langflow: Continuous diffusion rivals discrete in language modeling\.*arXiv preprint arXiv:2604\.11748*, 2026\.
- Chiang et al\. \(2023\)Wei\-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E\. Gonzalez, Ion Stoica, and Eric P\. Xing\.Vicuna: An open\-source chatbot impressing gpt\-4 with 90%\* chatgpt quality, March 2023\.URL[https://lmsys\.org/blog/2023\-03\-30\-vicuna/](https://lmsys.org/blog/2023-03-30-vicuna/)\.
- Cobbe et al\. \(2021\)Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman\.Training verifiers to solve math word problems\.*arXiv preprint arXiv:2110\.14168*, 2021\.
- Davis et al\. \(2026\)Oscar Davis, Anastasiia Filippova, Pierre Ablin, Victor Turrisi, Amitis Shidani, Marco Cuturi, and Louis Béthune\.Scaling categorical flow maps, 2026\.URL[https://arxiv\.org/abs/2605\.07820](https://arxiv.org/abs/2605.07820)\.
- Del Moral and Penev \(2017\)Pierre Del Moral and Spiridon Penev\.*Stochastic processes: From applications to theory*\.Chapman and Hall/CRC, 2017\.
- Deng et al\. \(2023\)Yuntian Deng, Kiran Prasad, Roland Fernandez, Paul Smolensky, Vishrav Chaudhary, and Stuart Shieber\.Implicit chain of thought reasoning via knowledge distillation\.*arXiv preprint arXiv:2311\.01460*, 2023\.
- Deschenaux and Gulcehre \(2024\)Justin Deschenaux and Caglar Gulcehre\.Beyond autoregression: Fast llms via self\-distillation through time\.*arXiv preprint arXiv:2410\.21035*, 2024\.
- Dieleman \(2023\)Sander Dieleman\.Diffusion language models\.[https://benanne\.github\.io/2023/01/09/diffusion\-language\.html](https://benanne.github.io/2023/01/09/diffusion-language.html), 2023\.Accessed: 2026\-01\-25\.
- Franca and Tong \(2026\)Antonio Franca and Alexander Tong\.Hacking generative perplexity: Why unconditional text evaluation needs distributional metrics\.*arXiv preprint arXiv:2606\.08417*, 2026\.
- Ghazvininejad et al\. \(2019\)Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer\.Mask\-predict: Parallel decoding of conditional masked language models\.In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,*Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\)*, pages 6112–6121, Hong Kong, China, November 2019\. Association for Computational Linguistics\.doi:10\.18653/v1/D19\-1633\.URL[https://aclanthology\.org/D19\-1633/](https://aclanthology.org/D19-1633/)\.
- Ho and Salimans \(2022\)Jonathan Ho and Tim Salimans\.Classifier\-free diffusion guidance\.*arXiv preprint arXiv:2207\.12598*, 2022\.
- Ho et al\. \(2020\)Jonathan Ho, Ajay Jain, and Pieter Abbeel\.Denoising diffusion probabilistic models\.*Advances in neural information processing systems*, 33:6840–6851, 2020\.
- Hu et al\. \(2022\)Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen\-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al\.Lora: Low\-rank adaptation of large language models\.*Iclr*, 1\(2\):3, 2022\.
- Hu et al\. \(2026\)Keya Hu, Linlu Qiu, Yiyang Lu, Hanhong Zhao, Tianhong Li, Yoon Kim, Jacob Andreas, and Kaiming He\.Elf: Embedded language flows, 2026\.URL[https://arxiv\.org/abs/2605\.10938](https://arxiv.org/abs/2605.10938)\.
- Kingma et al\. \(2021\)Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho\.Variational diffusion models\.*Advances in neural information processing systems*, 34:21696–21707, 2021\.
- Lee et al\. \(2026\)Chanhyuk Lee, Jaehoon Yoo, Manan Agarwal, Sheel Shah, Jerry Huang, Aditi Raghunathan, Seunghoon Hong, Nicholas M\. Boffi, and Jinwoo Kim\.Flow map language models: One\-step language modeling via continuous denoising, 2026\.URL[https://arxiv\.org/abs/2602\.16813](https://arxiv.org/abs/2602.16813)\.
- LI et al\. \(2024\)Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu\.Numinamath\.[\[https://huggingface\.co/AI\-MO/NuminaMath\-CoT\]\(https://github\.com/project\-numina/aimo\-progress\-prize/blob/main/report/numina\_dataset\.pdf\)](https://arxiv.org/html/2606.27617v1/%5Bhttps://huggingface.co/AI-MO/NuminaMath-CoT%5D(https://github.com/project-numina/aimo-progress-prize/blob/main/report/numina_dataset.pdf)), 2024\.
- Lipman et al\. \(2023\)Yaron Lipman, Ricky T\. Q\. Chen, Heli Ben\-Hamu, Maximilian Nickel, and Matthew Le\.Flow matching for generative modeling\.In*The Eleventh International Conference on Learning Representations*, 2023\.URL[https://openreview\.net/forum?id=PqvMRDCJT9t](https://openreview.net/forum?id=PqvMRDCJT9t)\.
- Loshchilov and Hutter \(2017\)Ilya Loshchilov and Frank Hutter\.Decoupled weight decay regularization\.*arXiv preprint arXiv:1711\.05101*, 2017\.
- Lou et al\. \(2024\)Alex Lou, Chenlin Meng, and Stefano Ermon\.Discrete diffusion modeling by estimating the ratios of the data distribution\.*arXiv preprint arXiv:2310\.16834*, 2024\.
- Nie et al\. \(2025a\)Shengqi Nie, Fenglin Zhu, Chengpeng Du, Tianyu Pang, Qi Liu, Gang Zeng, Min Lin, and Chenguang Li\.Scaling up masked diffusion models on text\.*arXiv preprint arXiv:2410\.18514*, 2025a\.
- Nie et al\. \(2025b\)Shengqi Nie, Fenglin Zhu, Zhen You, Xin Zhang, Jing Ou, Jing Hu, Jun Zhou, Yichang Lin, Ji\-Rong Wen, and Chenguang Li\.Large language diffusion models\.*arXiv preprint arXiv:2502\.09992*, 2025b\.
- Nye et al\. \(2021\)Maxwell Nye, Anders Johan Andreassen, Guy Gur\-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al\.Show your work: Scratchpads for intermediate computation with language models\.2021\.
- Ou et al\. \(2025\)Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li\.Your absorbing discrete diffusion secretly models the conditional distributions of clean data\.In*International Conference on Learning Representations*, volume 2025, pages 64972–65009, 2025\.
- Peebles and Xie \(2023\)William Peebles and Saining Xie\.Scalable diffusion models with transformers\.In*Proceedings of the IEEE/CVF international conference on computer vision*, pages 4195–4205, 2023\.
- Potaptchik et al\. \(2026\)Peter Potaptchik, Jason Yim, Adhi Saravanan, Peter Holderrieth, Eric Vanden\-Eijnden, and Michael S\. Albergo\.Discrete flow maps, 2026\.URL[https://arxiv\.org/abs/2604\.09784](https://arxiv.org/abs/2604.09784)\.
- Pynadath et al\. \(2025\)Patrick Pynadath, Jiaxin Shi, and Ruqi Zhang\.Candi: Hybrid discrete\-continuous diffusion models\.*arXiv preprint arXiv:2510\.22510*, 2025\.
- Roos et al\. \(2026\)Daan Roos, Oscar Davis, Floor Eijkelboom, Michael Bronstein, Max Welling, İsmail İlkan Ceylan, Luca Ambrogioni, and Jan\-Willem van de Meent\.Categorical flow maps, 2026\.URL[https://arxiv\.org/abs/2602\.12233](https://arxiv.org/abs/2602.12233)\.
- Sahoo et al\. \(2024\)Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and Volodymyr Kuleshov\.Simple and effective masked diffusion language models\.*Advances in Neural Information Processing Systems*, 37:130136–130184, 2024\.
- Shi et al\. \(2024\)Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias\.Simplified and generalized masked diffusion for discrete data\.*Advances in neural information processing systems*, 37:103131–103167, 2024\.
- Soboleva et al\. \(2023\)Daria Soboleva, Faisal Al\-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey\.SlimPajama: A 627B token cleaned and deduplicated version of RedPajama\.[https://cerebras\.ai/blog/slimpajama\-a\-627b\-token\-cleaned\-and\-deduplicated\-version\-of\-redpajama](https://cerebras.ai/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama), 2023\.URL[https://huggingface\.co/datasets/cerebras/SlimPajama\-627B](https://huggingface.co/datasets/cerebras/SlimPajama-627B)\.
- Song et al\. \(2021\)Yang Song, Jascha Sohl\-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole\.Score\-based generative modeling through stochastic differential equations\.In*International Conference on Learning Representations*, 2021\.URL[https://openreview\.net/forum?id=PxTIG12RRHS](https://openreview.net/forum?id=PxTIG12RRHS)\.
- Song et al\. \(2023\)Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever\.Consistency models\.In*International Conference on Machine Learning*, pages 32211–32252\. PMLR, 2023\.
- Touvron et al\. \(2023\)Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al\.Llama 2: Open foundation and fine\-tuned chat models\.*arXiv preprint arXiv:2307\.09288*, 2023\.
- Wei et al\. \(2022\)Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al\.Chain\-of\-thought prompting elicits reasoning in large language models\.*Advances in neural information processing systems*, 35:24824–24837, 2022\.
- Yu et al\. \(2024\)Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu\.Metamath: Bootstrap your own mathematical questions for large language models\.In*International Conference on Learning Representations*, volume 2024, pages 45040–45061, 2024\.
- Zheng et al\. \(2025\)Huangjie Zheng, Shansan Gong, Ruixiang Zhang, Tianrong Chen, Jiatao Gu, Mingyuan Zhou, Navdeep Jaitly, and Yizhe Zhang\.Continuously augmented discrete diffusion model for categorical generative modeling\.*arXiv preprint arXiv:2510\.01329*, 2025\.
- Zheng et al\. \(2024\)Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming\-Yu Liu, Jun Zhu, and Qinsheng Zhang\.Masked diffusion models are secretly time\-agnostic masked models and exploit inaccurate categorical sampling\.*arXiv preprint arXiv:2409\.02908*, 2024\.
- Zheng et al\. \(2023\)Lianmin Zheng, Wei\-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al\.Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.*Advances in neural information processing systems*, 36:46595–46623, 2023\.
- Zhou et al\. \(2026\)Cai Zhou, Chenxiao Yang, Yi Hu, Chenyu Wang, Chubin Zhang, Muhan Zhang, Lester Mackey, Tommi Jaakkola, Stephen Bates, and Dinghuai Zhang\.Coevolutionary continuous discrete diffusion: Make your diffusion language model a latent reasoner, 2026\.URL[https://arxiv\.org/abs/2510\.03206](https://arxiv.org/abs/2510.03206)\.

## Appendix Contents

## Appendix AGamma Schedule and Fitting

To sample the bridge time during training, we use an approach similar in spirit to LangFlow\(Chen et al\.,[2026](https://arxiv.org/html/2606.27617#bib.bib10)\)\. Rather than samplingttdirectly, we sample the log noise\-to\-signal ratioγ\\gammainduced by the Brownian bridge in \([7](https://arxiv.org/html/2606.27617#S3.E7)\):

γ=log⁡\(σ21−tt\)\.\\gamma=\\log\\left\(\\sigma^\{2\}\\frac\{1\-t\}\{t\}\\right\)\.The inverse map is

t=σ2eγ\+σ2\.t=\\frac\{\\sigma^\{2\}\}\{e^\{\\gamma\}\+\\sigma^\{2\}\}\.\(14\)
We clipγ\\gammato\[γmin,γmax\]=\[−10,6\]\[\\gamma\_\{\\min\},\\gamma\_\{\\max\}\]=\[\-10,6\]to avoid numerically extreme values near the bridge endpoints and to keep sampling within the range covered by our empirical observations\.

As inChen et al\. \([2026](https://arxiv.org/html/2606.27617#bib.bib10)\), the main component of our time\-sampling distribution is fitted from the empirical difficulty of denoising at different noise levels\. However, we found that using only a fitted component was less stable in our setting\. We therefore sampleγ\\gammafrom the three\-component mixture

q\(γ\)=0\.1qunif\(γ\)\+0\.2qnorm\(γ\)\+0\.7qfit\(γ\),γ∈\[−10,6\],q\(\\gamma\)=0\.1\\,q\_\{\\mathrm\{unif\}\}\(\\gamma\)\+0\.2\\,q\_\{\\mathrm\{norm\}\}\(\\gamma\)\+0\.7\\,q\_\{\\mathrm\{fit\}\}\(\\gamma\),\\qquad\\gamma\\in\[\-10,6\],and then convertγ\\gammatottusing \([14](https://arxiv.org/html/2606.27617#A1.E14)\)\. The uniform componentqunifq\_\{\\mathrm\{unif\}\}is supported on\[−10,6\]\[\-10,6\]\. The normal componentqnormq\_\{\\mathrm\{norm\}\}is a Gaussian with location−2\.5\-2\.5and scale2\.02\.0, with samples clipped to\[−10,6\]\[\-10,6\]; these values were chosen from empirical observations of stable training regions\.

The fitted componentqfitq\_\{\\mathrm\{fit\}\}is constructed from cross\-entropy diagnostics as a function ofγ\\gamma\. This is the relevant diagnostic because the MLFM denoiser is trained with masked\-token cross\-entropy, and the average CE at a fixed noise level measures the remaining uncertainty in predicting the clean token\. Thus, changes in CE acrossγ\\gammaindicate where the model gains information along the bridge\.

Concretely, we bin training examples byγ\\gammaand compute the mean masked\-token CE in each bin\. After smoothing, we normalize this CE curve to obtain an empirical CDF\-like curve on\[−10,6\]\[\-10,6\]\. We then fit the generalized\-logistic CDF

Fglog\(γ\)=σ\(γ−μb\)a,F\_\{\\mathrm\{glog\}\}\(\\gamma\)=\\sigma\\left\(\\frac\{\\gamma\-\\mu\}\{b\}\\right\)^\{a\},wherea\>0a\>0is a shape parameter,μ\\muis a location parameter,b\>0b\>0is a scale parameter, andσ\(⋅\)\\sigma\(\\cdot\)denotes the logistic sigmoid\. The corresponding density is

fglog\(γ\)=abσ\(γ−μb\)a\(1−σ\(γ−μb\)\),f\_\{\\mathrm\{glog\}\}\(\\gamma\)=\\frac\{a\}\{b\}\\sigma\\left\(\\frac\{\\gamma\-\\mu\}\{b\}\\right\)^\{a\}\\left\(1\-\\sigma\\left\(\\frac\{\\gamma\-\\mu\}\{b\}\\right\)\\right\),and we takeqfit=fglogq\_\{\\mathrm\{fit\}\}=f\_\{\\mathrm\{glog\}\}\. We sample from this component by inverse transform sampling, using the quantile function

Fglog−1\(u\)=μ−blog⁡\(u−1/a−1\),u∈\(0,1\)\.F\_\{\\mathrm\{glog\}\}^\{\-1\}\(u\)=\\mu\-b\\log\\left\(u^\{\-1/a\}\-1\\right\),\\qquad u\\in\(0,1\)\.
The CE–γ\\gammasummaries are updated every 200 batches using exponential smoothing with coefficient0\.9990\.999\. We accept a new generalized\-logistic fit only when its fit quality satisfiesR2≥0\.95R^\{2\}\\geq 0\.95; otherwise, we keep the previous fitted component\. For numerical stability, the fitted scale is lower bounded by0\.050\.05, and the shape parameter is clipped to\[0\.05,20\]\[0\.05,20\]\. We also use low\-discrepancy stratification\(Kingma et al\.,[2021](https://arxiv.org/html/2606.27617#bib.bib24)\)for bothγ\\gammasamples and mask\-ratio samples within each batch\.

## Appendix BAdditional Experimental Details

Here, we provide further details on the experimental setup used in Section[5](https://arxiv.org/html/2606.27617#S5)\. Unless stated otherwise, the settings below are shared between the MLFM adaptation and SFT phases\.

### Model and Adapter Hyperparameters

The base MDM model we adapt is the official 1028M\-parameter SMDM checkpoint333More specifically, we use the official SMDM 1028M checkpoint,mdm\-1028M\-3300e18\-rsl\-0\.01\-bs\-1024\.safetensors, from the[https://huggingface\.co/nieshen/SMDM](https://huggingface.co/nieshen/SMDM)Hugging Face repository\.ofNie et al\. \([2025a](https://arxiv.org/html/2606.27617#bib.bib30)\), which uses a Diff\-LLaMA architecture and the LLaMA\-2 tokenizer\(Touvron et al\.,[2023](https://arxiv.org/html/2606.27617#bib.bib43)\)\. The resulting vocabulary size, hidden dimension, and number of transformer layers are 32000, 1792, and 20, respectively\.

The pretrained backbone and token embedding matrix are kept frozen\. We attach LoRA adapters\(Hu et al\.,[2022](https://arxiv.org/html/2606.27617#bib.bib22)\)to attention and projection modules, and use a separate additive LoRA adapter for the output head\. The time conditioning is provided by DiT\-style AdaLN adapters\(Peebles and Xie,[2023](https://arxiv.org/html/2606.27617#bib.bib34)\)\. Specifically, we apply AdaLN to the normalisation layers inside each transformer block and to the final normalisation layer before the output head\. We do not apply AdaLN to the token embedding layer or to the output\-head module\.

Table 3:Adapter hyperparameters\.ComponentRank / Widthα\\alphaDropoutBackbone LoRA2565120\.05Output\-head LoRA2562560AdaLN time embedding256––AdaLN hidden layer512––Our total number of trainable parameters is 319M\.

### Optimisation Hyperparameters

Table[4](https://arxiv.org/html/2606.27617#A2.T4)lists the optimisation settings for the adaptation and SFT phase of our MLFM\. We use the AdamW optimiser\(Loshchilov and Hutter,[2017](https://arxiv.org/html/2606.27617#bib.bib28)\)with\(0\.9,0\.95\)\(0\.9,0\.95\)for theβ\\betaparameters\. Weight decay is applied only to matrix\-valued trainable weights\. We exclude bias terms and all one\-dimensional affine parameters from weight decay, including AdaLN scale/shift modulation parameters\. Learning rates are linearly warmed up and then decayed by a cosine schedule to 10% of their peak value\. We maintain an adapter\-only EMA with decay 0\.999 and use EMA weights for validation\.

Table 4:Optimisation hyperparameters\. We abbreviate learning rate to ”LR” here\.HyperparameterAdaptationSFTGlobal batch size256256Gradient accumulation22Optimiser steps200k50kWarmup steps3k3kLoRA LR10−410^\{\-4\}5×10−55\\times 10^\{\-5\}Output\-head LoRA LR10−410^\{\-4\}5×10−55\\times 10^\{\-5\}AdaLN LR10−410^\{\-4\}5×10−55\\times 10^\{\-5\}Weight decay0\.010\.01Gradient clipping1\.01\.0EMA decay0\.9990\.999The adaptation stage uses 200k optimiser updates with effective batch size 512 and maximum sequence length 1024, corresponding to approximately100100B processed token positions\. The SFT stage uses 50k optimiser updates with effective batch size 512\. The final, realised budget is approximately1515B processed token positions\.

### Masking Details

We sample the masking probabilityssfrom a cosine schedule inspired by MaskGIT\(Chang et al\.,[2022](https://arxiv.org/html/2606.27617#bib.bib8)\)\. Specifically, we drawu∼Unif⁡\(0,1\)u\\sim\\operatorname\{Unif\}\(0,1\)and set

s=ρmin\+\(ρmax−ρmin\)cos⁡\(πu2\),s=\\rho\_\{\\min\}\+\(\\rho\_\{\\max\}\-\\rho\_\{\\min\}\)\\cos\\\!\\left\(\\frac\{\\pi u\}\{2\}\\right\),withρmin=0\.05\\rho\_\{\\min\}=0\.05andρmax=1\.0\\rho\_\{\\max\}=1\.0\. Padding tokens, special tokens, and invalid positions are excluded from masking\.

During SFT, masking is restricted to answer tokens: prompt tokens are always kept clean\. With probability0\.50\.5, the full answer span is masked\. Otherwise, the answer\-token masking probability is sampled from the same cosine schedule\.

### Dataset Details

The adaptation stage uses SlimPajama\(Soboleva et al\.,[2023](https://arxiv.org/html/2606.27617#bib.bib40)\)tokenised with the LLaMA\-2 tokenizer\(Touvron et al\.,[2023](https://arxiv.org/html/2606.27617#bib.bib43)\)using sequences of length 1024\. We use the provided train and validation splits\. During adaptation, with probability0\.010\.01all sequences in the batch are cropped to a shared random prefix length sampled uniformly from\{1,…,1024\}\\\{1,\\ldots,1024\\\}\.

The SFT stage uses a mixture of general instruction, math, and code data, with the mixture weights being 0\.20, 0\.60 and 0\.20 respectively\. The general instruction data come from first\-turn ShareGPT444The dataset can be accessed from[https://sharegpt\.com/](https://sharegpt.com/)\.conversations\. The math data come from the GSM8K\-Aug\-NL\(Cobbe et al\.,[2021](https://arxiv.org/html/2606.27617#bib.bib12); Deng et al\.,[2023](https://arxiv.org/html/2606.27617#bib.bib15)\), MetaMathQA\(Yu et al\.,[2024](https://arxiv.org/html/2606.27617#bib.bib45)\), and NuminaMath\-CoT\(LI et al\.,[2024](https://arxiv.org/html/2606.27617#bib.bib26)\)datasets with sub\-mixture weights 0\.60, 0\.20 and 0\.20 respectively\. The code data come from the short OpenCodeInstruct\(Ahmad et al\.,[2025](https://arxiv.org/html/2606.27617#bib.bib2)\)split, with the prefix fraction sampled from\[0\.25,0\.75\]\[0\.25,0\.75\]\. SFT examples are capped at 1024 tokens, with dynamic cropping to multiples of 64\. All SFT sources use the chat\-style prompt`USER:\\n\{prompt\}\\nASSISTANT:\\n`, followed by the response and an EOS token\. Math targets append a separate answer field as`\#\#\# \{answer\}`when available; code\-instruction examples include sampled unit tests in the prompt when available\.

SFT examples are truncated to a maximum length of 1024 tokens and 512 for MT\-Bench and GSM8K respectively\. To reduce padding, batches are dynamically cropped to sequence lengths that are multiples of 64\. All SFT datasets use the same Vicuna prompt template\(Chiang et al\.,[2023](https://arxiv.org/html/2606.27617#bib.bib11)\),`USER:\\n\{prompt\}\\nASSISTANT:\\n`, followed by the response and an EOS token\. For math examples, when a separate final\-answer field is available, we append it to the target as`\#\#\# \{answer\}`\. For code\-instruction examples, when unit tests are available, we include a sampled subset of them in the prompt\.

### SFT Details

The SFT stage is initialised from the EMA\-smoothed adapter weights obtained after adaptation\. We reset the optimizer and learning\-rate schedule, keep the pretrained backbone and token embeddings frozen, and train only the LoRA adapters and AdaLN time\-conditioning parameters\.

During SFT, masking and continuous noising are applied only to response tokens; prompt tokens are always kept clean and visible\. With probability 0\.5, all response tokens are masked\. Otherwise, the response mask ratio is sampled from the same MaskGIT cosine schedule used during adaptation\.

Additionally, we also alter theγ\\gamma\-sampling distribution during SFT as we find that the approach forγ\\gammaused during adaptation does not work as well for SFT due to the shift in the learning objective\. In particular, unlike the adaptation stage, which uses a fitted generalized\-logistic component, SFT uses an active empirical gamma curve\. The SFT sampler draws

qSFT\(γ\)=0\.1qunif\(γ\)\+0\.9qactive\(γ\),γ∈\[−10,6\],q\_\{\\mathrm\{SFT\}\}\(\\gamma\)=0\.1\\,q\_\{\\mathrm\{unif\}\}\(\\gamma\)\+0\.9\\,q\_\{\\mathrm\{active\}\}\(\\gamma\),\\qquad\\gamma\\in\[\-10,6\],with a normal component𝒩\(−2\.5,2\.02\)\\mathcal\{N\}\(\-2\.5,2\.0^\{2\}\)used as a fallback until the active curve is initialized\. The active component is estimated from the high\-mask SFT diagnostics with response mask ratios in\[0\.95,1\.0\]\[0\.95,1\.0\]\. Because SFT contains many easy EOS and padding targets, the diagnostic curve uses response\-token CE with EOS targets removed\. We form an empirical inverse CDF overγ\\gamma, smooth it with isotonic regression, and represent it with 101 quantile knots, and update the active inverse CDF by EMA:

Qk\+1\(u\)=\(1−η\)Qk\(u\)\+ηQ^k\(u\),η=0\.05\.Q\_\{k\+1\}\(u\)=\(1\-\\eta\)Q\_\{k\}\(u\)\+\\eta\\,\\widehat\{Q\}\_\{k\}\(u\),\\qquad\\eta=0\.05\.An update is applied only after at least 8 populated gamma bins and 4096 diagnostic examples are available\.[Algorithm4](https://arxiv.org/html/2606.27617#alg4)summarizes one SFT learning step\.

## Appendix CAlgorithm Blocks

Algorithms[3](https://arxiv.org/html/2606.27617#alg3)and[4](https://arxiv.org/html/2606.27617#alg4)summarise the corresponding MLFM learning steps for adaptation and supervised fine\-tuning\.

Algorithm 3One MLFM Adaptation Step0:Clean\-text distribution

𝒟\\mathcal\{D\}, model

pθp\_\{\\theta\}, mask distribution

πs\\pi\_\{s\}, corruption\-level distribution

πt\\pi\_\{t\}, optimizer\.

1:Draw a minibatch

X∼𝒟X\\sim\\mathcal\{D\}\.

2:Sample a mask pattern

ℳs\\mathcal\{M\}\_\{s\}using

s∼πss\\sim\\pi\_\{s\}, and set

𝒰s=\[L\]∖ℳs\\mathcal\{U\}\_\{s\}=\[L\]\\setminus\\mathcal\{M\}\_\{s\}\.

3:Sample

γ∼πt\\gamma\\sim\\pi\_\{t\}and then set

ttusing \([14](https://arxiv.org/html/2606.27617#A1.E14)\)\.

4:Construct

zs,tz\_\{s,t\}: set

zs,tℓ=EXℓz\_\{s,t\}^\{\\ell\}=E\_\{X^\{\\ell\}\}for

ℓ∈𝒰s\\ell\\in\\mathcal\{U\}\_\{s\}, and sample

zs,tℓ∣Xℓz\_\{s,t\}^\{\\ell\}\\mid X^\{\\ell\}from \([7](https://arxiv.org/html/2606.27617#S3.E7)\) for

ℓ∈ℳs\\ell\\in\\mathcal\{M\}\_\{s\}\.

5:Compute

ℒ^MLFM=1\|ℳs\|∑ℓ∈ℳs−log⁡pθℓ\(Xℓ∣zs,t,t\)\.\\widehat\{\\mathcal\{L\}\}\_\{\\mathrm\{MLFM\}\}=\\frac\{1\}\{\|\\mathcal\{M\}\_\{s\}\|\}\\sum\_\{\\ell\\in\\mathcal\{M\}\_\{s\}\}\-\\log p\_\{\\theta\}^\{\\ell\}\(X^\{\\ell\}\\mid z\_\{s,t\},t\)\.
6:Update

θ\\thetawith one optimizer step on

ℒ^MLFM\\widehat\{\\mathcal\{L\}\}\_\{\\mathrm\{MLFM\}\}\.

7:Update the distribution

πt\\pi\_\{t\}using the batch diagnostics at

γ\\gamma\.

8:returnupdated

θ\\thetaand

πt\\pi\_\{t\}\.

Algorithm 4One MLFM Supervised Fine\-Tuning Step0:Prompt–answer dataset

𝒟FT\\mathcal\{D\}\_\{\\mathrm\{FT\}\}, model

pθp\_\{\\theta\}, response\-mask distribution

πsFT\\pi\_\{s\}^\{\\mathrm\{FT\}\}, corruption\-level distribution

πt\\pi\_\{t\}, optimizer\.

1:Draw a minibatch

\(p,a\)∼𝒟FT\(p,a\)\\sim\\mathcal\{D\}\_\{\\mathrm\{FT\}\}and tokenize it as

X=X\(p,a\)X=X\(p,a\)\.

2:Let

𝒫\\mathcal\{P\}be prompt positions and

𝒜\\mathcal\{A\}be response positions\.

3:Sample a nonempty response mask

ℳs∼πsFT\(⋅∣𝒜\)\\mathcal\{M\}\_\{s\}\\sim\\pi\_\{s\}^\{\\mathrm\{FT\}\}\(\\cdot\\mid\\mathcal\{A\}\), and set

𝒰s=\[L\]∖ℳs\\mathcal\{U\}\_\{s\}=\[L\]\\setminus\\mathcal\{M\}\_\{s\}\.

4:Sample

γ∼πt\\gamma\\sim\\pi\_\{t\}and set

ttusing \([14](https://arxiv.org/html/2606.27617#A1.E14)\)\.

5:Construct

zs,tz\_\{s,t\}: keep all positions in

𝒰s\\mathcal\{U\}\_\{s\}clean, including every prompt position in

𝒫\\mathcal\{P\}, and sample

zs,tℓ∣Xℓz\_\{s,t\}^\{\\ell\}\\mid X^\{\\ell\}from \([7](https://arxiv.org/html/2606.27617#S3.E7)\) for

ℓ∈ℳs\\ell\\in\\mathcal\{M\}\_\{s\}\.

6:Compute the response\-token objective

ℒ^FT=1\|ℳs\|∑ℓ∈ℳs−log⁡pθℓ\(Xℓ∣zs,t,t\)\.\\widehat\{\\mathcal\{L\}\}\_\{\\mathrm\{FT\}\}=\\frac\{1\}\{\|\\mathcal\{M\}\_\{s\}\|\}\\sum\_\{\\ell\\in\\mathcal\{M\}\_\{s\}\}\-\\log p\_\{\\theta\}^\{\\ell\}\(X^\{\\ell\}\\mid z\_\{s,t\},t\)\.
7:Update

θ\\thetawith one optimizer step on

ℒ^FT\\widehat\{\\mathcal\{L\}\}\_\{\\mathrm\{FT\}\}\.

8:Update the distribution

πt\\pi\_\{t\}using the response\-token diagnostics at

γ\\gamma\.

9:returnupdated

θ\\thetaand

πt\\pi\_\{t\}\.

## Appendix DAdditional Results

### Samples

Figures[2](https://arxiv.org/html/2606.27617#A4.F2)and[3](https://arxiv.org/html/2606.27617#A4.F3)show two qualitative examples of MLFM performing conditional generation on MT\-Bench and GSM8K respectively\. These examples complement our quantitative results by demonstrating that MLFM can successfully handle mathematical reasoning and instruction\-following tasks, which have largely remained outside the scope of prior FLM evaluations due to their inability to handle conditional generation\.

\\lxSVG@picture

MT\-BenchPromptCompose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must\-see attractions\.AnswerAs I returned to the tropical state of California, I was struck by the incred diversity and beauty of the islands\. I arrived in the island and met with the people of my country, and I was amazed by the beauty of the blue beaches, the lush green plants, and the crystal blue waters\.One of the most famous attractions of Hawaii is the unique natural beauty of the island\. From the red mountains to the tropical forests, the island is a master of natural beauty\. The Kilauea’s Palace, which is the largest in the United States, serves as a museum of science and technology\.As the sunseted over the island, I was curious about the rich culture of the Hawaii\. The Duke Palace is the oldest and the largest museum of the islands\. It is the largest collection of historical and cultural art in the world, and it is a stunning destination for tourists\.As the sunseted in the mountains, I had a fresh perspective on the history of Hawaii, a country with a rich history in the Pacific\. One of the most well\-known attractions in the world is the Hualaiian Cathedrals, located in the footsteps of the island’s Mau National Park\. The park is located in the tropical mountains, with a beautiful lana and views on the ocean\.In addition, there was a diverse culture in the island, with different customs and traditions\. For example, the fire dance is a popular dance, and the Hawaii are also interested in the cone dance, which is a traditional dance that still maintains in the culture of the islands\.Another place to visit is the Hawaiolian National Zoo\. It is the largest animal park in the world, where a visitor can see a variety of animals from around the world\.Overall, my trip to Hawaii was a unique experience that cannot be beat, with its unique attractions, beautiful natural beauty, and diverse cultural experiences\. If you are interested in the culture of the islands, I would recommend the Hawaii Museum and the Hawaiolian National Museum, and if you want to learn more, this assistant is very helpful\.\\endlxSVG@picture

Figure 2:Qualitative example of MLFM performing conditional generation on MT\-Bench\.\\lxSVG@picture

GSM8KPromptIn a dance class of 20 students, 20% enrolled in contemporary dance, 25% of the remaining enrolled in jazz dance, and the rest enrolled in hip\-hop dance\. What percentage of the entire students enrolled in hip\-hop dance?Answer20% of 20 students is 20\*0\.2 = 4\. The remaining number after enrolling in contemporary dance is 20\-4 = 16 students\. 25% of 16 students enrolled in jazz dance, which is \.25\*16 = 4 students\. So, the number of students who are enrolled in hip\-hop dance is 16\-4 = 12\. Therefore, the percentage of students enrolled in hip\-hop dance is 12/20\*100 = 60\\endlxSVG@picture

Figure 3:Qualitative example of MLFM performing conditional generation on GSM8K\.

## Appendix EProofs

### Proof of Proposition[3\.1](https://arxiv.org/html/2606.27617#S3.Thmtheorem1)

###### Proof\.

Att=0t=0, every corrupted coordinate is equal to the deterministic mask embeddingmm, while every uncorrupted coordinate is equal to its clean embeddingEXℓE\_\{X^\{\\ell\}\}\. Thereforezs,0z\_\{s,0\}contains exactly the same information as the partially masked sequenceXsX\_\{s\}, up to the deterministic embedding map\. Conditioning on either representation induces the same posterior over masked token identities\. ∎

### Proof of Proposition[4\.1](https://arxiv.org/html/2606.27617#S4.Thmtheorem1)

CoupleX∼pX\\sim pwith the promoted sampler\. Let

E=\{∃i,∃ℓ∈Pi:X^iℓ≠Xℓ\}E=\\left\\\{\\exists i,\\ \\exists\\ell\\in P\_\{i\}:\\widehat\{X\}\_\{i\}^\{\\ell\}\\neq X^\{\\ell\}\\right\\\}be the event that some promoted token is wrong\. SincePiP\_\{i\}isℱi\\mathcal\{F\}\_\{i\}\-measurable,

ℙ\(E\)\\displaystyle\\mathbb\{P\}\(E\)≤∑i∑ℓ=1Lℙ\(ℓ∈Pi,X^iℓ≠Xℓ\)\\displaystyle\\leq\\sum\_\{i\}\\sum\_\{\\ell=1\}^\{L\}\\mathbb\{P\}\\\!\\left\(\\ell\\in P\_\{i\},\\,\\widehat\{X\}\_\{i\}^\{\\ell\}\\neq X^\{\\ell\}\\right\)=∑i∑ℓ=1L𝔼\[𝟏\{ℓ∈Pi\}ℙ\(X^iℓ≠Xℓ∣ℱi\)\]\\displaystyle=\\sum\_\{i\}\\sum\_\{\\ell=1\}^\{L\}\\mathbb\{E\}\\\!\\left\[\\mathbf\{1\}\_\{\\\{\\ell\\in P\_\{i\}\\\}\}\\mathbb\{P\}\\\!\\left\(\\widehat\{X\}\_\{i\}^\{\\ell\}\\neq X^\{\\ell\}\\mid\\mathcal\{F\}\_\{i\}\\right\)\\right\]≤ε𝔼\[∑i\|Pi\|\]≤εL,\\displaystyle\\leq\\varepsilon\\,\\mathbb\{E\}\\\!\\left\[\\sum\_\{i\}\|P\_\{i\}\|\\right\]\\leq\\varepsilon L,because each coordinate is promoted at most once\.

OnEcE^\{c\}, every promoted value agrees with the corresponding coordinate ofXX\. By exact conditional dynamics and exact terminal decoding, the remaining randomness can be coupled so that the promoted sampler finishes with outputX~=X\\widetilde\{X\}=X\. Therefore

TV⁡\(p,p~\)≤ℙ\(X~≠X\)≤ℙ\(E\)≤εL\.\\operatorname\{TV\}\(p,\\widetilde\{p\}\)\\leq\\mathbb\{P\}\(\\widetilde\{X\}\\neq X\)\\leq\\mathbb\{P\}\(E\)\\leq\\varepsilon L\.
Masked Language Flow Models

Similar Articles

Masked Diffusion Decoding as $x$-Prediction Flow

LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling

FlowLM: Few-Step Language Modeling via Diffusion-to-Flow Adaptation

Masked Diffusion Language Models are Strong and Steerable Text-Based World Models for Agentic RL [R]

Language Modeling with Hyperspherical Flows

Submit Feedback

Similar Articles

Masked Diffusion Decoding as $x$-Prediction Flow
LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling
FlowLM: Few-Step Language Modeling via Diffusion-to-Flow Adaptation
Masked Diffusion Language Models are Strong and Steerable Text-Based World Models for Agentic RL [R]
Language Modeling with Hyperspherical Flows