Self-conditioned Flow Map Language Models via Fixed-point Flows
Summary
Introduces fixed-point flows, a self-conditioned flow language model that treats self-conditioning as a fixed-point iteration, enabling distillation into a few-step flow map language model (FMLM⋆) that outperforms prior work on OpenWebText.
View Cached Full Text
Cached at: 07/02/26, 05:38 AM
# Self-conditioned Flow Map Language Models via Fixed-point Flows
Source: [https://arxiv.org/html/2607.00714](https://arxiv.org/html/2607.00714)
Jaehoon Yoo1Wonjung Kim111footnotemark:1Floor Eijkelboom2Chanhyuk Lee1Nicholas M\. Boffi3Seunghoon Hong1†Jinwoo Kim1† 1KAIST2University of Amsterdam3Carnegie Mellon University
###### Abstract
Self\-conditioning is a core technique that enhances continuous flow\-based language models, where the model learns to denoise generated text by conditioning on its own denoising estimate\. While empirically successful, its performance improvements are poorly understood\. Moreover, there is growing interest in the use of few\-step generators based on flow maps, for which how to leverage self\-conditioning is unclear\. Here, we show that flow language models with self\-conditioning solve a fixed\-point iteration that bootstraps the performance of the learned denoiser\. We use this viewpoint to formulatefixed\-point flows, a two\-dimensional class of self\-conditioned flows, where the first dimension represents the flow process and the second represents the fixed\-point iteration\. We show that fixed\-point flows define valid flow maps, and show that they can be distilled from self\-conditioned flow models by compressing both fixed\-point iterations and the flow process, the former with fixed\-point distillation and the latter with flow map distillation\. Our resulting flow map language model, FMLM⋆, outperforms state\-of\-the\-art self\-conditioned models and few\-step models in one\- and few\-step generation on OpenWebText\.111Code is available at[https://github\.com/Ugness/self\-conditioned\-fmlm](https://github.com/Ugness/self-conditioned-fmlm)\.
## 1Introduction
Language models \(LMs\) based on continuous flows have recently emerged as a promising paradigm for non\-autoregressive text generation\(Leeet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib2); Chemseddineet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib13); Deschenaux and Gulcehre,[2026](https://arxiv.org/html/2607.00714#bib.bib14)\)\. By learning to denoise tokens in a continuous space, these models perform parallel iterative generation through a deterministic evolution driven by a velocity field\. Importantly, such processes define a unique flow map, the solution operator that directly transports noise to data in as few as one function evaluation\. This advantage has sparked recent breakthroughs on distillation of flow language models into flow map language models that are capable of generating text in one to few inference steps\(Leeet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib2); Rooset al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib5); Potaptchiket al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib6)\)\.
Recently, the performance of flow language models has improved through the incorporation of a self\-conditioning mechanism\(Chenet al\.,[2022](https://arxiv.org/html/2607.00714#bib.bib7)\)\. Unlike in usual flow training, self\-conditioned flow models learn to denoise an input by conditioning on its own denoising prediction\. Then, during generation, they perform denoising at each flow timestep by conditioning on the previously denoised outcome\. Self\-conditioning has empirically proven very effective\(Dielemanet al\.,[2022](https://arxiv.org/html/2607.00714#bib.bib8); Strudelet al\.,[2022](https://arxiv.org/html/2607.00714#bib.bib9)\), and has been widely adopted in the latest state\-of\-the\-art flow language models\(Chenet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib4); Huet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib3); Batzoliset al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib11); Meshchaninovet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib12); Yanget al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib15)\)\. Yet, despite its success, why self\-conditioning leads to improvements remains poorly understood\. Furthermore, it is unclear how to distill flows into flow maps under self\-conditioning, as it introduces additional dependencies across generation timesteps\.
In this work, we introducefixed\-point flows, a mathematical framework for self\-conditioned flows and their associated flow maps \([Figure1](https://arxiv.org/html/2607.00714#S2.F1)\)\. Our key observation is that self\-conditioned flow language models solve a fixed\-point iteration that bootstraps the performance of learned denoising\. With this insight, we characterize a two\-dimensional class of self\-conditioned flows, where the first dimension represents the original flow, and the second represents fixed\-point iterations\. We use this view to distill self\-conditioned flow language models into flow maps by compressing both fixed\-point iterations and flow, achieving state\-of\-the\-art one\- and few\-step language modeling\. Our main contributions are:
- •Fixed\-point view of self\-conditioning\.We show that flow language models with self\-conditioning implicitly learn a fixed\-point iteration that refines the denoising estimate\. We theoretically show that such behavior can emerge from self\-conditioned training under a contractivity assumption\.
- •Fixed\-point flows and their flow maps\.We introduce fixed\-point flows, a formalization of self\-conditioned flows\. We show that fixed\-point flows define a valid flow map, which can be learned by compressing both fixed\-point iterations and flow with the respective few\-step distillation objectives\.
- •Empirical results\.We distill self\-conditioned flow language models into flow map language models, achieving state\-of\-the\-art one\- and few\-step generation on OpenWebText\. We also show that it is possible to distill self\-conditioned models into self\-conditioning\-free ones without degradation\.
## 2Preliminary
FlowSelf\-conditionedflowDistillationFlow trainingflowt=0t\{=\}0t=1t\{=\}1fixed\-point dynamicsFlow samplingflow statedenoising estimatefixed\-point iterationwarm startflow\-map jumpfixed\-point jumpSC trainingWarm\-startSC samplingCold\-startSC samplingFlow map distillation\+\+Fixed\-pointdistillation==SC flow mapdistillationFigure 1:Overview\.We show that flow language models with self\-conditioning solve a fixed\-point iteration that refines the denoising estimate\. We leverage this insight to formulate fixed\-point flows, a class of self\-conditioned flows that run fixed\-point iterations at each flow timestep\. Fixed\-point flows yield valid flow maps, which we learn by compressing both the flow and the fixed\-point iterations\.#### Flow language models\.
LetVVbe a vocabulary of tokens, and denote text data with lengthLLby𝐲=\(𝐲l\)l=1L∈VL\{\\bf y\}=\(\{\\bf y\}^\{l\}\)\_\{l=1\}^\{L\}\\in V^\{L\}\. The goal of language modeling is to learn the data distributionp\(𝐲\)p\(\{\\bf y\}\)so that we can draw a new text sample efficiently\. Flow language models choose a continuous embedding𝐲↦𝐱∈ℝL×d\{\\bf y\}\\mapsto\{\\bf x\}\\in\\mathbb\{R\}^\{L\\times d\}and a decoder𝐱↦𝐲\{\\bf x\}\\mapsto\{\\bf y\}, and model the induced distributionp\(𝐱\)p\(\{\\bf x\}\)on the embedding space, generating𝐱^∼p\(𝐱\)\\hat\{\{\\bf x\}\}\\sim p\(\{\\bf x\}\)and then rounding into discrete language through𝐱^↦𝐲^\\hat\{\{\\bf x\}\}\\mapsto\\hat\{\{\\bf y\}\}\.
To model the now\-continuous data distributionp\(𝐱\)p\(\{\\bf x\}\), flow language models use flow matching over a stochastic interpolant\(Lipmanet al\.,[2022](https://arxiv.org/html/2607.00714#bib.bib17); Albergoet al\.,[2025](https://arxiv.org/html/2607.00714#bib.bib18)\)\. They specify a probability pathpt\(𝐱t\)p\_\{t\}\(\{\\bf x\}\_\{t\}\)overt∈\[0,1\]t\\in\[0,1\]as the density of an interpolantIt≔\(1−t\)𝐱0\+t𝐱1I\_\{t\}\\coloneqq\(1\-t\)\{\\bf x\}\_\{0\}\+t\{\\bf x\}\_\{1\}between noise𝐱0∼p0\{\\bf x\}\_\{0\}\\sim p\_\{0\}and data𝐱1∼p1\{\\bf x\}\_\{1\}\\sim p\_\{1\}\. The pathptp\_\{t\}admits a deterministic evolution equation that can be used to draw a sample𝐱t∼pt\{\\bf x\}\_\{t\}\\sim p\_\{t\}at inference time:
𝐱˙t=bt\(𝐱t\),bt\(𝐱\)=𝔼\[𝐱1−𝐱0∣It=𝐱\],𝐱0∼p0,t∈\[0,1\],\\dot\{\{\\bf x\}\}\_\{t\}=b\_\{t\}\(\{\\bf x\}\_\{t\}\),\\quad b\_\{t\}\(\{\\bf x\}\)=\\mathbb\{E\}\[\{\\bf x\}\_\{1\}\-\{\\bf x\}\_\{0\}\\mid I\_\{t\}=\{\\bf x\}\],\\quad\{\\bf x\}\_\{0\}\\sim p\_\{0\},\\quad t\\in\[0,1\],\(1\)wherebtb\_\{t\}is the velocity field of the flow\. If a modelb^\\hat\{b\}of the velocity is available, a sample𝐱^1∼p1\\hat\{\{\\bf x\}\}\_\{1\}\\sim p\_\{1\}can be approximately drawn by numerically integrating \([1](https://arxiv.org/html/2607.00714#S2.E1)\) across a time grid0=t0<…<tN=10=t\_\{0\}<\.\.\.<t\_\{N\}=1\. A simple choice for the integration is the forward Euler scheme:
𝐱^ti\+1=𝐱^ti\+\(ti\+1−ti\)b^ti\(𝐱^ti\),𝐱^0∼p0\.\\hat\{\{\\bf x\}\}\_\{t\_\{i\+1\}\}=\\hat\{\{\\bf x\}\}\_\{t\_\{i\}\}\+\(t\_\{i\+1\}\-t\_\{i\}\)\\hat\{b\}\_\{t\_\{i\}\}\(\\hat\{\{\\bf x\}\}\_\{t\_\{i\}\}\),\\quad\\hat\{\{\\bf x\}\}\_\{0\}\\sim p\_\{0\}\.\(2\)Instead of learning the velocity directly, it is more common in language modeling to learn the denoiserDtD\_\{t\}\(Leeet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib2); Huet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib3)\)which outputs the mean of clean data, and from which one can recover the velocity\(Albergoet al\.,[2025](https://arxiv.org/html/2607.00714#bib.bib18); Li and He,[2026](https://arxiv.org/html/2607.00714#bib.bib24)\):
Dt\(𝐱\)≔𝔼\[𝐱1∣It=𝐱\],bt\(𝐱\)=Dt\(𝐱\)−𝐱1−t\.D\_\{t\}\(\{\\bf x\}\)\\coloneqq\\mathbb\{E\}\[\{\\bf x\}\_\{1\}\\mid I\_\{t\}=\{\\bf x\}\],\\quad b\_\{t\}\(\{\\bf x\}\)=\\frac\{D\_\{t\}\(\{\\bf x\}\)\-\{\\bf x\}\}\{1\-t\}\.\(3\)The ideal denoiserDDcan be learned in practice by solving a regression problemD=argminD^ℒ\(D^\)D=\\operatorname\*\{argmin\}\_\{\\hat\{D\}\}\\mathcal\{L\}\(\\hat\{D\}\)that predicts clean data by minimizing the following loss:
ℒ\(D^\)≔∫01𝔼\|D^t\(It\)−𝐱1\|2dt\.\\mathcal\{L\}\(\\hat\{D\}\)\\coloneqq\\int\_\{0\}^\{1\}\\mathbb\{E\}\|\\hat\{D\}\_\{t\}\(I\_\{t\}\)\-\{\\bf x\}\_\{1\}\|^\{2\}\{\\rm d\}t\.\(4\)While we have used square loss above, our discussions and results readily transfer to the cross\-entropy setting ofDielemanet al\.\([2022](https://arxiv.org/html/2607.00714#bib.bib8)\); Eijkelboomet al\.\([2024](https://arxiv.org/html/2607.00714#bib.bib23)\); Leeet al\.\([2026](https://arxiv.org/html/2607.00714#bib.bib2)\)\. With a learned denoiserD^\\hat\{D\}, generation can be done by turning it into an estimation of the velocityb^\\hat\{b\}through \([3](https://arxiv.org/html/2607.00714#S2.E3)\) and then leveraging the forward Euler scheme \([2](https://arxiv.org/html/2607.00714#S2.E2)\)\.
#### Flow map language models\.
A key advantage of continuous flow for language modeling is that its deterministic generative process \([1](https://arxiv.org/html/2607.00714#S2.E1)\) driven by the velocitybtb\_\{t\}defines a unique flow mapXs,t:ℝL×d→ℝL×dX\_\{s,t\}:\\mathbb\{R\}^\{L\\times d\}\\to\\mathbb\{R\}^\{L\\times d\}, the solution operator that transports a sample directly𝐱t=Xs,t\(𝐱s\)\{\\bf x\}\_\{t\}=X\_\{s,t\}\(\{\\bf x\}\_\{s\}\)between any timesteps\(s,t\)∈\[0,1\]2\(s,t\)\\in\[0,1\]^\{2\}\. It satisfies the following integral equation:
Xs,t\(𝐱s\)=𝐱s\+∫stbτ\(𝐱τ\)dτ\.X\_\{s,t\}\(\{\\bf x\}\_\{s\}\)=\{\\bf x\}\_\{s\}\+\\int\_\{s\}^\{t\}b\_\{\\tau\}\(\{\\bf x\}\_\{\\tau\}\)\\,\{\\rm d\}\\tau\.\(5\)If available, a flow map allows for efficient few\-step generation by sequentially evaluating𝐱^ti\+1=Xti,ti\+1\(𝐱^ti\)\\hat\{\{\\bf x\}\}\_\{t\_\{i\+1\}\}=X\_\{t\_\{i\},t\_\{i\+1\}\}\(\\hat\{\{\\bf x\}\}\_\{t\_\{i\}\}\)upon any grid0=t0<…<tN=10=t\_\{0\}<\.\.\.<t\_\{N\}=1, even one\-step generation via𝐱^1=X0,1\(𝐱^0\)\\hat\{\{\\bf x\}\}\_\{1\}=X\_\{0,1\}\(\\hat\{\{\\bf x\}\}\_\{0\}\)\.
Recent works have demonstrated that flow map language models can be learned by distilling the flow velocity learned by flow language models\(Leeet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib2); Rooset al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib5); Potaptchiket al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib6)\)\. These methods leverage a set of mathematical relations between the flow velocity and the flow map\(Boffiet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib22); Leeet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib2)\), which are derivable from the integral equation in \([5](https://arxiv.org/html/2607.00714#S2.E5)\), and learn the flow map with respect to a given flow velocity to satisfy these relations\.
## 3Theoretical framework
In this section, we formalize self\-conditioning and propose that it learns a fixed\-point iteration which refines the denoising prediction\. We then develop fixed\-point flows, a framework for self\-conditioned flows, along with the associated flow maps for few\-step generation\. Finally, we introduce distillation methods for turning self\-conditioned flow language models into few\-step language models\.
### 3\.1Self\-conditioned flow language models
Self\-conditioning is a technique introduced inChenet al\.\([2022](https://arxiv.org/html/2607.00714#bib.bib7)\)where the denoiser learns to predict conditioned on its own denoising estimates\. In these approaches, the model of the denoiser \([3](https://arxiv.org/html/2607.00714#S2.E3)\) takes an additional conditioning,D^t\(𝐱,𝐳\)\\hat\{D\}\_\{t\}\(\{\\bf x\},\{\\bf z\}\), which effectively reduces to a usual denoiser model when𝐳=𝟎\{\\bf z\}=\\boldsymbol\{0\}\. The denoiser is trained using the following loss, whereμ\\muis a choice of training\-time distribution of𝐳\{\\bf z\}:
ℒμ\(D^\)≔∫01𝔼𝐱0,𝐱1𝔼μ\|D^t\(It,𝐳\)−𝐱1\|2dt\.\\mathcal\{L\}\_\{\\mu\}\(\\hat\{D\}\)\\coloneqq\\int\_\{0\}^\{1\}\\mathbb\{E\}\_\{\{\\bf x\}\_\{0\},\{\\bf x\}\_\{1\}\}\\mathbb\{E\}\_\{\\mu\}\|\\hat\{D\}\_\{t\}\(I\_\{t\},\{\\bf z\}\)\-\{\\bf x\}\_\{1\}\|^\{2\}\{\\rm d\}t\.\(6\)A common choice ofμ\\muis the mixture of delta peaks at zero𝐳=𝟎\{\\bf z\}=\\boldsymbol\{0\}and the model’s own denoising prediction𝐳=𝗌𝗀\(D^t\(It,𝟎\)\)\{\\bf z\}=\\mathsf\{sg\}\(\\hat\{D\}\_\{t\}\(I\_\{t\},\\boldsymbol\{0\}\)\)where𝗌𝗀\\mathsf\{sg\}is the stop\-gradient operator\. This leads to the following:
ℒμ\(D^\)=12∫01𝔼\|D^t\(It,𝟎\)−𝐱1\|2dt\+12∫01𝔼\|D^t\(It,𝗌𝗀\(D^t\(It,𝟎\)\)\)−𝐱1\|2dt,\\mathcal\{L\}\_\{\\mu\}\(\\hat\{D\}\)=\\frac\{1\}\{2\}\\int\_\{0\}^\{1\}\\mathbb\{E\}\|\\hat\{D\}\_\{t\}\(I\_\{t\},\\boldsymbol\{0\}\)\-\{\\bf x\}\_\{1\}\|^\{2\}\{\\rm d\}t\+\\frac\{1\}\{2\}\\int\_\{0\}^\{1\}\\mathbb\{E\}\|\\hat\{D\}\_\{t\}\(I\_\{t\},\\mathsf\{sg\}\(\\hat\{D\}\_\{t\}\(I\_\{t\},\\boldsymbol\{0\}\)\)\)\-\{\\bf x\}\_\{1\}\|^\{2\}\{\\rm d\}t,\(7\)where the first term is the usual denoising loss \([4](https://arxiv.org/html/2607.00714#S2.E4)\), whereas the second term is a self\-conditioned loss which encourages correcting an initial estimate by the model\.
In general, the conditioning variable𝐳\{\\bf z\}in the training loss \([6](https://arxiv.org/html/2607.00714#S3.E6)\) does not change the learning target of the denoiser\. This is because the Bayes\-optimal prediction of𝐱1\{\\bf x\}\_\{1\}given an interpolantItI\_\{t\}is specified by the ideal denoiser,Dt\(It\)D\_\{t\}\(I\_\{t\}\)\([3](https://arxiv.org/html/2607.00714#S2.E3)\), regardless of𝐳\{\\bf z\}, assuming𝐳\{\\bf z\}does not leak additional information about𝐱1\{\\bf x\}\_\{1\}beyond what is available fromItI\_\{t\}, as in \([7](https://arxiv.org/html/2607.00714#S3.E7)\)\. Thus, the objective trains the model to map every on\-distribution conditioning𝐳\{\\bf z\}to the ideal denoising target\. We provide a formal proof:
###### Proposition 3\.1\. Assume that𝐳⟂𝐱1∣It\{\\bf z\}\\perp\{\\bf x\}\_\{1\}\\mid I\_\{t\}for almost everytt, and assume all second moments are finite\. Then every unrestricted population minimizerD¯\\bar\{D\}of \([6](https://arxiv.org/html/2607.00714#S3.E6)\) satisfiesD¯t\(𝐱,𝐳\)=Dt\(𝐱\)\\bar\{D\}\_\{t\}\(\{\\bf x\},\{\\bf z\}\)=D\_\{t\}\(\{\\bf x\}\)\(8\)almost surely under the law of\(It,𝐳\)\(I\_\{t\},\{\\bf z\}\), for almost everytt\.
A proof is in[SectionA\.1](https://arxiv.org/html/2607.00714#A1.SS1)\. In practice, training would not attain the minimum loss and the model would have dependence on𝐳\{\\bf z\}\. With the self\-conditioned loss \([7](https://arxiv.org/html/2607.00714#S3.E7)\), this can encourage a self\-correcting behavior that produces an initial prediction and then pulls it closer to the Bayes\-optimal prediction\.
The generative process under self\-conditioning is similar to that of \([1](https://arxiv.org/html/2607.00714#S2.E1)\), evolving a sample𝐱^ti\\hat\{\{\\bf x\}\}\_\{t\_\{i\}\}across a flow\-time grid0=t0<…<tN=10=t\_\{0\}<\.\.\.<t\_\{N\}=1\. Yet, the denoising estimate is also maintained and updated as𝐳^ti\\hat\{\{\\bf z\}\}\_\{t\_\{i\}\}to condition and bootstrap the denoiser, altering the numerical integration scheme \([2](https://arxiv.org/html/2607.00714#S2.E2)\) as follows:
𝐱^ti\+1\\displaystyle\\hat\{\{\\bf x\}\}\_\{t\_\{i\+1\}\}=𝐱^ti\+\(ti\+1−ti\)b^ti,b^ti=𝐳^ti\+1−𝐱^ti1−ti,𝐱^0∼p0,\\displaystyle=\\hat\{\{\\bf x\}\}\_\{t\_\{i\}\}\+\(t\_\{i\+1\}\-t\_\{i\}\)\\hat\{b\}\_\{t\_\{i\}\},\\quad\\hat\{b\}\_\{t\_\{i\}\}=\\frac\{\\hat\{\{\\bf z\}\}\_\{t\_\{i\+1\}\}\-\\hat\{\{\\bf x\}\}\_\{t\_\{i\}\}\}\{1\-t\_\{i\}\},\\quad\\hat\{\{\\bf x\}\}\_\{0\}\\sim p\_\{0\},\(9\)𝐳^ti\+1\\displaystyle\\hat\{\{\\bf z\}\}\_\{t\_\{i\+1\}\}=D^ti\(𝐱^ti,𝐳^ti\),𝐳^0=𝟎\.\\displaystyle=\\hat\{D\}\_\{t\_\{i\}\}\(\\hat\{\{\\bf x\}\}\_\{t\_\{i\}\},\\hat\{\{\\bf z\}\}\_\{t\_\{i\}\}\),\\quad\\hat\{\{\\bf z\}\}\_\{0\}=\\boldsymbol\{0\}\.Intuitively, this creates an information sharing across flow timesteps via𝐳^\\hat\{\\bf z\}on top of the flow state𝐱^\\hat\{\\bf x\}\.
Although self\-conditioning empirically leads to substantial improvements of flow language models\(Chenet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib4); Huet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib3); Batzoliset al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib11); Meshchaninovet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib12); Yanget al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib15)\), understanding of its mechanism is limited\(Chenet al\.,[2022](https://arxiv.org/html/2607.00714#bib.bib7); Dielemanet al\.,[2022](https://arxiv.org/html/2607.00714#bib.bib8); Shabalinet al\.,[2025](https://arxiv.org/html/2607.00714#bib.bib20);[2026](https://arxiv.org/html/2607.00714#bib.bib19)\)\. Furthermore, since the self\-conditioned velocity \([9](https://arxiv.org/html/2607.00714#S3.E9)\) is no longer autonomous on the flow state but now also coupled with the updates of the denoising estimates, it is unclear how to use it for few\-step distillation, which relies on the existence of the flow map characterized as a purely integral solution of the flow velocity \([5](https://arxiv.org/html/2607.00714#S2.E5)\)\.
### 3\.2Self\-conditioning induces a fixed\-point iteration
We present our key intuition that a self\-conditioned denoiser, trained with \([7](https://arxiv.org/html/2607.00714#S3.E7)\), learns to improve its own predictions, implementing an iteration that approximately converges toward the ideal denoiser\.
Formally, a self\-conditioned denoiserD^\(𝐱,𝐳\)\\hat\{D\}\(\{\\bf x\},\{\\bf z\}\), given a choice of flow timestepttand flow state𝐱\{\\bf x\}, can be used to define the following iteration𝐳j\{\\bf z\}^\{j\}forj=0,1,…j=0,1,\.\.\.that starts at some initialization𝐳0\{\\bf z\}^\{0\}:
𝐳j\+1=D^t\(𝐱,𝐳j\)\.\{\\bf z\}^\{j\+1\}=\\hat\{D\}\_\{t\}\(\{\\bf x\},\{\\bf z\}^\{j\}\)\.\(10\)As the denoiser learns to correct its own prediction \([7](https://arxiv.org/html/2607.00714#S3.E7)\), we posit that it characterizes a self\-correcting process, eventually reaching a fixed point𝐳⋆=D^t\(𝐱,𝐳⋆\)\{\\bf z\}^\{\\star\}=\\hat\{D\}\_\{t\}\(\{\\bf x\},\{\\bf z\}^\{\\star\}\)\(Gulrajani and Hashimoto,[2023](https://arxiv.org/html/2607.00714#bib.bib10)\)\. We could think of this fixed point as a self\-corrected approximation of the Bayes\-optimal predictionDt\(𝐱\)D\_\{t\}\(\{\\bf x\}\), since it is the learning target as shown in[Proposition3\.1](https://arxiv.org/html/2607.00714#S3.Thmtheorem1)\. We make this intuition precise with a series of theoretical results\.
First, we recall the notion of a contraction, which suffices for convergence to a unique fixed point\.
###### Definition 3\.2\(Contraction\)\. A functionf:ℝd→ℝdf:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}is a contraction on a closed setO⊆ℝdO\\subseteq\\mathbb\{R\}^\{d\}with factor0≤η<10\\leq\\eta<1if it satisfiesf\(O\)⊆Of\(O\)\\subseteq Oand\|f\(𝐳\)−f\(𝐳′\)\|≤η\|𝐳−𝐳′\|\|f\(\{\\bf z\}\)\-f\(\{\\bf z\}^\{\\prime\}\)\|\\leq\\eta\|\{\\bf z\}\-\{\\bf z\}^\{\\prime\}\|for every𝐳,𝐳′∈O\{\\bf z\},\{\\bf z\}^\{\\prime\}\\in O\.
By[Proposition3\.1](https://arxiv.org/html/2607.00714#S3.Thmtheorem1), a perfect denoiserD¯t\(𝐱,⋅\)\\bar\{D\}\_\{t\}\(\{\\bf x\},\\cdot\)with minimum loss maps directly toDt\(𝐱\)D\_\{t\}\(\{\\bf x\}\), making it a contraction withη=0\\eta=0\. When the denoiserD^\\hat\{D\}is learned, it is not always guaranteed to be contractive\. Yet, in[SectionA\.2](https://arxiv.org/html/2607.00714#A1.SS2), we show that optimizing self\-conditioned loss leads to approximate contractivity\. To simplify our analysis, we shall henceforth assume that the learned denoiser is contractive\. This is in line with common assumptions in looped models\(Romanoet al\.,[2017](https://arxiv.org/html/2607.00714#bib.bib35); Ryuet al\.,[2019](https://arxiv.org/html/2607.00714#bib.bib34); Baiet al\.,[2019](https://arxiv.org/html/2607.00714#bib.bib25); Funget al\.,[2022](https://arxiv.org/html/2607.00714#bib.bib27); Movahediet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib26)\)and supported by our results that self\-conditioned models ELF\(Huet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib3)\)and LangFlow\(Chenet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib4)\)approximate fixed\-point iterations\. We show:
###### Proposition 3\.3\. Fixtt,𝐱\{\\bf x\}, and0≤η<10\\leq\\eta<1\. Assume thatD^t\(𝐱,⋅\)\\hat\{D\}\_\{t\}\(\{\\bf x\},\\cdot\)is a contraction on a nonempty closed setO⊆ℝL×dO\\subseteq\\mathbb\{R\}^\{L\\times d\}with factorη\\eta\. Then, the iteration𝐳j\+1=D^t\(𝐱,𝐳j\)\{\\bf z\}^\{j\+1\}=\\hat\{D\}\_\{t\}\(\{\\bf x\},\{\\bf z\}^\{j\}\)\([10](https://arxiv.org/html/2607.00714#S3.E10)\) from any𝐳0∈O\{\\bf z\}^\{0\}\\in Osatisfies the following properties:\(i\)It converges exponentially to a unique fixed point𝐳⋆∈O\{\\bf z\}^\{\\star\}\\in O,\(ii\)Its denoising error at any iterationj≥0j\\geq 0is bounded as\|𝐳j−Dt\(𝐱\)\|≤\|D^t\(𝐱,𝐳0\)−Dt\(𝐱\)\|\+η−ηj1−η\|𝐳1−𝐳0\|\.\|\{\\bf z\}^\{j\}\-D\_\{t\}\(\{\\bf x\}\)\|\\leq\|\\hat\{D\}\_\{t\}\(\{\\bf x\},\{\\bf z\}^\{0\}\)\-D\_\{t\}\(\{\\bf x\}\)\|\+\\frac\{\\eta\-\\eta^\{j\}\}\{1\-\\eta\}\|\{\\bf z\}^\{1\}\-\{\\bf z\}^\{0\}\|\.\(11\)\(iii\)Its denoising error at the fixed point is bounded as\|𝐳⋆−Dt\(𝐱\)\|≤\|D^t\(𝐱,𝐳0\)−Dt\(𝐱\)\|\+η1−η\|𝐳1−𝐳0\|,\|\{\\bf z\}^\{\\star\}\-D\_\{t\}\(\{\\bf x\}\)\|\\leq\|\\hat\{D\}\_\{t\}\(\{\\bf x\},\{\\bf z\}^\{0\}\)\-D\_\{t\}\(\{\\bf x\}\)\|\+\\frac\{\\eta\}\{1\-\\eta\}\|\{\\bf z\}^\{1\}\-\{\\bf z\}^\{0\}\|,\(12\)
A proof is in[SectionA\.3](https://arxiv.org/html/2607.00714#A1.SS3)\. The result shows that the denoising loss at the initialization, together with the magnitude of the first iteration, controls the denoising error bound of multi\-step iterations during inference up to the fixed point\. It further shows that the error bound improves with iterations\.
Having formulated the relationship between a self\-conditioned denoiserD^\\hat\{D\}and its fixed points for each pair ofttand𝐱\{\\bf x\}, it is convenient to think of a function that directly predicts the fixed point\. We call such a function thefixed\-point denoiserD⋆D^\{\\star\}, defined as follows for everyt∈\[0,1\],𝐱∈ℝL×dt\\in\[0,1\],\{\\bf x\}\\in\\mathbb\{R\}^\{L\\times d\}:
Dt⋆\(𝐱\)≔𝐳⋆=D^t\(𝐱,𝐳⋆\)\.D\_\{t\}^\{\\star\}\(\{\\bf x\}\)\\coloneqq\{\\bf z\}^\{\\star\}=\\hat\{D\}\_\{t\}\(\{\\bf x\},\{\\bf z\}^\{\\star\}\)\.\(13\)where𝐳⋆\{\\bf z\}^\{\\star\}is the unique fixed point ofD^t\(𝐱,⋅\)\\hat\{D\}\_\{t\}\(\{\\bf x\},\\cdot\)\. The fixed\-point denoiser is a self\-conditioning\-free denoiser that behaves the same as the given self\-conditioned denoiser run until convergence\. Its ideal target under[Proposition3\.1](https://arxiv.org/html/2607.00714#S3.Thmtheorem1)is the Bayes\-optimal predictionDt\(𝐱\)D\_\{t\}\(\{\\bf x\}\), in which case we haveD⋆=DD^\{\\star\}=D\.
### 3\.3Fixed\-point flows
In[Section3\.1](https://arxiv.org/html/2607.00714#S3.SS1), we have seen that self\-conditioned flow appears to involve two states: the flow state and the self\-conditioning state\. In[Section3\.2](https://arxiv.org/html/2607.00714#S3.SS2), we have established that a self\-conditioned denoiser induces a fixed\-point iteration\. Here, we propose to replace the self\-conditioning state by its fixed point\. Once this is done, the velocity depends only on the flow state, so it becomes an ordinary flow\. We call this afixed\-point flow, and prove its fundamental properties as follows\.
###### Proposition 3\.4\. Assume that, for everyt∈\[0,1\)t\\in\[0,1\)and𝐱\{\\bf x\}, the self\-conditioned denoiserD^t\(𝐱,⋅\)\\hat\{D\}\_\{t\}\(\{\\bf x\},\\cdot\)has a unique fixed pointDt⋆\(𝐱\)=D^t\(𝐱,Dt⋆\(𝐱\)\)D^\{\\star\}\_\{t\}\(\{\\bf x\}\)=\\hat\{D\}\_\{t\}\(\{\\bf x\},D^\{\\star\}\_\{t\}\(\{\\bf x\}\)\)\. Define thefixed\-point velocitybt⋆\(𝐱\)≔Dt⋆\(𝐱\)−𝐱1−t,t∈\[0,1\)\.b\_\{t\}^\{\\star\}\(\{\\bf x\}\)\\coloneqq\\frac\{D\_\{t\}^\{\\star\}\(\{\\bf x\}\)\-\{\\bf x\}\}\{1\-t\},\\quad t\\in\[0,1\)\.\(14\)Thenbt⋆b\_\{t\}^\{\\star\}depends only onttand𝐱\{\\bf x\}, and therefore the dynamics𝐱˙t=bt⋆\(𝐱t\)\\dot\{\{\\bf x\}\}\_\{t\}=b\_\{t\}^\{\\star\}\(\{\\bf x\}\_\{t\}\)\(15\)define an ordinary time\-dependent ODE\. Furthermore, when the fixed\-point denoiser is Bayes optimal,D⋆=DD^\{\\star\}=D, the fixed\-point velocity recovers the true velocity,b⋆=bb^\{\\star\}=b\.
A proof is in[SectionA\.4](https://arxiv.org/html/2607.00714#A1.SS4)\. Equation \([14](https://arxiv.org/html/2607.00714#S3.E14)\) says that at each flow time, the self\-conditioned denoiserD^\\hat\{D\}runs fixed\-point iteration \([10](https://arxiv.org/html/2607.00714#S3.E10)\) until convergence, and the fixed point modulates the flow via \([14](https://arxiv.org/html/2607.00714#S3.E14)\)\. This suggests an Euler scheme for sampling𝐱^1∼p1\\hat\{\{\\bf x\}\}\_\{1\}\\sim p\_\{1\}over a flow\-time grid0=t0<…<tN=10=t\_\{0\}<\.\.\.<t\_\{N\}=1:
𝐱^ti\+1=𝐱^ti\+\(ti\+1−ti\)𝐳^ti⋆−𝐱^ti1−ti,\\hat\{\{\\bf x\}\}\_\{t\_\{i\+1\}\}=\\hat\{\{\\bf x\}\}\_\{t\_\{i\}\}\+\(t\_\{i\+1\}\-t\_\{i\}\)\\frac\{\\hat\{\{\\bf z\}\}\_\{t\_\{i\}\}^\{\\star\}\-\\hat\{\{\\bf x\}\}\_\{t\_\{i\}\}\}\{1\-t\_\{i\}\},\(16\)where each𝐳^ti⋆\\hat\{\{\\bf z\}\}\_\{t\_\{i\}\}^\{\\star\}is obtained by solving an inner fixed\-point iteration from an initialization𝐳^ti0\\hat\{\{\\bf z\}\}\_\{t\_\{i\}\}^\{0\}:
𝐳^tij\+1=D^ti\(𝐱^ti,𝐳^tij\)\.\\hat\{\{\\bf z\}\}\_\{t\_\{i\}\}^\{j\+1\}=\\hat\{D\}\_\{t\_\{i\}\}\(\\hat\{\{\\bf x\}\}\_\{t\_\{i\}\},\\hat\{\{\\bf z\}\}\_\{t\_\{i\}\}^\{j\}\)\.\(17\)If the fixed\-point iteration is run until convergence, how it is initialized must not affect the outcome\. Therefore, a simple and valid approach is to always initialize with zero,𝐳^ti0=𝟎\\hat\{\{\\bf z\}\}\_\{t\_\{i\}\}^\{0\}=\\boldsymbol\{0\}, which we callcold\-start samplingof a fixed\-point flow\. On the other hand, choosing the right initialization can improve the efficiency of finding the fixed point\. This can be done, for instance, with the fixed\-point estimate of the previous flow timestep,𝐳^ti0=𝐳^ti−1⋆\\hat\{\{\\bf z\}\}\_\{t\_\{i\}\}^\{0\}=\\hat\{\{\\bf z\}\}\_\{t\_\{i\-1\}\}^\{\\star\}\. We call thiswarm\-start sampling, and show:
###### Proposition 3\.5\. In the setting of[Proposition3\.3](https://arxiv.org/html/2607.00714#S3.Thmtheorem3), if\|𝐳0−𝐳⋆\|\>ε\|\{\\bf z\}^\{0\}\-\{\\bf z\}^\{\\star\}\|\>\\varepsilon, then it is enough to takej≥log\|𝐳0−𝐳⋆\|/εlog\(1/η\)j\\geq\\frac\{\\log\|\{\\bf z\}^\{0\}\-\{\\bf z\}^\{\\star\}\|/\\varepsilon\}\{\\log\(1/\\eta\)\}\(18\)iterations to guarantee\|𝐳j−𝐳⋆\|≤ε\|\{\\bf z\}^\{j\}\-\{\\bf z\}^\{\\star\}\|\\leq\\varepsilon\.
A proof is given in[SectionA\.5](https://arxiv.org/html/2607.00714#A1.SS5)\. The result shows that initializing closer to the fixed point improves convergence, which is the case for warm\-start sampling if the fixed points are correlated locally intt\.
With this, we can view conventional self\-conditioned generation in \([9](https://arxiv.org/html/2607.00714#S3.E9)\) as warm\-start sampling that approximates fixed points with a single iteration,𝐳^ti⋆≈D^ti\(𝐱^ti,𝐳^ti−1⋆\)\\hat\{\{\\bf z\}\}\_\{t\_\{i\}\}^\{\\star\}\\approx\\hat\{D\}\_\{t\_\{i\}\}\(\\hat\{\{\\bf x\}\}\_\{t\_\{i\}\},\\hat\{\{\\bf z\}\}\_\{t\_\{i\-1\}\}^\{\\star\}\)\. This implies that the coupling of flow state and self\-conditioning state in \([9](https://arxiv.org/html/2607.00714#S3.E9)\) is not a defining property of self\-conditioned flows, as it is merely a byproduct of warm starts, a sampling heuristic\. We indeed find that cold\-start sampling with sufficient fixed\-point iterations can replace warm starts in ELF\(Huet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib3)\)and LangFlow\(Chenet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib4)\)\.
Taking this idea further, we find that it is possible to take a self\-conditioned flow model and turn it into a self\-conditioning\-free model by distilling the self\-conditioned denoiserD^\\hat\{D\}into the fixed\-point denoiserD⋆D^\{\\star\}\. This can be done, for instance, via the following fixed\-point distillation loss:
ℒ\(D⋆\)=∫01𝔼\|Dt⋆\(It\)−𝐳⋆\|2dt,𝐳⋆=D^t\(It,𝐳⋆\),\\mathcal\{L\}\(D^\{\\star\}\)=\\int\_\{0\}^\{1\}\\mathbb\{E\}\|D\_\{t\}^\{\\star\}\(I\_\{t\}\)\-\{\\bf z\}^\{\\star\}\|^\{2\}\\,\{\\rm d\}t,\\quad\{\\bf z\}^\{\\star\}=\\hat\{D\}\_\{t\}\(I\_\{t\},\{\\bf z\}^\{\\star\}\),\(19\)where the target𝐳⋆\{\\bf z\}^\{\\star\}is estimated by iterating𝐳j\+1=D^t\(It,𝐳j\)\{\\bf z\}^\{j\+1\}=\\hat\{D\}\_\{t\}\(I\_\{t\},\{\\bf z\}^\{j\}\)from𝐳0=𝟎\{\\bf z\}^\{0\}=\\boldsymbol\{0\}\. Any other fixed\-point distillation method can be used instead of \([19](https://arxiv.org/html/2607.00714#S3.E19)\), such as consistency distillation\(Linet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib21)\)\. The resulting fixed\-point velocityb⋆b^\{\\star\}is autonomous, so its generation can use the usual Euler scheme \([2](https://arxiv.org/html/2607.00714#S2.E2)\)\.
### 3\.4Fixed\-point flow maps
In[Section3\.3](https://arxiv.org/html/2607.00714#S3.SS3), we have formalized fixed\-point flows, which express self\-conditioned flows via ordinary velocity fields\. We leverage this insight to define their associated flow maps, which allows us to design few\-step distillation methods for self\-conditioned flow language models\.
###### Proposition 3\.6\. On any interval where the fixed\-point flow ODE \([15](https://arxiv.org/html/2607.00714#S3.E15)\) has unique solutions, thefixed\-point flow map, its solution operatorXs,t⋆\(𝐱s\)=𝐱tX\_\{s,t\}^\{\\star\}\(\{\\bf x\}\_\{s\}\)=\{\\bf x\}\_\{t\}, satisfiesXs,t⋆\(𝐱\)=𝐱\+∫stbτ⋆\(𝐱τ\)dτ\.X^\{\\star\}\_\{s,t\}\(\{\\bf x\}\)=\{\\bf x\}\+\\int\_\{s\}^\{t\}b\_\{\\tau\}^\{\\star\}\(\{\\bf x\}\_\{\\tau\}\)\\,\{\\rm d\}\\tau\.\(20\)Moreover, for0≤s≤u≤t<10\\leq s\\leq u\\leq t<1,Xs,t⋆=Xu,t⋆∘Xs,u⋆\.X\_\{s,t\}^\{\\star\}=X\_\{u,t\}^\{\\star\}\\circ X\_\{s,u\}^\{\\star\}\.\(21\)
A proof is in[SectionA\.6](https://arxiv.org/html/2607.00714#A1.SS6)\. The result shows thatX⋆X^\{\\star\}is a valid flow map and admits conditions such as \([21](https://arxiv.org/html/2607.00714#S3.E21)\) that can be leveraged to learn it through distillation from the velocitybt⋆b\_\{t\}^\{\\star\}\. In order to learn the flow map, we parameterize it with atwo\-time denoiserδs,t\\delta\_\{s,t\}defined as follows\(Leeet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib2); Luet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib32)\):
δs,t\(𝐱\)≔𝐱\+\(1−s\)vs,t\(𝐱\),\\delta\_\{s,t\}\(\{\\bf x\}\)\\coloneqq\{\\bf x\}\+\(1\-s\)v\_\{s,t\}\(\{\\bf x\}\),\(22\)wherevs,tv\_\{s,t\}is the average velocity fromsstott, defined as follows\(Boffiet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib22)\):
vs,t\(𝐱\)≔Xs,t⋆\(𝐱\)−𝐱t−s,vt,t\(𝐱\)≔bt⋆\(𝐱\)\.v\_\{s,t\}\(\{\\bf x\}\)\\coloneqq\\frac\{X\_\{s,t\}^\{\\star\}\(\{\\bf x\}\)\-\{\\bf x\}\}\{t\-s\},\\quad v\_\{t,t\}\(\{\\bf x\}\)\\coloneqq b\_\{t\}^\{\\star\}\(\{\\bf x\}\)\.\(23\)Analogous to the relationship between the single\-time denoiser and velocity,Dt⋆\(𝐱\)=𝐱\+\(1−t\)bt⋆\(𝐱\)D\_\{t\}^\{\\star\}\(\{\\bf x\}\)=\{\\bf x\}\+\(1\-t\)b\_\{t\}^\{\\star\}\(\{\\bf x\}\), the two\-time denoiser can be viewed as a single denoising step to the clean data domain\. Its outputs are known to lie in a low\-dimensional manifold\(Leeet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib2); Luet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib32)\), making learning easier\. We now show:
###### Proposition 3\.7\. The two\-time denoiserδs,t\\delta\_\{s,t\}satisfies the following properties:\(i\)For0≤s<t<10\\leq s<t<1, the flow map is recovered fromδs,t\\delta\_\{s,t\}byXs,t⋆\(𝐱\)=1−t1−s𝐱\+t−s1−sδs,t\(𝐱\)\.X\_\{s,t\}^\{\\star\}\(\{\\bf x\}\)=\\frac\{1\-t\}\{1\-s\}\{\\bf x\}\+\\frac\{t\-s\}\{1\-s\}\\delta\_\{s,t\}\(\{\\bf x\}\)\.\(24\)\(ii\)On the diagonal,δt,t\(𝐱\)=Dt⋆\(𝐱\)\.\\delta\_\{t,t\}\(\{\\bf x\}\)=D\_\{t\}^\{\\star\}\(\{\\bf x\}\)\.\(25\)\(iii\)For0≤s<u<t<10\\leq s<u<t<1, the semigroup condition \([21](https://arxiv.org/html/2607.00714#S3.E21)\) is equivalent toδs,t\(𝐱\)=γδs,u\(𝐱\)\+\(1−γ\)δu,t\(Xs,u⋆\(𝐱\)\),\\delta\_\{s,t\}\(\{\\bf x\}\)=\\gamma\\delta\_\{s,u\}\(\{\\bf x\}\)\+\(1\-\\gamma\)\\delta\_\{u,t\}\(X\_\{s,u\}^\{\\star\}\(\{\\bf x\}\)\),\(26\)whereγ=\(1−t\)\(u−s\)\(1−u\)\(t−s\)∈\[0,1\]\\gamma=\\tfrac\{\(1\-t\)\(u\-s\)\}\{\(1\-u\)\(t\-s\)\}\\in\[0,1\]\.
A proof is in[SectionA\.7](https://arxiv.org/html/2607.00714#A1.SS7)\. The result shows that learning the two\-time denoiser yields the flow map, and that the two\-time denoiser matches the fixed\-point denoiserD⋆D^\{\\star\}\([13](https://arxiv.org/html/2607.00714#S3.E13)\) on the diagonals=ts=t, while satisfying a self\-consistency criterion off the diagonal\. These conditions lead to a learning objective:
###### Proposition 3\.8\. Consider the population objectiveℒ\(δ\)=𝔼s,u,t,𝐱\|δs,t\(𝐱\)−𝗌𝗀\(𝒯\(δ\)s,u,t\(𝐱\)\)\|2\+𝔼t,𝐱\|δt,t\(𝐱\)−Dt⋆\(𝐱\)\|2\\mathcal\{L\}\(\\delta\)=\\mathbb\{E\}\_\{s,u,t,\{\\bf x\}\}\|\\delta\_\{s,t\}\(\{\\bf x\}\)\-\\mathsf\{sg\}\(\\mathcal\{T\}\(\\delta\)\_\{s,u,t\}\(\{\\bf x\}\)\)\|^\{2\}\+\\mathbb\{E\}\_\{t,\{\\bf x\}\}\|\\delta\_\{t,t\}\(\{\\bf x\}\)\-D\_\{t\}^\{\\star\}\(\{\\bf x\}\)\|^\{2\}\(27\)where𝒯\(δ\)s,u,t\(𝐱\):=γδs,u\(𝐱\)\+\(1−γ\)δu,t\(Xs,u⋆\(𝐱\)\)\.\\mathcal\{T\}\(\\delta\)\_\{s,u,t\}\(\{\\bf x\}\):=\\gamma\\delta\_\{s,u\}\(\{\\bf x\}\)\+\(1\-\\gamma\)\\delta\_\{u,t\}\\bigl\(X\_\{s,u\}^\{\\star\}\(\{\\bf x\}\)\\bigr\)\.\(28\)Then the true two\-time denoiser has zero loss\. Conversely, any zero\-loss solution satisfies the diagonal \([25](https://arxiv.org/html/2607.00714#S3.E25)\) and semigroup \([26](https://arxiv.org/html/2607.00714#S3.E26)\) conditions almost surely under the training distribution\.
A proof is given in[SectionA\.8](https://arxiv.org/html/2607.00714#A1.SS8)\. The first loss term enforces the semigroup condition, and the second loss term anchors the diagonal to the fixed pointDt⋆\(It\)=𝐳⋆=D^t\(It,𝐳⋆\)D\_\{t\}^\{\\star\}\(I\_\{t\}\)=\{\\bf z\}^\{\\star\}=\\hat\{D\}\_\{t\}\(I\_\{t\},\{\\bf z\}^\{\\star\}\)\. Evaluating the latter requires fixed points𝐳⋆\{\\bf z\}^\{\\star\}, which we may make available in one of two ways: an offline route that first distills the fixed\-point denoiserD⋆D^\{\\star\}as a separate model via \([19](https://arxiv.org/html/2607.00714#S3.E19)\), or an online route that performs two compressions in \([19](https://arxiv.org/html/2607.00714#S3.E19)\) and \([27](https://arxiv.org/html/2607.00714#S3.E27)\) jointly, estimating𝐳⋆\{\\bf z\}^\{\\star\}with a few iterations of𝐳j\+1=D^t\(It,𝐳j\)\{\\bf z\}^\{j\+1\}=\\hat\{D\}\_\{t\}\(I\_\{t\},\{\\bf z\}^\{j\}\)from𝐳0=𝟎\{\\bf z\}^\{0\}=\\boldsymbol\{0\}\. The online approach ultimately achievesδt,t≈Dt⋆\\delta\_\{t,t\}\\approx D\_\{t\}^\{\\star\}without training a separate model\.
## 4Experiments
Through our experiments, we aim to answer the following key questions:
1. Q1\.Do self\-conditioned flow language models learn fixed\-point iterations?
2. Q2\.Is self\-conditioned generation merely offering better initialization of fixed points?
3. Q3\.Can we train self\-conditioning\-free flows competitive with self\-conditioned ones?
4. Q4\.Can we train a flow map language model that leverages self\-conditioning?
#### Setup\.
We conduct all experiments on the OpenWebText dataset\(Gokaslanet al\.,[2019](https://arxiv.org/html/2607.00714#bib.bib28)\), a standard corpus for language modeling, with a sequence length of 1024 tokens\. We assess generation quality with generative perplexity \(gPPL\)\(Dielemanet al\.,[2022](https://arxiv.org/html/2607.00714#bib.bib8); Sahooet al\.,[2025](https://arxiv.org/html/2607.00714#bib.bib30)\)measured with pretrained GPT\-2 Large\(Radfordet al\.,[2019](https://arxiv.org/html/2607.00714#bib.bib29)\), together with average per\-sample unigram entropy\. We regard a model as strong only when its generations attain low gPPL while remaining close to the data entropy of 5\.44 nats\(Leeet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib2)\)\. We use 256 generated samples for analysis \([Section4\.1](https://arxiv.org/html/2607.00714#S4.SS1)\) and 1024 for distillation \([Section4\.2](https://arxiv.org/html/2607.00714#S4.SS2)\)\. See[AppendixB](https://arxiv.org/html/2607.00714#A2)for more experimental details\.
### 4\.1Analyzing self\-conditioned flow language models
To answer whether self\-conditioned denoisers induce fixed\-point iterations \(Q1\) and whether the gains of self\-conditioned sampling come from improved initialization of fixed\-point iterations \(Q2\), we study ELF\(Huet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib3)\)and LangFlow\(Chenet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib4)\)as representatives of self\-conditioned flow language models on learned embeddings and one\-hot token embeddings, respectively\. We draw gPPL\-entropy frontier curves\(Pynadathet al\.,[2025](https://arxiv.org/html/2607.00714#bib.bib36)\)by sweeping sampling parameterγ\\gammafor ELF and softmax temperature for LangFlow\.
#### Self\-conditioning induces a fixed\-point iteration \(Q1\)\.
We first ask whether self\-conditioned denoisers learn fixed\-point iteration \([10](https://arxiv.org/html/2607.00714#S3.E10)\)\. FollowingLinet al\.\([2026](https://arxiv.org/html/2607.00714#bib.bib21)\), we track relative distance to the fixed point,\|𝐳j−𝐳⋆\|/\|𝐳⋆\|\|\{\\bf z\}^\{j\}\-\{\\bf z\}^\{\\star\}\|/\|\{\\bf z\}^\{\\star\}\|, which equals one atj=0j=0when cold\-start sampling is used \(𝐳0=𝟎\{\\bf z\}^\{0\}=\{\\bf 0\}\)\. To approximate𝐳⋆\{\\bf z\}^\{\\star\}, we run 200 damped Picard iterations\(Lauriere,[2021](https://arxiv.org/html/2607.00714#bib.bib39)\)with a fixed damping parameter0\.30\.3\(see[SectionB\.1](https://arxiv.org/html/2607.00714#A2.SS1)for details\)\. As shown in[Figure2](https://arxiv.org/html/2607.00714#S4.F2), the relative distance decays across flow timettfor both models\. This supports that the models have approximately learned a fixed point iteration as posited in[Section3\.2](https://arxiv.org/html/2607.00714#S3.SS2)\.
We now ask whether the fixed\-point iteration converges towards the ideal denoiser \([Section3\.2](https://arxiv.org/html/2607.00714#S3.SS2)\)\. As illustrated in[Section4\.1](https://arxiv.org/html/2607.00714#S4.SS1.SSS0.Px1), gPPL decreases as the number of fixed\-point iterations increases on both ELF and LangFlow, whereas entropy remains stable near the data distribution \(5\.44 nats\), showing that taking more steps toward the fixed point produces a better denoiser, agreeing with[Proposition3\.3](https://arxiv.org/html/2607.00714#S3.Thmtheorem3)\.


Figure 2:Convergence towards the fixed point across fixed\-point iterations\.Table 1:Effect of fixed\-point iterations\. Increasing the iterations enhances generation quality\.

Figure 3:Warm\-start and cold\-start sampling with 1 and 100 fixed\-point iterations\.
#### Self\-conditioned generation merely offers a better initialization \(Q2\)\.
In[Section3\.3](https://arxiv.org/html/2607.00714#S3.SS3), we viewed conventional generation of self\-conditioned flows as a warm\-start sampling scheme that attempts to use a better initialization for the fixed\-point iteration\. Accordingly, we conjectured that cold\-start sampling would reach a similar performance if sufficient iterations are used as it would eventually find the fixed point regardless of initialization\. To validate this, we draw the gPPL–entropy frontier of each sampling scheme in[Figure3](https://arxiv.org/html/2607.00714#S4.F3)\. When we use a single fixed\-point iteration, warm\-start outperforms cold\-start sampling, demonstrating that warm\-start offers a good initialization around the fixed point\. With 100 iterations, however, warm\- and cold\-start attain the same frontier, indicating that the choice of initialization has a negligible effect on generation once enough fixed\-point iterations are run\.
### 4\.2Distilling self\-conditioning into fixed\-point flow maps
To test self\-conditioning\-free distillation \(Q3\) and self\-conditioned flow map distillation \(Q4\), we train ELF\(Huet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib3)\)as a teacher model, replacing its encoder with GPT\-2 Large\(Radfordet al\.,[2019](https://arxiv.org/html/2607.00714#bib.bib29)\)for tokenizer\-matched comparison with baselines\.
Table 2:Comparison with few\-step language models on OpenWebText\. FMLM⋆outperforms all baselines that preserve data\-level entropy \(5\.44 nats\)\.Figure 4:Comparison between ELF⋆and its teacher model ELF\.#### Self\-conditioning is removable \(Q3\)\.
We find that, given a self\-conditioned denoiserD^\\hat\{D\}\(ELF in our case\), we can learn its associated fixed\-point denoiserD⋆D^\{\\star\}\([13](https://arxiv.org/html/2607.00714#S3.E13)\) through fixed\-point distillation, which yields fixed\-point velocity \([14](https://arxiv.org/html/2607.00714#S3.E14)\), a self\-conditioning\-free model\. Herein, we consider CDEQ\(Linet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib21)\), an existing fixed\-point distillation method\. Using the method, we distill the ELF teacher into a self\-conditioning\-free modelELF⋆, with implementation details provided in[AppendixB](https://arxiv.org/html/2607.00714#A2)\. We compare 8\-step generation performance of ELF⋆against 32\-step generation of the teacher with gPPL\-entropy frontier curves\. As in[Figure4](https://arxiv.org/html/2607.00714#S4.F4), ELF⋆matches the teacher’s frontier, confirming that self\-conditioning is removable via fixed\-point distillation without degrading quality\.
Figure 5:Comparison between FMLM⋆and teacher model ELF\.
#### Self\-conditioning is distillable into a flow map \(Q4\)\.
Finally, we ask whether self\-conditioning can be leveraged to train a few\-step flow map\. To this end, we distill the self\-conditioned ELF teacher into a self\-conditioned flow map language modelFMLM⋆, parameterized by the two\-time denoiserδs,t\\delta\_\{s,t\}\([22](https://arxiv.org/html/2607.00714#S3.E22)\)\. Following the offline route proposed in[Section3\.4](https://arxiv.org/html/2607.00714#S3.SS4), we take the fixed\-point denoiser ELF⋆as teacher and learn its associated fixed\-point flow map with the semigroup distillation objective \([27](https://arxiv.org/html/2607.00714#S3.E27)\)\. As illustrated in[Figure5](https://arxiv.org/html/2607.00714#S4.F5), FMLM⋆with 8 generation steps approaches the 32\-step frontier of the self\-conditioned teacher ELF\.
We then compare FMLM⋆in one\-step and few\-step generation settings on OpenWebText against recent few\-step distillation baselines: discrete diffusion models Duo with DCD\(Sahooet al\.,[2025](https://arxiv.org/html/2607.00714#bib.bib30)\), MDLM with SDTT\(Deschenaux and Gulcehre,[2024](https://arxiv.org/html/2607.00714#bib.bib31)\), and both with Di4C\(Hayakawaet al\.,[2024](https://arxiv.org/html/2607.00714#bib.bib33)\), and continuous flow map language models FMLM\(Leeet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib2)\)and DFM\(Potaptchiket al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib6)\), reporting gPPL and entropy\. As shown in[Table2](https://arxiv.org/html/2607.00714#S4.T2), FMLM⋆attains the best gPPL among baselines that preserve data\-level entropy \(5\.44 nats\) in the one\- and few\-step regimes, advancing the state\-of\-the\-art under a matched sampling budget\. Qualitative samples can be found in[AppendixC](https://arxiv.org/html/2607.00714#A3)\.
Table 3:Comparison of FMLM⋆with online and offline fixed\-point distillation\. The performance of online distillation saturates around 9 fixed\-point iterations and nearly matches offline distillation\.To reduce training cost, we also consider online distillation, which avoids training a separate fixed\-point denoiser \(ELF⋆in our case\)\. FMLM⋆instead distills directly from the self\-conditioned teacher \(ELF\) using a fixed number of cold\-start fixed\-point iterations at each training step\. As shown in[Table3](https://arxiv.org/html/2607.00714#S4.T3), its performance saturates around 9 FPIs, and the resulting online\-distilled FMLM⋆achieves competitive quality at approximately0\.3×0\.3\\timesthe training cost of offline\-distilled FMLM⋆\. Together, these results show that self\-conditioning can be leveraged to build a state\-of\-the\-art few\-step flow map through either two\-stage offline distillation or cheaper one\-stage online distillation \(Q4\)\.
## 5Conclusion
We demonstrated that the self\-conditioned flow language model implicitly learns a fixed\-point iteration that refines its own denoising estimate, approximately converging toward the ideal denoiser\. Such convergence is provable under a contractivity assumption, and we observe it empirically in existing self\-conditioned models\. Based on this viewpoint, we proposed fixed\-point flows, a class of self\-conditioned flows organized along two axes: the flow over time and the fixed\-point iteration at each flow time, which also offers an understanding of conventional self\-conditioned sampling as a single\-step warm\-started fixed\-point iteration\. Building on fixed\-point flows, we proposed a method to distill self\-conditioned flows into a flow map language model FMLM⋆that enables few\-step generation while inheriting the excellent performance of self\-conditioned flows\. On OpenWebText, FMLM⋆achieves state\-of\-the\-art one\- and few\-step generation under matched sampling budgets\.
#### Limitations and future work\.
We restrict our flow map learning to the distillation setting which requires a separate teacher model \(e\.g\., ELF or ELF⋆\)\. A promising next step is extending to self\-distillation that only uses data supervision, simultaneously learning the self\-conditioned denoiser, fixed\-point denoiser, and two\-time denoiser in a single model\. In addition, we have limited our scope to text data, although our formulation is general and would apply to other modalities such as graphs, images, and videos\. Understanding self\-conditioning in these modalities would prove interesting\.
## References
- M\. Albergo, N\. M\. Boffi, and E\. Vanden\-Eijnden \(2025\)Stochastic interpolants: a unifying framework for flows and diffusions\.Journal of Machine Learning Research26\(209\),pp\. 1–80\.Cited by:[§2](https://arxiv.org/html/2607.00714#S2.SS0.SSS0.Px1.p2.13),[§2](https://arxiv.org/html/2607.00714#S2.SS0.SSS0.Px1.p2.8)\.
- S\. Bai, J\. Z\. Kolter, and V\. Koltun \(2019\)Deep equilibrium models\.Advances in neural information processing systems32\.Cited by:[§3\.2](https://arxiv.org/html/2607.00714#S3.SS2.p5.4)\.
- G\. Batzolis, M\. Girolami, and L\. Ambrogioni \(2026\)Towards closing the autoregressive gap in language modeling via entropy\-gated continuous bitstream diffusion\.arXiv preprint arXiv:2605\.07013\.Cited by:[§1](https://arxiv.org/html/2607.00714#S1.p2.1),[§3\.1](https://arxiv.org/html/2607.00714#S3.SS1.p6.1)\.
- N\. Boffi, M\. Albergo, and E\. Vanden\-Eijnden \(2026\)How to build a consistency model: learning flow maps via self\-distillation\.Advances in Neural Information Processing Systems38,pp\. 33346–33382\.Cited by:[§2](https://arxiv.org/html/2607.00714#S2.SS0.SSS0.Px2.p2.1),[§3\.4](https://arxiv.org/html/2607.00714#S3.SS4.p3.6)\.
- J\. Chemseddine, G\. Kornhardt, and G\. Steidl \(2026\)Spherical flows for sampling categorical data\.arXiv preprint arXiv:2605\.05629\.Cited by:[§1](https://arxiv.org/html/2607.00714#S1.p1.1)\.
- T\. Chen, R\. Zhang, and G\. Hinton \(2022\)Analog bits: generating discrete data using diffusion models with self\-conditioning\.arXiv preprint arXiv:2208\.04202\.Cited by:[§1](https://arxiv.org/html/2607.00714#S1.p2.1),[§3\.1](https://arxiv.org/html/2607.00714#S3.SS1.p1.4),[§3\.1](https://arxiv.org/html/2607.00714#S3.SS1.p6.1)\.
- Y\. Chen, C\. Liang, H\. Sui, R\. Guo, C\. Cheng, J\. You, and G\. Liu \(2026\)LangFlow: continuous diffusion rivals discrete in language modeling\.arXiv preprint arXiv:2604\.11748\.Cited by:[§1](https://arxiv.org/html/2607.00714#S1.p2.1),[§3\.1](https://arxiv.org/html/2607.00714#S3.SS1.p6.1),[§3\.2](https://arxiv.org/html/2607.00714#S3.SS2.p5.4),[§3\.3](https://arxiv.org/html/2607.00714#S3.SS3.p6.1),[§4\.1](https://arxiv.org/html/2607.00714#S4.SS1.p1.1)\.
- J\. Deschenaux and C\. Gulcehre \(2024\)Beyond autoregression: fast llms via self\-distillation through time\.arXiv preprint arXiv:2410\.21035\.Cited by:[§4\.2](https://arxiv.org/html/2607.00714#S4.SS2.SSS0.Px2.p2.2)\.
- J\. Deschenaux and C\. Gulcehre \(2026\)Language modeling with hyperspherical flows\.arXiv preprint arXiv:2605\.11125\.Cited by:[§1](https://arxiv.org/html/2607.00714#S1.p1.1)\.
- S\. Dieleman, L\. Sartran, A\. Roshannai, N\. Savinov, Y\. Ganin, P\. H\. Richemond, A\. Doucet, R\. Strudel, C\. Dyer, C\. Durkan,et al\.\(2022\)Continuous diffusion for categorical data\.arXiv preprint arXiv:2211\.15089\.Cited by:[§1](https://arxiv.org/html/2607.00714#S1.p2.1),[§2](https://arxiv.org/html/2607.00714#S2.SS0.SSS0.Px1.p2.17),[§3\.1](https://arxiv.org/html/2607.00714#S3.SS1.p6.1),[§4](https://arxiv.org/html/2607.00714#S4.SS0.SSS0.Px1.p1.1)\.
- F\. Eijkelboom, G\. Bartosh, C\. A\. Naesseth, M\. Welling, and J\. van de Meent \(2024\)Variational flow matching for graph generation\.Advances in Neural Information Processing Systems37,pp\. 11735–11764\.Cited by:[§2](https://arxiv.org/html/2607.00714#S2.SS0.SSS0.Px1.p2.17)\.
- S\. W\. Fung, H\. Heaton, Q\. Li, D\. McKenzie, S\. Osher, and W\. Yin \(2022\)Jfb: jacobian\-free backpropagation for implicit networks\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.36,pp\. 6648–6656\.Cited by:[§3\.2](https://arxiv.org/html/2607.00714#S3.SS2.p5.4)\.
- A\. Gokaslan, V\. Cohen, E\. Pavlick, and S\. Tellex \(2019\)OpenWebText corpus\.Note:[http://Skylion007\.github\.io/OpenWebTextCorpus](http://skylion007.github.io/OpenWebTextCorpus)Cited by:[§4](https://arxiv.org/html/2607.00714#S4.SS0.SSS0.Px1.p1.1)\.
- I\. Gulrajani and T\. B\. Hashimoto \(2023\)Likelihood\-based diffusion language models\.Advances in Neural Information Processing Systems36,pp\. 16693–16715\.Cited by:[§3\.2](https://arxiv.org/html/2607.00714#S3.SS2.p2.8)\.
- S\. Hayakawa, Y\. Takida, M\. Imaizumi, H\. Wakaki, and Y\. Mitsufuji \(2024\)Distillation of discrete diffusion through dimensional correlations\.arXiv preprint arXiv:2410\.08709\.Cited by:[§4\.2](https://arxiv.org/html/2607.00714#S4.SS2.SSS0.Px2.p2.2)\.
- K\. Hu, L\. Qiu, Y\. Lu, H\. Zhao, T\. Li, Y\. Kim, J\. Andreas, and K\. He \(2026\)ELF: embedded language flows\.arXiv preprint arXiv:2605\.10938\.Cited by:[§B\.1](https://arxiv.org/html/2607.00714#A2.SS1.p2.3),[§B\.2](https://arxiv.org/html/2607.00714#A2.SS2.SSS0.Px1.p1.5),[§B\.2](https://arxiv.org/html/2607.00714#A2.SS2.SSS0.Px2.p1.18),[§B\.2](https://arxiv.org/html/2607.00714#A2.SS2.p1.4),[§B\.3](https://arxiv.org/html/2607.00714#A2.SS3.SSS0.Px2.p2.7),[§1](https://arxiv.org/html/2607.00714#S1.p2.1),[§2](https://arxiv.org/html/2607.00714#S2.SS0.SSS0.Px1.p2.13),[§3\.1](https://arxiv.org/html/2607.00714#S3.SS1.p6.1),[§3\.2](https://arxiv.org/html/2607.00714#S3.SS2.p5.4),[§3\.3](https://arxiv.org/html/2607.00714#S3.SS3.p6.1),[§4\.1](https://arxiv.org/html/2607.00714#S4.SS1.p1.1),[§4\.2](https://arxiv.org/html/2607.00714#S4.SS2.p1.1)\.
- D\. Kim, C\. Lai, W\. Liao, N\. Murata, Y\. Takida, T\. Uesaka, Y\. He, Y\. Mitsufuji, and S\. Ermon \(2024\)Consistency trajectory models: learning probability flow ode trajectory of diffusion\.InICLR,Cited by:[§B\.3](https://arxiv.org/html/2607.00714#A2.SS3.SSS0.Px3.p1.4)\.
- M\. Lauriere \(2021\)Numerical methods for mean field games and mean field type control\.arXiv preprint arXiv:2106\.06231\.Cited by:[§B\.1](https://arxiv.org/html/2607.00714#A2.SS1.p1.4),[§4\.1](https://arxiv.org/html/2607.00714#S4.SS1.SSS0.Px1.p1.6)\.
- C\. Lee, J\. Yoo, M\. Agarwal, S\. Shah, J\. Huang, A\. Raghunathan, S\. Hong, N\. M\. Boffi, and J\. Kim \(2026\)Flow map language models: one\-step language modeling via continuous denoising\.arXiv preprint arXiv:2602\.16813\.Cited by:[§B\.3](https://arxiv.org/html/2607.00714#A2.SS3.SSS0.Px2.p1.18),[§B\.3](https://arxiv.org/html/2607.00714#A2.SS3.SSS0.Px3.p1.4),[§1](https://arxiv.org/html/2607.00714#S1.p1.1),[§2](https://arxiv.org/html/2607.00714#S2.SS0.SSS0.Px1.p2.13),[§2](https://arxiv.org/html/2607.00714#S2.SS0.SSS0.Px1.p2.17),[§2](https://arxiv.org/html/2607.00714#S2.SS0.SSS0.Px2.p2.1),[§3\.4](https://arxiv.org/html/2607.00714#S3.SS4.p3.3),[§3\.4](https://arxiv.org/html/2607.00714#S3.SS4.p3.7),[§4](https://arxiv.org/html/2607.00714#S4.SS0.SSS0.Px1.p1.1),[§4\.2](https://arxiv.org/html/2607.00714#S4.SS2.SSS0.Px2.p2.2)\.
- T\. Li and K\. He \(2026\)Back to basics: let denoising generative models denoise\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 36115–36125\.Cited by:[§2](https://arxiv.org/html/2607.00714#S2.SS0.SSS0.Px1.p2.13)\.
- J\. Lin, Z\. Ling, J\. Xu, and R\. C\. Qiu \(2026\)Consistency deep equilibrium models\.arXiv preprint arXiv:2602\.03024\.Cited by:[§B\.2](https://arxiv.org/html/2607.00714#A2.SS2.SSS0.Px2.p1.18),[§B\.2](https://arxiv.org/html/2607.00714#A2.SS2.p1.4),[§3\.3](https://arxiv.org/html/2607.00714#S3.SS3.p7.6),[§4\.1](https://arxiv.org/html/2607.00714#S4.SS1.SSS0.Px1.p1.6),[§4\.2](https://arxiv.org/html/2607.00714#S4.SS2.SSS0.Px1.p1.5)\.
- Y\. Lipman, R\. T\. Chen, H\. Ben\-Hamu, M\. Nickel, and M\. Le \(2022\)Flow matching for generative modeling\.arXiv preprint arXiv:2210\.02747\.Cited by:[§2](https://arxiv.org/html/2607.00714#S2.SS0.SSS0.Px1.p2.8)\.
- Y\. Lu, S\. Lu, Q\. Sun, H\. Zhao, Z\. Jiang, X\. Wang, T\. Li, Z\. Geng, and K\. He \(2026\)One\-step latent\-free image generation with pixel mean flows\.arXiv preprint arXiv:2601\.22158\.Cited by:[§3\.4](https://arxiv.org/html/2607.00714#S3.SS4.p3.3),[§3\.4](https://arxiv.org/html/2607.00714#S3.SS4.p3.7)\.
- V\. Meshchaninov, A\. Shabalin, E\. Chimbulatov, N\. Gushchin, I\. Koziev, A\. Korotin, and D\. Vetrov \(2026\)How to train your latent diffusion language model jointly with the latent space\.arXiv preprint arXiv:2605\.07933\.Cited by:[§1](https://arxiv.org/html/2607.00714#S1.p2.1),[§3\.1](https://arxiv.org/html/2607.00714#S3.SS1.p6.1)\.
- S\. Movahedi, V\. Milovanović, S\. L\. Feigin, A\. Theus, T\. Hofmann, V\. Boeva, T\. K\. Rusch, and A\. Orvieto \(2026\)Fixed\-point reasoners: stable and adaptive deep looped transformers\.arXiv preprint arXiv:2606\.18206\.Cited by:[§3\.2](https://arxiv.org/html/2607.00714#S3.SS2.p5.4)\.
- P\. Potaptchik, J\. Yim, A\. Saravanan, P\. Holderrieth, E\. Vanden\-Eijnden, and M\. S\. Albergo \(2026\)Discrete flow maps\.arXiv preprint arXiv:2604\.09784\.Cited by:[§1](https://arxiv.org/html/2607.00714#S1.p1.1),[§2](https://arxiv.org/html/2607.00714#S2.SS0.SSS0.Px2.p2.1),[§4\.2](https://arxiv.org/html/2607.00714#S4.SS2.SSS0.Px2.p2.2)\.
- P\. Pynadath, J\. Shi, and R\. Zhang \(2025\)Candi: hybrid discrete\-continuous diffusion models\.arXiv preprint arXiv:2510\.22510\.Cited by:[§4\.1](https://arxiv.org/html/2607.00714#S4.SS1.p1.1)\.
- A\. Radford, J\. Wu, R\. Child, D\. Luan, D\. Amodei, I\. Sutskever,et al\.\(2019\)Language models are unsupervised multitask learners\.OpenAI blog1\(8\),pp\. 9\.Cited by:[§B\.3](https://arxiv.org/html/2607.00714#A2.SS3.SSS0.Px2.p4.1),[§4](https://arxiv.org/html/2607.00714#S4.SS0.SSS0.Px1.p1.1),[§4\.2](https://arxiv.org/html/2607.00714#S4.SS2.p1.1)\.
- C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu \(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.Journal of machine learning research21\(140\),pp\. 1–67\.Cited by:[§B\.3](https://arxiv.org/html/2607.00714#A2.SS3.SSS0.Px2.p4.1)\.
- Y\. Romano, M\. Elad, and P\. Milanfar \(2017\)The little engine that could: regularization by denoising \(red\)\.SIAM journal on imaging sciences10\(4\),pp\. 1804–1844\.Cited by:[§3\.2](https://arxiv.org/html/2607.00714#S3.SS2.p5.4)\.
- D\. Roos, O\. Davis, F\. Eijkelboom, M\. Bronstein, M\. Welling, I\. I\. Ceylan, L\. Ambrogioni, and J\. van de Meent \(2026\)Categorical flow maps\.arXiv preprint arXiv:2602\.12233\.Cited by:[§1](https://arxiv.org/html/2607.00714#S1.p1.1),[§2](https://arxiv.org/html/2607.00714#S2.SS0.SSS0.Px2.p2.1)\.
- E\. Ryu, J\. Liu, S\. Wang, X\. Chen, Z\. Wang, and W\. Yin \(2019\)Plug\-and\-play methods provably converge with properly trained denoisers\.InInternational Conference on Machine Learning,pp\. 5546–5557\.Cited by:[§3\.2](https://arxiv.org/html/2607.00714#S3.SS2.p5.4)\.
- A\. Sabour, S\. Fidler, and K\. Kreis \(2026\)Align your flow: scaling continuous\-time flow map distillation\.Advances in Neural Information Processing Systems38,pp\. 146459–146512\.Cited by:[§B\.3](https://arxiv.org/html/2607.00714#A2.SS3.SSS0.Px3.p1.4)\.
- S\. S\. Sahoo, J\. Deschenaux, A\. Gokaslan, G\. Wang, J\. Chiu, and V\. Kuleshov \(2025\)The diffusion duality\.arXiv preprint arXiv:2506\.10892\.Cited by:[§4](https://arxiv.org/html/2607.00714#S4.SS0.SSS0.Px1.p1.1),[§4\.2](https://arxiv.org/html/2607.00714#S4.SS2.SSS0.Px2.p2.2)\.
- A\. Shabalin, S\. Elistratov, V\. Meshchaninov, I\. Sadrtdinov, and D\. Vetrov \(2026\)Why gaussian diffusion models fail on discrete data?\.arXiv preprint arXiv:2604\.02028\.Cited by:[§3\.1](https://arxiv.org/html/2607.00714#S3.SS1.p6.1)\.
- A\. Shabalin, V\. Meshchaninov, E\. Chimbulatov, V\. Lapikov, R\. Kim, G\. Bartosh, D\. Molchanov, S\. Markov, and D\. Vetrov \(2025\)Tencdm: understanding the properties of the diffusion model in the space of language model encodings\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 25110–25118\.Cited by:[§3\.1](https://arxiv.org/html/2607.00714#S3.SS1.p6.1)\.
- R\. Strudel, C\. Tallec, F\. Altché, Y\. Du, Y\. Ganin, A\. Mensch, W\. Grathwohl, N\. Savinov, S\. Dieleman, L\. Sifre,et al\.\(2022\)Self\-conditioned embedding diffusion for text generation\.arXiv preprint arXiv:2211\.04236\.Cited by:[§1](https://arxiv.org/html/2607.00714#S1.p2.1)\.
- Z\. Yang, W\. Guo, S\. Zhang, S\. S\. Sahoo, Y\. Chen, A\. Vahdat, M\. Mardani, and J\. Thickstun \(2026\)Continuous diffusion scales competitively with discrete diffusion for language\.arXiv preprint arXiv:2605\.18530\.Cited by:[§1](https://arxiv.org/html/2607.00714#S1.p2.1),[§3\.1](https://arxiv.org/html/2607.00714#S3.SS1.p6.1)\.
## Appendix AProofs
### A\.1Proof of Proposition[3\.1](https://arxiv.org/html/2607.00714#S3.Thmtheorem1)
See[3\.1](https://arxiv.org/html/2607.00714#S3.Thmtheorem1)
###### Proof\.
Fixttfor which the conditional independence assumption holds, and write
D≔Dt\(It\),U≔D^t\(It,𝐳\)\.D\\coloneqq D\_\{t\}\(I\_\{t\}\),\\qquad U\\coloneqq\\hat\{D\}\_\{t\}\(I\_\{t\},\{\\bf z\}\)\.Then
𝐱1−U=\(𝐱1−D\)\+\(D−U\)\.\{\\bf x\}\_\{1\}\-U=\(\{\\bf x\}\_\{1\}\-D\)\+\(D\-U\)\.Squaring and taking expectations gives
𝔼\|𝐱1−U\|2=𝔼\|𝐱1−D\|2\+𝔼\|U−D\|2\+2𝔼⟨𝐱1−D,D−U⟩\.\\mathbb\{E\}\|\{\\bf x\}\_\{1\}\-U\|^\{2\}=\\mathbb\{E\}\|\{\\bf x\}\_\{1\}\-D\|^\{2\}\+\\mathbb\{E\}\|U\-D\|^\{2\}\+2\\mathbb\{E\}\\langle\{\\bf x\}\_\{1\}\-D,D\-U\\rangle\.The cross term vanishes\. Indeed,
𝔼\[𝐱1−D∣It,𝐳\]=𝔼\[𝐱1∣It,𝐳\]−Dt\(It\)\.\\mathbb\{E\}\[\{\\bf x\}\_\{1\}\-D\\mid I\_\{t\},\{\\bf z\}\]=\\mathbb\{E\}\[\{\\bf x\}\_\{1\}\\mid I\_\{t\},\{\\bf z\}\]\-D\_\{t\}\(I\_\{t\}\)\.By the conditional independence assumption,
𝔼\[𝐱1∣It,𝐳\]=𝔼\[𝐱1∣It\]=Dt\(It\)\.\\mathbb\{E\}\[\{\\bf x\}\_\{1\}\\mid I\_\{t\},\{\\bf z\}\]=\\mathbb\{E\}\[\{\\bf x\}\_\{1\}\\mid I\_\{t\}\]=D\_\{t\}\(I\_\{t\}\)\.Hence
𝔼\[𝐱1−D∣It,𝐳\]=0,\\mathbb\{E\}\[\{\\bf x\}\_\{1\}\-D\\mid I\_\{t\},\{\\bf z\}\]=0,and therefore
𝔼⟨𝐱1−D,D−U⟩=0\.\\mathbb\{E\}\\langle\{\\bf x\}\_\{1\}\-D,D\-U\\rangle=0\.Thus
𝔼\|D^t\(It,𝐳\)−𝐱1\|2=𝔼\|𝐱1−Dt\(It\)\|2\+𝔼\|D^t\(It,𝐳\)−Dt\(It\)\|2\.\\mathbb\{E\}\|\\hat\{D\}\_\{t\}\(I\_\{t\},\{\\bf z\}\)\-\{\\bf x\}\_\{1\}\|^\{2\}=\\mathbb\{E\}\|\{\\bf x\}\_\{1\}\-D\_\{t\}\(I\_\{t\}\)\|^\{2\}\+\\mathbb\{E\}\|\\hat\{D\}\_\{t\}\(I\_\{t\},\{\\bf z\}\)\-D\_\{t\}\(I\_\{t\}\)\|^\{2\}\.The first term does not depend onD^t\\hat\{D\}\_\{t\}\. The second term is minimized exactly when
D^t\(It,𝐳\)=Dt\(It\)\\hat\{D\}\_\{t\}\(I\_\{t\},\{\\bf z\}\)=D\_\{t\}\(I\_\{t\}\)almost surely under the law of\(It,𝐳\)\(I\_\{t\},\{\\bf z\}\)\. Integrating overttgives the claim\. ∎
### A\.2Proof of approximate contractivity
We show that learning the self\-conditioned objective leads to an approximate notion of contractivity\.
###### Definition A\.1\(A simplified self\-conditioned regression\.\)\. LetO⊆ℝdO\\subseteq\\mathbb\{R\}^\{d\}be a nonempty and closed set, letZ∼μZ\\sim\\muwithμ\(O\)=1\\mu\(O\)=1, letX∈ℝdX\\in\\mathbb\{R\}^\{d\}be fixed, and letf:ℝd→ℝdf:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}be a continuous function\. For some0<p<10<p<1, we define the following self\-conditioned regression objective:ℒ\(f\)≔p𝔼\|f\(Z\)−X\|2\+\(1−p\)𝔼\|f\(f\(Z\)\)−X\|2\.\\mathcal\{L\}\(f\)\\coloneqq p\\mathbb\{E\}\|f\(Z\)\-X\|^\{2\}\+\(1\-p\)\\mathbb\{E\}\|f\(f\(Z\)\)\-X\|^\{2\}\.\(29\)
This is a simplification of our problem setting in[Section3\.1](https://arxiv.org/html/2607.00714#S3.SS1)\. Takingp=1/2p=1/2,f=D^t\(It,⋅\)f=\\hat\{D\}\_\{t\}\(I\_\{t\},\\cdot\), andX=Dt\(𝐱\)X=D\_\{t\}\(\{\\bf x\}\)\(the Bayes\-optimal prediction target;[Proposition3\.1](https://arxiv.org/html/2607.00714#S3.Thmtheorem1)\), for fixedt∈\[0,1\]t\\in\[0,1\]and𝐱\{\\bf x\}, we recover a practical learning scenario for self\-conditioned language models similar to \([7](https://arxiv.org/html/2607.00714#S3.E7)\)\.
Let us fix
- •a scalew\>0w\>0,
- •a factor0≤η<10\\leq\\eta<1,
- •a tolerancer\>0r\>0,
- •and a failure probability0<ε<10<\\varepsilon<1\.
Let us assume:
B¯\(X,r\)⊆O,2r≤ηw,ℒ\(f\)≤min\{p,1−p\}εr2\.\\bar\{B\}\(X,r\)\\subseteq O,\\quad 2r\\leq\\eta w,\\quad\\mathcal\{L\}\(f\)\\leq\\min\\\{p,1\-p\\\}\\varepsilon r^\{2\}\.\(30\)Define
A≔\{𝐳∈O:\|f\(𝐳\)−X\|≤r\}\.A\\coloneqq\\\{\{\\bf z\}\\in O:\|f\(\{\\bf z\}\)\-X\|\\leq r\\\}\.\(31\)
The following lemmas show that a sufficiently small loss yields a closed, high\-probability setAAthat is contractive with factorη\\etaabove the scaleww, and approximately forward\-invariant under the training distribution with leakage at mostε\\varepsilon\. Intuitively, the first term in the self\-conditioned loss makes the setAAhigh\-probability underμ\\mu, whereas the second term makesf\(Z\)f\(Z\)remain inAAfor mostZ∼μZ\\sim\\mu\.
###### Lemma A\.2\(Closedness and approximate forward invariance\)\. AAis a closed subset ofℝd\\mathbb\{R\}^\{d\}\. Furthermore,f\(A\)⊆Of\(A\)\\subseteq O, and it holds thatμ\(A∩f−1\(A\)\)≥1−ε\.\\mu\(A\\cap f^\{\-1\}\(A\)\)\\geq 1\-\\varepsilon\.\(32\)In particular,μ\(A\)≥1−ε\\mu\(A\)\\geq 1\-\\varepsilonandμ\(\{𝐳∈A:f\(𝐳\)∉A\}\)≤ε\.\\mu\(\\\{\{\\bf z\}\\in A:f\(\{\\bf z\}\)\\notin A\\\}\)\\leq\\varepsilon\.\(33\)
###### Proof\.
The closed Euclidean ballB¯\(X,r\)\\bar\{B\}\(X,r\)is closed\. Sinceffis continuous,
f−1\(B¯\(X,r\)\)f^\{\-1\}\(\\bar\{B\}\(X,r\)\)is closed\. Therefore,
A=O∩f−1\(B¯\(X,r\)\)A=O\\cap f^\{\-1\}\(\\bar\{B\}\(X,r\)\)is closed becauseOOis closed\.
For every𝐳∈A\{\\bf z\}\\in A, we have, by definition,
\|f\(𝐳\)−X\|≤r\.\|f\(\{\\bf z\}\)\-X\|\\leq r\.Hence
f\(𝐳\)∈B¯\(X,r\)\.f\(\{\\bf z\}\)\\in\\bar\{B\}\(X,r\)\.By assumption \([30](https://arxiv.org/html/2607.00714#A1.E30)\),
Define the initial and self\-conditioned errors
e1\(𝐳\)≔\|f\(𝐳\)−X\|,e2\(𝐳\)≔\|f\(f\(𝐳\)\)−X\|\.e\_\{1\}\(\{\\bf z\}\)\\coloneqq\|f\(\{\\bf z\}\)\-X\|,\\quad e\_\{2\}\(\{\\bf z\}\)\\coloneqq\|f\(f\(\{\\bf z\}\)\)\-X\|\.and define
G≔\{𝐳∈O:e1\(𝐳\)≤rande2\(𝐳\)≤r\}\.G\\coloneqq\\\{\{\\bf z\}\\in O:e\_\{1\}\(\{\\bf z\}\)\\leq r\\text\{ and \}e\_\{2\}\(\{\\bf z\}\)\\leq r\\\}\.The conditione1\(𝐳\)≤re\_\{1\}\(\{\\bf z\}\)\\leq ris exactly𝐳∈A\{\\bf z\}\\in A\.
Moreover, as we have seen,e1\(𝐳\)≤re\_\{1\}\(\{\\bf z\}\)\\leq rimpliesf\(𝐳\)∈Of\(\{\\bf z\}\)\\in O\. Therefore,
e2\(𝐳\)≤re\_\{2\}\(\{\\bf z\}\)\\leq ris equivalent to
f\(𝐳\)∈A\.f\(\{\\bf z\}\)\\in A\.Consequently,
G=A∩f−1\(A\)\.G=A\\cap f^\{\-1\}\(A\)\.
On the complementGcG^\{c\}, at least one of the inequalities
e1\(𝐳\)≤r,e2\(𝐳\)≤re\_\{1\}\(\{\\bf z\}\)\\leq r,\\quad e\_\{2\}\(\{\\bf z\}\)\\leq rfails\.
Ife1\(𝐳\)\>re\_\{1\}\(\{\\bf z\}\)\>r, then
pe1\(𝐳\)2\>pr2≥min\{p,1−p\}r2\.pe\_\{1\}\(\{\\bf z\}\)^\{2\}\>pr^\{2\}\\geq\\min\\\{p,1\-p\\\}r^\{2\}\.Ife2\(𝐳\)\>re\_\{2\}\(\{\\bf z\}\)\>r, then
\(1−p\)e2\(𝐳\)2\>\(1−p\)r2≥min\{p,1−p\}r2\.\(1\-p\)e\_\{2\}\(\{\\bf z\}\)^\{2\}\>\(1\-p\)r^\{2\}\\geq\\min\\\{p,1\-p\\\}r^\{2\}\.Thus, for every𝐳∉G\{\\bf z\}\\notin G,
pe1\(𝐳\)2\+\(1−p\)e2\(𝐳\)2≥min\{p,1−p\}r2\.pe\_\{1\}\(\{\\bf z\}\)^\{2\}\+\(1\-p\)e\_\{2\}\(\{\\bf z\}\)^\{2\}\\geq\\min\\\{p,1\-p\\\}r^\{2\}\.Equivalently,
𝟏\(𝐳∉G\)≤pe1\(𝐳\)2\+\(1−p\)e2\(𝐳\)2min\{p,1−p\}r2\.\\boldsymbol\{1\}\(\{\\bf z\}\\notin G\)\\leq\\frac\{pe\_\{1\}\(\{\\bf z\}\)^\{2\}\+\(1\-p\)e\_\{2\}\(\{\\bf z\}\)^\{2\}\}\{\\min\\\{p,1\-p\\\}r^\{2\}\}\.Taking expectation with respect toZ∼μZ\\sim\\mu,
μ\(Gc\)≤𝔼\[pe1\(𝐳\)2\+\(1−p\)e2\(𝐳\)2\]min\{p,1−p\}r2=ℒ\(f\)min\{p,1−p\}r2\.\\mu\(G^\{c\}\)\\leq\\frac\{\\mathbb\{E\}\[pe\_\{1\}\(\{\\bf z\}\)^\{2\}\+\(1\-p\)e\_\{2\}\(\{\\bf z\}\)^\{2\}\]\}\{\\min\\\{p,1\-p\\\}r^\{2\}\}=\\frac\{\\mathcal\{L\}\(f\)\}\{\\min\\\{p,1\-p\\\}r^\{2\}\}\.By assumption \([30](https://arxiv.org/html/2607.00714#A1.E30)\),
μ\(Gc\)≤ε\.\\mu\(G^\{c\}\)\\leq\\varepsilon\.Therefore,
μ\(G\)≥1−ε\.\\mu\(G\)\\geq 1\-\\varepsilon\.SinceG=A∩f−1\(A\)G=A\\cap f^\{\-1\}\(A\),
μ\(A∩f−1\(A\)\)≥1−ε\.\\mu\(A\\cap f^\{\-1\}\(A\)\)\\geq 1\-\\varepsilon\.This impliesμ\(A\)≥1−ε\\mu\(A\)\\geq 1\-\\varepsilonbecauseA∩f−1\(A\)⊆AA\\cap f^\{\-1\}\(A\)\\subseteq A\.
It also implies
μ\(A∖f−1\(A\)\)≤ε,\\mu\(A\\setminus f^\{\-1\}\(A\)\)\\leq\\varepsilon,because the event
A∖f−1\(A\)=\{𝐳∈A:f\(𝐳\)∉A\}A\\setminus f^\{\-1\}\(A\)=\\\{\{\\bf z\}\\in A:f\(\{\\bf z\}\)\\notin A\\\}is contained inGcG^\{c\}\. ∎
###### Lemma A\.3\(Approximate contractivity\)\. For every𝐳,𝐳′∈A\{\\bf z\},\{\\bf z\}^\{\\prime\}\\in Asatisfying\|𝐳−𝐳′\|≥w\|\{\\bf z\}\-\{\\bf z\}^\{\\prime\}\|\\geq w, one has\|f\(𝐳\)−f\(𝐳′\)\|≤η\|𝐳−𝐳′\|\.\|f\(\{\\bf z\}\)\-f\(\{\\bf z\}^\{\\prime\}\)\|\\leq\\eta\|\{\\bf z\}\-\{\\bf z\}^\{\\prime\}\|\.\(34\)
###### Proof\.
Take arbitrary𝐳,𝐳′∈A\{\\bf z\},\{\\bf z\}^\{\\prime\}\\in A\. By definition ofAA,
\|f\(𝐳\)−X\|≤r,\|f\(𝐳′\)−X\|≤r\.\|f\(\{\\bf z\}\)\-X\|\\leq r,\\quad\|f\(\{\\bf z\}^\{\\prime\}\)\-X\|\\leq r\.By the triangle inequality,
\|f\(𝐳\)−f\(𝐳′\)\|≤\|f\(𝐳\)−X\|\+\|f\(𝐳′\)−X\|≤2r\.\|f\(\{\\bf z\}\)\-f\(\{\\bf z\}^\{\\prime\}\)\|\\leq\|f\(\{\\bf z\}\)\-X\|\+\|f\(\{\\bf z\}^\{\\prime\}\)\-X\|\\leq 2r\.By assumption,
Therefore, whenever\|𝐳−𝐳′\|≥w\|\{\\bf z\}\-\{\\bf z\}^\{\\prime\}\|\\geq w,
\|f\(𝐳\)−f\(𝐳′\)\|≤2r≤ηw≤η\|𝐳−𝐳′\|\.\|f\(\{\\bf z\}\)\-f\(\{\\bf z\}^\{\\prime\}\)\|\\leq 2r\\leq\\eta w\\leq\\eta\|\{\\bf z\}\-\{\\bf z\}^\{\\prime\}\|\.Hence
\|f\(𝐳\)−f\(𝐳′\)\|≤η\|𝐳−𝐳′\|\|f\(\{\\bf z\}\)\-f\(\{\\bf z\}^\{\\prime\}\)\|\\leq\\eta\|\{\\bf z\}\-\{\\bf z\}^\{\\prime\}\|for every pair𝐳,𝐳′∈A\{\\bf z\},\{\\bf z\}^\{\\prime\}\\in Aseparated by at leastww\. ∎
### A\.3Proof of Proposition[3\.3](https://arxiv.org/html/2607.00714#S3.Thmtheorem3)
We first show a useful lemma\.
###### Lemma A\.4\. Letf:ℝd→ℝdf:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}be a contraction on a nonempty closed setO⊆ℝdO\\subseteq\\mathbb\{R\}^\{d\}with factor0≤η<10\\leq\\eta<1\. Then, there is a unique fixed point𝐳⋆∈O\{\\bf z\}^\{\\star\}\\in O, to which the iteration𝐳j\+1=f\(𝐳j\)\{\\bf z\}^\{j\+1\}=f\(\{\\bf z\}^\{j\}\)from any𝐳0∈O\{\\bf z\}^\{0\}\\in Oconverges exponentially\. Furthermore, the following holds for any finitej≥k≥0j\\geq k\\geq 0:\|𝐳j−𝐳k\|≤ηk−ηj1−η\|𝐳1−𝐳0\|\.\|\{\\bf z\}^\{j\}\-\{\\bf z\}^\{k\}\|\\leq\\frac\{\\eta^\{k\}\-\\eta^\{j\}\}\{1\-\\eta\}\|\{\\bf z\}^\{1\}\-\{\\bf z\}^\{0\}\|\.Similarly, the following holds:\|𝐳⋆−𝐳k\|≤ηk1−η\|𝐳1−𝐳0\|\.\|\{\\bf z\}^\{\\star\}\-\{\\bf z\}^\{k\}\|\\leq\\frac\{\\eta^\{k\}\}\{1\-\\eta\}\|\{\\bf z\}^\{1\}\-\{\\bf z\}^\{0\}\|\.
###### Proof\.
BecauseOOis a closed subset ofℝd\\mathbb\{R\}^\{d\}, it is complete\. The mapffmapsOOinto itself and is a contraction onOO\. Therefore, by the Banach fixed\-point theorem,ffhas a unique fixed point𝐳⋆∈O\{\\bf z\}^\{\\star\}\\in O, and the iterates converge to it\.
A geometric convergence bound follows directly from the contraction property:
\|𝐳j\+1−𝐳⋆\|=\|f\(𝐳j\)−f\(𝐳⋆\)\|≤η\|𝐳j−𝐳⋆\|\.\|\{\\bf z\}^\{j\+1\}\-\{\\bf z\}^\{\\star\}\|=\|f\(\{\\bf z\}^\{j\}\)\-f\(\{\\bf z\}^\{\\star\}\)\|\\leq\\eta\|\{\\bf z\}^\{j\}\-\{\\bf z\}^\{\\star\}\|\.Repeating this inequality gives
\|𝐳j−𝐳⋆\|≤ηj\|𝐳0−𝐳⋆\|\.\|\{\\bf z\}^\{j\}\-\{\\bf z\}^\{\\star\}\|\\leq\\eta^\{j\}\|\{\\bf z\}^\{0\}\-\{\\bf z\}^\{\\star\}\|\.\(35\)
Forj≥k≥0j\\geq k\\geq 0, telescoping gives
𝐳j−𝐳k=∑i=kj−1\(𝐳i\+1−𝐳i\)\.\{\\bf z\}^\{j\}\-\{\\bf z\}^\{k\}=\\sum\_\{i=k\}^\{j\-1\}\(\{\\bf z\}^\{i\+1\}\-\{\\bf z\}^\{i\}\)\.Taking norms and applying the triangle inequality,
\|𝐳j−𝐳k\|=\|∑i=kj−1\(𝐳i\+1−𝐳i\)\|≤∑i=kj−1\|𝐳i\+1−𝐳i\|\.\|\{\\bf z\}^\{j\}\-\{\\bf z\}^\{k\}\|=\\left\|\\sum\_\{i=k\}^\{j\-1\}\(\{\\bf z\}^\{i\+1\}\-\{\\bf z\}^\{i\}\)\\right\|\\leq\\sum\_\{i=k\}^\{j\-1\}\|\{\\bf z\}^\{i\+1\}\-\{\\bf z\}^\{i\}\|\.\(36\)Each increment can be bounded using the contraction property:
\|𝐳i\+1−𝐳i\|=\|f\(𝐳i\)−f\(𝐳i−1\)\|≤η\|𝐳i−𝐳i−1\|\.\|\{\\bf z\}^\{i\+1\}\-\{\\bf z\}^\{i\}\|=\|f\(\{\\bf z\}^\{i\}\)\-f\(\{\\bf z\}^\{i\-1\}\)\|\\leq\\eta\|\{\\bf z\}^\{i\}\-\{\\bf z\}^\{i\-1\}\|\.Repeating this inequality gives
\|𝐳i\+1−𝐳i\|≤ηi\|𝐳1−𝐳0\|\.\|\{\\bf z\}^\{i\+1\}\-\{\\bf z\}^\{i\}\|\\leq\\eta^\{i\}\|\{\\bf z\}^\{1\}\-\{\\bf z\}^\{0\}\|\.Applying this in \([36](https://arxiv.org/html/2607.00714#A1.E36)\),
\|𝐳j−𝐳k\|≤∑i=kj−1ηi\|𝐳1−𝐳0\|\.\|\{\\bf z\}^\{j\}\-\{\\bf z\}^\{k\}\|\\leq\\sum\_\{i=k\}^\{j\-1\}\\eta^\{i\}\|\{\\bf z\}^\{1\}\-\{\\bf z\}^\{0\}\|\.The geometric sum is
∑i=kj−1ηi=ηk−ηj1−η\.\\sum\_\{i=k\}^\{j\-1\}\\eta^\{i\}=\\frac\{\\eta^\{k\}\-\\eta^\{j\}\}\{1\-\\eta\}\.Thus,
\|𝐳j−𝐳k\|≤ηk−ηj1−η\|𝐳1−𝐳0\|\.\|\{\\bf z\}^\{j\}\-\{\\bf z\}^\{k\}\|\\leq\\frac\{\\eta^\{k\}\-\\eta^\{j\}\}\{1\-\\eta\}\|\{\\bf z\}^\{1\}\-\{\\bf z\}^\{0\}\|\.Finally, since𝐳j→𝐳⋆\{\\bf z\}^\{j\}\\to\{\\bf z\}^\{\\star\}, continuity of the norm gives
\|𝐳⋆−𝐳k\|=limj→∞\|𝐳j−𝐳k\|\.\|\{\\bf z\}^\{\\star\}\-\{\\bf z\}^\{k\}\|=\\lim\_\{j\\to\\infty\}\|\{\\bf z\}^\{j\}\-\{\\bf z\}^\{k\}\|\.Therefore,
\|𝐳⋆−𝐳k\|≤limj→∞ηk−ηj1−η\|𝐳1−𝐳0\|=ηk1−η\|𝐳1−𝐳0\|\.\|\{\\bf z\}^\{\\star\}\-\{\\bf z\}^\{k\}\|\\leq\\lim\_\{j\\to\\infty\}\\frac\{\\eta^\{k\}\-\\eta^\{j\}\}\{1\-\\eta\}\|\{\\bf z\}^\{1\}\-\{\\bf z\}^\{0\}\|=\\frac\{\\eta^\{k\}\}\{1\-\\eta\}\|\{\\bf z\}^\{1\}\-\{\\bf z\}^\{0\}\|\.∎
See[3\.3](https://arxiv.org/html/2607.00714#S3.Thmtheorem3)
###### Proof\.
SinceD^t\(𝐱,⋅\)\\hat\{D\}\_\{t\}\(\{\\bf x\},\\cdot\)is a contraction onOO, by applying[LemmaA\.4](https://arxiv.org/html/2607.00714#A1.Thmtheorem4)to it, we have that𝐳j\{\\bf z\}^\{j\}converges exponentially to a unique fixed point𝐳⋆∈O\{\\bf z\}^\{\\star\}\\in O\. Furthermore, we have
\|𝐳j−𝐳k\|≤ηk−ηj1−η\|𝐳1−𝐳0\|,\|𝐳⋆−𝐳k\|≤ηk1−η\|𝐳1−𝐳0\|\.\|\{\\bf z\}^\{j\}\-\{\\bf z\}^\{k\}\|\\leq\\frac\{\\eta^\{k\}\-\\eta^\{j\}\}\{1\-\\eta\}\|\{\\bf z\}^\{1\}\-\{\\bf z\}^\{0\}\|,\\quad\|\{\\bf z\}^\{\\star\}\-\{\\bf z\}^\{k\}\|\\leq\\frac\{\\eta^\{k\}\}\{1\-\\eta\}\|\{\\bf z\}^\{1\}\-\{\\bf z\}^\{0\}\|\.\(37\)By the triangle inequality,
\|𝐳j−Dt\(𝐱\)\|≤\|𝐳j−D^t\(𝐱,𝐳0\)\|\+\|D^t\(𝐱,𝐳0\)−Dt\(𝐱\)\|\.\|\{\\bf z\}^\{j\}\-D\_\{t\}\(\{\\bf x\}\)\|\\leq\|\{\\bf z\}^\{j\}\-\\hat\{D\}\_\{t\}\(\{\\bf x\},\{\\bf z\}\_\{0\}\)\|\+\|\\hat\{D\}\_\{t\}\(\{\\bf x\},\{\\bf z\}\_\{0\}\)\-D\_\{t\}\(\{\\bf x\}\)\|\.By the definition of the iteration,
\|𝐳j−Dt\(𝐱\)\|≤\|𝐳j−𝐳1\|\+\|D^t\(𝐱,𝐳0\)−Dt\(𝐱\)\|\.\|\{\\bf z\}^\{j\}\-D\_\{t\}\(\{\\bf x\}\)\|\\leq\|\{\\bf z\}^\{j\}\-\{\\bf z\}^\{1\}\|\+\|\\hat\{D\}\_\{t\}\(\{\\bf x\},\{\\bf z\}\_\{0\}\)\-D\_\{t\}\(\{\\bf x\}\)\|\.Using the first part of \([37](https://arxiv.org/html/2607.00714#A1.E37)\) withk=1k=1,
\|𝐳j−Dt\(𝐱\)\|≤\|D^t\(𝐱,𝐳0\)−Dt\(𝐱\)\|\+η−ηj1−η\|𝐳1−𝐳0\|\.\|\{\\bf z\}^\{j\}\-D\_\{t\}\(\{\\bf x\}\)\|\\leq\|\\hat\{D\}\_\{t\}\(\{\\bf x\},\{\\bf z\}\_\{0\}\)\-D\_\{t\}\(\{\\bf x\}\)\|\+\\frac\{\\eta\-\\eta^\{j\}\}\{1\-\\eta\}\|\{\\bf z\}^\{1\}\-\{\\bf z\}^\{0\}\|\.Applying the same argument, with the second part of \([37](https://arxiv.org/html/2607.00714#A1.E37)\),
\|𝐳⋆−Dt\(𝐱\)\|≤\|D^t\(𝐱,𝐳0\)−Dt\(𝐱\)\|\+η1−η\|𝐳1−𝐳0\|\.\|\{\\bf z\}^\{\\star\}\-D\_\{t\}\(\{\\bf x\}\)\|\\leq\|\\hat\{D\}\_\{t\}\(\{\\bf x\},\{\\bf z\}\_\{0\}\)\-D\_\{t\}\(\{\\bf x\}\)\|\+\\frac\{\\eta\}\{1\-\\eta\}\|\{\\bf z\}^\{1\}\-\{\\bf z\}^\{0\}\|\.∎
### A\.4Proof of Proposition[3\.4](https://arxiv.org/html/2607.00714#S3.Thmtheorem4)
See[3\.4](https://arxiv.org/html/2607.00714#S3.Thmtheorem4)
###### Proof\.
By definition,Dt⋆\(𝐱\)D\_\{t\}^\{\\star\}\(\{\\bf x\}\)is obtained by solving the inner self\-conditioning fixed\-point problem at the pair\(t,𝐱\)\(t,\{\\bf x\}\)\. Hence, after taking the fixed point, there is no remaining conditioning variable𝐳\{\\bf z\}\. Therefore
bt⋆\(𝐱\)=Dt⋆\(𝐱\)−𝐱1−tb\_\{t\}^\{\\star\}\(\{\\bf x\}\)=\\frac\{D\_\{t\}^\{\\star\}\(\{\\bf x\}\)\-\{\\bf x\}\}\{1\-t\}is a function only ofttand𝐱\{\\bf x\}\.
The sampling dynamics are therefore the ordinary ODE
𝐱˙t=bt⋆\(𝐱t\)\.\\dot\{\\bf x\}\_\{t\}=b\_\{t\}^\{\\star\}\(\{\\bf x\}\_\{t\}\)\.
When the fixed point matches the Bayes\-optimal denoiser,D⋆=DD^\{\\star\}=D, then by \([1](https://arxiv.org/html/2607.00714#S2.E1)\), we have that
bt⋆\(𝐱\)=Dt\(𝐱\)−𝐱1−t=bt\(𝐱\)\.b\_\{t\}^\{\\star\}\(\{\\bf x\}\)=\\frac\{D\_\{t\}\(\{\\bf x\}\)\-\{\\bf x\}\}\{1\-t\}=b\_\{t\}\(\{\\bf x\}\)\.Therefore, the fixed\-point velocity recovers the true velocity\. ∎
### A\.5Proof of Proposition[3\.5](https://arxiv.org/html/2607.00714#S3.Thmtheorem5)
See[3\.5](https://arxiv.org/html/2607.00714#S3.Thmtheorem5)
###### Proof\.
From the geometric convergence bound \([35](https://arxiv.org/html/2607.00714#A1.E35)\) in[LemmaA\.4](https://arxiv.org/html/2607.00714#A1.Thmtheorem4),
\|𝐳j−𝐳⋆\|≤ηj\|𝐳0−𝐳⋆\|\.\|\{\\bf z\}^\{j\}\-\{\\bf z\}^\{\\star\}\|\\leq\\eta^\{j\}\|\{\\bf z\}^\{0\}\-\{\\bf z\}^\{\\star\}\|\.In the nontrivial case\|𝐳0−𝐳⋆\|\>ε\|\{\\bf z\}^\{0\}\-\{\\bf z\}^\{\\star\}\|\>\\varepsilon, it is enough to require
ηj\|𝐳0−𝐳⋆\|≤ε\.\\eta^\{j\}\|\{\\bf z\}^\{0\}\-\{\\bf z\}^\{\\star\}\|\\leq\\varepsilon\.Taking logarithms and usinglogη<0\\log\\eta<0gives
j≥log\|𝐳0−𝐳⋆\|/εlog\(1/η\)\.j\\geq\\frac\{\\log\|\{\\bf z\}^\{0\}\-\{\\bf z\}^\{\\star\}\|/\\varepsilon\}\{\\log\(1/\\eta\)\}\.∎
### A\.6Proof of Proposition[3\.6](https://arxiv.org/html/2607.00714#S3.Thmtheorem6)
See[3\.6](https://arxiv.org/html/2607.00714#S3.Thmtheorem6)
###### Proof\.
Since \([15](https://arxiv.org/html/2607.00714#S3.E15)\) is an ODE, it defines the flow mapXs,t⋆\(𝐱s\)=𝐱tX\_\{s,t\}^\{\\star\}\(\{\\bf x\}\_\{s\}\)=\{\\bf x\}\_\{t\}wherever the ODE solution is unique\. Uniqueness also gives the composition law: evolving fromsstottis the same as first evolving fromsstouu, and then fromuutott\. Thus
Xs,t⋆=Xu,t⋆∘Xs,u⋆\.X\_\{s,t\}^\{\\star\}=X\_\{u,t\}^\{\\star\}\\circ X\_\{s,u\}^\{\\star\}\.∎
### A\.7Proof of Proposition[3\.7](https://arxiv.org/html/2607.00714#S3.Thmtheorem7)
See[3\.7](https://arxiv.org/html/2607.00714#S3.Thmtheorem7)
###### Proof\.
First, by definition,
δs,t\(𝐱\)=𝐱\+\(1−s\)vs,t\(𝐱\)\.\\delta\_\{s,t\}\(\{\\bf x\}\)=\{\\bf x\}\+\(1\-s\)v\_\{s,t\}\(\{\\bf x\}\)\.Therefore
vs,t\(𝐱\)=δs,t\(𝐱\)−𝐱1−s\.v\_\{s,t\}\(\{\\bf x\}\)=\\frac\{\\delta\_\{s,t\}\(\{\\bf x\}\)\-\{\\bf x\}\}\{1\-s\}\.Substituting this into
Xs,t⋆\(𝐱\)=𝐱\+\(t−s\)vs,t\(𝐱\)X\_\{s,t\}^\{\\star\}\(\{\\bf x\}\)=\{\\bf x\}\+\(t\-s\)v\_\{s,t\}\(\{\\bf x\}\)gives
Xs,t⋆\(𝐱\)=𝐱\+t−s1−s\(δs,t\(𝐱\)−𝐱\)\.X\_\{s,t\}^\{\\star\}\(\{\\bf x\}\)=\{\\bf x\}\+\\frac\{t\-s\}\{1\-s\}\\bigl\(\\delta\_\{s,t\}\(\{\\bf x\}\)\-\{\\bf x\}\\bigr\)\.Rearranging gives
Xs,t⋆\(𝐱\)=1−t1−s𝐱\+t−s1−sδs,t\(𝐱\),X\_\{s,t\}^\{\\star\}\(\{\\bf x\}\)=\\frac\{1\-t\}\{1\-s\}\{\\bf x\}\+\\frac\{t\-s\}\{1\-s\}\\delta\_\{s,t\}\(\{\\bf x\}\),which is[Equation24](https://arxiv.org/html/2607.00714#S3.E24)\.
For the diagonal identity,
δt,t\(𝐱\)=𝐱\+\(1−t\)bt⋆\(𝐱\)\.\\delta\_\{t,t\}\(\{\\bf x\}\)=\{\\bf x\}\+\(1\-t\)b\_\{t\}^\{\\star\}\(\{\\bf x\}\)\.Since
bt⋆\(𝐱\)=Dt⋆\(𝐱\)−𝐱1−t,b\_\{t\}^\{\\star\}\(\{\\bf x\}\)=\\frac\{D\_\{t\}^\{\\star\}\(\{\\bf x\}\)\-\{\\bf x\}\}\{1\-t\},we get
δt,t\(𝐱\)=Dt⋆\(𝐱\),\\delta\_\{t,t\}\(\{\\bf x\}\)=D\_\{t\}^\{\\star\}\(\{\\bf x\}\),which is[Equation25](https://arxiv.org/html/2607.00714#S3.E25)\.
It remains to prove the semigroup identity\. Let
𝐳:=Xs,u⋆\(𝐱\)=1−u1−s𝐱\+u−s1−sδs,u\(𝐱\)\.\{\\bf z\}:=X\_\{s,u\}^\{\\star\}\(\{\\bf x\}\)=\\frac\{1\-u\}\{1\-s\}\{\\bf x\}\+\\frac\{u\-s\}\{1\-s\}\\delta\_\{s,u\}\(\{\\bf x\}\)\.Using[Equation24](https://arxiv.org/html/2607.00714#S3.E24),
Xs,t⋆\(𝐱\)=1−t1−s𝐱\+t−s1−sδs,t\(𝐱\),X\_\{s,t\}^\{\\star\}\(\{\\bf x\}\)=\\frac\{1\-t\}\{1\-s\}\{\\bf x\}\+\\frac\{t\-s\}\{1\-s\}\\delta\_\{s,t\}\(\{\\bf x\}\),Xu,t⋆\(𝐳\)=1−t1−u𝐳\+t−u1−uδu,t\(𝐳\)\.X\_\{u,t\}^\{\\star\}\(\{\\bf z\}\)=\\frac\{1\-t\}\{1\-u\}\{\\bf z\}\+\\frac\{t\-u\}\{1\-u\}\\delta\_\{u,t\}\(\{\\bf z\}\)\.Since the semigroup property[Equation21](https://arxiv.org/html/2607.00714#S3.E21)holds,
Xs,t⋆\(𝐱\)=Xu,t⋆\(Xs,u⋆\(𝐱\)\)=Xu,t⋆\(𝐳\)\.X\_\{s,t\}^\{\\star\}\(\{\\bf x\}\)=X\_\{u,t\}^\{\\star\}\(X\_\{s,u\}^\{\\star\}\(\{\\bf x\}\)\)=X\_\{u,t\}^\{\\star\}\(\{\\bf z\}\)\.Substituting the expression for𝐳=Xs,u⋆\(𝐱\)\{\\bf z\}=X\_\{s,u\}^\{\\star\}\(\{\\bf x\}\)into the right\-hand side gives
Xs,t⋆\(𝐱\)\\displaystyle X\_\{s,t\}^\{\\star\}\(\{\\bf x\}\)=1−t1−u\(1−u1−s𝐱\+u−s1−sδs,u\(𝐱\)\)\+t−u1−uδu,t\(𝐳\)\\displaystyle=\\frac\{1\-t\}\{1\-u\}\\left\(\\frac\{1\-u\}\{1\-s\}\{\\bf x\}\+\\frac\{u\-s\}\{1\-s\}\\delta\_\{s,u\}\(\{\\bf x\}\)\\right\)\+\\frac\{t\-u\}\{1\-u\}\\delta\_\{u,t\}\(\{\\bf z\}\)=1−t1−s𝐱\+\(1−t\)\(u−s\)\(1−u\)\(1−s\)δs,u\(𝐱\)\+t−u1−uδu,t\(𝐳\)\.\\displaystyle=\\frac\{1\-t\}\{1\-s\}\{\\bf x\}\+\\frac\{\(1\-t\)\(u\-s\)\}\{\(1\-u\)\(1\-s\)\}\\delta\_\{s,u\}\(\{\\bf x\}\)\+\\frac\{t\-u\}\{1\-u\}\\delta\_\{u,t\}\(\{\\bf z\}\)\.and canceling the common𝐱\{\\bf x\}\-term gives
t−s1−sδs,t\(𝐱\)=\(1−t\)\(u−s\)\(1−u\)\(1−s\)δs,u\(𝐱\)\+t−u1−uδu,t\(𝐳\)\.\\frac\{t\-s\}\{1\-s\}\\delta\_\{s,t\}\(\{\\bf x\}\)=\\frac\{\(1\-t\)\(u\-s\)\}\{\(1\-u\)\(1\-s\)\}\\delta\_\{s,u\}\(\{\\bf x\}\)\+\\frac\{t\-u\}\{1\-u\}\\delta\_\{u,t\}\(\{\\bf z\}\)\.Multiplying by\(1−s\)/\(t−s\)\(1\-s\)/\(t\-s\), we obtain
δs,t\(𝐱\)=\(1−t\)\(u−s\)\(1−u\)\(t−s\)δs,u\(𝐱\)\+\(1−s\)\(t−u\)\(1−u\)\(t−s\)δu,t\(𝐳\)\.\\delta\_\{s,t\}\(\{\\bf x\}\)=\\frac\{\(1\-t\)\(u\-s\)\}\{\(1\-u\)\(t\-s\)\}\\delta\_\{s,u\}\(\{\\bf x\}\)\+\\frac\{\(1\-s\)\(t\-u\)\}\{\(1\-u\)\(t\-s\)\}\\delta\_\{u,t\}\(\{\\bf z\}\)\.Define
γ=\(1−t\)\(u−s\)\(1−u\)\(t−s\)\.\\gamma=\\frac\{\(1\-t\)\(u\-s\)\}\{\(1\-u\)\(t\-s\)\}\.Then
1−γ=\(1−s\)\(t−u\)\(1−u\)\(t−s\)\.1\-\\gamma=\\frac\{\(1\-s\)\(t\-u\)\}\{\(1\-u\)\(t\-s\)\}\.Therefore
δs,t\(𝐱\)=γδs,u\(𝐱\)\+\(1−γ\)δu,t\(Xs,u⋆\(𝐱\)\)\.\\delta\_\{s,t\}\(\{\\bf x\}\)=\\gamma\\delta\_\{s,u\}\(\{\\bf x\}\)\+\(1\-\\gamma\)\\delta\_\{u,t\}\\bigl\(X\_\{s,u\}^\{\\star\}\(\{\\bf x\}\)\\bigr\)\.
Conversely, if this identity forδ\\deltaholds, then applying[Equation24](https://arxiv.org/html/2607.00714#S3.E24)to both sides gives
Xs,t⋆\(𝐱\)=Xu,t⋆\(Xs,u⋆\(𝐱\)\)\.X\_\{s,t\}^\{\\star\}\(\{\\bf x\}\)=X\_\{u,t\}^\{\\star\}\(X\_\{s,u\}^\{\\star\}\(\{\\bf x\}\)\)\.Thus the two identities are equivalent\.
∎
### A\.8Proof of Proposition[3\.8](https://arxiv.org/html/2607.00714#S3.Thmtheorem8)
See[3\.8](https://arxiv.org/html/2607.00714#S3.Thmtheorem8)
###### Proof\.
By the two\-time denoiser identities[Equation25](https://arxiv.org/html/2607.00714#S3.E25)and[Equation26](https://arxiv.org/html/2607.00714#S3.E26), the true two\-time denoiserδ\\deltasatisfies
δt,t\(𝐱\)=Dt⋆\(𝐱\)\\delta\_\{t,t\}\(\{\\bf x\}\)=D\_\{t\}^\{\\star\}\(\{\\bf x\}\)and
δs,t\(𝐱\)=γδs,u\(𝐱\)\+\(1−γ\)δu,t\(Xs,u⋆\(𝐱\)\)\.\\delta\_\{s,t\}\(\{\\bf x\}\)=\\gamma\\delta\_\{s,u\}\(\{\\bf x\}\)\+\(1\-\\gamma\)\\delta\_\{u,t\}\\bigl\(X\_\{s,u\}^\{\\star\}\(\{\\bf x\}\)\\bigr\)\.Therefore both squared\-error terms inℒ\(δ\)\\mathcal\{L\}\(\\delta\)are zero, and hence
ℒ\(δ\)=0\.\\mathcal\{L\}\(\\delta\)=0\.
Conversely, supposeℒ\(δ\)=0\\mathcal\{L\}\(\\delta\)=0\. Sinceℒ\\mathcal\{L\}is a sum of nonnegative squared norms, each squared norm must vanish almost surely under its corresponding training distribution\. The diagonal term gives
δt,t\(𝐱\)=Dt⋆\(𝐱\)\\delta\_\{t,t\}\(\{\\bf x\}\)=D\_\{t\}^\{\\star\}\(\{\\bf x\}\)almost surely\. The semigroup term gives
δs,t\(𝐱\)=𝒯\(δ\)s,u,t\(𝐱\)\\delta\_\{s,t\}\(\{\\bf x\}\)=\\mathcal\{T\}\(\\delta\)\_\{s,u,t\}\(\{\\bf x\}\)almost surely\. Substituting the definition of𝒯\\mathcal\{T\}, this becomes
δs,t\(𝐱\)=γδs,u\(𝐱\)\+\(1−γ\)δu,t\(Xs,u⋆\(𝐱\)\)\\delta\_\{s,t\}\(\{\\bf x\}\)=\\gamma\\delta\_\{s,u\}\(\{\\bf x\}\)\+\(1\-\\gamma\)\\delta\_\{u,t\}\\bigl\(X\_\{s,u\}^\{\\star\}\(\{\\bf x\}\)\\bigr\)almost surely\. Thus any zero\-loss solution satisfies the diagonal and semigroup consistency identities on the training distributions\. ∎
## Appendix BExperiment Details
### B\.1Analysis details
In[Section4\.1](https://arxiv.org/html/2607.00714#S4.SS1)\([Figure2](https://arxiv.org/html/2607.00714#S4.F2)\), we employ damped Picard iterations\(Lauriere,[2021](https://arxiv.org/html/2607.00714#bib.bib39)\)to approximate the fixed point and update𝐳j\{\\bf z\}^\{j\}with 200 iterations\. The damped iteration is formulated as𝐳j\+1=α𝐳j\+\(1−α\)D^t\(𝐱t,𝐳j\)\{\\bf z\}^\{j\+1\}=\\alpha\{\\bf z\}^\{j\}\+\(1\-\\alpha\)\\hat\{D\}\_\{t\}\(\{\{\\bf x\}\}\_\{t\},\{\\bf z\}^\{j\}\), whereα\\alphais a damping hyperparameter\. To find the fixed point at each timestep, we run this process for 200 iterations withα=0\.3\\alpha=0\.3\.
For the analysis with generation performances \([Sections4\.1](https://arxiv.org/html/2607.00714#S4.SS1.SSS0.Px1)and[3](https://arxiv.org/html/2607.00714#S4.F3)\), to ensure a fair comparison with the baseline self\-conditioning, we utilize a standard Picard iteration \(*i\.e\.,*α=0\.0\\alpha=0\.0\) except for the runs with 100 iterations\. For the 100\-iteration runs, we retain the damped Picard withα=0\.3\\alpha=0\.3\. For the sampling hyperparameters, we set the self\-conditioning guidance weight in ELF\(Huet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib3)\)tow=1w=1\.
### B\.2Self\-conditioning\-free model details
We realize the fixed\-point denoiserD⋆D^\{\\star\}\([13](https://arxiv.org/html/2607.00714#S3.E13)\) as a self\-conditioning\-free model, ELF⋆, by distilling a frozen self\-conditioned ELF teacherD^\\hat\{D\}\(Huet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib3)\)with consistency deep equilibrium \(CDEQ\) distillation\(Linet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib21)\)\. The distillation compresses the fixed\-point iteration \([10](https://arxiv.org/html/2607.00714#S3.E10)\) into a single forward that predicts its limit𝐳⋆\{\\bf z\}^\{\\star\}, so the resulting model matches the converged self\-conditioned denoiser without iterating at inference\.
#### Architecture\.
ELF\(Huet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib3)\)conditions on the flow timettthrough a bank of four learned prefix tokens, prepended to the latent sequence, to which an embedding ofttis added; the self\-conditioning guidance weight enters through a second such bank\. Following this design, to let the student track progress along the iteration, we condition it on a consistency timeτ\\tauwhich represents the progress of fixed\-point iteration in the same manner: we append a parallel bank of four phase tokens that carry an embedding ofτ\\tau\. We zero\-initialize both the phase token bank and theτ\\tau\-embedder output, so the phase pathway is initially inert and the warm\-started student nearly reproduces the teacher\.
#### Training\.
FollowingLinet al\.\([2026](https://arxiv.org/html/2607.00714#bib.bib21)\), at each step, we run the fixed\-point iteration \([10](https://arxiv.org/html/2607.00714#S3.E10)\) at the training interpolant,𝐳j\+1=D^t\(𝐱,𝐳j\)\{\\bf z\}^\{j\+1\}=\\hat\{D\}\_\{t\}\(\{\\bf x\},\{\\bf z\}^\{j\}\), from the cold start𝐳0=𝟎\{\\bf z\}^\{0\}=\{\\bf 0\}forKKAnderson\-accelerated steps, giving a detached trajectory𝐳0,…,𝐳K\{\\bf z\}^\{0\},\\dots,\{\\bf z\}^\{K\}whose last iterate we treat as the fixed point𝐳⋆\{\\bf z\}^\{\\star\}\. We assign iterate𝐳j\{\\bf z\}^\{j\}a consistency timeτj=ε\+\(1−e−ρj\)\(T−ε\)\\tau\_\{j\}=\\varepsilon\+\(1\-e^\{\-\\rho j\}\)\(T\-\\varepsilon\)withτ0=ε\\tau\_\{0\}=\\varepsilon, and letcskip\(τ\)=\(τ−ε\)/\(T−ε\)c\_\{\\mathrm\{skip\}\}\(\\tau\)=\(\\tau\-\\varepsilon\)/\(T\-\\varepsilon\)andcout\(τ\)=1−cskip\(τ\)c\_\{\\mathrm\{out\}\}\(\\tau\)=1\-c\_\{\\mathrm\{skip\}\}\(\\tau\)\. The student predicts a consistency functiongj=cskip\(τj\)𝐳j\+cout\(τj\)Pjg\_\{j\}=c\_\{\\mathrm\{skip\}\}\(\\tau\_\{j\}\)\\,\{\\bf z\}^\{j\}\+c\_\{\\mathrm\{out\}\}\(\\tau\_\{j\}\)\\,P\_\{j\}, wherePjP\_\{j\}is a closed\-form two\-point Anderson combination \(mixing1\.01\.0\) of two phase\-conditioned student passes\. At the cold\-start phaseτ0\\tau\_\{0\}, we havecskip=0c\_\{\\mathrm\{skip\}\}=0, sogjg\_\{j\}collapses to the bare network prediction, which is exactly what inference evaluates\. We traingjg\_\{j\}with a global term anchoring it to the equilibrium and a local term enforcing consistency with the previous phase,ℒ=λ1\|gj−𝐳⋆\|2\+\(1−λ1\)\|gj−𝗌𝗀\(gj−1\)\|2\\mathcal\{L\}=\\lambda\_\{1\}\|g\_\{j\}\-\{\\bf z\}^\{\\star\}\|^\{2\}\+\(1\-\\lambda\_\{1\}\)\|g\_\{j\}\-\\mathsf\{sg\}\(g\_\{j\-1\}\)\|^\{2\}, where the earlier\-phase prediction is stop\-gradient, and we retain the teacher’s decoder head\(Huet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib3)\)\.
We useK=20K=20teacher iterations with Anderson history33, mixing0\.90\.9, and regularization10−410^\{\-4\}; a schedule withε=0\.002\\varepsilon=0\.002,T=5\.0T=5\.0,ρ=0\.3\\rho=0\.3; and loss weightλ1=0\.8\\lambda\_\{1\}=0\.8\. Because the consistency weighting leaves the cold\-start phase weakly supervised, with probability0\.250\.25we regress the cold\-start pointτ0\\tau\_\{0\}directly onto𝐳⋆\{\\bf z\}^\{\\star\}\. The remaining optimization follows the teacher’s recipe \([SectionB\.3](https://arxiv.org/html/2607.00714#A2.SS3)\)\.
#### Inference\.
At inference, ELF⋆takes a single forward per flow step with a zero self\-conditioning slot fixed at the cold\-start phaseτ0\\tau\_\{0\}, and carries no self\-conditioning across flow steps\. Its velocity is therefore autonomous, so generation uses the standard Euler scheme \([2](https://arxiv.org/html/2607.00714#S2.E2)\)\.
### B\.3Self\-conditioning flow map details
#### Architecture\.
We parameterize the two\-time denoiserδs,t\\delta\_\{s,t\}\([24](https://arxiv.org/html/2607.00714#S3.E24)\) by conditioning the ELF backbone on both the source timessand the target timett\. We reuse ELF’s existing bank of flow\-time prefix tokens, assigning half to the targetttand half to the sourcessthrough the shared time\-embedder; this introduces no new parameters and reduces to the single\-time ELF denoiser on the diagonals=ts=t, realizing the diagonal conditionδs,s=Ds⋆\\delta\_\{s,s\}=D^\{\\star\}\_\{s\}\([25](https://arxiv.org/html/2607.00714#S3.E25)\) by construction\.
#### Training\.
Each training step samples times0≤s≤u≤t≤10\\leq s\\leq u\\leq t\\leq 1and splits the batch per example into diagonal \(s=ts=t\) and off\-diagonal rows\. Following \([27](https://arxiv.org/html/2607.00714#S3.E27)\), diagonal rows regressδt,t\\delta\_\{t,t\}onto the fixed\-point denoiserDt⋆D^\{\\star\}\_\{t\}\. Off\-diagonal rows regressδs,t\\delta\_\{s,t\}onto the semigroup teacherδ¯s,t\\bar\{\\delta\}\_\{s,t\}, which we build from two stop\-gradient passes of the current student,δs,u\\delta\_\{s,u\}andδu,t\\delta\_\{u,t\}, chained through the midpoint stateXs,u⋆X^\{\\star\}\_\{s,u\}\([24](https://arxiv.org/html/2607.00714#S3.E24)\) by the convex weightγ\\gamma\([26](https://arxiv.org/html/2607.00714#S3.E26)\); hereuuis the midpoint ofssandttin the percentile parameterization\. FollowingLeeet al\.\([2026](https://arxiv.org/html/2607.00714#bib.bib2)\), we draw a diagonal row with probability0\.50\.5and pin a fraction1/321/32of the off\-diagonal rows to the boundary\(s,t\)=\(0,1\)\(s,t\)=\(0,1\)to cover full one\-step jumps\. The times follow the teacher’s training\-time marginal, a logit\-normal distribution with\(μ,ν\)=\(−1\.5,0\.8\)\(\\mu,\\nu\)=\(\-1\.5,0\.8\)\. A single student forward at\(s,t\)\(s,t\)then carries the gradient, and we minimize the per\-branch mean\-squared error with equal weights\.
The self\-conditioned teacher is steered by a self\-conditioning guidance weightww\(Huet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib3)\), which we hold fixed within each training example\. We sample a singlewwper example, from the same log\-uniform range\[0\.5,5\.0\]\[0\.5,5\.0\]used in ELF training, and feed that one value to every model call that enters its loss: the teacher passes that form the diagonal targetDt⋆D^\{\\star\}\_\{t\}, the two student passes that form the off\-diagonal semigroup target, and the graded student forward\. Sharing onewwacross both branches and across teacher and student keeps the targets at a common guidance level, so the student learns a singleww\-conditioned flow map; at inference,wwis likewise fixed across all steps and swept to trace the gPPL\-entropy frontier\.
The offline and online routes of[Section3\.4](https://arxiv.org/html/2607.00714#S3.SS4)differ only in how the diagonal targetDt⋆D^\{\\star\}\_\{t\}is supplied\. The two\-stage FMLM⋆\(offline\) uses the separately distilled ELF⋆\([SectionB\.2](https://arxiv.org/html/2607.00714#A2.SS2)\) as the teacher, which already returnsDt⋆D^\{\\star\}\_\{t\}in a single forward\. The one\-stage online FMLM⋆instead uses the self\-conditioned ELF teacher and forms the target on the fly, cold\-starting the iteration𝐳j\+1=D^t\(𝐱,𝐳j\)\{\\bf z\}^\{j\+1\}=\\hat\{D\}\_\{t\}\(\{\\bf x\},\{\\bf z\}^\{j\}\)at𝐳0=𝟎\{\\bf z\}^\{0\}=\{\\bf 0\}, running a fixed number of Picard refinements, and reading outDt⋆D^\{\\star\}\_\{t\}with one final self\-conditioned pass; the refinement count is the “\# FPIs” in[Table3](https://arxiv.org/html/2607.00714#S4.T3), where more iterations sharpen the target at a higher training cost\.
For distillation in[Section4\.2](https://arxiv.org/html/2607.00714#S4.SS2), we use ELF\-B as the teacher model\. To ensure tokenizer consistency with the baselines, we train a variant of ELF\-B using the GPT\-2 tokenizer\. Specifically, we replace its original T5 text encoder\(Raffelet al\.,[2020](https://arxiv.org/html/2607.00714#bib.bib41)\)with the last hidden states of a pretrained GPT\-2 Large model\(Radfordet al\.,[2019](https://arxiv.org/html/2607.00714#bib.bib29)\)\. Aside from this modification, we strictly follow the original ELF architecture and train this variant using the identical hyperparameters\.
We follow the hyperparameter settings of the original ELF model\. Both models are trained for 5 epochs with a global batch size of 512 using 8 NVIDIA B200 GPUs\. We employ a combination of Muon optimizers with a peak learning rate of 0\.002, incorporating a learning rate warmup over the first 0\.5 epochs\. Additionally, we apply an exponential moving average \(EMA\) with a decay rate of 0\.9999 and use the EMA checkpoint for inference\.
#### Inference\.
To report the gPPL and entropy for the two\- and four\-step FMLM⋆\([Tables2](https://arxiv.org/html/2607.00714#S4.T2)and[3](https://arxiv.org/html/2607.00714#S4.T3)\), we utilizeγ\\gamma\-sampling\(Leeet al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib2); Sabouret al\.,[2026](https://arxiv.org/html/2607.00714#bib.bib42); Kimet al\.,[2024](https://arxiv.org/html/2607.00714#bib.bib40)\)withγ=0\.75\\gamma=0\.75and1\.01\.0, respectively\.
## Appendix CQualitative results
We provide generation samples from the FMLM⋆model\. The one\-step, two\-step, and four\-step samples are shown in[Figure6](https://arxiv.org/html/2607.00714#A3.F6),[Figure7](https://arxiv.org/html/2607.00714#A3.F7), and[Figure8](https://arxiv.org/html/2607.00714#A3.F8), respectively\.
Sampling Steps: 1gPPL: 92\.37, Entropy: 5\.27are related to increasing the short\-term size of the water cover, or the temperature level of which can promote environmental change impacts\. These waters and major major agricultural agents may be responsible for temperature increase in the ocean\. But the increase of warming from low temperatures are also increasingly increased because of the long\-term temperature increase of glaciers\.More people living in developing countries such such as South Africa and South America, say their more short\-term activities have to have significant impacts on the world environmental systems\.According to most\-relevant findings, the tropics vegetation crops are the presence of seurane vegetation, the south of the St\. Lawrence, the continent’s estuary, according to data reported by Nor se researchers of the Japanese\-1990 era and the current the meeting of \[1ehorn\. And their studies also show that through recent global past activities has caused some 7 million people to water increases through from 20 to 2000 from U\.S\. 2008, 10 million years, including, and fishing\.Sea waters have a major part in the long\-term climate gases due to climate disturbances,, in order to counteract this increase on marine water water\.A significant increase in moreaching, Orests in the late African Sea through the the Mediterranean Sea, may have impacts on water ecosystems, the presence is a major source of culprits such as marine whales\. The outcome could cause much greater water levels \[20\]\. Water in Southern countries and other contributions to water by the rise in the U\.S\. could also lead to a broader extent of the world environmental system, as in Jojal;’le reported in Sept\. \[ 2004\. Station areas in wetlands could cause an increase by increasing water usage, by by drying up access to aater \[23 and environmental degradation by by developing improved water quality from floods, the extent of water’s rocks and also water\.To clarify, the specific environmental impacts of Nij’ water over the year will have a significant influence on N&es and H\. says, which indicates that as the water impacts of freshwater through surrngs are particularly promising\.Yet the broader implications of global energy policies may also have large effects on large\-term environmental impacts from climate sources\. Note that the possibility of a share of s oil\-based fuels will emerging from Earth’s water into the mid\-2000 era\. Theort studies are also more than five studies\.We suggest that the fish diets will increase temperature increase and warmth in the 20 years to come, at the range of results, showing the benefit of lower energy approaches for all people studied, Jo’en says\.A fish practice was rated at about 5\.5 percent for people who use fish for others, while environmental effects estimates are very poor\. It is noteworthy, we included evidence of fitness for the long term studies\.Additionally, the researchers found that fish feeding caused increase in temperature in adults, due the increases in water, in addition to a decrease in rainfall in tropical populations, a much higher over the 2010 season\. The growth rate observed among Y fish in individuals increased 30 percent in temperature and temperature by increasing through the longest after consuming more than a month among participants, and a greater increase occurred on a longer\-term regimen\. Other studies also indicated that the UV diet and water are likely to reduce conditions in climate conditions for the O\-species\. In addition, according to the original studies showed that these warming does will be involved in the increasing value of climate levels\.In addition, the long\-term study has been turned out to reflect somewhat in pollutants \(H’an, G\. 13\. ”Based on understanding that there are differences in the amount of products being delivered by reulture,”27\. Future efforts to mean for heavy water can be improved the health in relative to marine programs \(20, 32\)\.Therefore, we may also observe an important use of interventions in water health each year in populations, including the efforts to ensure short\-term temperatures\. Thusful feeding is encouraged along with temperature gain increases in marine water\.Some studies over water diets increase the value of people’ energy by as they cause long term changes, including David Eich \(34 and 32\)\. Mediterranean Farmers\. Subrain diets should increase by rate of 10 percent over the next five years \(see F\. 32\)\.Climateforestationing reducing water water levels is a significant threat to the nature of B\- and water systems\. However, the research also indicate that the increasing loss of consumption of refre fish in nature is also likely sustainable \(28, 34\)\.On the other hand, we suggest that fish rewulture use fuels in the \(louific climate, together by reducing the abundance of water and water nutrients that they expect an a decrease in water\-use rates\. Because out of course, fish crops may increase the number of low\-income individuals and cardiovascular disease conditions, emphasizing with populous populations in the AmericasFigure 6:A sample generated by FMLM⋆with one\-step decoding\.Sampling Steps: 2gPPL: 69\.28, Entropy: 5\.42go forward\. We have done a lot in working with, but we have all the the best of bringing in customers and we in the right people to prove our goals and show how easy it is to be\.The BitcoinFresh team is efficient and also the workhorse of the company\. So all this is a company where you can spend anything and countless dollars right out there\. Check up with usJared Upton, Cofounder,At the last time we’ve been surrounded by quite a bunch of people we’ve long been with and loved, and have begun to work on another project\. We have thanked all the customers who wanted to share awareness of our existence within…Unfortunately I couldn’t wait to give so much gratitude to the amazing all of people who have felt so incredibly satisfied putting their focus on how to be part of the Bitcoin Company\. I most of can’t wait to put the rest of time focusing on seeing two people working on us, and giving out our own product\.Jared Upton,WeTell you, it’s been almost four\-years to get this up, but we’ve always had to work quickly\. We tried to do a huge amount to get for the buildings\. Any single look that we want an event comes up early in the back air so we don’t even turn our screen again\.I just said you were excited about it\. We have over $1000\. This is a challenge, but there’s a well trained team\. Our long\-term goal is to get back onto the floor\. So if you want to be inspired, please sign in\. \(Credit: anbyl / Flickr PhotographyIf you read The League games called Hearthstone, it’s seems that Longland’s partner e\-Sport Games is just about to start an its release of The Porsche Racer on Friday 2017\. But it actually had a lot of chatter within two weeks as too, with a lot of rumors appearing this week around– and more than hundreds of people have already played it in the world\. Now that’s all we’ll decide what it does at the end of the coming year\.The Flying GT is a very mysterious title, but got the facts going before the game was released, and the problem is what we’re looking to see\. The Street Road car’s air tires are fan\-loaded at $70degF, allowing for extreme maneuvers across the course, requiring less than 30 seconds to drive the car\.Firsting note, both cool details and details show how the team of this actually helps you\. So, we’re working to update it to everyone else\. We’ll see more of the full news\.Like the upcoming Flying GT website, it will release a full review early today, stay kind for you around the world\. It’s also a real enthusiast project, so we can also keep mind some of the important back points before ready to get in with this Porsche car that we play in 2017\.Importantly, it’s not surprising that Muzway will have a high–able micro card at home\. If you also have a serious road driving experience at 27, then it’s one of the best cards I’ve seen\. You can also track these builds on social networks and Twitter, and Twitter, PlayStation Messenger, and Facebook\.Due a near\-native chipset is included for South Road, it will be an co\-ear, much improved controller for both Android and Linux\.Elreo 20x is a new social game from Sanway Entertainment and large\-mono company that allows dynamic movement changes\. When available on the website, these maps path\-siveting is persistent arenas for multiplayer matches\.As with player gaming\. As it did with the previous games, teams include a very large array of co\-op features to replicate the individual mechanics of South Formula\.San Francisco itself received around 100 points after release – which is isn’t a good thing for high games, as it seems, and it includes a wide variety of features for the Go game\.Just slipping right off the edge of your car by causing obstacles, making a car mow without trying to push it down an upwards path, or take an whister towards you to overtake it\.O…The fact that you can do things from inside it the Horizon’s map is another instance of engagement, and even on\-screen sessions bring creating a much greater sense of investment on player players in a period when tense online battles\. As a result, contribution is extremely high indeed, and the high experience per population on each map will lower considerably\. As a result of Horizon Racing, we decided in order to develop some individual multiplayer features so that it’s very easy to realize will also be a tougher challenge as the dates build up\.This is what we want to see; itFigure 7:A sample generated by FMLM⋆with two\-step decoding\.Sampling Steps: 4gPPL: 56\.60, Entropy: 5\.38the same reason, Prostars continue to be enthusiastic\. But for building a huge brand, getting us back in business is a little back burner, but we have certainly worked all of the same people together\. Prostars and Adidas will continue to make the mark in basketball and motor racing\. And unfortunately, their career has been continually disrupted, and that strength could be greatly improved if just it wasn’t the answer\.So what we’re asking for is game knowledge\. Because NASCAR is a talented team and the wealthy, you just give out the money is ready to work for\.What we want to tell you is know that you will be far better\. The NASCAR’s Strategy Course takes 30 minutes per week, looking for a way to make your truly know how to get some way up\.This ability to go fast will take your mind a lot of excitement over the next year\. What is important about this course is you can improve your ability and get you back to the top\. This will only take time to take the end of your life\. It’s one awesome to see the course so often that you are going to take the time to care to your concerns, to bring your brand alive to the public\.For comparison, Pro Sports is fairly awesome\. But it’s also a real source of attention, the big video\. You are demonstrating it, putting your number on the wall – it’s going to be the difference to your life, making that point, and the results of surviving the financial crisis could have changed the end point of your career…Unfortunately, we also know your mindset is no not who you are\. I’ve never been able to be all done due to company updates\. Even if you sit down with your team when thinking about reaching the position you are intended to reach,, it can be helpful to talk with someone you think you like\. Hopefully he can change his company mindset, think about getting down, and scan his eyes, and try to see how things are\. While it’s quite easy to get yourself back off the top of ground, think at the mindset and resources that you have about how to succeed…If you really think about having all the hard efforts to keep it up and your ability to pursue it as well as possible, you would find many people are assuming the same\. My peers realize, after all, there are so many so few people who know how to work well in their life\.ShowbacksPrior to this project, it’s quite different for my own company and it was a difference for me\.I want to mention that you have already put a lot of attention along with the efforts you’ve done for NASCAR Marketing, so asked, after deciding\. Can you help go even further?I have fought with this goal every since\. I’ve been at this good gaming park for years, especially when that finally comes to learn how to keep with the company’s long goals\.With focus on public finance, we can continue to pursue the other type of things we need \- making ourselves more active, through what we are able to do, and help us get going higher again\.There are so many ways of doing better rather than great anymore\.To keep things simple, there are plenty of articles to talk about\. But in the end, that doesn’t be something you would necessarily want to see, but I suggest you make serious comments\.In an effort to discuss the relief and recovery problems, of those men who have been able to have heroin treated this year, has published one of the biggest Day of Fame posts, a website specifically for people in America\.One of those she posted was co\-churchist Father John S\. O\-Salgar who works at New York and New Anne, St Pierre and Miel Univ\., in TOWN\. He talked about his given fortune helping him deal with obesity problems, booze still hit a huge sales high\.”The of my goal to the day is putting the money that help to stay here and work in these issues among the people\. A lady who was walking lazy and then came out to help in knowing about several relevant issues and get a job,” Leanger wrote\.The post came from David Leber who served for A\.A in a Connecticut plane and then worked as an aviation flight demonstrator in the United States, 35\-year\-old man who was removed from the Connecticut Air Company’s job after being charged with drug possession for a prior three years, felony drug possession and possession, and a Rolled Euler in Florida\.Salanger quickly hoped he could get back his money as he also had to sign an introductory fee of $200 a month, showing them he worked on hormone monitoring and protecting his body against inflammation\. That year he felt that money was not out of hand\.DevinFigure 8:A sample generated by FMLM⋆with four\-step decoding\.Similar Articles
FlowLM: Few-Step Language Modeling via Diffusion-to-Flow Adaptation
FlowLM introduces a flow matching language model derived from pre-trained diffusion models via efficient fine-tuning, enabling high-quality few-step text generation that rivals 2,000-step diffusion sampling with far fewer training epochs.
Masked Language Flow Models
This paper introduces Masked Language Flow Models (MLFMs), which incorporate masking into flow-based language models to enable continuous flow for conditional generation and allow pretrained Masked Diffusion Models to be converted. The authors propose a novel sampler that alternates continuous denoising with discrete unmasking, demonstrating for the first time that flow-based language models can scale to downstream reasoning and instruction-following tasks.
Language Modeling with Hyperspherical Flows
This paper introduces S-FLM, a novel flow-based language model that operates in a hyperspherical latent space to address the computational costs and semantic limitations of existing discrete diffusion and continuous flow models.
Flow-Map GRPO: Reinforcement Learning for Few-Step Flow-Map Generators via Anchored Stochastic Composition
Proposes Flow-Map GRPO, an online RL post-training framework for deterministic few-step flow-map generators, introducing Anchored Stochastic Flow Map Composition (ASFMC) to enable stochastic optimization without altering original model parameterization. Experiments on FLUX-based MeanFlow and sCM show improvement across reward-based, perceptual, and task-level metrics.
Flow Reasoning Models: Scaling Reasoning Through Iterative Self-Refinement
Flow Reasoning Models (FRMs) introduce a training and test-time-scaling framework for discrete flow models on structured reasoning tasks. By using self-verification and self-conditioning, FRMs achieve nearly 100% solve rates on Sudoku and Zebra puzzles with far fewer passes than previous baselines.