CRoCoDiL: Continuous and Robust Conditioned Diffusion for Language
Summary
CRoCoDiL proposes a continuous and robust conditioned diffusion approach for language that shifts masked diffusion models into a continuous semantic space, achieving superior generation quality and 10x faster sampling speeds compared to discrete methods like LLaDA.
View Cached Full Text
Cached at: 04/20/26, 08:32 AM
# CRoCoDiL: Continuous and Robust Conditioned Diffusion for Language
Source: https://arxiv.org/html/2603.20210
Omer Belhasin, Itay Levy, Akhiad Bercovich, Ran El-Yaniv, Ran Zilberstein, Michael Elad, NVIDIA
###### Abstract
Masked Diffusion Models (MDMs) provide an efficient non-causal alternative to autoregressive generation but often struggle with token dependencies and semantic incoherence due to their reliance on discrete marginal distributions. We address these limitations by shifting the diffusion process into a continuous sentence-level semantic space. We propose **CRoCoDiL** – Continuous and Robust Conditioned Diffusion for Language – a unified fine-tuning approach that jointly trains an encoder–demasker architecture, grounding the MDM demasking in continuous latent representations. This leads to the formation of a novel autoencoder in which the decoding is obtained by an MDM algorithm. Relying on the same framework, we proceed by introducing two **unconditional** text synthesis algorithms: Continuous-Then-Discrete (**ConThenDisc**), a hybrid-diffusion approach that first generates latent representations in continuous space and then decodes these to tokens via an MDM, and Continuous-Within-Discrete (**ConWithinDisc**), a multi-diffusion strategy that refines latent representations throughout the discrete sampling process. Experiments using LLaDA show that our methods achieve superior generation quality and more than ×10 faster sampling speeds in an unconditional setting.
Machine Learning, ICML
## 1 Introduction
Diffusion-based alternatives to autoregressive large language models have been drawing much attention recently (Li et al., 2022; Yi et al., 2024). Such methods encompass an appealing potential to break the causal, one-token-at-a-time, paradigm of autoregressive machines, with the general hope to lead to faster and improved quality text synthesis. The main challenge in bringing diffusion models to text is the evident gap between the continuous formulation of classical diffusion algorithms and the discrete nature of language (Lou et al., 2024). Earlier work addressed the discrete-continuum gap in a wide variety of ways; among these, the commonly used ones are based on **masked** diffusion models (Sahoo et al., 2024; Nie et al., 2025; Ye et al., 2025). This breadth of algorithms relies on a forward degradation process that masks tokens gradually until the whole sequence is masked-out. Text generation is based on a reversed process, in which a demasker iteratively revives tokens, constituting the **Masked Diffusion Models** (MDMs), such as MDLM (Sahoo et al., 2024), LLaDA (Nie et al., 2025), Dream (Ye et al., 2025), and their many followups (e.g., Arriola et al., 2025a; Wu et al., 2025; Liu et al., 2025b, c).
MDMs rely on a demasking model that is trained on partially masked sequences to estimate discrete logits for the missing tokens, representing one-dimensional marginal distributions that lack information on statistical cross-dependencies between tokens. When sampling from these logits, revealing multiple tokens in parallel necessarily produces flawed samples that degrade generation quality (Liu et al., 2025a). Nevertheless, as synthesis speed depends on parallel token sampling, existing algorithms compromise speed for quality. Another, related yet different, weakness of MDM algorithms has to do with their core **modus-operandi** of constructing the generated text by sampling individual tokens (separately or jointly) sequentially, and such that they are committed to be part of the final sequence. While appealing due to its resemblance to the autoregressive strategy, having no global guidance to drive this overall synthesis, MDM necessarily struggle in forming coherent eventual sentences.
In this paper we propose a novel extension to MDMs that addresses these limitations. Our approach operates in the continuum, using a continuous diffusion model to generate sentence-level semantic representations, while the MDM algorithm serves as a decoder translating these latent vectors into token sequences. This way, the burden of capturing long-range, cross-token structure is shifted to a lightweight classical diffusion in the latent space. This representation is then used to guide the MDM for token decoding, enabling effective multi-token sampling per step by yielding better efficiency-quality tradeoffs in text synthesis. We name this methodology **CRoCoDiL**: Continuous and Robust Conditioned Diffusion for Language.

Building on this framework, we introduce a unified encoder-demasker training scheme that encodes sequences into latent representations for effective token decoding. We then present two text synthesis algorithms: (1) Continuous-Then-Discrete (**ConThenDisc**) that generates embeddings via continuous diffusion and uses MDM to decode the latent vector into tokens; (2) Continuous-Within-Discrete (**ConWithinDisc**), that updates the guidance vector during the demasking steps using a continuous diffusion trained to recover valid latent vectors from partially masked sequences. We emphasize that the proposed algorithms are focused on unconditional text generation, leaving conditional synthesis across benchmarks for future work.
We conduct an extensive experimental study using LLaDA-8B (Nie et al., 2025) as the base MDM and Qwen-embedding-0.6B (Ren and et al., 2025) as an initial encoder, all jointly retrained with our decoder-demasker framework. We first validate the effectiveness of the continuous guidance for MDM by autoencoding, demonstrating faithful reconstruction. We then evaluate our two proposed algorithms for unconditional code synthesis, showing that our methods achieve much faster sampling without quality loss.
To summarize, the following are the main contributions of this work, as depicted in Figure 1:
- We propose **CRoCoDiL**, a framework that guides discrete MDMs using a continuous, sentence-level semantic guidance, bridging the gap between global coherence and local token dependencies, thus enabling faithful parallel token sampling.
- We introduce a general purpose autoencoder that maps accurately sequences to the continuum and back, leaning on MDM as a decoder.
- Consequently, two text synthesis algorithms are proposed: **ConThenDisc** and **ConWithinDisc**, both shift the core generative process into a continuous sentence-level semantic space that serves as a global sketched guide for an MDM.
- We demonstrate superior generation quality and sampling speed with LLaDA with significant gains in unconditional text generation setting.
## 2 Related Work
In Appendix A we provide a broad overview on the field of diffusion models for text generation. In this section we dive into specific recent work that has a direct relevance to this paper's contributions.
The work reported in Meshchaninov et al. (2025) presents COSMOS, a language generation algorithm that relies on a continuous latent space diffusion. While similar to the main theme of our work, COSMOS differs from it substantially. In particular, the decoder that converts the embedding to tokens in COSMOS has no generative capabilities, which implies that the latent representation must be fully informative in order to enable proper text synthesis. In contrast, our latent representation serves as a sketch guide that conditions an iterative MDM-based decoding process, and thus even partially informative representations can lead to valid and high quality generated text, as MDM compliments and refines the synthesis process. Indeed, in the spirit of the main contrast between COSMOS and our paradigm, the work of Morris et al. (2023) argues that when using embedding representations, decoding must be performed iteratively rather than in a single step, which supports our proposed fusion of continuum and MDM. That said, Morris et al. (2023) is distinct from our work as it focuses on text correction tasks rather than their generation.
Another related work is reported in Arriola et al. (2025b), presenting an autoencoding framework, referred to as E2D2. Under a conditional synthesis setup in which the model receives a prompt and is required to provide an answer, E2D2 encodes the prompt to a continuous vector and uses it to guide a fully discrete MDM decoder that constructs the response. As the synthesis of the answer relies on a plain MDM, the statistical cross-token dependencies are not taken into account – a problem that we tackle in this work.
The algorithms reported in Liu et al. (2025a), Xu et al. (2025), and Xie et al. (2025) tackle the problem of joint token sampling in MDM, as in our work. The first handles the missing dependencies by incorporating a copula model, the second augments the demasker with a learned energy model, and the third introduces a Gaussian-distributed latent variable for accounting for the token dependencies. All concentrate on small scale base models for offering improvements in text synthesis speed or quality.
A related yet different line of reasoning towards the very same goal appears in Azangulova et al. (2025), Luxembourg et al. (2025), presenting inference-only strategies for prioritizing the order of unmasked tokens so as to avoid too-dependent ones to be sampled jointly. These methods are inherently limited, as they seek weakly correlated tokens, which do not necessarily exist. In addition, these inference algorithms are tightly coupled with their base models, operating semi auto-regressively with small block-sizes, thus limiting their achievable gain.
In contrast to the above, our work aims to fully harness the potential of diffusion models for language, aiming to override the speed and text-quality barriers of MDM. This is achieved by injecting informative guidance to MDM such that it can both handle cross-token dependencies, while also providing a synthesized sketch for the text to be generated.
## 3 Problem Formulation and Background
Let **x** = (x¹, x², …, xⁿ) be a discrete random vector of n tokens, where each xⁱ belongs to a vocabulary V. We assume text sequences are sampled from an unknown joint data distribution q_data, and our objective is to learn a generative model capable of synthesizing samples from q_data. Following recent work on discrete diffusion methods, we adopt the masked diffusion modeling (MDM) (Sahoo et al., 2024) framework. We augment the vocabulary with a special mask token [M] and define the fully masked vector as **m** = (m¹, m², …, mⁿ), where mⁱ := [M] for all i.
The generative algorithm begins with a forward diffusion process that gradually degrades a clean sequence. In MDM, this occurs via progressive masking, factorized across tokens as
q(**x**_t | **x**_0) = ∏ᵢ₌₁ⁿ q(x_t^i | x_0^i), (1)
where each q(x_t^i | x_0^i) defines an independent categorical corruption process interpolating between a clean sample **x**_0 ~ q_data and the masked vector **m**:
q(x_t^i | x_0^i) := α_t **e**_{x_0^i} + (1 − α_t) **e**_{[M]}. (2)
Here, α_t ∈ [0,1] is a strictly decreasing noise schedule over time t ∈ [0,1], with α_0 ≈ 1 and α_1 ≈ 0. The notation **e**_j denotes the one-hot encoding of the j-th vocabulary index.
Generative sampling is achieved by reversing the above-described forward process. For any pair of time-steps 0 ≤ s < t ≤ 1, the reverse transition p_θ(**x**_s | **x**_t) is parameterized by a demasker network f_θ that predicts logits for the tokens. The reverse process is typically factorized as:
p_θ(**x**_s | **x**_t) = ∏ᵢ₌₁ⁿ p_θ(x_s^i | **x**_t), (3)
where p_θ(x_s^i | **x**_t) is a categorical distribution defined by the network's predicted logits.
### 4.3.2 Continuous-Within-Discrete
A delicate weakness (and thus an unexploited opportunity) in Algorithm 2 is the fact that the guidance is kept fixed throughout the T iterations, even though the sequence **x**_t is available, giving additional yet partial information about the text to be created. The **ConWithinDisc** algorithm aims to leverage this opportunity, by updating the guidance vector within the MDM steps. More specifically, within each demasking step, the guidance vector can be updated by drawing from the conditional distribution
**z**_0 ~ P(**z** | h_φ(**x**_t)).
In words, the guidance vector is sharpened to take into account the currently held temporal sequence **x**_t. Algorithm 3 provides a description of this variant, and Figure 3 presents ConThenDisc and ConWithinDisc, highlighting their difference.
A few comments are in order:
(i) The update of **z**_0 can be done in a pre-selected subset of the overall T steps, in order to benefit from the improved guidance while reducing the overall complexity of the generative algorithm;
(ii) In drawing the guidance vector, the conditioning we present leans on the **embedding** of the partially masked sequence **x**_t, i.e., **z**_0 ~ P(**z** | h_φ(**x**_t)). Rather, we could have conditioned the distribution directly on **x**_t;
(iii) In training the conditional diffusion (Algorithm 4), we use h_φ(**x**_t) for embedding the partially masked sentence. However, this encoder was not trained for such masked content. An improved strategy would be to define a second encoder h_μ(**x**_t) specifically trained to embed partially masked sequences.Similar Articles
LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling
LangFlow presents the first continuous diffusion language model that rivals discrete diffusion approaches, challenging the long-held belief that continuous diffusion is inferior for language modeling. The work introduces key ingredients like optimal Gumbel-based noise scheduling and demonstrates competitive perplexity and transfer learning performance compared to discrete diffusion baselines.
Continuous Latent Diffusion Language Model
Cola DLM is a hierarchical latent diffusion language model that uses text-to-latent mapping and conditional decoding to achieve efficient, non-autoregressive text generation.
MoDiCoL: A Modular Diagnostic Continual Learning Dataset for Robust Speech Recognition
This paper introduces MoDiCoL, a modular diagnostic continual learning dataset for robust speech recognition, enabling controlled analysis of linguistic content, speaker characteristics, and acoustic environments, and proposes a continual learning curriculum to study how robustness is acquired, transferred, and forgotten.
$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction
R²-dLLM introduces spatio-temporal redundancy reduction techniques that cut diffusion LLM decoding steps by up to 75% while preserving generation quality, addressing a key deployment bottleneck.
Improved Large Language Diffusion Models
iLLaDA is an 8B parameter masked diffusion language model with fully bidirectional attention, trained from scratch on 12T tokens. It shows broad improvements over LLaDA and remains competitive with Qwen2.5 7B on several benchmarks. The model and code are open-sourced.