Diffusion Language Models: An Experimental Analysis

arXiv cs.AI 06/20/26, 04:00 AM Papers
Summary
A systematic experimental analysis evaluating eight state-of-the-art Diffusion Language Models across multiple benchmarks, analyzing trade-offs between generation quality and computational efficiency.
arXiv:2606.19475v1 Announce Type: new Abstract: Large Language Models (LLMs) have revolutionized language modeling through autoregressive generation, enabling strong performance across a wide range of tasks. Recently, Diffusion Language Models (DLMs) have emerged as an alternative paradigm that generates text through iterative denoising rather than next-token prediction, allowing parallel refinement of entire sequences. While numerous diffusion-based architectures have been proposed, differences in evaluation protocols, datasets, inference budgets, and generation hyperparameters make it difficult to compare their capabilities and understand the trade-offs they offer. In this work, we present a systematic experimental analysis of modern DLMs. Specifically, we evaluate eight state-of-the-art DLMs across eight benchmarks spanning reasoning, coding, translation, knowledge, and structured problem solving, while explicitly considering both generation quality and computational efficiency. Beyond downstream evaluation, we analyze the impact of key inference-time factors, including denoising steps, context length, block size, and parallel unmasking strategies, and complement large-scale experiments with controlled comparisons of smaller models trained under identical conditions. Our analysis highlights the strengths and limitations of diffusion-based language modeling across different tasks, architectures, and inference budgets. We show that the behavior of DLMs is strongly influenced by generation-time design choices, leading to distinct trade-offs between performance and computational efficiency. Overall, our study provides practical insights into the capabilities and deployment characteristics of contemporary DLMs.
Original Article
View Cached Full Text
Cached at: 06/20/26, 02:30 PM
# Diffusion Language Models: An Experimental Analysis
Source: [https://arxiv.org/html/2606.19475](https://arxiv.org/html/2606.19475)
\\copyrightclause

Copyright for this paper by its authors\. Use permitted under Creative Commons License Attribution 4\.0 International \(CC BY 4\.0\)\.

\\conference

Twelfth Italian Conference on Computational Linguistics \(CLiC\-it 2026\), September 14–16, 2026, Palermo, Italy

\[orcid=0009\-0005\-3638\-7862, email=282884@studenti\.unimore\.it, \]\\cormark\[1\]

\[orcid=0009\-0002\-9652\-8311, email=davide\.bucciarelli@unimore\.it, \]

\[orcid=0009\-0003\-9439\-9867, email=leonardo\.zini@unimore\.it, \] \[orcid=0000\-0001\-9640\-9385, email=marcella\.cornia@unimore\.it, \]

\[orcid=0000\-0001\-5125\-4957, email=lorenzo\.baraldi@unimore\.it, \]

\\cortext

\[1\]Corresponding author\.

Davide BucciarelliLeonardo ZiniMarcella CorniaLorenzo Baraldi

\(2026\)

###### Abstract

Large Language Models \(LLMs\) have revolutionized language modeling through autoregressive generation, enabling strong performance across a wide range of tasks\. Recently, Diffusion Language Models \(DLMs\) have emerged as an alternative paradigm that generates text through iterative denoising rather than next\-token prediction, allowing parallel refinement of entire sequences\. While numerous diffusion\-based architectures have been proposed, differences in evaluation protocols, datasets, inference budgets, and generation hyperparameters make it difficult to compare their capabilities and understand the trade\-offs they offer\. In this work, we present a systematic experimental analysis of modern DLMs\. Specifically, we evaluate eight state\-of\-the\-art DLMs across eight benchmarks spanning reasoning, coding, translation, knowledge, and structured problem solving, while explicitly considering both generation quality and computational efficiency\. Beyond downstream evaluation, we analyze the impact of key inference\-time factors, including denoising steps, context length, block size, and parallel unmasking strategies, and complement large\-scale experiments with controlled comparisons of smaller models trained under identical conditions\. Our analysis highlights the strengths and limitations of diffusion\-based language modeling across different tasks, architectures, and inference budgets\. We show that the behavior of DLMs is strongly influenced by generation\-time design choices, leading to distinct trade\-offs between performance and computational efficiency\. Overall, our study provides practical insights into the capabilities and deployment characteristics of contemporary DLMs\.

###### keywords:

Diffusion Language Models\\sepExperimental Analysis\\sepLarge Language Models\\sepDiffusion Models\\sepNon\-Autoregressive Models

## 1Introduction

Large Language Models \(LLMs\) are predominantly based on autoregressive generation\[grattafiori2024llama,yang2025qwen3,radford2019language,team2024gemma\], where text is produced sequentially one token at a time\. While highly effective, this paradigm imposes a strict left\-to\-right dependency during inference, limiting opportunities for parallel generation and global refinement of generated content\. These limitations have motivated the exploration of alternative generation paradigms that can leverage parallelism and iterative refinement during decoding\[leviathan2023fast,chen2023accelerating\]\.

Diffusion Language Models \(DLMs\) have recently emerged as a promising alternative\[arriola2025block,arriola2026encoder,sahoo2024simple,sahoo2025diffusion,ye2025dream,zhu2025llada,nie2026large\]\. Instead of generating text token\-by\-token, DLMs formulate generation as an iterative denoising process that progressively transforms a corrupted sequence into coherent text\. Inspired by the success of diffusion models in domains such as image, video, and audio generation\[dhariwal2021diffusion,rombach2022high,esser2024scaling,kong2020diffwave\], a growing body of work has adapted diffusion techniques to language, first through continuous latent representations and more recently through discrete token\-level formulations\. These approaches offer several attractive properties, including bidirectional context modeling, parallel token refinement, and the ability to naturally support tasks such as infilling, editing, and globally constrained reasoning\.

The rapid development of diffusion language modeling has produced a diverse ecosystem of architectures\. Recent proposals span fully discrete diffusion models, hybrid encoder\-decoder formulations, and block\-diffusion approaches that combine autoregressive conditioning with local diffusion\-based refinement\. While these models have demonstrated increasingly competitive performance, their evaluation remains highly fragmented\. Individual works are often assessed on different benchmarks, using distinct generation budgets, sampling schedules, context lengths, and inference configurations\. As a result, comparing results across papers is difficult, and it remains unclear whether reported gains stem from architectural improvements or from differences in evaluation protocols\. Furthermore, one of the defining characteristics of diffusion\-based language generation is the explicit trade\-off between quality and computational cost\. Unlike autoregressive models, whose generation procedure is largely fixed, diffusion models expose several inference\-time parameters \(*e\.g\.*, the number of denoising steps, sequence length, block size, and unmasking schedules\) that directly influence both performance and efficiency\. Despite the importance of these factors, their impact has not yet been systematically characterized across modern diffusion architectures\.

To address this gap, we present a comprehensive experimental analysis of state\-of\-the\-art DLMs\. We evaluate representative pure diffusion and block\-diffusion architectures under a unified protocol and compare them against strong autoregressive baselines\. Our study examines performance across general knowledge\[gema2025we,hendrycks2020measuring\], reasoning\[cobbe2021training,ye2025beyond,zellers2019hellaswag\], coding\[chen2021evaluating,austin2021program\], and machine translation\[bojar2016findings\]benchmarks, while explicitly analyzing the trade\-offs between quality and computational efficiency\. Beyond downstream evaluation, we conduct controlled scaling experiments to quantify the effects of denoising schedules, sequence lengths, block sizes, and parallel unmasking strategies on model behavior\. We also investigate the practical computational requirements of these architectures, providing insights into their memory consumption and inference costs under different generation settings\.

Our contributions can be summarized as follows:\(i\)We provide a unified evaluation of modern DLMs across a diverse set of downstream benchmarks, enabling direct comparison between competing architectural paradigms;\(ii\)We systematically analyze the quality–efficiency trade\-offs induced by key diffusion hyperparameters, including denoising steps, context length, block size, and unmasking ratios;\(iii\)We complement large\-scale downstream evaluations with controlled small\-scale experiments, allowing us to study architectural properties through perplexity and scaling analyses under identical training conditions;\(iv\)We provide a comparative analysis of the computational requirements of DLMs, reporting peak memory usage and floating\-point operations for both single forward passes and complete generation, highlighting the practical deployment trade\-offs between pure diffusion and block\-diffusion architectures\.

## 2Background

### 2\.1Autoregressive Language Modeling

Autoregressive language models define a probability distribution over token sequences by factorizing the joint distribution into a product of conditional distributions:

p\(x1,…,xT\)=∏t=1Tp\(xt∣x<t\)\.p\(x\_\{1\},\\ldots,x\_\{T\}\)=\\prod\_\{t=1\}^\{T\}p\(x\_\{t\}\\mid x\_\{<t\}\)\.\(1\)
This formulation enables straightforward maximum likelihood training and has scaled effectively with Transformer architectures, leading to strong empirical performance across a wide range of natural language processing tasks\[austin2021program,caffagni2024revolution,zini2026vhector,bucciarelli2024personalizing\]\. In particular, large\-scale autoregressive language models exhibit emergent zero\-shot and in\-context learning capabilities, making them the dominant paradigm for language modeling\.

Despite these strengths, autoregressive generation is inherently sequential: tokens must be generated one at a time conditioned on previous context\. This limits parallelism during inference and results in latency that grows linearly with sequence length\. Additionally, the left\-to\-right generation process requires models to commit to decisions sequentially, making it difficult to revisit earlier predictions once produced\. While these limitations have not prevented autoregressive models from achieving state\-of\-the\-art performance, they motivate the exploration of alternative generation paradigms that may offer improved efficiency or different quality–compute trade\-offs\.

### 2\.2Diffusion as a Generative Paradigm

Diffusion models provide an alternative approach to generative modeling based on iterative denoising\. A data sample is gradually corrupted through a forward noising process, and a neural network is trained to reverse this process step by step\[esser2024scaling,bucciarelli2026tiny\]\. Generation is performed by starting from noise and iteratively refining the sample into a coherent output\. A key conceptual advantage of diffusion models is their ability to refine an entire representation in parallel at each denoising step\. Rather than progressively extending a partial output, the model repeatedly updates all components of the sample simultaneously, whether they correspond to image latents or textual tokens\. This global refinement process allows information to propagate across the entire representation throughout generation, potentially enabling more coherent outputs and greater parallelism than autoregressive decoding\. However, diffusion models introduce a different computational bottleneck: generation typically requires multiple denoising iterations\. As a result, inference efficiency depends not only on the computational cost of a single forward pass but also on the total number of refinement steps, creating an explicit trade\-off between generation quality and computational efficiency\.

### 2\.3Continuous Diffusion for Language

Early attempts to apply diffusion models to language modeling extended continuous diffusion techniques from vision to text by operating in a continuous embedding space\. In these approaches, discrete tokens are mapped into continuous vector representations, and Gaussian noise is applied in this latent space\. A neural network is then trained to denoise corrupted embeddings\.

This design enables the reuse of standard diffusion mechanisms developed for continuous data domains and allows for parallel updates across the sequence\. However, several fundamental challenges arise due to the mismatch between continuous noise processes and discrete linguistic structure\.

First, language is inherently discrete, and perturbations in embedding space do not correspond to well\-defined symbolic transformations\. As a result, the denoising trajectory may traverse regions of the embedding space that do not map cleanly to valid or meaningful tokens\. Second, the final step of projecting continuous representations back into discrete tokens introduces additional quantization error and can degrade generation quality\. Finally, small perturbations in embedding space can induce disproportionately large semantic changes, making the denoising process difficult to stabilize\.

These issues suggest that continuous diffusion may be misaligned with the discrete nature of language, motivating approaches that define diffusion processes directly over token sequences\.

### 2\.4Discrete Diffusion Language Models

Discrete diffusion models address the limitations of continuous approaches by operating directly in token space\. Instead of adding Gaussian noise to embeddings, tokens are progressively corrupted through discrete stochastic processes, such as masking, token replacement, or categorical noise injection\. A neural model is trained to reconstruct the original sequence from these corrupted versions\.

Two common corruption strategies are*uniform*and*absorbing*diffusion\. In uniform diffusion, tokens are replaced with randomly sampled vocabulary items, gradually transforming the sequence toward a uniform distribution over tokens\. In absorbing diffusion, tokens are progressively replaced by a special absorbing state \(*e\.g\.*, a\[MASK\]token\), allowing the model to learn reconstruction from partially masked inputs\. These choices influence both the learning dynamics and the generation behavior of the resulting models\. In practice, absorbing diffusion has become the predominant approach in recent DLMs, as empirical studies have generally found it to provide stronger performance and more stable training dynamics than uniform corruption schemes\[austin2021structured\]\.

This formulation aligns naturally with language structure and removes the need for continuous\-to\-discrete output projection\. Moreover, discrete diffusion enables parallel prediction of multiple tokens during each denoising step, offering a potential efficiency advantage over autoregressive generation\.

Like other diffusion\-based approaches, discrete diffusion models rely on an iterative denoising process at inference time, making generation quality and computational efficiency closely linked to the number of denoising steps performed\. While reducing the number of steps can accelerate generation, it may also affect output quality, motivating research into architectures and sampling strategies that better balance these objectives\. Despite this trade\-off, recent advances have shown that discrete diffusion models can achieve competitive performance across a variety of language generation tasks\.

### 2\.5Block and Hybrid Diffusion Approaches

To address the inference costs associated with iterative diffusion\-based generation, recent work has explored hybrid architectures that combine autoregressive and diffusion\-based modeling\.

Rather than generating a sequence token\-by\-token as in conventional autoregressive language models, these approaches partition the output into blocks of tokens\. Generation then proceeds autoregressively at the block level: each block is conditioned on previously generated blocks, preserving a causal structure that facilitates long\-range coherence and efficient context modeling\. Within each block, however, tokens are generated using a diffusion process rather than a left\-to\-right decoding procedure\. The model iteratively denoises the tokens of a block in parallel, allowing local refinement and reducing the sequential dependencies that characterize autoregressive decoding\. Consequently, block diffusion can be viewed as a compromise between the two paradigms: autoregressive generation provides global structure across blocks, while diffusion\-based generation enables parallel refinement within each block\.

By combining these mechanisms, block diffusion aims to achieve a more favorable quality–efficiency trade\-off than either paradigm alone\. Compared to autoregressive models, it reduces sequential decoding steps during inference, while avoiding full\-sequence iterative refinement typical of diffusion models\. Nevertheless, its effectiveness depends on the chosen block structure and denoising schedule, and the balance between autoregressive and diffusion components remains an active area of research\.

## 3Related Work

### 3\.1Generative paradigms

Autoregressive Language Models\.Autoregressive language models form the standard paradigm for neural text generation\[radford2019language,brown2020language,grattafiori2024llama,team2024gemma,guo2025deepseek,yang2025qwen3\], originating from early neural language models and scaling through Transformer\-based architectures\[vaswani2017attention\]\. These models have demonstrated strong scaling behavior and emergent in\-context learning abilities at large scale\[kaplan2020scaling\]\. In our experiments, we adopt GPT\-2\[radford2019language\]and Qwen3\[yang2025qwen3\]as representative autoregressive baselines against which DLMs are evaluated\. However, their strictly sequential decoding process has motivated a long line of work on improving inference efficiency, including speculative decoding, caching strategies, and parallel decoding approximations\[pope2023efficiently,leviathan2023fast,chen2023accelerating\]\. Despite these efforts, token\-by\-token factorization limits parallel generation\.

Continuous Diffusion for Language\.Early diffusion\-based language models explore the application of diffusion processes to text by operating in continuous embedding spaces, where discrete tokens are mapped to continuous representations and corrupted with Gaussian noise before iterative denoising\[li2022diffusion,gong2022diffuseq,lin2023text\]\. These approaches are largely motivated by the success of diffusion models in continuous domains such as image generation\[ho2020denoising,song2020score\], and attempt to reuse the same iterative refinement framework within a continuous relaxation of discrete sequences\. While these approaches enable global parallel refinement, they are limited by the mismatch between continuous noise processes and discrete token semantics, as well as challenges in decoding continuous representations back into valid text\.

Discrete Diffusion Language Models\.Discrete diffusion models define generative processes directly over token sequences by progressively corrupting discrete symbols and learning to reverse this corruption through iterative denoising\. Foundational formulations include categorical diffusion processes and masked diffusion objectives, which generalize masked language modeling and define stochastic transition kernels over discrete vocabularies\[austin2021structured,sahoo2024simple,sahoo2025diffusion\]\. Related approaches further connect diffusion\-style denoising with energy\-based and score\-matching objectives in discrete settings, providing alternative views of iterative sequence reconstruction\[lou2023discrete\]\. A key insight in this line of work is the close relationship between autoregressive factorization and discrete diffusion processes, where autoregressive generation can be interpreted as a special case of sequential denoising under a fixed ordering\. This connection has motivated methods that initialize or adapt diffusion models from pretrained autoregressive language models, enabling improved optimization and scalability in large\-scale settings\[gong2025scaling\]\.

Recent work has demonstrated that discrete diffusion can be scaled to large language modeling regimes, extending these ideas to billions of parameters and diverse text generation tasks\[nie2026large,ye2025dream\]\. Recent studies suggest that diffusion\-based language models benefit substantially from large\-scale training data, exhibiting strong performance improvements as data and model size increase, as highlighted in recent large\-scale analyses of diffusion language modeling\[ni2025diffusion\]\. However, compared to autoregressive models, their scaling behavior and efficiency–quality trade\-offs remain less well understood\.

Despite these advances, discrete diffusion models remain computationally expensive at inference time due to the need for multiple iterations over full sequences\. This iterative refinement introduces a trade\-off between quality and efficiency, and makes scaling behavior less predictable compared to autoregressive models, motivating further exploration of more efficient hybrid approaches\.

Block and Hybrid Diffusion Models\.A recent line of work investigates block\-structured diffusion models that introduce hierarchical generation schemes to improve the efficiency of diffusion\-based language modeling\. These approaches decompose sequence generation into coarse\-grained block\-level generation combined with fine\-grained intra\-block denoising, reducing the cost of full\-sequence iterative refinement while preserving parallel token updates within blocks\.

This direction is represented by a variety of block\-wise diffusion formulations and hierarchical extensions that explore different strategies for partitioning sequences and scheduling denoising steps\[arriola2025block,arriola2026encoder,wu2025fast\]\. Across these approaches, a common design principle is the introduction of structure into the diffusion process to limit the scope of iterative refinement while maintaining expressive modeling capacity\. Closely related to this family are pseudo\-autoregressive diffusion models, which introduce a directional component into diffusion\-based generation by iteratively refining a sliding window of future tokens conditioned on a growing prefix\[liu2025sequential\]\. These methods blur the boundary between autoregressive decoding and diffusion by combining causal structure with iterative denoising\.

Overall, these works span a spectrum of hybrid design choices that integrate autoregressive structure and diffusion\-based generation at different granularities\. Rather than treating generation as purely sequential or fully parallel, they explore intermediate formulations that trade off between inference efficiency, structural conditioning, and the degree of iterative refinement\.

### 3\.2Evaluation Protocols in Current Literature

The development of autoregressive language modeling has been closely coupled with the establishment of standardized evaluation protocols, including unified multi\-task benchmarks that enable consistent and reproducible comparison across models\[biderman2024lessons,hendrycks2020measuring\]\. These frameworks have played a central role in ensuring that progress in autoregressive modeling is measured under comparable settings and well\-defined evaluation criteria\.

In contrast, diffusion\-based language modeling has not yet converged on a consistent evaluation standard\. Existing studies are conducted under heterogeneous experimental setups, differing in task collections, generation budgets, and sampling configurations, which limits the comparability of reported results\[nie2026large,ye2025dream,wu2025fast\]\. As a consequence, observed performance gains are often entangled with evaluation\-specific choices rather than reflecting purely architectural improvements\.

This lack of standardization is particularly problematic given that diffusion models introduce an explicit computational control variable in the form of iterative denoising steps, which directly governs the trade\-off between generation quality and inference cost\. Without a unified protocol for varying and reporting this parameter, it remains difficult to characterize the true quality–efficiency frontier of diffusion\-based language models\. The issue is further exacerbated in hybrid block\-diffusion settings, where additional factors such as block granularity and scheduling strategies introduce further degrees of freedom that are rarely controlled consistently across studies\.

To address these limitations, this work introduces a unified experimental framework for evaluating diffusion\-based language models under consistent task, budget, and inference settings, enabling systematic analysis of their quality–efficiency trade\-offs\.

## 4Experimental Setup

To provide a comprehensive evaluation of DLMs, we design an experimental setup that spans both large\-scale pretrained systems and controlled small\-scale models trained under standardized conditions\. This dual\-tier structure allows us to assess performance in realistic downstream settings while also isolating architectural differences under identical data regimes\. To complement this model comparison, we evaluate performance across a diverse set of benchmarks covering knowledge, reasoning, coding, translation, and structured problem solving

Table 1:Summary of architectural and training configurations for the evaluated DLMs, categorized into large\-scale and small\-scale regimes\. The \# Tokens column details the training data scale\. Note that underlined values indicate the number of fine\-tuning samples or instruction pairs, as opposed to number of tokens\.ModelVenue\# TokensDS TrainingParametersBlockInitializationDenoising typeLarge scale modelsLLaDa\[nie2026large\]NeurIPS 20252\.3TProprietary8B✗From ScratchAbsorbingDream\[ye2025dream\]arXiv 2025580BOpen Source7B✗ARAbsorbingSDLM\[liu2025sequential\]arXiv 20252\.3BOpen Source3B✓ARAbsorbingLLaDa\-1\.5\[zhu2025llada\]arXiv 2025350KProprietary8B✓DiffusionAbsorbingFast\-dLLM\[wu2025fast\]ICLR 202630MOpen Source7B✓ARAbsorbingSmall scale modelsMDLM\[sahoo2024simple\]NeurIPS 20249BOpenWebText200M✗From ScratchAbsorbingBD3\-LM\[arriola2025block\]ICLR 20259BOpenWebText200M✓From ScratchAbsorbingE2D2\[arriola2026encoder\]NeurIPS 20259BOpenWebText170M✓From ScratchAbsorbingDuo\[sahoo2025diffusion\]ICML 20259BOpenWebText200M✗From ScratchUniform

### 4\.1Evaluated Models

Large\-Scale Downstream Models\.This tier includes recent large\-scale language and diffusion models evaluated on standard reasoning, generation, and coding benchmarks, alongside autoregressive baselines\. LLaDA\[nie2026large\]is a large\-scale discrete diffusion model trained from scratch using fully bidirectional attention and a low\-confidence remasking strategy, and is evaluated under both standard full sequence diffusion and block\-based sampling configurations\. LLaDA 1\.5\[zhu2025llada\]extends this architecture with reinforcement\-learning\-based optimization for improved alignment and inference stability, and is evaluated in a block\-diffusion setting\. Dream\[ye2025dream\]is a discrete diffusion language model initialized from a pretrained autoregressive checkpoint \(Qwen2\.5 7B\)\. This initialization provides a strong linguistic prior, which is then adapted to iterative denoising through diffusion training\. Fast\-dLLM\-v2\[wu2025fast\]introduces a hierarchical block structure with nested sub\-blocks designed to reduce inference overhead and enable more efficient sequential decoding\. SDLM\[liu2025sequential\]represents a hybrid formulation that combines autoregressive and diffusion\-style generation by iteratively unmasking a fixed number of future tokens conditioned on a growing prefix, guided by internal confidence estimates\. As a reference point, Qwen3\[yang2025qwen3\]models are included as state\-of\-the\-art autoregressive baselines to quantify the performance and efficiency gap relative to diffusion\-based approaches\.

Small\-Scale Controlled Models\.This tier consists of compact architectures trained from scratch on a unified corpus \(OpenWebText\[Gokaslan2019OpenWeb\]\) to enable controlled comparisons under identical data conditions and allow precise perplexity evaluation without confounding factors from heterogeneous pretraining data\. MDLM\[sahoo2024simple\]serves as a baseline masked diffusion model that performs iterative corruption and denoising over full sequences, representing a standard discrete diffusion formulation\. BD3\-LM\[arriola2025block\]combines autoregressive block\-level generation with intra\-block diffusion, enabling parallel token refinement within each segment while preserving sequential block dependencies\. Duo\[sahoo2025diffusion\]is a discrete diffusion model based on uniform\-state corruption dynamics, leveraging a structured noise schedule that improves training stability and self\-correction behavior\. E2D2\[arriola2026encoder\]separates computation between an encoder processing the conditioning context and a lightweight decoder responsible for iterative denoising of target tokens\. GPT\-2\[radford2019language\]is included as a standard autoregressive baseline trained under the same data conditions, serving as a reference point for comparing sequential and diffusion\-based generation under controlled settings\.

Table[1](https://arxiv.org/html/2606.19475#S4.T1)summarizes the architectural and training characteristics of the evaluated DLMs, including parameter scale, training data, masking strategy, and whether a block\-based generation scheme is employed\. This overview provides a unified reference for comparing model design choices across both large\-scale and controlled experimental settings\.

### 4\.2Datasets

The benchmark suite includes MMLU\[hendrycks2020measuring\]and MMLU Redux\[gema2025we\]for evaluating factual knowledge and reasoning abilities, HellaSwag\[zellers2019hellaswag\]for commonsense reasoning and scenario completion, GSM8K\[cobbe2021training\]for multi\-step mathematical reasoning, and HumanEval\[chen2021evaluating\]together with MBPP\[austin2021program\]for code generation\. To assess performance beyond reasoning and coding tasks, we additionally include WMT16 En–De\[bojar2016findings\]as a machine translation benchmark and Sudoku\[ye2025beyond\]as a structured logical reasoning task requiring constraint satisfaction\. Together, these benchmarks provide a diverse evaluation setting covering both discriminative and generative tasks, allowing us to analyze the behavior of DLMs across a wide range of capabilities\.

### 4\.3Evaluation Protocol

All experiments are conducted using the widely adopted open\-source evaluation framework lm\-evaluation\-harness\[biderman2024lessons\], which provides a standardized interface for evaluating both autoregressive and diffusion\-based language models\.

MMLU is evaluated in the 5\-shot setting using teacher\-forced log\-likelihood scoring over the candidate answers, with performance reported as accuracy\. HellaSwag is evaluated in the 0\-shot setting using log\-likelihood scoring, with performance measured through normalized accuracy to account for differences in candidate completion length and avoid bias toward shorter responses\. MMLU Redux is evaluated in the 5\-shot setting, where the model generates an answer and accuracy is computed from the first generated token corresponding to the predicted option label\.

GSM8K is evaluated in the 4\-shot setting using the flexible\-extract protocol, which identifies the final numerical value in the generated response and compares it against the reference answer\. HumanEval is evaluated in the 0\-shot setting and MBPP in the 3\-shot setting, with both benchmarks measured using pass@1 functional correctness after applying the standard output filtering procedures provided by the evaluation framework\. WMT16 En–De is evaluated in the 0\-shot setting using the chrF metric\. Finally, Sudoku is evaluated in the 0\-shot setting by verifying that generated solutions remain consistent with the input clues and satisfy all puzzle constraints\.

Log\-likelihood Estimation\.For all models, we compute log\-likelihoods following the procedures described in their original papers\. For Dream, LLaDa, and LLaDa 1\.5, likelihoods are estimated through a Monte Carlo procedure, since exact autoregressive likelihoods are not directly available\. For Fast\-dLLM and SDLM, we follow the authors’ evaluation protocol by masking all target tokens and computing the sequence log\-likelihood in a single forward pass\. Unless otherwise stated, we retain the original hyperparameters and scoring configurations\.

## 5Large\-Scale Analysis

### 5\.1Architectural Paradigm Comparison

The results reported in Table[2](https://arxiv.org/html/2606.19475#S5.T2)summarize the peak performance achieved by each model under its best\-performing configuration in terms of diffusion steps and block structure\. This allows us to compare the intrinsic capabilities of three architectural paradigms: autoregressive modeling, full\-sequence discrete diffusion, and block\-based hybrid diffusion\. Autoregressive models \(Qwen3\) are included as strong reference baselines representing standard causal language modeling, against which diffusion\-based approaches are evaluated\.

Pure Discrete Diffusion: Strong Knowledge Retention and Global Reasoning\.The pure diffusion models, Dream and LLaDa, exhibit markedly different profiles\. Dream consistently emerges as the strongest full\-sequence diffusion model, achieving the highest diffusion performance on MMLU, MMLU\-Redux, HellaSwag, MBPP, and Sudoku\. In particular, its75\.00%75\.00\\%Sudoku accuracy substantially exceeds both autoregressive and block\-based approaches, suggesting that full\-sequence iterative refinement is particularly effective for tasks requiring global constraint satisfaction\. More broadly, Dream remains competitive with the larger Qwen3 baselines across several reasoning and knowledge benchmarks, indicating that diffusion\-based generation can approach autoregressive performance when paired with strong initialization and sufficient inference compute\.

Block\-Level Diffusion: Specialization through Structured Generation\.Block\-based diffusion architectures exhibit a more specialized performance profile\. Fast\-dLLM achieves the strongest diffusion results on GSM8K \(83\.39%83\.39\\%\) and HumanEval \(69\.51%69\.51\\%\), matching or surpassing much larger models on algorithmic reasoning and code generation\. However, this strength comes at the expense of linguistic tasks, most notably HellaSwag, where performance drops to30\.82%30\.82\\%\. In contrast, LLaDa\-1\.5 achieves the strongest translation performance among all diffusion models \(54\.8554\.85chrF\) while maintaining competitive reasoning accuracy, suggesting that different block\-generation strategies induce distinct trade\-offs between sequential language modeling and structured reasoning\. SDLM occupies an intermediate position, delivering competitive performance despite its smaller parameter count, particularly on coding and language understanding tasks\.

Overall, the results indicate that no single diffusion paradigm dominates across all benchmarks\. Full\-sequence diffusion appears most effective for globally constrained and knowledge\-intensive tasks, while block\-based approaches can achieve superior reasoning and coding performance at the cost of greater task specialization\.

Table 2:Comparison of peak benchmark accuracy \(%\) across autoregressive and DLMs\. Autoregressive baselines \(Qwen3\) are shaded in gray, while bold text indicates the highest performance achieved among the diffusion\-based architectures\.ModelGSM8KHumanEvalMBPPMMLUMMLU\-ReduxHellaSwagSudokuwmt16 en\-deQwen3\-4B81\.1971\.9562\.6070\.0674\.8068\.482\.0055\.37SDLM62\.0063\.4156\.6064\.9662\.5569\.702\.0050\.11Qwen3\-8B87\.4163\.4164\.8074\.7878\.4274\.978\.0058\.21Dream77\.7557\.9257\.0071\.7376\.0073\.7775\.0045\.61LLaDa65\.4233\.5340\.6065\.8461\.3370\.9146\.0051\.31LLaDa\-1\.582\.4150\.0042\.4064\.1461\.3869\.7028\.0054\.85Fast\-dLLM83\.3969\.5149\.0068\.0071\.1830\.821\.0043\.62

### 5\.2Scaling Analysis

![Refer to caption](https://arxiv.org/html/2606.19475v1/x1.png)Figure 1:Performance scaling across tasks when synchronously increasing diffusion steps and context length at a 1:1 ratio\.![Refer to caption](https://arxiv.org/html/2606.19475v1/x2.png)Figure 2:Impact of scaling diffusion steps \(varying the parallel unmasking ratio\) while maintaining a fixed context length ofN=1024N=1024\.Unlike autoregressive models, DLMs expose several inference\-time control variables, including the number of denoising steps, generation length, block size, and the degree of parallel token prediction\. These parameters directly affect both computational cost and output quality, making them central to understanding the practical behavior of diffusion\-based generation\. In this section, we systematically vary these factors to characterize how different architectures utilize additional inference compute and to identify the regimes in which performance gains saturate or degrade\.

Joint Scaling of Steps and Context Length\.We restrict this analysis to Dream and LLaDa, the two full\-sequence diffusion models in our evaluation, in order to study scaling behavior independently of block\-based generation\. Figure[1](https://arxiv.org/html/2606.19475#S5.F1)shows performance as a function of generation length, with one token denoised per step so that diffusion steps and output length scale together\. For reasoning and coding tasks \(GSM8K, MBPP, HumanEval\), both models benefit from larger budgets initially, but performance saturates or declines beyond 256–512 tokens\. The main exception is Dream on HumanEval, which continues improving at larger budgets, whereas LLaDa plateaus and eventually degrades\. Translation quality \(WMT16\) scales poorly throughout, declining almost monotonically and collapsing at the largest budgets, suggesting that translation does not benefit from longer outputs and is particularly sensitive to error accumulation over extended generation sequences\.

Scaling the Global Unmasking Ratio\.Figure[2](https://arxiv.org/html/2606.19475#S5.F2)isolates the effect of the parallel unmasking ratio by fixing the generation length at N=1024 and varying the number of denoising steps\. Across GSM8K, MBPP, and HumanEval, both models perform poorly at low step counts, indicating that aggressive parallel token prediction is detrimental to reasoning and code generation\. Performance improves steadily as the number of denoising steps increases, with no clear saturation within the evaluated range — suggesting that in the joint scaling experiment \(Figure[1](https://arxiv.org/html/2606.19475#S5.F1)\), it is the generation length rather than the step count that drives performance degradation\. Dream consistently outperforms LLaDa on reasoning and coding tasks across all step budgets\. Machine translation exhibits a qualitatively different pattern: both models perform poorly at low step counts, but LLaDa benefits more substantially from additional steps and eventually surpasses Dream on WMT16 at 1024 steps\.

Scaling Block Size under Constant Compute\.We focus this analysis on LLaDa, LLaDa 1\.5, and Fast\-dLLM, as these block\-based models allow fixed block sizes, unlike SDLM\. To isolate chunking effects, we scaled block sizes from 8 to 128 tokens while maintaining a strict 1:1 ratio of generated tokens to denoising steps\. As Figure[3](https://arxiv.org/html/2606.19475#S5.F3)shows, performance is largely robust to these changes\. This stability is most evident on WMT16, where all models maintain perfectly flat curves\. GSM8K and HumanEval show only minor fluctuations, implying block sizes can be flexibly tuned to meet hardware constraints \(like KV\-cache limits\) with minimal quality degradation\. MBPP is the primary exception, demonstrating higher sensitivity: Fast\-dLLM spikes at block size 16, while both LLaDa models dip at smaller sizes before recovering\. Regardless of these variations, relative model rankings remain strictly consistent: LLaDa 1\.5 outperforms base LLaDa, and Fast\-dLLM dominates reasoning and coding while trailing LLaDa 1\.5 in translation\.

![Refer to caption](https://arxiv.org/html/2606.19475v1/x3.png)Figure 3:Performance invariance when scaling absolute block size under a constant compute budget \(1:1 ratio of generated tokens to diffusion steps\)\.Scaling the Intra\-Block Unmasking Ratio\.We exclude Fast\-dLLM from this analysis, as it does not allow setting a fixed amount of diffusion steps per block\. To examine the cost of intra\-block parallel unmasking, we scaled the ratio of denoising steps to block length \(from 1/8 to 1/1\) at fixed optimal block sizes\. As Figure[4](https://arxiv.org/html/2606.19475#S5.F4)shows, block boundaries largely preserve the qualitative behaviors seen in the global setting\. Code generation \(MBPP, HumanEval\) heavily penalizes intra\-block parallelism: performance scales near\-linearly up to 1/1, indicating fine\-grained sequential refinement is necessary even within spatial chunks\. LLaDa 1\.5 widens its lead on HumanEval at 1/1, though both models converge identically on MBPP at this ratio\. Math reasoning \(GSM8K\) is slightly more tolerant to parallelism; accuracy continues to grow but the rate of improvement slows noticeably after 1/4 for both models\. On machine translation \(WMT16\), LLaDa 1\.5 improves consistently across the entire range, whereas the base LLaDa model peaks at 1/2 before experiencing a distinct performance drop at 1/1\.

![Refer to caption](https://arxiv.org/html/2606.19475v1/x4.png)Figure 4:Intra\-block parallel unmasking effects: scaling the ratio of diffusion steps to block length within fixed block boundaries\.
### 5\.3Computational cost analysis

Table[3](https://arxiv.org/html/2606.19475#S5.T3)compares the computational requirements of the evaluated DLMs in terms of peak GPU memory consumption and floating\-point operations, measured both for a single forward pass and for 100 complete GSM8K generation\. For a single forward pass, Dream, LLaDa, and LLaDa 1\.5 exhibit similar computational profiles, requiring approximately 16 GB of memory and 23–25 TFLOPS\. In contrast, the block\-diffusion architectures Fast\-dLLM and SDLM are substantially more efficient, with SDLM requiring less than half the memory and compute of the pure diffusion models\.

The gap widens considerably during generation\. Because pure diffusion models repeatedly execute the denoising network over many refinement steps, their cumulative computational cost increases dramatically despite their parallel generation capabilities\. Block\-diffusion architectures mitigate this overhead by restricting diffusion to local blocks, resulting in significantly lower end\-to\-end generation costs\. These results highlight the fundamental quality–efficiency trade\-off of DLMs: while pure diffusion architectures provide the greatest flexibility for iterative refinement, block\-based approaches offer a substantially more practical inference profile\.

Table 3:Computational cost comparison across the evaluated models\. Metrics represent peak GPU memory \(VRAM\) and floating\-point operations \(TFLOPS\) for a single forward pass versus full generation on GSM8K, with autoregressive baselines shaded in gray\.1 Forward PassGSM8K GenerationModelPeak VRAM \(GB\)TFLOPSPeak VRAM \(GB\)TFLOPSQwen3\-4B8\.365\.048\.369\.85SDLM7\.494\.157\.516\.72Qwen3\-8B16\.689\.4716\.6819\.47Dream15\.8023\.2320\.4523726\.95LLaDa16\.4824\.8217\.3125357\.94LLaDa 1\.516\.4825\.0117\.3225557\.57Fast\-dLLM15\.509\.7315\.5533\.38

### 5\.4Small\-Scale Models Analysis

To evaluate the raw predictive confidence of compact architectures independently of specific sampling strategies, we computed perplexity on a 1000\-sample ensemble dataset composed of GSM8K, MBPP, HumanEval, WMT16 En\-De, MMLU, and HellaSwag\.

Table 4:Ensemble perplexity and computational cost comparison for small\-scale architectures\. The table reports raw predictive performance \(PPL\) alongside peak VRAM and TFLOPS for both a single forward pass and unconditioned generation over a fixed 1024\-token sequence\. The autoregressive baseline \(GPT\-2\) is shaded in gray\.1 Forward PassGenerationModelPPL↓\\downarrowPeak VRAM \(GB\)TFLOPSPeak VRAM \(GB\)TFLOPSGPT\-220\.982\.390\.2922\.341\.75E2D236\.825\.451\.035\.45263\.15BD3\-LM36\.161\.200\.2531\.20259\.04MDLM28\.451\.070\.2531\.07259\.16Duo24\.361\.080\.2531\.08259\.22

The results reveal a clear hierarchy\. GPT\-2 achieves the lowest perplexity \(20\.98\), reflecting the advantage of autoregressive models on next\-token prediction\. Among diffusion models, Duo obtains the strongest result \(24\.36\), substantially improving over MDLM \(28\.45\) and demonstrating the effectiveness of its curriculum\-learning strategy\. In contrast, the hybrid architectures BD3\-LM \(36\.16\) and E2D2 \(36\.82\) exhibit considerably higher perplexities, suggesting that architectural modifications introduced to improve efficiency come at the cost of reduced likelihood modeling performance\. Computationally, MDLM, Duo, and BD3\-LM exhibit nearly identical costs, whereas E2D2 requires additional memory and compute due to its encoder–decoder architecture\. These differences are less pronounced than those observed in perplexity\.

## 6Conclusion

In this work, we presented a unified evaluation of modern DLMs across diverse downstream benchmarks\. Our results highlight the distinct strengths and limitations of diffusion\-based generation, showing that its effectiveness depends strongly on the task, inference strategy, and computational budget\. While pure diffusion models benefit from global refinement, block\-diffusion architectures offer a practical compromise between performance and efficiency\. These findings provide a clearer understanding of the trade\-offs underlying contemporary diffusion language models\.

## Declaration on Generative AI

During the preparation of this work, the author\(s\) used ChatGPT \(OpenAI\) and Grammarly in order to: grammar, spelling check and rephrasing\. After using these tool\(s\)/service\(s\), the author\(s\) reviewed and edited the content as needed and take\(s\) full responsibility for the publication’s content\.

## References
Diffusion Language Models: An Experimental Analysis

Similar Articles

Semantic DLM+: Improving Diffusion Language Models through Bias-variance Trade-off in Transition Kernel Design

TrustLDM: Benchmarking Trustworthiness in Language Diffusion Models

I built a diffusion language model from scratch. It writes flawless sentences that mean nothing, and that is the interesting part.

LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling

Learnability-Informed Fine-Tuning of Diffusion Language Models

Submit Feedback

Similar Articles

Semantic DLM+: Improving Diffusion Language Models through Bias-variance Trade-off in Transition Kernel Design
TrustLDM: Benchmarking Trustworthiness in Language Diffusion Models
I built a diffusion language model from scratch. It writes flawless sentences that mean nothing, and that is the interesting part.
LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling
Learnability-Informed Fine-Tuning of Diffusion Language Models