$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction
Summary
R²-dLLM introduces spatio-temporal redundancy reduction techniques that cut diffusion LLM decoding steps by up to 75% while preserving generation quality, addressing a key deployment bottleneck.
View Cached Full Text
Cached at: 04/22/26, 08:29 AM
# $R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction Source: [https://arxiv.org/abs/2604.18995](https://arxiv.org/abs/2604.18995) [View PDF](https://arxiv.org/pdf/2604.18995) > Abstract:Diffusion Large Language Models \(dLLMs\) have emerged as a promising alternative to autoregressive generation by enabling parallel token prediction\. However, practical dLLM decoding still suffers from high inference latency, which limits deployment\. In this work, we observe that a substantial part of this inefficiency comes from recurring redundancy in the decoding process, including spatial redundancy caused by confidence clusters and positional ambiguity, and temporal redundancy caused by repeatedly remasking predictions that have already stabilized\. Motivated by these patterns, we propose $R^2$\-dLLM, a unified framework for reducing decoding redundancy from both inference and training perspectives\. At inference time, we introduce training\-free decoding rules that aggregate local confidence and token predictions, and finalize temporally stable tokens to avoid redundant decoding steps\. We further propose a redundancy\-aware supervised fine\-tuning pipeline that aligns the model with efficient decoding trajectories and reduces reliance on manually tuned thresholds\. Experiments demonstrate that $R^2$\-dLLM consistently reduces the number of decoding steps by up to 75% compared to existing decoding strategies, while maintaining competitive generation quality across different models and tasks\. These results validate that decoding redundancy is a central bottleneck in dLLMs, and that explicitly reducing it yields substantial practical efficiency gains\. ## Submission history From: Zhenbang Du \[[view email](https://arxiv.org/show-email/5432f1fd/2604.18995)\] **\[v1\]**Tue, 21 Apr 2026 02:26:08 UTC \(2,373 KB\)
Similar Articles
Dynamic-dLLM: Dynamic Cache-Budget and Adaptive Parallel Decoding for Training-Free Acceleration of Diffusion LLM
This paper proposes Dynamic-dLLM, a training-free framework that accelerates diffusion large language models by dynamically allocating cache-update budgets and calibrating decoding thresholds, achieving over 3x speedup on models like LLaDA and Dream while maintaining performance.
Efficient Diffusion LLMs via Temporal-Spatial Parallel Decoding and Confidence Extrapolation
This paper introduces Temporal-Spatial Parallel Decoding (TSPD) and Confidence Extrapolation (CE) to accelerate inference in diffusion-based large language models by dynamically deciding when tokens have converged and forecasting logit trends, reducing unnecessary denoising steps while preserving output quality.
Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models
This paper proposes Prefilling-dLLM, a training-free framework that partitions the prefix into chunks and caches KV representations, achieving state-of-the-art quality and up to 28x speedup for long-context inference in diffusion language models.
@DailyDoseOfDS_: Turn any Autoregressive LLM into a Diffusion LM. dLLM is a Python library that unifies the training & evaluation of dif…
dLLM is an open-source Python library that allows converting any autoregressive language model into a diffusion language model with minimal compute, unifying training and evaluation.
PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding
This paper introduces Parallel Speculative Decoding (PSD), a training-free framework that accelerates diffusion LLM inference by jointly improving spatial and temporal efficiency, achieving up to 5.5× tokens per forward pass with comparable quality to greedy decoding.