$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction

arXiv cs.CL 04/22/26, 04:00 AM Papers

Summary

R²-dLLM introduces spatio-temporal redundancy reduction techniques that cut diffusion LLM decoding steps by up to 75% while preserving generation quality, addressing a key deployment bottleneck.

arXiv:2604.18995v1 Announce Type: new Abstract: Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive generation by enabling parallel token prediction. However, practical dLLM decoding still suffers from high inference latency, which limits deployment. In this work, we observe that a substantial part of this inefficiency comes from recurring redundancy in the decoding process, including spatial redundancy caused by confidence clusters and positional ambiguity, and temporal redundancy caused by repeatedly remasking predictions that have already stabilized. Motivated by these patterns, we propose $R^2$-dLLM, a unified framework for reducing decoding redundancy from both inference and training perspectives. At inference time, we introduce training-free decoding rules that aggregate local confidence and token predictions, and finalize temporally stable tokens to avoid redundant decoding steps. We further propose a redundancy-aware supervised fine-tuning pipeline that aligns the model with efficient decoding trajectories and reduces reliance on manually tuned thresholds. Experiments demonstrate that $R^2$-dLLM consistently reduces the number of decoding steps by up to 75% compared to existing decoding strategies, while maintaining competitive generation quality across different models and tasks. These results validate that decoding redundancy is a central bottleneck in dLLMs, and that explicitly reducing it yields substantial practical efficiency gains.

Original Article

View Cached Full Text

Cached at: 04/22/26, 08:29 AM

# $R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction
Source: [https://arxiv.org/abs/2604.18995](https://arxiv.org/abs/2604.18995)
[View PDF](https://arxiv.org/pdf/2604.18995)

> Abstract:Diffusion Large Language Models \(dLLMs\) have emerged as a promising alternative to autoregressive generation by enabling parallel token prediction\. However, practical dLLM decoding still suffers from high inference latency, which limits deployment\. In this work, we observe that a substantial part of this inefficiency comes from recurring redundancy in the decoding process, including spatial redundancy caused by confidence clusters and positional ambiguity, and temporal redundancy caused by repeatedly remasking predictions that have already stabilized\. Motivated by these patterns, we propose $R^2$\-dLLM, a unified framework for reducing decoding redundancy from both inference and training perspectives\. At inference time, we introduce training\-free decoding rules that aggregate local confidence and token predictions, and finalize temporally stable tokens to avoid redundant decoding steps\. We further propose a redundancy\-aware supervised fine\-tuning pipeline that aligns the model with efficient decoding trajectories and reduces reliance on manually tuned thresholds\. Experiments demonstrate that $R^2$\-dLLM consistently reduces the number of decoding steps by up to 75% compared to existing decoding strategies, while maintaining competitive generation quality across different models and tasks\. These results validate that decoding redundancy is a central bottleneck in dLLMs, and that explicitly reducing it yields substantial practical efficiency gains\.

## Submission history

From: Zhenbang Du \[[view email](https://arxiv.org/show-email/5432f1fd/2604.18995)\] **\[v1\]**Tue, 21 Apr 2026 02:26:08 UTC \(2,373 KB\)

$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction

Similar Articles

Dynamic-dLLM: Dynamic Cache-Budget and Adaptive Parallel Decoding for Training-Free Acceleration of Diffusion LLM

Efficient Diffusion LLMs via Temporal-Spatial Parallel Decoding and Confidence Extrapolation

Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models

@DailyDoseOfDS_: Turn any Autoregressive LLM into a Diffusion LM. dLLM is a Python library that unifies the training & evaluation of dif…

PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding

Submit Feedback

Similar Articles

Dynamic-dLLM: Dynamic Cache-Budget and Adaptive Parallel Decoding for Training-Free Acceleration of Diffusion LLM

Efficient Diffusion LLMs via Temporal-Spatial Parallel Decoding and Confidence Extrapolation

Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models

@DailyDoseOfDS_: Turn any Autoregressive LLM into a Diffusion LM. dLLM is a Python library that unifies the training & evaluation of dif…

PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding