Rethinking Cross-Layer Information Routing in Diffusion Transformers
Summary
This paper proposes Diffusion-Adaptive Routing (DAR), a learnable, timestep-adaptive residual replacement that improves cross-layer information flow in Diffusion Transformers, leading to significant training acceleration and quality improvements.
View Cached Full Text
Cached at: 05/25/26, 06:36 AM
Paper page - Rethinking Cross-Layer Information Routing in Diffusion Transformers
Source: https://huggingface.co/papers/2605.20708
Abstract
Diffusion Transformers suffer from inefficient cross-layer information flow that traditional residual connections cannot address, prompting the introduction of a learnable, timestep-adaptive routing mechanism that improves training efficiency and model quality.
Diffusion Transformers(DiTs) have become a de facto backbone of modern visual generation, and nearly every major axis of their design -- tokenization, attention, conditioning, objectives, and latent autoencoders -- has been extensively revisited. Theresidual streamthat governs how information accumulates across layers, however, has been directly inherited from the original Transformer. In this paper, we present a systematic empirical analysis ofcross-layer information flowin DiTs, jointly along depth anddenoising timestep, and identify three concrete symptoms of traditionalresidual addition, namely monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy. Motivated by this diagnosis, we proposeDiffusion-Adaptive Routing(DAR), a drop-in residual replacement that performs learnable, timestep-adaptive, and non-incremental aggregation over the history of sublayer outputs. Moreover, the proposed DAR is compatible with many modern Transformer enhancement methods, such asREPA. On ImageNet 256times256, DAR improves SiT-XL/2 by 2.11FID(7.56 vs.\ 9.67) and matches the baseline’s converged quality with 8.75times fewer training iterations. Stacked on top ofREPA, it yields a 2times training acceleration in the early stage, suggesting cross-layer information routing as an underexplored design axis in diffusion modeling, one that operates orthogonally to existing representation-alignment objectives. Beyond pretraining, DAR can also be applied during the fine-tuning stage of large-scale T2I models and preserves high-frequency details duringDistribution Matching Distillation.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.20708
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.20708 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.20708 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.20708 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
Adaptive Computation Depth via Learned Token Routing in Transformers
This paper presents Token-Selective Attention (TSA), a differentiable token routing mechanism that learns to skip unnecessary computations per token in transformer layers, reducing token-layer operations by 14–23% with minimal quality loss on language modeling tasks.
Learned Relay Representations for Forward-Thinking Discrete Diffusion Models
This paper introduces Learned Relay Representations (Relay), a method that allows masked diffusion models to propagate latent information across denoising steps, overcoming the hard reset problem and improving performance-latency trade-offs. The method is shown to outperform standard supervised finetuning on coding tasks while reducing inference latency by up to 32%.
Delta Attention Residuals
Delta Attention Residuals improve layer-wise routing in transformer models by attending to feature changes (deltas) rather than cumulative hidden states, achieving 1.7–8.2% validation perplexity gains across scales from 220M to 7.6B parameters.
Where Should Diffusion Enter a Language Model? Geometry-Guided Hidden-State Replacement
This paper introduces DiHAL, a diffusion-transformer hybrid that uses geometry-based proxies to select a layer in a pretrained language model for hidden-state replacement with a diffusion bridge, improving continuous diffusion language modeling by avoiding direct token recovery.
Learning to Skip Blocks: Self-Discovered Ultrametric Routing for Hardware-Accelerated Sparse Attention
This paper introduces Dynamic Ultrametric Attention, a framework where Transformers learn per-head block-sparse routing topologies during training, which are then offloaded to a custom Triton block-sparse kernel at inference time, achieving up to 28x speedup and 98.4% memory reduction over dense attention.