Tag
This paper introduces Multi-Resolution Residual Routing (WAV v1), an extension of Block Attention Residuals that augments block representations with directional detail bases, improving deep decoder-only Transformer training.
This paper proposes Diffusion-Adaptive Routing (DAR), a learnable, timestep-adaptive residual replacement that improves cross-layer information flow in Diffusion Transformers, leading to significant training acceleration and quality improvements.
DeepSeek has published a paper introducing mHC (Manifold-Constrained Hyper-Connections), a fundamental rewrite of the Transformer architecture that stabilizes large models by replacing standard residual connections with mathematically constrained multi-stream pathways.