SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers
Summary
SEGA is a training-free method that improves high-resolution text-to-image generation by adaptively scaling attention across RoPE components based on spatial-frequency structure during denoising steps.
View Cached Full Text
Cached at: 05/22/26, 02:20 PM
Paper page - SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers
Source: https://huggingface.co/papers/2605.22668
Abstract
SEGA improves high-resolution text-to-image generation by adaptively scaling attention across RoPE components based on spatial-frequency structure during denoising steps.
Diffusion transformers(DiTs) have emerged as a dominant architecture fortext-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by modifying inference-time attention behavior, often throughRotary Position Embeddings(RoPE) extrapolation combined withattention scaling. However, these strategies apply a uniform and content-agnostic scaling across RoPE components with distinct frequency characteristics, inducing a trade-off between preserving global structure and recovering fine detail. We introduce SEGA, a training-free method that dynamically scales attention across RoPE components according to the latent’sspatial-frequency structureat each denoising step. This adaptive scaling improves both structural coherence and fine-detail fidelity. Experiments show that SEGA consistently improveshigh-resolution synthesisacross multiple target resolutions, outperforming state-of-the-art training-free baselines.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2605\.22668
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.22668 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.22668 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.22668 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Spectral Guidance for Flexible and Efficient Control of Diffusion Models
Introduces Spectral Guidance, a framework for controlling diffusion models by leveraging low-dimensional representations of the diffusion process, enabling flexible and stable control without task-specific retraining or backpropagation through the denoiser.
Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation
This paper introduces SemanticSeg, a large-scale dataset for semantic segmentation of long texts, and block distillation, a training framework that enables block attention models to approach full-attention performance, improving KV cache reuse in RAG and long-context scenarios.
Energy-Gated Attention and Wavelet Positional Encoding: Complementary Inductive Biases for Transformer Attention
This paper proposes Energy-Gated Attention (EGA) and Morlet Positional Encoding (MoPE) to address missing inductive biases in transformer attention: token salience and scale-adaptive locality. Experiments on TinyShakespeare show superadditive gains when combined, highlighting complementarity.
Linearizing Vision Transformer with Test-Time Training
This paper proposes a method to convert pretrained Softmax attention models into linear-complexity Test-Time Training (TTT) architectures, achieving comparable text-to-image quality to fine-tuned Softmax models while significantly accelerating inference. The approach is validated by linearizing Stable Diffusion 3.5, resulting in SD3.5-T^5 with 1.32x speedup at 1K resolution.
VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction
VGGT-Edit proposes a feed-forward framework for text-conditioned native 3D scene editing using depth-synchronized text injection and residual field prediction, achieving superior quality and efficiency over 2D-lifting approaches.