SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers

Hugging Face Daily Papers Papers

Summary

SEGA is a training-free method that improves high-resolution text-to-image generation by adaptively scaling attention across RoPE components based on spatial-frequency structure during denoising steps.

Diffusion transformers (DiTs) have emerged as a dominant architecture for text-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by modifying inference-time attention behavior, often through Rotary Position Embeddings (RoPE) extrapolation combined with attention scaling. However, these strategies apply a uniform and content-agnostic scaling across RoPE components with distinct frequency characteristics, inducing a trade-off between preserving global structure and recovering fine detail. We introduce SEGA, a training-free method that dynamically scales attention across RoPE components according to the latent's spatial-frequency structure at each denoising step. This adaptive scaling improves both structural coherence and fine-detail fidelity. Experiments show that SEGA consistently improves high-resolution synthesis across multiple target resolutions, outperforming state-of-the-art training-free baselines.
Original Article
View Cached Full Text

Cached at: 05/22/26, 02:20 PM

Paper page - SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers

Source: https://huggingface.co/papers/2605.22668

Abstract

SEGA improves high-resolution text-to-image generation by adaptively scaling attention across RoPE components based on spatial-frequency structure during denoising steps.

Diffusion transformers(DiTs) have emerged as a dominant architecture fortext-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by modifying inference-time attention behavior, often throughRotary Position Embeddings(RoPE) extrapolation combined withattention scaling. However, these strategies apply a uniform and content-agnostic scaling across RoPE components with distinct frequency characteristics, inducing a trade-off between preserving global structure and recovering fine detail. We introduce SEGA, a training-free method that dynamically scales attention across RoPE components according to the latent’sspatial-frequency structureat each denoising step. This adaptive scaling improves both structural coherence and fine-detail fidelity. Experiments show that SEGA consistently improveshigh-resolution synthesisacross multiple target resolutions, outperforming state-of-the-art training-free baselines.

View arXiv pageView PDFProject pageAdd to collection

Get this paper in your agent:

hf papers read 2605\.22668

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.22668 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.22668 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.22668 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Linearizing Vision Transformer with Test-Time Training

Hugging Face Daily Papers

This paper proposes a method to convert pretrained Softmax attention models into linear-complexity Test-Time Training (TTT) architectures, achieving comparable text-to-image quality to fine-tuned Softmax models while significantly accelerating inference. The approach is validated by linearizing Stable Diffusion 3.5, resulting in SD3.5-T^5 with 1.32x speedup at 1K resolution.