SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers

Hugging Face Daily Papers 05/21/26, 12:00 AM Papers

diffusion-transformers text-to-image attention resolution-extrapolation training-free rope spectral-energy

Summary

SEGA is a training-free method that improves high-resolution text-to-image generation by adaptively scaling attention across RoPE components based on spatial-frequency structure during denoising steps.

Diffusion transformers (DiTs) have emerged as a dominant architecture for text-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by modifying inference-time attention behavior, often through Rotary Position Embeddings (RoPE) extrapolation combined with attention scaling. However, these strategies apply a uniform and content-agnostic scaling across RoPE components with distinct frequency characteristics, inducing a trade-off between preserving global structure and recovering fine detail. We introduce SEGA, a training-free method that dynamically scales attention across RoPE components according to the latent's spatial-frequency structure at each denoising step. This adaptive scaling improves both structural coherence and fine-detail fidelity. Experiments show that SEGA consistently improves high-resolution synthesis across multiple target resolutions, outperforming state-of-the-art training-free baselines.

Original Article

View Cached Full Text

Cached at: 05/22/26, 02:20 PM

Paper page - SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers

Source: https://huggingface.co/papers/2605.22668

Abstract

SEGA improves high-resolution text-to-image generation by adaptively scaling attention across RoPE components based on spatial-frequency structure during denoising steps.

Diffusion transformers(DiTs) have emerged as a dominant architecture fortext-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by modifying inference-time attention behavior, often throughRotary Position Embeddings(RoPE) extrapolation combined withattention scaling. However, these strategies apply a uniform and content-agnostic scaling across RoPE components with distinct frequency characteristics, inducing a trade-off between preserving global structure and recovering fine detail. We introduce SEGA, a training-free method that dynamically scales attention across RoPE components according to the latent’sspatial-frequency structureat each denoising step. This adaptive scaling improves both structural coherence and fine-detail fidelity. Experiments show that SEGA consistently improveshigh-resolution synthesisacross multiple target resolutions, outperforming state-of-the-art training-free baselines.

View arXiv page View PDF Project page Add to collection

Get this paper in your agent:

hf papers read 2605\.22668

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.22668 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.22668 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.22668 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers

Paper page - SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Spectral Guidance for Flexible and Efficient Control of Diffusion Models

Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation

Energy-Gated Attention and Wavelet Positional Encoding: Complementary Inductive Biases for Transformer Attention

Linearizing Vision Transformer with Test-Time Training

VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction

Submit Feedback

Similar Articles

Spectral Guidance for Flexible and Efficient Control of Diffusion Models

Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation

Energy-Gated Attention and Wavelet Positional Encoding: Complementary Inductive Biases for Transformer Attention

Linearizing Vision Transformer with Test-Time Training

VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction