SEGA:面向扩散Transformer中分辨率外推的频谱能量引导注意力机制
SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers
May 21, 2026
作者: Javad Rajabi, Kimia Shaban, Koorosh Roohi, David B. Lindell, Babak Taati
cs.AI
摘要
扩散变换器(DiTs)已成为文本到图像生成领域的主流架构,但在生成超过训练分辨率的图像时,其性能会显著下降。现有免训练方法通过修改推理过程中的注意力行为来缓解这一问题,通常采用旋转位置编码(RoPE)外推结合注意力缩放策略。然而,这些方法对具有不同频率特征的RoPE分量施加统一且内容无关的缩放,导致全局结构保持与局部细节恢复之间的权衡。我们提出SEGA——一种免训练方法,该方法根据每个去噪步骤中潜变量的空间-频率结构,动态调整各RoPE分量的注意力缩放。这种自适应缩放能够同时提升结构连贯性与细节保真度。实验表明,SEGA在多种目标分辨率下持续改善高分辨率合成效果,性能优于当前最先进的免训练基线方法。
English
Diffusion transformers (DiTs) have emerged as a dominant architecture for text-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by modifying inference-time attention behavior, often through Rotary Position Embeddings (RoPE) extrapolation combined with attention scaling. However, these strategies apply a uniform and content-agnostic scaling across RoPE components with distinct frequency characteristics, inducing a trade-off between preserving global structure and recovering fine detail. We introduce SEGA, a training-free method that dynamically scales attention across RoPE components according to the latent's spatial-frequency structure at each denoising step. This adaptive scaling improves both structural coherence and fine-detail fidelity. Experiments show that SEGA consistently improves high-resolution synthesis across multiple target resolutions, outperforming state-of-the-art training-free baselines.