SEGA：面向扩散Transformer中分辨率外推的频谱能量引导注意力机制

摘要

扩散变换器（DiTs）已成为文本到图像生成领域的主流架构，但在生成超过训练分辨率的图像时，其性能会显著下降。现有免训练方法通过修改推理过程中的注意力行为来缓解这一问题，通常采用旋转位置编码（RoPE）外推结合注意力缩放策略。然而，这些方法对具有不同频率特征的RoPE分量施加统一且内容无关的缩放，导致全局结构保持与局部细节恢复之间的权衡。我们提出SEGA——一种免训练方法，该方法根据每个去噪步骤中潜变量的空间-频率结构，动态调整各RoPE分量的注意力缩放。这种自适应缩放能够同时提升结构连贯性与细节保真度。实验表明，SEGA在多种目标分辨率下持续改善高分辨率合成效果，性能优于当前最先进的免训练基线方法。

English

Diffusion transformers (DiTs) have emerged as a dominant architecture for text-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by modifying inference-time attention behavior, often through Rotary Position Embeddings (RoPE) extrapolation combined with attention scaling. However, these strategies apply a uniform and content-agnostic scaling across RoPE components with distinct frequency characteristics, inducing a trade-off between preserving global structure and recovering fine detail. We introduce SEGA, a training-free method that dynamically scales attention across RoPE components according to the latent's spatial-frequency structure at each denoising step. This adaptive scaling improves both structural coherence and fine-detail fidelity. Experiments show that SEGA consistently improves high-resolution synthesis across multiple target resolutions, outperforming state-of-the-art training-free baselines.