SEGA: 拡散トランスフォーマーにおける解像度外挿のためのスペクトルエネルギー誘導注意機構

要旨

拡散トランスフォーマー（DiTs）は、テキストから画像への生成において支配的なアーキテクチャとして登場したが、訓練範囲を超える解像度で生成する際に性能が低下する。既存の学習不要アプローチでは、推論時のアテンション動作を変更することでこれを緩和しており、多くの場合、アテンションスケーリングと組み合わせたRoPE（Rotary Position Embeddings）の外挿が用いられる。しかし、これらの戦略は、異なる周波数特性を持つRoPE成分に対して一様かつ内容に依存しないスケーリングを適用するため、大域的な構造の維持と細部の再現の間にトレードオフを生じさせる。本稿では、各ノイズ除去ステップにおける潜在変数の空間周波数構造に応じて、RoPE成分にわたるアテンションを動的にスケーリングする学習不要手法SEGAを提案する。この適応的スケーリングにより、構造的一貫性と細部の忠実度がともに向上する。実験では、SEGAが複数の目標解像度にわたって高解像度合成を一貫して改善し、最先端の学習不要ベースラインを上回ることを示す。

English

Diffusion transformers (DiTs) have emerged as a dominant architecture for text-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by modifying inference-time attention behavior, often through Rotary Position Embeddings (RoPE) extrapolation combined with attention scaling. However, these strategies apply a uniform and content-agnostic scaling across RoPE components with distinct frequency characteristics, inducing a trade-off between preserving global structure and recovering fine detail. We introduce SEGA, a training-free method that dynamically scales attention across RoPE components according to the latent's spatial-frequency structure at each denoising step. This adaptive scaling improves both structural coherence and fine-detail fidelity. Experiments show that SEGA consistently improves high-resolution synthesis across multiple target resolutions, outperforming state-of-the-art training-free baselines.