ChatPaper.aiChatPaper

SEGA:用於擴散Transformer中分辨率外推的光譜能量引導注意力

SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers

May 21, 2026
作者: Javad Rajabi, Kimia Shaban, Koorosh Roohi, David B. Lindell, Babak Taati
cs.AI

摘要

擴散變換器(Diffusion Transformers, DiTs)已成為文本到圖像生成的主流架構,然而在生成高於其訓練範圍的解析度影像時,其效能會下降。現有的免訓練方法透過修改推論階段的注意力行為來緩解此問題,通常結合旋轉位置嵌入(RoPE)外推與注意力縮放。然而,這些策略對具有不同頻率特性的RoPE分量施加統一且與內容無關的縮放,導致在保留整體結構與恢復細節之間產生取捨。我們提出SEGA,一種免訓練方法,能根據每個去噪步驟中潛在特徵的空間頻率結構,動態調整RoPE分量上的注意力縮放。這種自適應縮放能同時改善結構連貫性與細節保真度。實驗顯示,SEGA在多個目標解析度下持續提升高解析度合成品質,優於現有最先進的免訓練基準方法。
English
Diffusion transformers (DiTs) have emerged as a dominant architecture for text-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by modifying inference-time attention behavior, often through Rotary Position Embeddings (RoPE) extrapolation combined with attention scaling. However, these strategies apply a uniform and content-agnostic scaling across RoPE components with distinct frequency characteristics, inducing a trade-off between preserving global structure and recovering fine detail. We introduce SEGA, a training-free method that dynamically scales attention across RoPE components according to the latent's spatial-frequency structure at each denoising step. This adaptive scaling improves both structural coherence and fine-detail fidelity. Experiments show that SEGA consistently improves high-resolution synthesis across multiple target resolutions, outperforming state-of-the-art training-free baselines.