SEGA: 확산 트랜스포머에서 해상도 외삽을 위한 스펙트럼-에너지 유도 어텐션

초록

확산 트랜스포머(DiTs)는 텍스트-이미지 생성을 위한 지배적인 아키텍처로 부상했지만, 훈련 범위를 초과하는 해상도에서 생성 시 성능이 저하됩니다. 기존의 훈련 없는 접근법은 추론 시의 어텐션 동작을 수정하여 이를 완화하며, 주로 회전 위치 임베딩(RoPE) 외삽과 어텐션 스케일링을 결합하여 사용합니다. 그러나 이러한 전략은 서로 다른 주파수 특성을 가진 RoPE 구성 요소 전반에 걸쳐 균일하고 내용에 무관한 스케일링을 적용하여, 전역 구조 보존과 세부 디테일 복원 사이의 절충을 유도합니다. 우리는 각 잡음 제거 단계에서 잠재 변수의 공간-주파수 구조에 따라 RoPE 구성 요소 전반에 걸쳐 어텐션을 동적으로 스케일링하는 훈련 없는 방법인 SEGA를 소개합니다. 이 적응형 스케일링은 구조적 일관성과 세부 디테일 충실도를 모두 향상시킵니다. 실험 결과, SEGA가 여러 대상 해상도에서 고해상도 합성을 일관되게 개선하여 최첨단 훈련 없는 기준선을 능가하는 것으로 나타났습니다.

English

Diffusion transformers (DiTs) have emerged as a dominant architecture for text-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by modifying inference-time attention behavior, often through Rotary Position Embeddings (RoPE) extrapolation combined with attention scaling. However, these strategies apply a uniform and content-agnostic scaling across RoPE components with distinct frequency characteristics, inducing a trade-off between preserving global structure and recovering fine detail. We introduce SEGA, a training-free method that dynamically scales attention across RoPE components according to the latent's spatial-frequency structure at each denoising step. This adaptive scaling improves both structural coherence and fine-detail fidelity. Experiments show that SEGA consistently improves high-resolution synthesis across multiple target resolutions, outperforming state-of-the-art training-free baselines.