세밀한 섭동 안내를 위한 어텐션 헤드 선택

초록

최근 확산 모델(diffusion models)의 지도 방법은 모델을 교란하여 암묵적인 약한 모델을 구성하고, 이를 통해 생성 과정을 조절하는 방식으로 역방향 샘플링을 이끌어냅니다. 이러한 접근법 중에서도, 주의 교란(attention perturbation)은 분류자 없는 지도(classifier-free guidance)가 적용되지 않는 무조건적 시나리오에서 강력한 실증적 성능을 보여왔습니다. 그러나 기존의 주의 교란 방법들은 교란이 적용되어야 할 위치를 결정하는 데 있어 체계적인 접근법이 부족하며, 특히 품질과 관련된 계산이 여러 계층에 분산되어 있는 Diffusion Transformer(DiT) 아키텍처에서 이러한 문제가 두드러집니다. 본 논문에서는 주의 교란의 세분화 정도를 계층 수준에서 개별 주의 헤드(attention head) 수준까지 조사하며, 특정 헤드들이 구조, 스타일, 질감 품질과 같은 독특한 시각적 개념을 주도한다는 사실을 발견했습니다. 이러한 통찰을 바탕으로, 우리는 사용자 중심의 목표와 일치하는 주의 헤드를 반복적으로 선택하는 체계적인 프레임워크인 "HeadHunter"를 제안합니다. 이를 통해 생성 품질과 시각적 속성에 대한 세밀한 제어가 가능해집니다. 또한, 우리는 선택된 각 헤드의 주의 맵(attention map)을 항등 행렬(identity matrix) 방향으로 선형 보간하는 SoftPAG를 소개하며, 이를 통해 교란 강도를 연속적으로 조절하고 아티팩트를 억제할 수 있는 방법을 제시합니다. 우리의 접근법은 기존의 계층 수준 교란에서 발생하는 과도한 평활화(oversmoothing) 문제를 완화할 뿐만 아니라, 조합적 헤드 선택을 통해 특정 시각적 스타일을 목표적으로 조작할 수 있게 합니다. 우리는 Stable Diffusion 3와 FLUX.1을 포함한 현대적인 대규모 DiT 기반 텍스트-이미지 모델에서 우리의 방법을 검증하며, 일반적인 품질 향상과 스타일 특화 지도 모두에서 우수한 성능을 입증했습니다. 본 연구는 확산 모델에서 주의 교란에 대한 최초의 헤드 수준 분석을 제공하며, 주의 계층 내에서 해석 가능한 전문화를 밝히고 효과적인 교란 전략의 실용적 설계를 가능하게 합니다.

English

Recent guidance methods in diffusion models steer reverse sampling by perturbing the model to construct an implicit weak model and guide generation away from it. Among these approaches, attention perturbation has demonstrated strong empirical performance in unconditional scenarios where classifier-free guidance is not applicable. However, existing attention perturbation methods lack principled approaches for determining where perturbations should be applied, particularly in Diffusion Transformer (DiT) architectures where quality-relevant computations are distributed across layers. In this paper, we investigate the granularity of attention perturbations, ranging from the layer level down to individual attention heads, and discover that specific heads govern distinct visual concepts such as structure, style, and texture quality. Building on this insight, we propose "HeadHunter", a systematic framework for iteratively selecting attention heads that align with user-centric objectives, enabling fine-grained control over generation quality and visual attributes. In addition, we introduce SoftPAG, which linearly interpolates each selected head's attention map toward an identity matrix, providing a continuous knob to tune perturbation strength and suppress artifacts. Our approach not only mitigates the oversmoothing issues of existing layer-level perturbation but also enables targeted manipulation of specific visual styles through compositional head selection. We validate our method on modern large-scale DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1, demonstrating superior performance in both general quality enhancement and style-specific guidance. Our work provides the first head-level analysis of attention perturbation in diffusion models, uncovering interpretable specialization within attention layers and enabling practical design of effective perturbation strategies.

세밀한 섭동 안내를 위한 어텐션 헤드 선택

Fine-Grained Perturbation Guidance via Attention Head Selection

초록

Support