細粒度の摂動ガイダンスによるアテンションヘッド選択

要旨

最近の拡散モデルにおけるガイダンス手法では、モデルを摂動させることで暗黙の弱いモデルを構築し、そこから生成を逸脱させることで逆サンプリングを誘導しています。これらのアプローチの中でも、アテンション摂動は、クラス分類器不要のガイダンスが適用できない無条件のシナリオにおいて、強力な実証的性能を示しています。しかし、既存のアテンション摂動手法では、特に品質関連の計算が層全体に分散しているDiffusion Transformer（DiT）アーキテクチャにおいて、どこに摂動を適用すべきかを決定するための原理的なアプローチが欠けています。本論文では、層レベルから個々のアテンションヘッドに至るまでのアテンション摂動の粒度を調査し、特定のヘッドが構造、スタイル、テクスチャ品質などの異なる視覚概念を支配していることを発見しました。この知見に基づいて、ユーザ中心の目的に沿ったアテンションヘッドを反復的に選択する体系的なフレームワーク「HeadHunter」を提案し、生成品質と視覚属性に対するきめ細かい制御を可能にします。さらに、選択された各ヘッドのアテンションマップを単位行列に向かって線形補間する「SoftPAG」を導入し、摂動強度を連続的に調整してアーティファクトを抑制する手法を提供します。本手法は、既存の層レベル摂動の過剰平滑化問題を軽減するだけでなく、構成論的なヘッド選択を通じて特定の視覚スタイルをターゲットに操作することを可能にします。Stable Diffusion 3やFLUX.1などの最新の大規模DiTベースのテキスト画像生成モデルにおいて本手法を検証し、一般的な品質向上とスタイル固有のガイダンスの両方で優れた性能を示します。本研究は、拡散モデルにおけるアテンション摂動の初めてのヘッドレベル分析を提供し、アテンション層内の解釈可能な専門化を明らかにするとともに、効果的な摂動戦略の実用的な設計を可能にします。

English

Recent guidance methods in diffusion models steer reverse sampling by perturbing the model to construct an implicit weak model and guide generation away from it. Among these approaches, attention perturbation has demonstrated strong empirical performance in unconditional scenarios where classifier-free guidance is not applicable. However, existing attention perturbation methods lack principled approaches for determining where perturbations should be applied, particularly in Diffusion Transformer (DiT) architectures where quality-relevant computations are distributed across layers. In this paper, we investigate the granularity of attention perturbations, ranging from the layer level down to individual attention heads, and discover that specific heads govern distinct visual concepts such as structure, style, and texture quality. Building on this insight, we propose "HeadHunter", a systematic framework for iteratively selecting attention heads that align with user-centric objectives, enabling fine-grained control over generation quality and visual attributes. In addition, we introduce SoftPAG, which linearly interpolates each selected head's attention map toward an identity matrix, providing a continuous knob to tune perturbation strength and suppress artifacts. Our approach not only mitigates the oversmoothing issues of existing layer-level perturbation but also enables targeted manipulation of specific visual styles through compositional head selection. We validate our method on modern large-scale DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1, demonstrating superior performance in both general quality enhancement and style-specific guidance. Our work provides the first head-level analysis of attention perturbation in diffusion models, uncovering interpretable specialization within attention layers and enabling practical design of effective perturbation strategies.

細粒度の摂動ガイダンスによるアテンションヘッド選択

Fine-Grained Perturbation Guidance via Attention Head Selection

要旨

Support