細粒度擾動引導：基於注意力頭選擇

摘要

近期在扩散模型中的引导方法，通过扰动模型构建隐式弱模型，并引导生成过程远离该模型，从而操控反向采样。在这些方法中，注意力扰动在无条件场景下展现了显著的实证性能，尤其是在无分类器引导不适用的情况下。然而，现有的注意力扰动方法缺乏确定扰动应施加于何处的原则性方法，特别是在扩散变换器（DiT）架构中，与质量相关的计算分散于各层之间。本文中，我们探究了注意力扰动的粒度，从层级细化至单个注意力头，发现特定头分别主导着如结构、风格及纹理质量等不同的视觉概念。基于这一发现，我们提出了“HeadHunter”，一个系统化框架，用于迭代选择与用户中心目标相一致的注意力头，实现对生成质量及视觉属性的精细控制。此外，我们引入了SoftPAG，它通过线性插值将每个选定头的注意力图向单位矩阵方向调整，提供了一个连续旋钮以调节扰动强度并抑制伪影。我们的方法不仅缓解了现有层级扰动导致的过度平滑问题，还通过组合选择特定头实现了对具体视觉风格的有针对性的操控。我们在包括Stable Diffusion 3和FLUX.1在内的现代大规模基于DiT的文本到图像模型上验证了该方法，展示了在整体质量提升和风格特定引导方面的卓越性能。本研究首次在扩散模型中进行了注意力头层面的扰动分析，揭示了注意力层内可解释的专门化现象，并为设计有效的扰动策略提供了实用指导。

English

Recent guidance methods in diffusion models steer reverse sampling by perturbing the model to construct an implicit weak model and guide generation away from it. Among these approaches, attention perturbation has demonstrated strong empirical performance in unconditional scenarios where classifier-free guidance is not applicable. However, existing attention perturbation methods lack principled approaches for determining where perturbations should be applied, particularly in Diffusion Transformer (DiT) architectures where quality-relevant computations are distributed across layers. In this paper, we investigate the granularity of attention perturbations, ranging from the layer level down to individual attention heads, and discover that specific heads govern distinct visual concepts such as structure, style, and texture quality. Building on this insight, we propose "HeadHunter", a systematic framework for iteratively selecting attention heads that align with user-centric objectives, enabling fine-grained control over generation quality and visual attributes. In addition, we introduce SoftPAG, which linearly interpolates each selected head's attention map toward an identity matrix, providing a continuous knob to tune perturbation strength and suppress artifacts. Our approach not only mitigates the oversmoothing issues of existing layer-level perturbation but also enables targeted manipulation of specific visual styles through compositional head selection. We validate our method on modern large-scale DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1, demonstrating superior performance in both general quality enhancement and style-specific guidance. Our work provides the first head-level analysis of attention perturbation in diffusion models, uncovering interpretable specialization within attention layers and enabling practical design of effective perturbation strategies.

細粒度擾動引導：基於注意力頭選擇

Fine-Grained Perturbation Guidance via Attention Head Selection

摘要

Support