通过注意力头选择实现细粒度扰动引导

摘要

近期在扩散模型中的引导方法通过扰动模型构建隐式弱模型，并引导生成过程远离该模型。在这些方法中，注意力扰动在无条件场景下展现了强大的实证性能，尤其是在无分类器引导不适用的情况下。然而，现有的注意力扰动方法缺乏确定扰动应用位置的原则性方法，特别是在扩散变换器（DiT）架构中，质量相关的计算分布在多个层之间。本文研究了注意力扰动的粒度，从层级细化到单个注意力头，发现特定头部主导着不同的视觉概念，如结构、风格和纹理质量。基于这一洞察，我们提出了“HeadHunter”，一个系统框架，用于迭代选择与用户目标一致的注意力头，实现对生成质量和视觉属性的精细控制。此外，我们引入了SoftPAG，它通过线性插值将每个选定头的注意力图向单位矩阵靠拢，提供了连续调节扰动强度并抑制伪影的手段。我们的方法不仅缓解了现有层级扰动导致的过度平滑问题，还通过组合选择特定头部实现了对特定视觉风格的有针对性操控。我们在包括Stable Diffusion 3和FLUX.1在内的现代大规模DiT文本到图像模型上验证了我们的方法，展示了在整体质量提升和风格特定引导方面的卓越性能。我们的工作首次在扩散模型中进行了头级的注意力扰动分析，揭示了注意力层内的可解释性专业化，并为设计有效的扰动策略提供了实用指导。

English

Recent guidance methods in diffusion models steer reverse sampling by perturbing the model to construct an implicit weak model and guide generation away from it. Among these approaches, attention perturbation has demonstrated strong empirical performance in unconditional scenarios where classifier-free guidance is not applicable. However, existing attention perturbation methods lack principled approaches for determining where perturbations should be applied, particularly in Diffusion Transformer (DiT) architectures where quality-relevant computations are distributed across layers. In this paper, we investigate the granularity of attention perturbations, ranging from the layer level down to individual attention heads, and discover that specific heads govern distinct visual concepts such as structure, style, and texture quality. Building on this insight, we propose "HeadHunter", a systematic framework for iteratively selecting attention heads that align with user-centric objectives, enabling fine-grained control over generation quality and visual attributes. In addition, we introduce SoftPAG, which linearly interpolates each selected head's attention map toward an identity matrix, providing a continuous knob to tune perturbation strength and suppress artifacts. Our approach not only mitigates the oversmoothing issues of existing layer-level perturbation but also enables targeted manipulation of specific visual styles through compositional head selection. We validate our method on modern large-scale DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1, demonstrating superior performance in both general quality enhancement and style-specific guidance. Our work provides the first head-level analysis of attention perturbation in diffusion models, uncovering interpretable specialization within attention layers and enabling practical design of effective perturbation strategies.

通过注意力头选择实现细粒度扰动引导

Fine-Grained Perturbation Guidance via Attention Head Selection

摘要

Support