图像编辑中的组相对注意力引导

摘要

近日，基于Diffusion-in-Transformer模型的图像编辑技术发展迅速。然而，现有编辑方法往往缺乏对编辑程度的有效控制，限制了其实现更精细化定制结果的能力。针对这一局限，我们研究了DiT模型中的MM-Attention机制，发现Query和Key令牌共享一个仅与网络层相关的偏置向量。我们将该偏置向量解释为模型固有的编辑行为表征，而各令牌与其对应偏置间的差值则编码了内容特定的编辑信号。基于此发现，我们提出了分组相对注意力引导（GRAG）方法——通过重新加权不同令牌的差值来调节模型对输入图像相对于编辑指令的关注程度，无需任何调参即可实现连续细粒度的编辑强度控制。在现有图像编辑框架上的大量实验表明，GRAG仅需四行代码即可集成，并能持续提升编辑质量。与常用的无分类器引导相比，GRAG能实现更平滑、更精确的编辑程度控制。我们的代码将在https://github.com/little-misfit/GRAG-Image-Editing发布。

English

Recently, image editing based on Diffusion-in-Transformer models has undergone rapid development. However, existing editing methods often lack effective control over the degree of editing, limiting their ability to achieve more customized results. To address this limitation, we investigate the MM-Attention mechanism within the DiT model and observe that the Query and Key tokens share a bias vector that is only layer-dependent. We interpret this bias as representing the model's inherent editing behavior, while the delta between each token and its corresponding bias encodes the content-specific editing signals. Based on this insight, we propose Group Relative Attention Guidance, a simple yet effective method that reweights the delta values of different tokens to modulate the focus of the model on the input image relative to the editing instruction, enabling continuous and fine-grained control over editing intensity without any tuning. Extensive experiments conducted on existing image editing frameworks demonstrate that GRAG can be integrated with as few as four lines of code, consistently enhancing editing quality. Moreover, compared to the commonly used Classifier-Free Guidance, GRAG achieves smoother and more precise control over the degree of editing. Our code will be released at https://github.com/little-misfit/GRAG-Image-Editing.

图像编辑中的组相对注意力引导

Group Relative Attention Guidance for Image Editing

摘要

Support