ConsistEdit:高度一致且精准的无训练视觉编辑
ConsistEdit: Highly Consistent and Precise Training-free Visual Editing
October 20, 2025
作者: Zixin Yin, Ling-Hao Chen, Lionel Ni, Xili Dai
cs.AI
摘要
近期,无需训练的注意力控制方法取得了显著进展,为现有生成模型提供了灵活高效的文本引导编辑能力。然而,当前方法在实现强大编辑效果的同时,难以保持与源内容的一致性。这一局限在多轮编辑和视频编辑中尤为突出,视觉误差会随时间累积。此外,现有方法大多强制全局一致性,限制了在保留其他属性的同时修改特定属性(如纹理)的能力,从而阻碍了细粒度编辑。最近,从U-Net到MM-DiT的架构转变带来了生成性能的显著提升,并引入了一种新颖的文本与视觉模态融合机制。这些进展为解决以往方法未能克服的挑战铺平了道路。通过对MM-DiT的深入分析,我们识别出其注意力机制的三个关键洞见。基于这些发现,我们提出了ConsistEdit,一种专为MM-DiT设计的新型注意力控制方法。ConsistEdit融合了仅视觉注意力控制、掩码引导的预注意力融合以及对查询、键和值令牌的差异化处理,以生成一致且与提示对齐的编辑结果。大量实验表明,ConsistEdit在广泛的图像和视频编辑任务中均达到了最先进的性能,包括结构一致和不一致的场景。与以往方法不同,它首次实现了在所有推理步骤和注意力层上的无手工编辑,显著增强了可靠性和一致性,从而支持稳健的多轮和多区域编辑。此外,它还支持结构一致性的渐进调整,实现了更精细的控制。
English
Recent advances in training-free attention control methods have enabled
flexible and efficient text-guided editing capabilities for existing generation
models. However, current approaches struggle to simultaneously deliver strong
editing strength while preserving consistency with the source. This limitation
becomes particularly critical in multi-round and video editing, where visual
errors can accumulate over time. Moreover, most existing methods enforce global
consistency, which limits their ability to modify individual attributes such as
texture while preserving others, thereby hindering fine-grained editing.
Recently, the architectural shift from U-Net to MM-DiT has brought significant
improvements in generative performance and introduced a novel mechanism for
integrating text and vision modalities. These advancements pave the way for
overcoming challenges that previous methods failed to resolve. Through an
in-depth analysis of MM-DiT, we identify three key insights into its attention
mechanisms. Building on these, we propose ConsistEdit, a novel attention
control method specifically tailored for MM-DiT. ConsistEdit incorporates
vision-only attention control, mask-guided pre-attention fusion, and
differentiated manipulation of the query, key, and value tokens to produce
consistent, prompt-aligned edits. Extensive experiments demonstrate that
ConsistEdit achieves state-of-the-art performance across a wide range of image
and video editing tasks, including both structure-consistent and
structure-inconsistent scenarios. Unlike prior methods, it is the first
approach to perform editing across all inference steps and attention layers
without handcraft, significantly enhancing reliability and consistency, which
enables robust multi-round and multi-region editing. Furthermore, it supports
progressive adjustment of structural consistency, enabling finer control.