ConsistEdit：高度一致且精准的无训练视觉编辑

摘要

近期，无需训练的注意力控制方法取得了显著进展，为现有生成模型提供了灵活高效的文本引导编辑能力。然而，当前方法在实现强大编辑效果的同时，难以保持与源内容的一致性。这一局限在多轮编辑和视频编辑中尤为突出，视觉误差会随时间累积。此外，现有方法大多强制全局一致性，限制了在保留其他属性的同时修改特定属性（如纹理）的能力，从而阻碍了细粒度编辑。最近，从U-Net到MM-DiT的架构转变带来了生成性能的显著提升，并引入了一种新颖的文本与视觉模态融合机制。这些进展为解决以往方法未能克服的挑战铺平了道路。通过对MM-DiT的深入分析，我们识别出其注意力机制的三个关键洞见。基于这些发现，我们提出了ConsistEdit，一种专为MM-DiT设计的新型注意力控制方法。ConsistEdit融合了仅视觉注意力控制、掩码引导的预注意力融合以及对查询、键和值令牌的差异化处理，以生成一致且与提示对齐的编辑结果。大量实验表明，ConsistEdit在广泛的图像和视频编辑任务中均达到了最先进的性能，包括结构一致和不一致的场景。与以往方法不同，它首次实现了在所有推理步骤和注意力层上的无手工编辑，显著增强了可靠性和一致性，从而支持稳健的多轮和多区域编辑。此外，它还支持结构一致性的渐进调整，实现了更精细的控制。

English

Recent advances in training-free attention control methods have enabled flexible and efficient text-guided editing capabilities for existing generation models. However, current approaches struggle to simultaneously deliver strong editing strength while preserving consistency with the source. This limitation becomes particularly critical in multi-round and video editing, where visual errors can accumulate over time. Moreover, most existing methods enforce global consistency, which limits their ability to modify individual attributes such as texture while preserving others, thereby hindering fine-grained editing. Recently, the architectural shift from U-Net to MM-DiT has brought significant improvements in generative performance and introduced a novel mechanism for integrating text and vision modalities. These advancements pave the way for overcoming challenges that previous methods failed to resolve. Through an in-depth analysis of MM-DiT, we identify three key insights into its attention mechanisms. Building on these, we propose ConsistEdit, a novel attention control method specifically tailored for MM-DiT. ConsistEdit incorporates vision-only attention control, mask-guided pre-attention fusion, and differentiated manipulation of the query, key, and value tokens to produce consistent, prompt-aligned edits. Extensive experiments demonstrate that ConsistEdit achieves state-of-the-art performance across a wide range of image and video editing tasks, including both structure-consistent and structure-inconsistent scenarios. Unlike prior methods, it is the first approach to perform editing across all inference steps and attention layers without handcraft, significantly enhancing reliability and consistency, which enables robust multi-round and multi-region editing. Furthermore, it supports progressive adjustment of structural consistency, enabling finer control.

ConsistEdit：高度一致且精准的无训练视觉编辑

ConsistEdit: Highly Consistent and Precise Training-free Visual Editing

摘要

Support