ConsistEdit:高度一致且精確的免訓練視覺編輯
ConsistEdit: Highly Consistent and Precise Training-free Visual Editing
October 20, 2025
作者: Zixin Yin, Ling-Hao Chen, Lionel Ni, Xili Dai
cs.AI
摘要
近期,無需訓練的注意力控制方法取得了顯著進展,使得現有生成模型具備了靈活且高效的文本引導編輯能力。然而,當前方法在實現強編輯力度的同時,難以保持與源內容的一致性。這一限制在多輪編輯和視頻編輯中尤為關鍵,因為視覺誤差會隨時間累積。此外,大多數現有方法強制實施全局一致性,這限制了它們在修改如紋理等個別屬性時保持其他屬性的能力,從而阻礙了細粒度編輯的實現。最近,從U-Net到MM-DiT的架構轉變,不僅在生成性能上帶來了顯著提升,還引入了一種整合文本與視覺模態的新機制。這些進步為克服以往方法未能解決的挑戰鋪平了道路。通過對MM-DiT的深入分析,我們識別出其注意力機制中的三個關鍵見解。基於這些見解,我們提出了ConsistEdit,這是一種專為MM-DiT設計的新型注意力控制方法。ConsistEdit融合了僅視覺的注意力控制、掩碼引導的前注意力融合,以及對查詢、鍵和值令牌的差異化操作,以產生一致且與提示對齊的編輯結果。大量實驗證明,ConsistEdit在廣泛的圖像和視頻編輯任務中,包括結構一致與不一致的場景,均達到了最先進的性能。與以往方法不同,它是首個能在所有推理步驟和注意力層中無需手工干預進行編輯的方法,極大增強了可靠性和一致性,從而支持了穩健的多輪和多區域編輯。此外,它還支持結構一致性的漸進調整,實現了更精細的控制。
English
Recent advances in training-free attention control methods have enabled
flexible and efficient text-guided editing capabilities for existing generation
models. However, current approaches struggle to simultaneously deliver strong
editing strength while preserving consistency with the source. This limitation
becomes particularly critical in multi-round and video editing, where visual
errors can accumulate over time. Moreover, most existing methods enforce global
consistency, which limits their ability to modify individual attributes such as
texture while preserving others, thereby hindering fine-grained editing.
Recently, the architectural shift from U-Net to MM-DiT has brought significant
improvements in generative performance and introduced a novel mechanism for
integrating text and vision modalities. These advancements pave the way for
overcoming challenges that previous methods failed to resolve. Through an
in-depth analysis of MM-DiT, we identify three key insights into its attention
mechanisms. Building on these, we propose ConsistEdit, a novel attention
control method specifically tailored for MM-DiT. ConsistEdit incorporates
vision-only attention control, mask-guided pre-attention fusion, and
differentiated manipulation of the query, key, and value tokens to produce
consistent, prompt-aligned edits. Extensive experiments demonstrate that
ConsistEdit achieves state-of-the-art performance across a wide range of image
and video editing tasks, including both structure-consistent and
structure-inconsistent scenarios. Unlike prior methods, it is the first
approach to perform editing across all inference steps and attention layers
without handcraft, significantly enhancing reliability and consistency, which
enables robust multi-round and multi-region editing. Furthermore, it supports
progressive adjustment of structural consistency, enabling finer control.