ConsistEdit: 高度に一貫性があり精密なトレーニング不要の視覚的編集

要旨

近年、トレーニング不要のアテンション制御手法の進展により、既存の生成モデルに対して柔軟かつ効率的なテキストガイド編集機能が実現されています。しかし、現在の手法では、編集の強度を高めつつ、ソースとの一貫性を維持することが困難です。この制限は、複数回の編集や動画編集において特に顕著であり、視覚的な誤差が時間とともに蓄積する可能性があります。さらに、既存の手法の多くはグローバルな一貫性を強制するため、テクスチャなどの個別の属性を変更しつつ他の属性を維持する能力が制限され、細かい編集が妨げられています。最近、U-NetからMM-DiTへのアーキテクチャの移行により、生成性能が大幅に向上し、テキストと視覚モダリティを統合する新たなメカニズムが導入されました。これらの進展により、従来の手法では解決できなかった課題を克服する道が開かれました。MM-DiTのアテンションメカニズムを詳細に分析することで、その重要な洞察を3つ特定しました。これらを基に、MM-DiTに特化した新しいアテンション制御手法であるConsistEditを提案します。ConsistEditは、視覚のみのアテンション制御、マスクガイドによる事前アテンション融合、およびクエリ、キー、バリュートークンの差別化された操作を組み込み、一貫性のあるプロンプトに沿った編集を実現します。広範な実験により、ConsistEditが構造的一貫性のあるシナリオと構造的一貫性のないシナリオを含む、幅広い画像および動画編集タスクにおいて最先端の性能を達成することが示されました。従来の手法とは異なり、手作業を必要とせずにすべての推論ステップとアテンションレイヤーにわたって編集を行う初めてのアプローチであり、信頼性と一貫性を大幅に向上させ、堅牢な複数回および複数領域の編集を可能にします。さらに、構造的一貫性の段階的な調整をサポートし、より細かい制御を実現します。

English

Recent advances in training-free attention control methods have enabled flexible and efficient text-guided editing capabilities for existing generation models. However, current approaches struggle to simultaneously deliver strong editing strength while preserving consistency with the source. This limitation becomes particularly critical in multi-round and video editing, where visual errors can accumulate over time. Moreover, most existing methods enforce global consistency, which limits their ability to modify individual attributes such as texture while preserving others, thereby hindering fine-grained editing. Recently, the architectural shift from U-Net to MM-DiT has brought significant improvements in generative performance and introduced a novel mechanism for integrating text and vision modalities. These advancements pave the way for overcoming challenges that previous methods failed to resolve. Through an in-depth analysis of MM-DiT, we identify three key insights into its attention mechanisms. Building on these, we propose ConsistEdit, a novel attention control method specifically tailored for MM-DiT. ConsistEdit incorporates vision-only attention control, mask-guided pre-attention fusion, and differentiated manipulation of the query, key, and value tokens to produce consistent, prompt-aligned edits. Extensive experiments demonstrate that ConsistEdit achieves state-of-the-art performance across a wide range of image and video editing tasks, including both structure-consistent and structure-inconsistent scenarios. Unlike prior methods, it is the first approach to perform editing across all inference steps and attention layers without handcraft, significantly enhancing reliability and consistency, which enables robust multi-round and multi-region editing. Furthermore, it supports progressive adjustment of structural consistency, enabling finer control.

ConsistEdit: 高度に一貫性があり精密なトレーニング不要の視覚的編集

ConsistEdit: Highly Consistent and Precise Training-free Visual Editing

要旨

Support