忠実性と編集性を兼ね備えたチューニング不要の画像編集：統合型潜在拡散モデルによるアプローチ

要旨

テキストベースの画像編集（TIE）において、忠実性と編集性のバランスを取ることは極めて重要であり、これに失敗すると過剰編集や編集不足といった問題が生じがちです。既存の手法では、構造保存のためにアテンション注入を利用し、編集性のために事前学習済みのテキストto画像（T2I）モデルの内在的なテキストアライメント能力を活用するのが一般的ですが、これら二つの目的を適切にバランスさせるための明示的かつ統一的なメカニズムが欠けています。本研究では、UnifyEditを提案します。これは、拡散潜在空間最適化を行い、統一フレームワーク内で忠実性と編集性のバランスの取れた統合を可能にするチューニング不要の手法です。直接的なアテンション注入とは異なり、構造忠実性のための自己アテンション（SA）保存制約と、編集性向上のためのテキストアライメントを強化するクロスアテンション（CA）アライメント制約という二つのアテンションベースの制約を開発しました。しかし、両制約を同時に適用すると勾配競合が生じ、一方の制約が支配的になることで過剰編集や編集不足が発生する可能性があります。この課題に対処するため、これらの制約の影響を動的に調整する適応的時間ステップスケジューラを導入し、拡散潜在空間を最適なバランスへと導きます。大規模な定量的・定性的実験を通じて、本手法の有効性が検証され、様々な編集タスクにおいて構造保存とテキストアライメントの堅牢なバランスを達成する点で他の最先端手法を凌駕する優位性が示されました。ソースコードはhttps://github.com/CUC-MIPG/UnifyEditで公開予定です。

English

Balancing fidelity and editability is essential in text-based image editing (TIE), where failures commonly lead to over- or under-editing issues. Existing methods typically rely on attention injections for structure preservation and leverage the inherent text alignment capabilities of pre-trained text-to-image (T2I) models for editability, but they lack explicit and unified mechanisms to properly balance these two objectives. In this work, we introduce UnifyEdit, a tuning-free method that performs diffusion latent optimization to enable a balanced integration of fidelity and editability within a unified framework. Unlike direct attention injections, we develop two attention-based constraints: a self-attention (SA) preservation constraint for structural fidelity, and a cross-attention (CA) alignment constraint to enhance text alignment for improved editability. However, simultaneously applying both constraints can lead to gradient conflicts, where the dominance of one constraint results in over- or under-editing. To address this challenge, we introduce an adaptive time-step scheduler that dynamically adjusts the influence of these constraints, guiding the diffusion latent toward an optimal balance. Extensive quantitative and qualitative experiments validate the effectiveness of our approach, demonstrating its superiority in achieving a robust balance between structure preservation and text alignment across various editing tasks, outperforming other state-of-the-art methods. The source code will be available at https://github.com/CUC-MIPG/UnifyEdit.

忠実性と編集性を兼ね備えたチューニング不要の画像編集：統合型潜在拡散モデルによるアプローチ

Tuning-Free Image Editing with Fidelity and Editability via Unified Latent Diffusion Model

要旨

Support