MAG-Edit: 複雑なシナリオにおける局所的な画像編集のためのマスクベースの注意調整ガイダンス

要旨

近年の拡散モデルに基づく画像編集手法は、単純な構図の画像において印象的な編集能力を示してきた。しかし、複雑なシナリオにおける局所的な編集は、現実世界での需要が高まっているにもかかわらず、文献上十分に研究されていない。既存のマスクベースの修復手法は、編集領域内の基盤となる構造を保持する点で不十分である。一方、マスクフリーの注意機構ベースの手法は、より複雑な構図において編集の漏れや位置ずれを示すことが多い。本研究では、複雑なシナリオにおける局所的な画像編集を可能にする、学習不要の推論段階最適化手法であるMAG-Editを開発する。具体的には、MAG-Editは拡散モデル内のノイズ潜在特徴を、編集トークンの2つのマスクベースのクロスアテンション制約を最大化することで最適化し、それによって所望のプロンプトとの局所的な整合性を段階的に向上させる。広範な定量的および定性的な実験を通じて、本手法が複雑なシナリオ内での局所的な編集において、テキスト整合性と構造保存の両方を達成する有効性が実証された。

English

Recent diffusion-based image editing approaches have exhibited impressive editing capabilities in images with simple compositions. However, localized editing in complex scenarios has not been well-studied in the literature, despite its growing real-world demands. Existing mask-based inpainting methods fall short of retaining the underlying structure within the edit region. Meanwhile, mask-free attention-based methods often exhibit editing leakage and misalignment in more complex compositions. In this work, we develop MAG-Edit, a training-free, inference-stage optimization method, which enables localized image editing in complex scenarios. In particular, MAG-Edit optimizes the noise latent feature in diffusion models by maximizing two mask-based cross-attention constraints of the edit token, which in turn gradually enhances the local alignment with the desired prompt. Extensive quantitative and qualitative experiments demonstrate the effectiveness of our method in achieving both text alignment and structure preservation for localized editing within complex scenarios.

MAG-Edit: 複雑なシナリオにおける局所的な画像編集のためのマスクベースの注意調整ガイダンス

MAG-Edit: Localized Image Editing in Complex Scenarios via Mask-Based Attention-Adjusted Guidance

要旨

Support