LOCATEdit: 局所的なテキストガイド画像編集のためのグラフラプラシアン最適化クロスアテンション

要旨

テキストガイド画像編集は、自然言語の指示に従って画像の特定の領域を変更しつつ、全体的な構造と背景の忠実性を維持することを目的としています。既存の手法では、拡散モデルから生成されたクロスアテンションマップに基づいてマスクを利用し、変更対象の領域を特定します。しかし、クロスアテンションメカニズムは意味的な関連性に焦点を当てるため、画像の整合性を維持することが困難です。その結果、これらの手法はしばしば空間的な一貫性を欠き、編集アーティファクトや歪みを引き起こします。本研究では、これらの制限に対処し、LOCATEditを提案します。LOCATEditは、グラフベースのアプローチを用いてクロスアテンションマップを強化し、セルフアテンションから導出されたパッチ間の関係を利用して、画像領域全体にわたる滑らかで一貫したアテンションを維持します。これにより、指定されたアイテムに限定された変更を行いながら、周囲の構造を保持することが可能になります。\methodは、PIE-Benchにおいて既存のベースラインを一貫して大幅に上回り、様々な編集タスクにおける最先端の性能と有効性を実証しています。コードはhttps://github.com/LOCATEdit/LOCATEdit/で公開されています。

English

Text-guided image editing aims to modify specific regions of an image according to natural language instructions while maintaining the general structure and the background fidelity. Existing methods utilize masks derived from cross-attention maps generated from diffusion models to identify the target regions for modification. However, since cross-attention mechanisms focus on semantic relevance, they struggle to maintain the image integrity. As a result, these methods often lack spatial consistency, leading to editing artifacts and distortions. In this work, we address these limitations and introduce LOCATEdit, which enhances cross-attention maps through a graph-based approach utilizing self-attention-derived patch relationships to maintain smooth, coherent attention across image regions, ensuring that alterations are limited to the designated items while retaining the surrounding structure. \method consistently and substantially outperforms existing baselines on PIE-Bench, demonstrating its state-of-the-art performance and effectiveness on various editing tasks. Code can be found on https://github.com/LOCATEdit/LOCATEdit/

LOCATEdit: 局所的なテキストガイド画像編集のためのグラフラプラシアン最適化クロスアテンション

LOCATEdit: Graph Laplacian Optimized Cross Attention for Localized Text-Guided Image Editing

要旨

Support