Add-it: 事前学習済み拡散モデルを用いた画像内のトレーニング不要なオブジェクト挿入

要旨

画像にオブジェクトを追加することは、意味的な画像編集において、元のシーンを保持しつつ新しいオブジェクトを適切な位置にシームレスに統合するバランスが求められる難しい課題です。既存のモデルは、特に複雑なシーンにオブジェクトを追加する自然な位置を見つけることに苦労することが多く、このバランスを保つことが難しいとされています。私たちは、Add-itというトレーニング不要のアプローチを紹介します。このアプローチは、拡散モデルの注意メカニズムを拡張し、シーン画像、テキストプロンプト、生成された画像自体からの情報を組み込みます。重み付けされた拡張された注意メカニズムは、構造の一貫性と細部を維持しつつ、自然なオブジェクト配置を確保します。タスク固有の微調整を行わずに、Add-itは実際の画像挿入ベンチマークと生成された画像挿入ベンチマークの両方で最先端の結果を達成し、オブジェクト配置の妥当性を評価するために新たに構築された「Additing Affordance Benchmark」でも、教師あり方法を凌駕します。ヒューマン評価では、Add-itが80%以上のケースで好まれることが示され、さまざまな自動化されたメトリクスでも改善が見られます。

English

Adding Object into images based on text instructions is a challenging task in semantic image editing, requiring a balance between preserving the original scene and seamlessly integrating the new object in a fitting location. Despite extensive efforts, existing models often struggle with this balance, particularly with finding a natural location for adding an object in complex scenes. We introduce Add-it, a training-free approach that extends diffusion models' attention mechanisms to incorporate information from three key sources: the scene image, the text prompt, and the generated image itself. Our weighted extended-attention mechanism maintains structural consistency and fine details while ensuring natural object placement. Without task-specific fine-tuning, Add-it achieves state-of-the-art results on both real and generated image insertion benchmarks, including our newly constructed "Additing Affordance Benchmark" for evaluating object placement plausibility, outperforming supervised methods. Human evaluations show that Add-it is preferred in over 80% of cases, and it also demonstrates improvements in various automated metrics.

Add-it: 事前学習済み拡散モデルを用いた画像内のトレーニング不要なオブジェクト挿入

Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models

要旨

Support