SpotEdit: 視覚誘導型画像編集手法の評価

要旨

視覚的ガイドに基づく画像編集は、視覚的キューとテキストプロンプトの両方を条件とした編集を行う強力なパラダイムとして登場し、きめ細かく制御可能なコンテンツ生成を実現しています。最近の生成モデルは驚くべき能力を示していますが、既存の評価は単純で、現実世界の編集課題を十分に代表するものではありません。本論文では、SpotEditを紹介します。これは、多様な拡散モデル、自己回帰モデル、ハイブリッド生成モデルにわたる視覚的ガイドに基づく画像編集手法を体系的に評価するための包括的なベンチマークであり、大幅な性能差を明らかにします。重要な未開拓の課題に対処するため、本ベンチマークには幻覚に関する専用コンポーネントが含まれており、GPT-4oなどの主要モデルが視覚的キューの存在を幻覚し、誤って編集タスクを実行する様子を強調しています。私たちのコードとベンチマークは、https://github.com/SaraGhazanfari/SpotEdit で公開されています。

English

Visually-guided image editing, where edits are conditioned on both visual cues and textual prompts, has emerged as a powerful paradigm for fine-grained, controllable content generation. Although recent generative models have shown remarkable capabilities, existing evaluations remain simple and insufficiently representative of real-world editing challenges. We present SpotEdit, a comprehensive benchmark designed to systematically assess visually-guided image editing methods across diverse diffusion, autoregressive, and hybrid generative models, uncovering substantial performance disparities. To address a critical yet underexplored challenge, our benchmark includes a dedicated component on hallucination, highlighting how leading models, such as GPT-4o, often hallucinate the existence of a visual cue and erroneously perform the editing task. Our code and benchmark are publicly released at https://github.com/SaraGhazanfari/SpotEdit.

SpotEdit: 視覚誘導型画像編集手法の評価

SpotEdit: Evaluating Visually-Guided Image Editing Methods

要旨

Support