SpotEdit：視覺引導圖像編輯方法評估

摘要

視覺引導的圖像編輯，即編輯過程同時依賴於視覺線索和文本提示，已成為一種精細且可控內容生成的強大範式。儘管近期的生成模型展現了顯著的能力，現有的評估方法仍顯簡陋，不足以全面反映實際編輯中的挑戰。我們提出了SpotEdit，這是一個旨在系統評估多種擴散模型、自回歸模型及混合生成模型在視覺引導圖像編輯方面表現的綜合基準，揭示了顯著的性能差異。針對一個關鍵但尚未充分探索的挑戰，我們的基準特別包含了對幻覺現象的專項評估，揭示了如GPT-4o等領先模型常常錯誤地感知視覺線索的存在並錯誤執行編輯任務的情況。我們的代碼和基準已公開發佈於https://github.com/SaraGhazanfari/SpotEdit。

English

Visually-guided image editing, where edits are conditioned on both visual cues and textual prompts, has emerged as a powerful paradigm for fine-grained, controllable content generation. Although recent generative models have shown remarkable capabilities, existing evaluations remain simple and insufficiently representative of real-world editing challenges. We present SpotEdit, a comprehensive benchmark designed to systematically assess visually-guided image editing methods across diverse diffusion, autoregressive, and hybrid generative models, uncovering substantial performance disparities. To address a critical yet underexplored challenge, our benchmark includes a dedicated component on hallucination, highlighting how leading models, such as GPT-4o, often hallucinate the existence of a visual cue and erroneously perform the editing task. Our code and benchmark are publicly released at https://github.com/SaraGhazanfari/SpotEdit.

SpotEdit：視覺引導圖像編輯方法評估

SpotEdit: Evaluating Visually-Guided Image Editing Methods

摘要

Support