エディターズチョイス：原子実体分析による画像編集における抽象的意図の評価

要旨

人間は自然に「雰囲気」のような抽象的な概念を通じてコミュニケーションをとる。しかしながら、現在の画像編集ベンチマークは主に明示的で文字通りのコマンドに焦点を当てており、抽象的な指示はほとんど調査されていない。本研究では、まず抽象的画像編集の定義と分類体系を形式化する。この困難な領域における指示追従性を測定するために、我々はEntity-Rubricsを導入する。これは抽象的な編集を個々のエンティティレベルの評価に分解し、人間の判断と強い相関を達成するフレームワークである。このフレームワークに加えて、多様な実世界シーンにわたる抽象的画像編集に特化した最初のベンチマークであるAbstractEditを提供する。このデータセットで11の主要モデルを評価した結果、標準的なアーキテクチャは意図と保存のバランスを取るのに苦労し、一般的に過小編集または過剰編集に陥るという根本的な課題が明らかになった。我々の分析は、有意義な改善を推進するには、高度なLLMテキストエンコーダと反復的思考の統合に大きく依存することを示している。将来的には、我々のエンティティベースのパラダイムは評価を超えて一般化し、報酬モデルとして機能したり、モデルが抽象的なコミュニケーションを正しく解釈できるようにしたり、テスト時の批評ループで特定の失敗を強調したりすることができる。最終的に、本研究がシームレスなマルチモーダル対話への足がかりとなり、機械の硬直的な実行と人間の自然で自由なコミュニケーション方法との間のギャップを埋めることを願っている。

English

Humans naturally communicate through abstract concepts like "mood". However, current image editing benchmarks focus primarily on explicit, literal commands, leaving abstract instructions largely underexplored. In this work, we first formalize the definition and taxonomy of abstract image editing. To measure instruction-following in this challenging domain, we introduce Entity-Rubrics, a framework that breaks down abstract edits into individual, entity-level assessments and achieves strong correlation with human judgment. Alongside this framework, we contribute AbstractEdit, the first benchmark dedicated to abstract image editing across diverse real-world scenes. Evaluating 11 leading models on this dataset reveals a fundamental challenge: standard architectures struggle to balance intent and preservation, commonly defaulting to under-editing or over-editing. Our analysis demonstrates that driving meaningful improvements relies heavily on integrating advanced LLM text encoders and iterative thinking. Looking forward, our entity-based paradigm can generalize beyond assessment to serve as a reward model, enable models to correctly interpret abstract communication, or highlight specific failures in test-time critique loops. Ultimately, we hope this work serves as a stepping stone toward seamless multimodal interaction, closing the gap between rigid machine execution and the natural, open-ended way humans communicate.