ChatPaper.aiChatPaper

编辑之选:通过原子实体分析评估图像编辑中的抽象意图

Editor's Choice: Evaluating Abstract Intent in Image Editing through Atomic Entity Analysis

May 14, 2026
作者: Mor Ventura, Roy Hirsch, Yonatan Bitton, Regev Cohen, Roi Reichart
cs.AI

摘要

人类自然地通过“情绪”等抽象概念进行交流。然而,当前的图像编辑基准主要侧重于显式、字面化的指令,抽象指令在很大程度上仍未得到充分探索。在本工作中,我们首先正式定义了抽象图像编辑的概念与分类体系。为衡量这一挑战性领域中的指令遵循能力,我们提出了实体评分框架(Entity-Rubrics),该框架将抽象编辑分解为逐实体、逐层面的评估,并与人类判断实现了强相关性。基于这一框架,我们构建了AbstractEdit——首个专注于跨多样真实场景的抽象图像编辑基准。在基准上对11个主流模型的评估揭示了一个根本性挑战:标准架构难以在意图保持与内容保留之间取得平衡,常常陷入欠编辑或过编辑的缺陷。我们的分析表明,推动实质性改进高度依赖先进的大语言模型文本编码器与迭代推理能力的整合。展望未来,我们基于实体的范式可超越评估范畴,作为奖励模型发挥作用,使模型能够正确解读抽象交流,或在测试时反馈循环中精准定位特定失败模式。最终,我们希望本工作成为无缝多模态交互的垫脚石,弥合机器僵化执行与人机自然开放式交流之间的鸿沟。
English
Humans naturally communicate through abstract concepts like "mood". However, current image editing benchmarks focus primarily on explicit, literal commands, leaving abstract instructions largely underexplored. In this work, we first formalize the definition and taxonomy of abstract image editing. To measure instruction-following in this challenging domain, we introduce Entity-Rubrics, a framework that breaks down abstract edits into individual, entity-level assessments and achieves strong correlation with human judgment. Alongside this framework, we contribute AbstractEdit, the first benchmark dedicated to abstract image editing across diverse real-world scenes. Evaluating 11 leading models on this dataset reveals a fundamental challenge: standard architectures struggle to balance intent and preservation, commonly defaulting to under-editing or over-editing. Our analysis demonstrates that driving meaningful improvements relies heavily on integrating advanced LLM text encoders and iterative thinking. Looking forward, our entity-based paradigm can generalize beyond assessment to serve as a reward model, enable models to correctly interpret abstract communication, or highlight specific failures in test-time critique loops. Ultimately, we hope this work serves as a stepping stone toward seamless multimodal interaction, closing the gap between rigid machine execution and the natural, open-ended way humans communicate.