ChatPaper.aiChatPaper

編輯精選:通過原子實體分析評估圖像編輯中的抽象意圖

Editor's Choice: Evaluating Abstract Intent in Image Editing through Atomic Entity Analysis

May 14, 2026
作者: Mor Ventura, Roy Hirsch, Yonatan Bitton, Regev Cohen, Roi Reichart
cs.AI

摘要

人类天生通过诸如“情绪”之类的抽象概念进行交流。然而,当前的图像编辑基准主要聚焦于明确、字面的指令,致使抽象指令在很大程度上仍是未探索的领域。在本工作中,我们首先形式化定义了抽象图像编辑的概念及其分类体系。为了衡量这一具有挑战性领域中的指令遵循能力,我们提出了实体准则(Entity-Rubrics)框架,该框架将抽象编辑分解为针对各实体的独立评估,并与人类判断实现了高度一致性。依托这一框架,我们贡献了首个专注于跨多样真实场景的抽象图像编辑基准——AbstractEdit。在该数据集上对11个领先模型的评估揭示了一个根本性挑战:标准架构难以在意图保持与内容保真之间取得平衡,常常默认倾向于欠编辑或过编辑。我们的分析表明,推动有意义的改进高度依赖于集成先进的大语言模型文本编码器与迭代思维。展望未来,我们基于实体的范式可超越评估范畴,作为奖励模型使用,使模型能够正确解读抽象沟通,或在测试时的反馈循环中精准定位特定失败。最终,我们希望这项研究能成为通往无缝多模态交互的垫脚石,缩小僵化机器执行与人类自然、开放式沟通方式之间的鸿沟。
English
Humans naturally communicate through abstract concepts like "mood". However, current image editing benchmarks focus primarily on explicit, literal commands, leaving abstract instructions largely underexplored. In this work, we first formalize the definition and taxonomy of abstract image editing. To measure instruction-following in this challenging domain, we introduce Entity-Rubrics, a framework that breaks down abstract edits into individual, entity-level assessments and achieves strong correlation with human judgment. Alongside this framework, we contribute AbstractEdit, the first benchmark dedicated to abstract image editing across diverse real-world scenes. Evaluating 11 leading models on this dataset reveals a fundamental challenge: standard architectures struggle to balance intent and preservation, commonly defaulting to under-editing or over-editing. Our analysis demonstrates that driving meaningful improvements relies heavily on integrating advanced LLM text encoders and iterative thinking. Looking forward, our entity-based paradigm can generalize beyond assessment to serve as a reward model, enable models to correctly interpret abstract communication, or highlight specific failures in test-time critique loops. Ultimately, we hope this work serves as a stepping stone toward seamless multimodal interaction, closing the gap between rigid machine execution and the natural, open-ended way humans communicate.