这个编辑正确吗？一个面向推理感知图像编辑的多维基准

摘要

基于扩散模型的图像编辑在自然语言指令下实现了较强的视觉保真度，然而现有大多数系统仍停留在表面指令遵循层面，未能对真实用户请求中所蕴含的隐式上下文约束进行推理。这常常导致编辑结果在视觉上看似合理但在逻辑上不一致。在本工作中，我们提出了RE-Edit——一个面向推理感知图像编辑的基准测试，该基准从五个互补的推理维度评估图像编辑系统：物理、环境、文化、因果和指代。RE-Edit包含1,000个精心策划的样本，每个样本的设计都确保仅凭视觉合理性不足以完成正确编辑，正确编辑需要满足隐式逻辑约束。为支持细粒度分析，我们建立了维度对齐的评价标准，并对十个开源模型和两个商业图像编辑模型进行了全面研究。我们的结果表明，即使是先进的系统，尽管能生成高质量的视觉结果，也常常在隐式多维推理上遇到困难。我们进一步提出了一种轻量级的推理引导后编辑基线作为初步探索，展示了插入显式推理如何以模型无关的方式帮助缓解此类失败。

English

Diffusion-based image editing has achieved strong visual fidelity under natural language instructions, yet most existing systems still operate at the level of surface instruction following, without reasoning about the implicit contextual constraints embedded in real user requests. This often leads to visually plausible but logically inconsistent edits. In this work, we introduce RE-Edit, a benchmark for REasoning-aware image Editing that evaluates image editing systems across five complementary reasoning dimensions: physical, environmental, cultural, causal, and referential. RE-Edit comprises 1,000 carefully curated samples, each designed such that visual plausibility alone is insufficient and correct editing requires satisfying implicit logical constraints. To support fine-grained analysis, we establish dimension-aligned evaluation criteria and conduct a comprehensive study of ten open-source and two commercial image editing models. Our results show that even advanced systems frequently struggle with implicit multi-dimensional reasoning despite producing high-quality visuals. We further present a lightweight reasoning-guided post-edit baseline as an initial exploration, illustrating how inserting explicit reasoning can help mitigate such failures in a model-agnostic manner.