GRADE：基于学科知识的图像编辑推理基准测试

摘要

统一多模态模型旨在实现联合理解、推理与生成，但现有图像编辑基准大多局限于自然图像和浅层常识推理，难以评估其在结构化、领域特定约束下的能力。为此，我们提出首个面向学科知识与推理的图像编辑基准GRADE，包含10个学术领域（从自然科学到社会科学）的520个精心构建样本。为支持严谨评估，我们设计了融合学科推理、视觉一致性与逻辑可读性的多维评估体系。在20个前沿开源与闭源模型上的实验表明，当前模型在隐含知识密集的编辑场景中存在显著局限，性能差距巨大。除量化评分外，我们通过系统化分析与消融实验揭示了模型缺陷，并明确了学科编辑中的关键约束。GRADE为统一多模态模型的未来发展指明了方向，推动学科化图像编辑与推理研究。基准数据与评估代码已公开。

English

Unified multimodal models target joint understanding, reasoning, and generation, but current image editing benchmarks are largely confined to natural images and shallow commonsense reasoning, offering limited assessment of this capability under structured, domain-specific constraints. In this work, we introduce GRADE, the first benchmark to assess discipline-informed knowledge and reasoning in image editing. GRADE comprises 520 carefully curated samples across 10 academic domains, spanning from natural science to social science. To support rigorous evaluation, we propose a multi-dimensional evaluation protocol that jointly assesses Discipline Reasoning, Visual Consistency, and Logical Readability. Extensive experiments on 20 state-of-the-art open-source and closed-source models reveal substantial limitations in current models under implicit, knowledge-intensive editing settings, leading to large performance gaps. Beyond quantitative scores, we conduct rigorous analyses and ablations to expose model shortcomings and identify the constraints within disciplinary editing. Together, GRADE pinpoints key directions for the future development of unified multimodal models, advancing the research on discipline-informed image editing and reasoning. Our benchmark and evaluation code are publicly released.

GRADE：基于学科知识的图像编辑推理基准测试

GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing

摘要

Support