GRADE: 画像編集における分野知識に基づく推論のベンチマーキング

要旨

統一的マルチモーダルモデルは、理解・推論・生成の統合を目指すが、現行の画像編集ベンチマークは自然画像と浅い常識推論に偏重し、構造化された分野特化的制約下での能力評価が不十分である。本論文では、学術的知識に基づく画像編集の推論能力を評価する初のベンチマークGRADEを提案する。GRADEは自然科学から社会科学にわたる10学術分野・520検証サンプルで構成される。厳密な評価のため、分野推論・視覚的一貫性・論理的可読性を総合評価する多次元評価プロトコルを設計した。20の先進的オープンソース/クローズドソースモデルを用いた大規模実験により、暗黙的で知識集約的な編集設定において現行モデルが重大な限界を示し、大幅な性能差が生じることを明らかにした。定量的評価に加え、厳密な分析とアブレーション研究を通じてモデルの欠点を特定し、学術分野編集における制約要因を解明した。GRADEは統一的マルチモーダルモデルの発展に向けた重要方向性を示し、学術知識に基づく画像編集・推論研究の進展に寄与する。ベンチマークと評価コードは公開済みである。

English

Unified multimodal models target joint understanding, reasoning, and generation, but current image editing benchmarks are largely confined to natural images and shallow commonsense reasoning, offering limited assessment of this capability under structured, domain-specific constraints. In this work, we introduce GRADE, the first benchmark to assess discipline-informed knowledge and reasoning in image editing. GRADE comprises 520 carefully curated samples across 10 academic domains, spanning from natural science to social science. To support rigorous evaluation, we propose a multi-dimensional evaluation protocol that jointly assesses Discipline Reasoning, Visual Consistency, and Logical Readability. Extensive experiments on 20 state-of-the-art open-source and closed-source models reveal substantial limitations in current models under implicit, knowledge-intensive editing settings, leading to large performance gaps. Beyond quantitative scores, we conduct rigorous analyses and ablations to expose model shortcomings and identify the constraints within disciplinary editing. Together, GRADE pinpoints key directions for the future development of unified multimodal models, advancing the research on discipline-informed image editing and reasoning. Our benchmark and evaluation code are publicly released.

GRADE: 画像編集における分野知識に基づく推論のベンチマーキング

GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing

要旨

Support