GRADE: 이미지 편집에서 분야 기반 추론 성능 평가

초록

통합 멀티모달 모델은 통합적인 이해, 추론 및 생성을 목표로 하지만, 현재 이미지 편집 벤치마크는 자연 이미지와 단순한 상식 추론에 국한되어 구조화된 도메인 특화 제약 조건 하에서의 이러한 능력을 제대로 평가하지 못합니다. 본 연구에서는 학문 분야 기반 지식과 추론 능력을 이미지 편집에서 평가하는 최초의 벤치마크인 GRADE를 소개합니다. GRADE는 자연과학부터 사회과학까지 10개 학문 영역에 걸쳐 신중하게 선별된 520개 샘플로 구성됩니다. 엄격한 평가를 위해 우리는 학문적 추론(Discipline Reasoning), 시각적 일관성(Visual Consistency), 논리적 가독성(Logical Readability)을 종합적으로 평가하는 다차원 평가 프로토콜을 제안합니다. 20개의 최첨단 오픈소스 및 클로즈드소스 모델에 대한 대규모 실험 결과, 암묵적이고 지식 집약적인 편집 환경에서 현재 모델들의 심각한 한계가 드러나 큰 성능 격차를 확인했습니다. 정량적 점수 이상으로, 우리는 모델의 단점을 규명하고 학문 분야 편집의 제약 조건을 파악하기 위해 엄밀한 분석과 애블레이션 실험을 수행했습니다. GRADE는 통합 멀티모달 모델의 미래 발전을 위한 핵심 방향을 제시하며, 학문 기반 이미지 편집 및 추론 연구의 발전을 촉진합니다. 우리의 벤치마크와 평가 코드는 공개되었습니다.

English

Unified multimodal models target joint understanding, reasoning, and generation, but current image editing benchmarks are largely confined to natural images and shallow commonsense reasoning, offering limited assessment of this capability under structured, domain-specific constraints. In this work, we introduce GRADE, the first benchmark to assess discipline-informed knowledge and reasoning in image editing. GRADE comprises 520 carefully curated samples across 10 academic domains, spanning from natural science to social science. To support rigorous evaluation, we propose a multi-dimensional evaluation protocol that jointly assesses Discipline Reasoning, Visual Consistency, and Logical Readability. Extensive experiments on 20 state-of-the-art open-source and closed-source models reveal substantial limitations in current models under implicit, knowledge-intensive editing settings, leading to large performance gaps. Beyond quantitative scores, we conduct rigorous analyses and ablations to expose model shortcomings and identify the constraints within disciplinary editing. Together, GRADE pinpoints key directions for the future development of unified multimodal models, advancing the research on discipline-informed image editing and reasoning. Our benchmark and evaluation code are publicly released.

GRADE: 이미지 편집에서 분야 기반 추론 성능 평가

GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing

초록

Support