픽셀 너머를 상상하다: 추론 기반 시각적 편집의 벤치마킹

초록

대규모 다중모달리티 모델(LMMs)은 시각적 이해와 생성 분야에서 상당한 진전을 이루었지만, 일반적인 시각 편집(General Visual Editing)에서는 여전히 복잡한 지시를 따르기, 외관 일관성 유지, 유연한 입력 형식 지원 등의 과제에 직면해 있습니다. 이러한 격차를 해소하기 위해, 우리는 추론 기반 시각 편집(Reasoning-Informed viSual Editing, RISE)을 평가하기 위한 첫 번째 벤치마크인 RISEBench을 소개합니다. RISEBench은 시간적(Temporal), 인과적(Causal), 공간적(Spatial), 논리적(Logical) 추론이라는 네 가지 주요 추론 유형에 초점을 맞추고 있습니다. 각 범주에 대해 고품질 테스트 케이스를 선별하고, 인간 평가자와 LMM-as-a-judge 접근법을 통해 지시 추론(Instruction Reasoning), 외관 일관성(Appearance Consistency), 시각적 타당성(Visual Plausibility)을 평가하는 프레임워크를 제안합니다. 우리의 실험 결과, GPT-4o-Native가 다른 오픈소스 및 상용 모델을 크게 앞지르는 것으로 나타났지만, 이 최첨단 시스템조차 논리적 추론 작업에서 어려움을 겪는 것으로 드러나, 이 분야가 여전히 미개척 상태임을 보여줍니다. 초기 단계로서, RISEBench은 추론 인식 시각 편집에 대한 기초적인 통찰을 제공하고 미래 연구를 촉진하는 것을 목표로 합니다. 아직 초기 단계이지만, 우리는 차세대 다중모달 시스템을 보다 포괄적이고 신뢰할 수 있으며 확장 가능한 방식으로 평가할 수 있도록 벤치마크를 지속적으로 확장하고 개선할 것을 약속합니다. 우리의 코드와 데이터는 https://github.com/PhoenixZ810/RISEBench에서 공개될 예정입니다.

English

Large Multi-modality Models (LMMs) have made significant progress in visual understanding and generation, but they still face challenges in General Visual Editing, particularly in following complex instructions, preserving appearance consistency, and supporting flexible input formats. To address this gap, we introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE). RISEBench focuses on four key reasoning types: Temporal, Causal, Spatial, and Logical Reasoning. We curate high-quality test cases for each category and propose an evaluation framework that assesses Instruction Reasoning, Appearance Consistency, and Visual Plausibility with both human judges and an LMM-as-a-judge approach. Our experiments reveal that while GPT-4o-Native significantly outperforms other open-source and proprietary models, even this state-of-the-art system struggles with logical reasoning tasks, highlighting an area that remains underexplored. As an initial effort, RISEBench aims to provide foundational insights into reasoning-aware visual editing and to catalyze future research. Though still in its early stages, we are committed to continuously expanding and refining the benchmark to support more comprehensive, reliable, and scalable evaluations of next-generation multimodal systems. Our code and data will be released at https://github.com/PhoenixZ810/RISEBench.

픽셀 너머를 상상하다: 추론 기반 시각적 편집의 벤치마킹

Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

초록

Support