ピクセルを超えたビジョン：推論を考慮した視覚編集のベンチマーキング

要旨

大規模マルチモーダルモデル（LMMs）は視覚理解と生成において大きな進歩を遂げてきたが、一般的な視覚編集、特に複雑な指示の追従、外観の一貫性の維持、柔軟な入力形式のサポートにおいて依然として課題に直面している。このギャップを埋めるため、我々は推論を考慮した視覚編集（RISE）を評価する最初のベンチマークであるRISEBenchを導入する。RISEBenchは、時間的、因果的、空間的、論理的推論という4つの主要な推論タイプに焦点を当てている。各カテゴリに対して高品質なテストケースを精選し、指示推論、外観一貫性、視覚的妥当性を評価するフレームワークを提案し、人間の評価者とLMM-as-a-judgeアプローチの両方を用いて評価を行う。実験の結果、GPT-4o-Nativeが他のオープンソースおよびプロプライエタリモデルを大幅に上回る一方で、この最先端のシステムでさえ論理的推論タスクに苦戦することが明らかとなり、未開拓の領域が浮き彫りになった。初期の取り組みとして、RISEBenchは推論を意識した視覚編集に関する基礎的な洞察を提供し、将来の研究を促進することを目指している。まだ初期段階ではあるが、次世代マルチモーダルシステムのより包括的で信頼性が高くスケーラブルな評価をサポートするため、ベンチマークの継続的な拡張と改良に取り組む。コードとデータはhttps://github.com/PhoenixZ810/RISEBenchで公開予定である。

English

Large Multi-modality Models (LMMs) have made significant progress in visual understanding and generation, but they still face challenges in General Visual Editing, particularly in following complex instructions, preserving appearance consistency, and supporting flexible input formats. To address this gap, we introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE). RISEBench focuses on four key reasoning types: Temporal, Causal, Spatial, and Logical Reasoning. We curate high-quality test cases for each category and propose an evaluation framework that assesses Instruction Reasoning, Appearance Consistency, and Visual Plausibility with both human judges and an LMM-as-a-judge approach. Our experiments reveal that while GPT-4o-Native significantly outperforms other open-source and proprietary models, even this state-of-the-art system struggles with logical reasoning tasks, highlighting an area that remains underexplored. As an initial effort, RISEBench aims to provide foundational insights into reasoning-aware visual editing and to catalyze future research. Though still in its early stages, we are committed to continuously expanding and refining the benchmark to support more comprehensive, reliable, and scalable evaluations of next-generation multimodal systems. Our code and data will be released at https://github.com/PhoenixZ810/RISEBench.

ピクセルを超えたビジョン：推論を考慮した視覚編集のベンチマーキング

Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

要旨

Support