MMR-Life: 実生活シーンの統合によるマルチモーダル複数画像推論

要旨

マルチモーダル大規模言語モデル（MLLM）の推論能力における最近の進展は、科学分析や数学的推論といったより複雑なタスクに対処する力をこれらのモデルに与えている。その可能性にもかかわらず、現実の様々なシナリオにおけるMLLMの推論能力は未だ十分に探求されておらず、評価のための標準化されたベンチマークを欠いている。このギャップを埋めるため、我々は現実生活シナリオにわたるMLLMの多様なマルチモーダル・マルチイメージ推論能力を評価する包括的ベンチマーク「MMR-Life」を提案する。MMR-Lifeは、主に実世界の文脈から収集された19,108枚の画像に基づく2,646の多肢選択問題で構成され、推論タイプとして「アブダクション（仮説形成）」「類推」「因果」「演繹」「帰納」「空間」「時間」の7種類を網羅的にカバーする。既存の推論ベンチマークとは異なり、MMR-Lifeは領域特化的な専門知識に依存せず、代わりにモデルが複数の画像にわたる情報を統合し、多様な推論能力を適用することを要求する。37の先進モデルによる評価は、MMR-Lifeが提示する相当な課題の難度を示している。GPT-5のようなトップモデルでさえ58%の正答率に留まり、推論タイプ間で性能に大きなばらつきが見られる。さらに、我々は既存MLLMの推論パラダイムを分析し、思考の長さ、推論方法、推論タイプといった要因が性能に与える影響を探る。総括すると、MMR-Lifeは次世代マルチモーダル推論システムを評価・分析・改善するための包括的な基盤を確立するものである。

English

Recent progress in the reasoning capabilities of multimodal large language models (MLLMs) has empowered them to address more complex tasks such as scientific analysis and mathematical reasoning. Despite their promise, MLLMs' reasoning abilities across different scenarios in real life remain largely unexplored and lack standardized benchmarks for evaluation. To address this gap, we introduce MMR-Life, a comprehensive benchmark designed to evaluate the diverse multimodal multi-image reasoning capabilities of MLLMs across real-life scenarios. MMR-Life consists of 2,646 multiple-choice questions based on 19,108 images primarily sourced from real-world contexts, comprehensively covering seven reasoning types: abductive, analogical, causal, deductive, inductive, spatial, and temporal. Unlike existing reasoning benchmarks, MMR-Life does not rely on domain-specific expertise but instead requires models to integrate information across multiple images and apply diverse reasoning abilities. The evaluation of 37 advanced models highlights the substantial challenge posed by MMR-Life. Even top models like GPT-5 achieve only 58% accuracy and display considerable variance in performance across reasoning types. Moreover, we analyze the reasoning paradigms of existing MLLMs, exploring how factors such as thinking length, reasoning method, and reasoning type affect their performance. In summary, MMR-Life establishes a comprehensive foundation for evaluating, analyzing, and improving the next generation of multimodal reasoning systems.

MMR-Life: 実生活シーンの統合によるマルチモーダル複数画像推論

MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning

要旨

Support