MMMR: 大規模マルチモーダル推論タスクのベンチマーキング

要旨

近年のマルチモーダル大規模言語モデル（MLLMs）の進展により、言語、視覚、構造化入力の統一的な処理が可能となり、論理的推論、空間的推論、科学的分析などの複雑なタスクへの扉が開かれました。しかし、その可能性にもかかわらず、特に中間思考トレースを強化したMLLMs（MLLMs-T）の推論能力は十分に理解されておらず、標準化された評価ベンチマークが不足しています。既存の研究は主に知覚や最終的な回答の正確性に焦点を当てており、モデルがどのように推論するか、あるいは異なるモダリティ間でどのように失敗するかについての洞察が限られています。このギャップを埋めるため、我々はMMMRという新しいベンチマークを導入しました。これは、明示的な思考を伴うマルチモーダル推論を厳密に評価するために設計されています。MMMRは、1）シンボリックな深さとマルチホップの要求を備えた6つの多様な推論タイプにまたがる1,083問の高難易度データセットと、2）正確性を超えた推論品質を評価するためのモジュール型推論トレース評価パイプライン（RTEP）から構成されています。RTEPは、関連性、一貫性、構造化されたエラー注釈などのメトリクスを通じて推論品質を評価します。実証結果によると、MLLMs-Tは全体的に非思考型のモデルを上回りますが、Claude-3.7-SonnetやGemini-2.5 Proのようなトップモデルでも、一貫性の欠如や過剰思考などの推論上の問題が観察されます。このベンチマークは、正確性と推論品質の間に存在する持続的なギャップを明らかにし、将来のモデル開発のための実践的な評価パイプラインを提供します。全体として、MMMRは次世代のマルチモーダル推論システムを評価、比較、改善するためのスケーラブルな基盤を提供します。

English

Recent advances in Multi-Modal Large Language Models (MLLMs) have enabled unified processing of language, vision, and structured inputs, opening the door to complex tasks such as logical deduction, spatial reasoning, and scientific analysis. Despite their promise, the reasoning capabilities of MLLMs, particularly those augmented with intermediate thinking traces (MLLMs-T), remain poorly understood and lack standardized evaluation benchmarks. Existing work focuses primarily on perception or final answer correctness, offering limited insight into how models reason or fail across modalities. To address this gap, we introduce the MMMR, a new benchmark designed to rigorously evaluate multi-modal reasoning with explicit thinking. The MMMR comprises 1) a high-difficulty dataset of 1,083 questions spanning six diverse reasoning types with symbolic depth and multi-hop demands and 2) a modular Reasoning Trace Evaluation Pipeline (RTEP) for assessing reasoning quality beyond accuracy through metrics like relevance, consistency, and structured error annotations. Empirical results show that MLLMs-T overall outperform non-thinking counterparts, but even top models like Claude-3.7-Sonnet and Gemini-2.5 Pro suffer from reasoning pathologies such as inconsistency and overthinking. This benchmark reveals persistent gaps between accuracy and reasoning quality and provides an actionable evaluation pipeline for future model development. Overall, the MMMR offers a scalable foundation for evaluating, comparing, and improving the next generation of multi-modal reasoning systems.

MMMR: 大規模マルチモーダル推論タスクのベンチマーキング

MMMR: Benchmarking Massive Multi-Modal Reasoning Tasks

要旨

Support