MLLMは私を家に導けるか？交通地図からの細粒度視覚推論に関するベンチマーク研究

要旨

マルチモーダル大規模言語モデル（MLLMs）は最近、意味的シーン理解やテキスト-画像アラインメントを含む視覚タスクにおいて大きな進展を遂げており、数学や論理を含む複雑なタスクにおいては推論バリアントが性能を向上させています。しかし、細粒度の視覚理解を必要とする推論タスクにおける能力は十分に評価されていません。このギャップを埋めるため、我々はReasonMapを導入しました。これは、MLLMsの細粒度視覚理解能力と空間推論能力を評価するためのベンチマークです。ReasonMapは、13か国30都市の高解像度交通マップを含み、2つの質問タイプと3つのテンプレートにまたがる1,008の質問-回答ペアを網羅しています。さらに、回答の正確性と品質を適切に評価する2段階の評価パイプラインを設計しました。ベースモデルと推論バリアントを含む15の主要なMLLMsに対する包括的な評価から、直感に反するパターンが明らかになりました。オープンソースモデルでは、ベースモデルが推論モデルを上回る一方、クローズドソースモデルでは逆の傾向が観察されました。また、視覚入力をマスクすると一般的に性能が低下することから、MLLMsは一部の質問に答えるために事前知識を活用できるものの、細粒度の視覚推論タスクでは強力な性能を発揮するために真の視覚知覚が必要であることが示唆されました。我々のベンチマーク研究は、視覚推論に関する新たな洞察を提供し、オープンソースモデルとクローズドソースモデルの間のギャップを調査するための貢献を果たします。

English

Multimodal large language models (MLLMs) have recently achieved significant progress in visual tasks, including semantic scene understanding and text-image alignment, with reasoning variants enhancing performance on complex tasks involving mathematics and logic. However, their capacity for reasoning tasks involving fine-grained visual understanding remains insufficiently evaluated. To address this gap, we introduce ReasonMap, a benchmark designed to assess the fine-grained visual understanding and spatial reasoning abilities of MLLMs. ReasonMap encompasses high-resolution transit maps from 30 cities across 13 countries and includes 1,008 question-answer pairs spanning two question types and three templates. Furthermore, we design a two-level evaluation pipeline that properly assesses answer correctness and quality. Comprehensive evaluations of 15 popular MLLMs, including both base and reasoning variants, reveal a counterintuitive pattern: among open-source models, base models outperform reasoning ones, while the opposite trend is observed in closed-source models. Additionally, performance generally degrades when visual inputs are masked, indicating that while MLLMs can leverage prior knowledge to answer some questions, fine-grained visual reasoning tasks still require genuine visual perception for strong performance. Our benchmark study offers new insights into visual reasoning and contributes to investigating the gap between open-source and closed-source models.

MLLMは私を家に導けるか？交通地図からの細粒度視覚推論に関するベンチマーク研究

Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps

要旨

Support