多模態大語言模型能否指引我回家？基於交通地圖的細粒度視覺推理基準研究

摘要

多模態大型語言模型（MLLMs）近期在視覺任務上取得了顯著進展，包括語義場景理解與圖文對齊，其推理變體更是在涉及數學與邏輯的複雜任務中提升了表現。然而，這些模型在需要細粒度視覺理解的推理任務上的能力仍未被充分評估。為填補這一空白，我們引入了ReasonMap，一個旨在評估MLLMs細粒度視覺理解與空間推理能力的基準。ReasonMap涵蓋了來自13個國家30個城市的高分辨率交通地圖，並包含1008個問答對，覆蓋兩類問題類型與三種模板。此外，我們設計了一個兩層評估流程，以恰當評判答案的正確性與質量。對15個流行MLLMs（包括基礎模型與推理變體）的全面評估揭示了一個反直覺的現象：在開源模型中，基礎模型優於推理模型，而在閉源模型中則觀察到相反的趨勢。同時，當視覺輸入被遮擋時，模型性能普遍下降，這表明儘管MLLMs能利用先驗知識回答部分問題，但細粒度視覺推理任務仍需真實的視覺感知才能取得強勁表現。我們的基準研究為視覺推理提供了新的見解，並有助於探討開源與閉源模型之間的差距。

English

Multimodal large language models (MLLMs) have recently achieved significant progress in visual tasks, including semantic scene understanding and text-image alignment, with reasoning variants enhancing performance on complex tasks involving mathematics and logic. However, their capacity for reasoning tasks involving fine-grained visual understanding remains insufficiently evaluated. To address this gap, we introduce ReasonMap, a benchmark designed to assess the fine-grained visual understanding and spatial reasoning abilities of MLLMs. ReasonMap encompasses high-resolution transit maps from 30 cities across 13 countries and includes 1,008 question-answer pairs spanning two question types and three templates. Furthermore, we design a two-level evaluation pipeline that properly assesses answer correctness and quality. Comprehensive evaluations of 15 popular MLLMs, including both base and reasoning variants, reveal a counterintuitive pattern: among open-source models, base models outperform reasoning ones, while the opposite trend is observed in closed-source models. Additionally, performance generally degrades when visual inputs are masked, indicating that while MLLMs can leverage prior knowledge to answer some questions, fine-grained visual reasoning tasks still require genuine visual perception for strong performance. Our benchmark study offers new insights into visual reasoning and contributes to investigating the gap between open-source and closed-source models.

多模態大語言模型能否指引我回家？基於交通地圖的細粒度視覺推理基準研究

Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps

摘要

Support