ChatPaper.aiChatPaper

多模态大语言模型能否指引我回家?基于交通地图的细粒度视觉推理基准研究

Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps

May 24, 2025
作者: Sicheng Feng, Song Wang, Shuyi Ouyang, Lingdong Kong, Zikai Song, Jianke Zhu, Huan Wang, Xinchao Wang
cs.AI

摘要

多模态大语言模型(MLLMs)近期在视觉任务上取得了显著进展,包括语义场景理解与图文对齐,其推理变体在涉及数学与逻辑的复杂任务中表现更优。然而,这些模型在需要细粒度视觉理解的推理任务上的能力尚未得到充分评估。为填补这一空白,我们推出了ReasonMap基准,旨在评估MLLMs在细粒度视觉理解与空间推理方面的能力。ReasonMap涵盖了来自13个国家30个城市的高分辨率交通地图,并包含1,008个问答对,涉及两种问题类型和三种模板。此外,我们设计了一个两级评估流程,以准确判断答案的正确性与质量。对15个流行MLLMs(包括基础版与推理版)的全面评估揭示了一个反直觉的现象:在开源模型中,基础模型优于推理模型,而在闭源模型中则呈现相反趋势。同时,当视觉输入被遮蔽时,模型性能普遍下降,这表明尽管MLLMs能够利用先验知识回答部分问题,但要在细粒度视觉推理任务上取得优异表现,仍需依赖真实的视觉感知。我们的基准研究为视觉推理提供了新见解,并有助于探究开源与闭源模型之间的差距。
English
Multimodal large language models (MLLMs) have recently achieved significant progress in visual tasks, including semantic scene understanding and text-image alignment, with reasoning variants enhancing performance on complex tasks involving mathematics and logic. However, their capacity for reasoning tasks involving fine-grained visual understanding remains insufficiently evaluated. To address this gap, we introduce ReasonMap, a benchmark designed to assess the fine-grained visual understanding and spatial reasoning abilities of MLLMs. ReasonMap encompasses high-resolution transit maps from 30 cities across 13 countries and includes 1,008 question-answer pairs spanning two question types and three templates. Furthermore, we design a two-level evaluation pipeline that properly assesses answer correctness and quality. Comprehensive evaluations of 15 popular MLLMs, including both base and reasoning variants, reveal a counterintuitive pattern: among open-source models, base models outperform reasoning ones, while the opposite trend is observed in closed-source models. Additionally, performance generally degrades when visual inputs are masked, indicating that while MLLMs can leverage prior knowledge to answer some questions, fine-grained visual reasoning tasks still require genuine visual perception for strong performance. Our benchmark study offers new insights into visual reasoning and contributes to investigating the gap between open-source and closed-source models.

Summary

AI-Generated Summary

PDF233May 27, 2025