大型视觉语言模型能否像人类一样解读地图？

摘要

本文介绍了MapBench——首个专为人类可读、基于像素的户外地图导航设计的数据集，该数据集源自复杂的路径规划场景。MapBench包含来自100张多样化地图的超过1600个像素空间地图路径规划问题。在MapBench中，大型视觉语言模型（LVLMs）根据地图图像及包含起点与终点地标的查询生成基于语言的导航指令。针对每张地图，MapBench提供了地图空间场景图（MSSG）作为索引数据结构，用于在自然语言与评估LVLM生成结果之间进行转换。我们展示了MapBench对当前最先进的LVLMs构成了显著挑战，无论是零样本提示还是采用思维链（CoT）增强的推理框架，后者将地图导航分解为一系列认知过程。我们对开源与闭源LVLMs的评估均凸显了MapBench带来的巨大难度，揭示了它们在空间推理与结构化决策能力上的关键局限。我们已在https://github.com/taco-group/MapBench上公开所有代码与数据集。

English

In this paper, we introduce MapBench-the first dataset specifically designed for human-readable, pixel-based map-based outdoor navigation, curated from complex path finding scenarios. MapBench comprises over 1600 pixel space map path finding problems from 100 diverse maps. In MapBench, LVLMs generate language-based navigation instructions given a map image and a query with beginning and end landmarks. For each map, MapBench provides Map Space Scene Graph (MSSG) as an indexing data structure to convert between natural language and evaluate LVLM-generated results. We demonstrate that MapBench significantly challenges state-of-the-art LVLMs both zero-shot prompting and a Chain-of-Thought (CoT) augmented reasoning framework that decomposes map navigation into sequential cognitive processes. Our evaluation of both open-source and closed-source LVLMs underscores the substantial difficulty posed by MapBench, revealing critical limitations in their spatial reasoning and structured decision-making capabilities. We release all the code and dataset in https://github.com/taco-group/MapBench.

大型视觉语言模型能否像人类一样解读地图？

Can Large Vision Language Models Read Maps Like a Human?

摘要

Support