大型視覺語言模型能像人類一樣閱讀地圖嗎?
Can Large Vision Language Models Read Maps Like a Human?
March 18, 2025
作者: Shuo Xing, Zezhou Sun, Shuangyu Xie, Kaiyuan Chen, Yanjia Huang, Yuping Wang, Jiachen Li, Dezhen Song, Zhengzhong Tu
cs.AI
摘要
在本篇論文中,我們介紹了MapBench——首個專為人類可讀、基於像素的地圖戶外導航而設計的數據集,該數據集源自複雜的路徑尋找場景。MapBench包含來自100張多樣化地圖的超過1600個像素空間地圖路徑尋找問題。在MapBench中,大型視覺語言模型(LVLMs)根據地圖圖像及包含起點和終點地標的查詢生成基於語言的導航指令。對於每張地圖,MapBench提供了地圖空間場景圖(MSSG)作為索引數據結構,用於在自然語言與評估LVLM生成結果之間進行轉換。我們展示了MapBench對現有最先進的LVLMs構成了重大挑戰,無論是零樣本提示還是在思維鏈(CoT)增強推理框架下,該框架將地圖導航分解為一系列認知過程。我們對開源和閉源LVLMs的評估凸顯了MapBench帶來的巨大難度,揭示了這些模型在空間推理和結構化決策能力上的關鍵限制。我們已在https://github.com/taco-group/MapBench上發布了所有代碼和數據集。
English
In this paper, we introduce MapBench-the first dataset specifically designed
for human-readable, pixel-based map-based outdoor navigation, curated from
complex path finding scenarios. MapBench comprises over 1600 pixel space map
path finding problems from 100 diverse maps. In MapBench, LVLMs generate
language-based navigation instructions given a map image and a query with
beginning and end landmarks. For each map, MapBench provides Map Space Scene
Graph (MSSG) as an indexing data structure to convert between natural language
and evaluate LVLM-generated results. We demonstrate that MapBench significantly
challenges state-of-the-art LVLMs both zero-shot prompting and a
Chain-of-Thought (CoT) augmented reasoning framework that decomposes map
navigation into sequential cognitive processes. Our evaluation of both
open-source and closed-source LVLMs underscores the substantial difficulty
posed by MapBench, revealing critical limitations in their spatial reasoning
and structured decision-making capabilities. We release all the code and
dataset in https://github.com/taco-group/MapBench.Summary
AI-Generated Summary