大規模視覚言語モデルは人間のように地図を読むことができるか？

要旨

本論文では、複雑な経路探索シナリオからキュレーションされた、人間が読み取り可能なピクセルベースの地図を用いた屋外ナビゲーションに特化した初のデータセットであるMapBenchを紹介する。MapBenchは、100種類の多様な地図から1600以上のピクセル空間地図経路探索問題を包含している。MapBenchでは、LVLM（大規模視覚言語モデル）が地図画像と開始・終了地点のランドマークを含むクエリを与えられ、言語ベースのナビゲーション指示を生成する。各地図に対して、MapBenchはMap Space Scene Graph（MSSG）をインデックスデータ構造として提供し、自然言語との変換およびLVLM生成結果の評価を可能にする。我々は、MapBenchが最先端のLVLMに対して、ゼロショットプロンプティングおよび地図ナビゲーションを連続的な認知プロセスに分解するChain-of-Thought（CoT）拡張推論フレームワークの両方において、大きな課題を提示することを実証する。オープンソースおよびクローズドソースのLVLMの評価を通じて、MapBenchが空間推論および構造化意思決定能力における重大な限界を明らかにすることを示す。全てのコードとデータセットをhttps://github.com/taco-group/MapBenchで公開している。

English

In this paper, we introduce MapBench-the first dataset specifically designed for human-readable, pixel-based map-based outdoor navigation, curated from complex path finding scenarios. MapBench comprises over 1600 pixel space map path finding problems from 100 diverse maps. In MapBench, LVLMs generate language-based navigation instructions given a map image and a query with beginning and end landmarks. For each map, MapBench provides Map Space Scene Graph (MSSG) as an indexing data structure to convert between natural language and evaluate LVLM-generated results. We demonstrate that MapBench significantly challenges state-of-the-art LVLMs both zero-shot prompting and a Chain-of-Thought (CoT) augmented reasoning framework that decomposes map navigation into sequential cognitive processes. Our evaluation of both open-source and closed-source LVLMs underscores the substantial difficulty posed by MapBench, revealing critical limitations in their spatial reasoning and structured decision-making capabilities. We release all the code and dataset in https://github.com/taco-group/MapBench.

大規模視覚言語モデルは人間のように地図を読むことができるか？

Can Large Vision Language Models Read Maps Like a Human?

要旨

Support