ESPIRE: 視覚言語モデルの具象的空間推論のための診断ベンチマーク

要旨

近年の視覚言語モデル（VLM）では、エンボディド領域における空間認知能力の向上が注目されている。進展は見られるものの、既存の評価手法はパラダイムと対象範囲の両面で限界があり、迅速で反復的なモデル開発を妨げている。これらの課題に対処するため、我々はエンボディド空間推論の診断ベンチマークであるESPIREを提案する。ESPIREはVLMを物理的に接地するシミュレートされた世界を提供し、空間推論を中核とするロボットタスクで評価を行うことで、評価と実世界での展開の隔たりを縮める。VLMをロボットタスクに適応させるため、各タスクを位置特定と実行に分解し、両者を生成的問題として定式化する。これは、妨害要素に依存し実行を無視する主流の識別的評価（例：視覚質問応答）とは対照的である。この分解により、受動的な空間推論から、行動のための推論へと、きめ細かい分析が可能となる。ESPIREは指示レベルと環境レベルの両方で体系的に設計され、空間推論シナリオの広範なカバレッジを保証する。我々はESPIREを用いて、最先端のVLM群を診断し、その空間推論行動に関する詳細な分析を提供する。

English

A recent trend in vision-language models (VLMs) has been to enhance their spatial cognition for embodied domains. Despite progress, existing evaluations have been limited both in paradigm and in coverage, hindering rapid, iterative model development. To address these limitations, we propose ESPIRE, a diagnostic benchmark for embodied spatial reasoning. ESPIRE offers a simulated world that physically grounds VLMs and evaluates them on spatial-reasoning-centric robotic tasks, thus narrowing the gap between evaluation and real-world deployment. To adapt VLMs to robotic tasks, we decompose each task into localization and execution, and frame both as generative problems, in stark contrast to predominant discriminative evaluations (e.g., via visual-question answering) that rely on distractors and discard execution. This decomposition further enables a fine-grained analysis beyond passive spatial reasoning toward reasoning to act. We systematically design ESPIRE both at the instruction level and at the environment level, ensuring broad coverage of spatial reasoning scenarios. We use ESPIRE to diagnose a range of frontier VLMs and provide in-depth analysis of their spatial reasoning behaviors.

ESPIRE: 視覚言語モデルの具象的空間推論のための診断ベンチマーク

ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models

要旨

Support