ESPIRE：面向视觉语言模型具身空间推理能力的诊断性基准

摘要

视觉语言模型（VLM）领域近期出现一种趋势：通过增强空间认知能力以适配具身应用场景。尽管已有进展，但现有评估方法在范式与覆盖范围上均存在局限，阻碍了模型的快速迭代开发。为突破这些限制，我们提出ESPIRE——面向具身空间推理的诊断性基准框架。该框架通过构建模拟世界为VLM提供物理基础环境，并在以空间推理为核心任务的机器人场景中进行评估，从而缩小评估与实际部署之间的差距。为使VLM适配机器人任务，我们将每项任务分解为定位与执行两个阶段，并将其构建为生成式问题，这与依赖干扰项且忽略执行过程的判别式评估方法（如视觉问答）形成鲜明对比。这种任务分解机制还能实现从被动空间推理到行动推理的细粒度分析。我们在指令层级与环境层级系统化设计ESPIRE，确保对空间推理场景的广泛覆盖。基于该基准，我们对前沿VLM系列模型进行诊断，并深入解析其空间推理行为特征。

English

A recent trend in vision-language models (VLMs) has been to enhance their spatial cognition for embodied domains. Despite progress, existing evaluations have been limited both in paradigm and in coverage, hindering rapid, iterative model development. To address these limitations, we propose ESPIRE, a diagnostic benchmark for embodied spatial reasoning. ESPIRE offers a simulated world that physically grounds VLMs and evaluates them on spatial-reasoning-centric robotic tasks, thus narrowing the gap between evaluation and real-world deployment. To adapt VLMs to robotic tasks, we decompose each task into localization and execution, and frame both as generative problems, in stark contrast to predominant discriminative evaluations (e.g., via visual-question answering) that rely on distractors and discard execution. This decomposition further enables a fine-grained analysis beyond passive spatial reasoning toward reasoning to act. We systematically design ESPIRE both at the instruction level and at the environment level, ensuring broad coverage of spatial reasoning scenarios. We use ESPIRE to diagnose a range of frontier VLMs and provide in-depth analysis of their spatial reasoning behaviors.