ESPIRE：视觉语言模型具身空间推理诊断基准

摘要

视觉语言模型（VLM）领域近期涌现出增强具身化空间认知能力的研究趋势。尽管取得了一定进展，但现有评估方法在范式设计和覆盖范围上存在局限，阻碍了模型的快速迭代开发。为突破这些限制，我们提出ESPIRE——面向具身空间推理的诊断性基准框架。该框架通过构建模拟世界为VLM提供物理基础环境，并在以空间推理为核心任务的机器人场景中进行评估，从而缩小评估与实际部署之间的差距。为适配机器人任务特性，我们将每项任务解构为定位与执行两个阶段，并将其统一构建为生成式问题，这与依赖干扰项且忽略执行环节的主流判别式评估（如视觉问答）形成鲜明对比。这种解构方式还能实现从被动空间推理到行动推理的细粒度分析。我们在指令层级和环境层级对ESPIRE进行系统化设计，确保其全面覆盖各类空间推理场景。基于该基准，我们对多款前沿VLM进行诊断评估，并深入解析其空间推理行为模式。

English

A recent trend in vision-language models (VLMs) has been to enhance their spatial cognition for embodied domains. Despite progress, existing evaluations have been limited both in paradigm and in coverage, hindering rapid, iterative model development. To address these limitations, we propose ESPIRE, a diagnostic benchmark for embodied spatial reasoning. ESPIRE offers a simulated world that physically grounds VLMs and evaluates them on spatial-reasoning-centric robotic tasks, thus narrowing the gap between evaluation and real-world deployment. To adapt VLMs to robotic tasks, we decompose each task into localization and execution, and frame both as generative problems, in stark contrast to predominant discriminative evaluations (e.g., via visual-question answering) that rely on distractors and discard execution. This decomposition further enables a fine-grained analysis beyond passive spatial reasoning toward reasoning to act. We systematically design ESPIRE both at the instruction level and at the environment level, ensuring broad coverage of spatial reasoning scenarios. We use ESPIRE to diagnose a range of frontier VLMs and provide in-depth analysis of their spatial reasoning behaviors.

ESPIRE：视觉语言模型具身空间推理诊断基准

ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models

摘要

Support