从意图到执行：探索视觉-语言-动作模型的泛化边界

摘要

视觉-语言-动作（VLA）模型相较于传统机器人模仿学习的一大优势在于，它们能够利用大规模视觉-语言模型（VLMs）的广泛泛化能力，生成多功能的“通用型”机器人策略。然而，当前对VLA模型的评估仍显不足。传统的模仿学习基准因缺乏语言指令而不适用。新兴的包含语言指令的VLA基准测试往往任务有限，且未深入探究VLM预训练对下游机器人策略泛化能力的实际贡献。同时，大量研究依赖于不同机构独立设计的真实机器人实验平台，这为研究的可复现性和可访问性设置了障碍。为填补这一空白，我们引入了一套统一的探测任务集，包含10个子类别下的50个基于模拟的任务，涵盖语言指令、视觉和物体操作。我们系统地评估了多种最先进的VLA架构在此任务集上的表现，以理解其泛化能力。结果显示，尽管VLM骨干赋予VLA模型强大的感知理解和高层次规划能力，我们称之为“良好意图”，但这并不总能转化为精确的运动执行：面对分布外观察时，策略常展现出连贯的意图，却在动作执行上失误。此外，针对动作数据的微调可能会削弱原始VLM的通用推理能力。我们公开了任务集和评估代码，旨在为未来的VLA研究提供标准化基准，并推动弥合感知与动作之间差距的研究。更多信息，包括源代码，请访问https://ai4ce.github.io/INT-ACT/。

English

One promise that Vision-Language-Action (VLA) models hold over traditional imitation learning for robotics is to leverage the broad generalization capabilities of large Vision-Language Models (VLMs) to produce versatile, "generalist" robot policies. However, current evaluations of VLAs remain insufficient. Traditional imitation learning benchmarks are unsuitable due to the lack of language instructions. Emerging benchmarks for VLAs that incorporate language often come with limited evaluation tasks and do not intend to investigate how much VLM pretraining truly contributes to the generalization capabilities of the downstream robotic policy. Meanwhile, much research relies on real-world robot setups designed in isolation by different institutions, which creates a barrier for reproducibility and accessibility. To address this gap, we introduce a unified probing suite of 50 simulation-based tasks across 10 subcategories spanning language instruction, vision, and objects. We systematically evaluate several state-of-the-art VLA architectures on this suite to understand their generalization capability. Our results show that while VLM backbones endow VLAs with robust perceptual understanding and high level planning, which we refer to as good intentions, this does not reliably translate into precise motor execution: when faced with out-of-distribution observations, policies often exhibit coherent intentions, but falter in action execution. Moreover, finetuning on action data can erode the original VLM's generalist reasoning abilities. We release our task suite and evaluation code to serve as a standardized benchmark for future VLAs and to drive research on closing the perception-to-action gap. More information, including the source code, can be found at https://ai4ce.github.io/INT-ACT/

从意图到执行：探索视觉-语言-动作模型的泛化边界

From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models

摘要

Support