从意图到执行:探索视觉-语言-动作模型的泛化边界
From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models
June 11, 2025
作者: Irving Fang, Juexiao Zhang, Shengbang Tong, Chen Feng
cs.AI
摘要
视觉-语言-动作(VLA)模型相较于传统机器人模仿学习的一大优势在于,它们能够利用大规模视觉-语言模型(VLMs)的广泛泛化能力,生成多功能的“通用型”机器人策略。然而,当前对VLA模型的评估仍显不足。传统的模仿学习基准因缺乏语言指令而不适用。新兴的包含语言指令的VLA基准测试往往任务有限,且未深入探究VLM预训练对下游机器人策略泛化能力的实际贡献。同时,大量研究依赖于不同机构独立设计的真实机器人实验平台,这为研究的可复现性和可访问性设置了障碍。为填补这一空白,我们引入了一套统一的探测任务集,包含10个子类别下的50个基于模拟的任务,涵盖语言指令、视觉和物体操作。我们系统地评估了多种最先进的VLA架构在此任务集上的表现,以理解其泛化能力。结果显示,尽管VLM骨干赋予VLA模型强大的感知理解和高层次规划能力,我们称之为“良好意图”,但这并不总能转化为精确的运动执行:面对分布外观察时,策略常展现出连贯的意图,却在动作执行上失误。此外,针对动作数据的微调可能会削弱原始VLM的通用推理能力。我们公开了任务集和评估代码,旨在为未来的VLA研究提供标准化基准,并推动弥合感知与动作之间差距的研究。更多信息,包括源代码,请访问https://ai4ce.github.io/INT-ACT/。
English
One promise that Vision-Language-Action (VLA) models hold over traditional
imitation learning for robotics is to leverage the broad generalization
capabilities of large Vision-Language Models (VLMs) to produce versatile,
"generalist" robot policies. However, current evaluations of VLAs remain
insufficient. Traditional imitation learning benchmarks are unsuitable due to
the lack of language instructions. Emerging benchmarks for VLAs that
incorporate language often come with limited evaluation tasks and do not intend
to investigate how much VLM pretraining truly contributes to the generalization
capabilities of the downstream robotic policy. Meanwhile, much research relies
on real-world robot setups designed in isolation by different institutions,
which creates a barrier for reproducibility and accessibility. To address this
gap, we introduce a unified probing suite of 50 simulation-based tasks across
10 subcategories spanning language instruction, vision, and objects. We
systematically evaluate several state-of-the-art VLA architectures on this
suite to understand their generalization capability. Our results show that
while VLM backbones endow VLAs with robust perceptual understanding and high
level planning, which we refer to as good intentions, this does not reliably
translate into precise motor execution: when faced with out-of-distribution
observations, policies often exhibit coherent intentions, but falter in action
execution. Moreover, finetuning on action data can erode the original VLM's
generalist reasoning abilities. We release our task suite and evaluation code
to serve as a standardized benchmark for future VLAs and to drive research on
closing the perception-to-action gap. More information, including the source
code, can be found at https://ai4ce.github.io/INT-ACT/