從意圖到執行：探索視覺-語言-動作模型的泛化邊界

摘要

視覺-語言-行動（VLA）模型相較於傳統的機器人模仿學習，其一大承諾在於利用大型視覺-語言模型（VLM）的廣泛泛化能力，來生成多功能、通用的機器人策略。然而，目前對VLA的評估仍顯不足。傳統的模仿學習基準因缺乏語言指令而不適用。新興的VLA基準雖然納入了語言，但往往評估任務有限，且並未深入探討VLM預訓練對下游機器人策略泛化能力的實際貢獻。同時，許多研究依賴於不同機構獨立設計的真實世界機器人設置，這為重現性和可訪問性設置了障礙。為填補這一空白，我們引入了一個統一的探測套件，包含跨語言指令、視覺和物體等10個子類別的50個基於模擬的任務。我們系統地評估了多種最先進的VLA架構在此套件上的表現，以理解其泛化能力。我們的結果表明，雖然VLM骨幹賦予了VLA強大的感知理解和高層次規劃能力（我們稱之為良好意圖），但這並不能可靠地轉化為精確的運動執行：面對分佈外觀測時，策略往往展現出連貫的意圖，但在行動執行上卻頻頻失誤。此外，對行動數據進行微調可能會削弱原始VLM的通用推理能力。我們發布了我們的任務套件和評估代碼，旨在作為未來VLA的標準化基準，並推動縮小感知到行動差距的研究。更多信息，包括源代碼，可在https://ai4ce.github.io/INT-ACT/找到。

English

One promise that Vision-Language-Action (VLA) models hold over traditional imitation learning for robotics is to leverage the broad generalization capabilities of large Vision-Language Models (VLMs) to produce versatile, "generalist" robot policies. However, current evaluations of VLAs remain insufficient. Traditional imitation learning benchmarks are unsuitable due to the lack of language instructions. Emerging benchmarks for VLAs that incorporate language often come with limited evaluation tasks and do not intend to investigate how much VLM pretraining truly contributes to the generalization capabilities of the downstream robotic policy. Meanwhile, much research relies on real-world robot setups designed in isolation by different institutions, which creates a barrier for reproducibility and accessibility. To address this gap, we introduce a unified probing suite of 50 simulation-based tasks across 10 subcategories spanning language instruction, vision, and objects. We systematically evaluate several state-of-the-art VLA architectures on this suite to understand their generalization capability. Our results show that while VLM backbones endow VLAs with robust perceptual understanding and high level planning, which we refer to as good intentions, this does not reliably translate into precise motor execution: when faced with out-of-distribution observations, policies often exhibit coherent intentions, but falter in action execution. Moreover, finetuning on action data can erode the original VLM's generalist reasoning abilities. We release our task suite and evaluation code to serve as a standardized benchmark for future VLAs and to drive research on closing the perception-to-action gap. More information, including the source code, can be found at https://ai4ce.github.io/INT-ACT/

從意圖到執行：探索視覺-語言-動作模型的泛化邊界

From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models

摘要

Support