意図から実行へ：視覚-言語-行動モデルの汎化境界の探求

要旨

ビジョン・ランゲージ・アクション（VLA）モデルが従来のロボティクスにおける模倣学習に対して持つ一つの約束は、大規模なビジョン・ランゲージモデル（VLM）の広範な汎化能力を活用して、汎用的な「ジェネラリスト」ロボットポリシーを生成することである。しかし、現在のVLAの評価は不十分である。従来の模倣学習ベンチマークは、言語指示の欠如により不適切である。言語を組み込んだ新興のVLAベンチマークは、評価タスクが限られており、VLMの事前学習が下流のロボットポリシーの汎化能力にどの程度寄与するかを真に調査する意図がない。一方、多くの研究は、異なる機関によって個別に設計された実世界のロボットセットアップに依存しており、再現性とアクセシビリティに障壁を生んでいる。このギャップを埋めるため、我々は言語指示、視覚、物体にまたがる10のサブカテゴリーにわたる50のシミュレーションベースのタスクからなる統一的なプロービングスイートを導入する。我々はこのスイート上でいくつかの最先端のVLAアーキテクチャを体系的に評価し、その汎化能力を理解する。結果は、VLMバックボーンがVLAに堅牢な知覚理解と高レベルの計画（我々が「良い意図」と呼ぶもの）を付与する一方で、これが正確な運動実行に確実に変換されるわけではないことを示している：分布外の観測に直面した場合、ポリシーはしばしば一貫した意図を示すが、アクション実行で躓く。さらに、アクションデータに対するファインチューニングは、元のVLMのジェネラリストとしての推論能力を損なう可能性がある。我々は、将来のVLAの標準化されたベンチマークとして、また知覚からアクションへのギャップを埋める研究を推進するために、タスクスイートと評価コードを公開する。詳細情報、およびソースコードは、https://ai4ce.github.io/INT-ACT/ で確認できる。

English

One promise that Vision-Language-Action (VLA) models hold over traditional imitation learning for robotics is to leverage the broad generalization capabilities of large Vision-Language Models (VLMs) to produce versatile, "generalist" robot policies. However, current evaluations of VLAs remain insufficient. Traditional imitation learning benchmarks are unsuitable due to the lack of language instructions. Emerging benchmarks for VLAs that incorporate language often come with limited evaluation tasks and do not intend to investigate how much VLM pretraining truly contributes to the generalization capabilities of the downstream robotic policy. Meanwhile, much research relies on real-world robot setups designed in isolation by different institutions, which creates a barrier for reproducibility and accessibility. To address this gap, we introduce a unified probing suite of 50 simulation-based tasks across 10 subcategories spanning language instruction, vision, and objects. We systematically evaluate several state-of-the-art VLA architectures on this suite to understand their generalization capability. Our results show that while VLM backbones endow VLAs with robust perceptual understanding and high level planning, which we refer to as good intentions, this does not reliably translate into precise motor execution: when faced with out-of-distribution observations, policies often exhibit coherent intentions, but falter in action execution. Moreover, finetuning on action data can erode the original VLM's generalist reasoning abilities. We release our task suite and evaluation code to serve as a standardized benchmark for future VLAs and to drive research on closing the perception-to-action gap. More information, including the source code, can be found at https://ai4ce.github.io/INT-ACT/

意図から実行へ：視覚-言語-行動モデルの汎化境界の探求

From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models

要旨

Support