LoHoVLA：長期的なエンボディードタスクのための統合型ビジョン・言語・アクションモデル

要旨

現実世界におけるエンボディエージェントは、単一のアクションを超えた多段階の解決策を必要とする高レベルの目標を特徴とする、長期的なタスクに直面しています。これらを成功裏にナビゲートするためには、高レベルのタスクプランニング（つまり、目標をサブタスクに分解すること）と低レベルのモーション制御（つまり、正確なロボットアクションを生成すること）の両方が必要です。既存の視覚言語アクション（VLA）モデルと階層的アーキテクチャは、エンボディエージェントタスクにおいて潜在的な可能性を提供しますが、前者はプランニングにおいてしばしば失敗し、後者は調整の問題に悩まされることがあり、どちらもパフォーマンスを妨げます。これらの制限を克服するために、我々は新しい統合型VLAフレームワークであるLoHoVLAを紹介します。LoHoVLAは、大規模な事前学習済み視覚言語モデル（VLM）をバックボーンとして活用し、サブタスク生成とロボットアクション予測のためにそれぞれ言語トークンとアクショントークンを共同で生成します。この共有された表現は、タスク間でのより良い一般化を促進します。さらに、LoHoVLAは、高レベルのプランニングと低レベルの制御の両方に起因するエラーを軽減するために、階層的な閉ループ制御メカニズムを採用しています。LoHoVLAを訓練するために、我々はRavensシミュレータに基づいて構築されたLoHoSetというデータセットを導入しました。このデータセットには、視覚的観察、言語的目標、サブタスク、およびロボットアクションで構成される1,000の専門家デモンストレーションを含む20の長期的タスクが含まれています。実験結果は、LoHoVLAがRavensシミュレータにおける長期的エンボディエージェントタスクにおいて、階層的アプローチと標準的なVLAアプローチの両方を大幅に上回ることを示しています。これらの発見は、一般化可能なエンボディエージェントインテリジェンスを進歩させるための統合型アーキテクチャの可能性を強調しています。

English

Real-world embodied agents face long-horizon tasks, characterized by high-level goals demanding multi-step solutions beyond single actions. Successfully navigating these requires both high-level task planning (i.e., decomposing goals into sub-tasks) and low-level motion control (i.e., generating precise robot actions). While existing vision language action (VLA) models and hierarchical architectures offer potential in embodied tasks, the former often falter in planning, and the latter can suffer from coordination issues, both hampering performance. We introduce a new unified VLA framework for long-horizon tasks, dubbed LoHoVLA, to overcome these limitations. LoHoVLA leverages a large pretrained vision language model (VLM) as the backbone to jointly generate language and action tokens for sub-task generation and robot action prediction, respectively. This shared representation promotes better generalization across tasks. Additionally, LoHoVLA embraces a hierarchical closed-loop control mechanism to mitigate errors originating from both high-level planning and low-level control. To train LoHoVLA, we introduce LoHoSet, a dataset built on the Ravens simulator, containing 20 long-horizon tasks, each with 1,000 expert demonstrations composed of visual observations, linguistic goals, sub-tasks, and robot actions. Experimental results show that LoHoVLA significantly surpasses both hierarchical and standard VLA approaches on long-horizon embodied tasks in the Ravens simulator. These findings underscore the promise of unified architectures for advancing generalizable embodied intelligence.

LoHoVLA：長期的なエンボディードタスクのための統合型ビジョン・言語・アクションモデル

LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks

要旨

Support