LoHoVLA：面向长程具身任务的统一视觉-语言-动作模型

摘要

現實世界中的具身智能體面臨著長期任務，這些任務以高層次目標為特徵，需要超越單一動作的多步驟解決方案。要成功應對這些任務，既需要高層次的任務規劃（即將目標分解為子任務），也需要低層次的運動控制（即生成精確的機器人動作）。雖然現有的視覺語言動作（VLA）模型和分層架構在具身任務中展現出潛力，但前者往往在規劃上表現不佳，後者則可能面臨協調問題，兩者都阻礙了性能的提升。我們提出了一種新的統一VLA框架，名為LoHoVLA，以克服這些限制。LoHoVLA利用一個大型預訓練的視覺語言模型（VLM）作為骨幹，分別生成用於子任務生成和機器人動作預測的語言和動作標記。這種共享表示促進了跨任務的更好泛化。此外，LoHoVLA採用了分層閉環控制機制，以減輕來自高層次規劃和低層次控制的錯誤。為了訓練LoHoVLA，我們引入了LoHoSet，這是一個基於Ravens模擬器構建的數據集，包含20個長期任務，每個任務有1,000個專家示範，由視覺觀察、語言目標、子任務和機器人動作組成。實驗結果表明，LoHoVLA在Ravens模擬器中的長期具身任務上顯著超越了分層和標準VLA方法。這些發現強調了統一架構在推進可泛化具身智能方面的潛力。

English

Real-world embodied agents face long-horizon tasks, characterized by high-level goals demanding multi-step solutions beyond single actions. Successfully navigating these requires both high-level task planning (i.e., decomposing goals into sub-tasks) and low-level motion control (i.e., generating precise robot actions). While existing vision language action (VLA) models and hierarchical architectures offer potential in embodied tasks, the former often falter in planning, and the latter can suffer from coordination issues, both hampering performance. We introduce a new unified VLA framework for long-horizon tasks, dubbed LoHoVLA, to overcome these limitations. LoHoVLA leverages a large pretrained vision language model (VLM) as the backbone to jointly generate language and action tokens for sub-task generation and robot action prediction, respectively. This shared representation promotes better generalization across tasks. Additionally, LoHoVLA embraces a hierarchical closed-loop control mechanism to mitigate errors originating from both high-level planning and low-level control. To train LoHoVLA, we introduce LoHoSet, a dataset built on the Ravens simulator, containing 20 long-horizon tasks, each with 1,000 expert demonstrations composed of visual observations, linguistic goals, sub-tasks, and robot actions. Experimental results show that LoHoVLA significantly surpasses both hierarchical and standard VLA approaches on long-horizon embodied tasks in the Ravens simulator. These findings underscore the promise of unified architectures for advancing generalizable embodied intelligence.

LoHoVLA：面向长程具身任务的统一视觉-语言-动作模型

LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks

摘要

Support