LoHoVLA：面向长程具身任务的统一视觉-语言-动作模型

摘要

现实世界中的具身智能体面临着长期任务，这些任务以高层次目标为特征，需要超越单一动作的多步骤解决方案。要成功应对这些任务，既需要高层次的任务规划（即将目标分解为子任务），也需要低层次的运动控制（即生成精确的机器人动作）。尽管现有的视觉语言动作（VLA）模型和分层架构在具身任务中展现出潜力，但前者常在规划上表现不佳，后者则可能遭遇协调问题，两者均限制了性能表现。为此，我们提出了一种新的统一VLA框架——LoHoVLA，旨在克服这些局限。LoHoVLA利用大规模预训练的视觉语言模型（VLM）作为核心，同时生成用于子任务生成的语言标记和用于机器人动作预测的动作标记，这种共享表示促进了任务间的更好泛化。此外，LoHoVLA采用了一种分层闭环控制机制，以减轻来自高层次规划和低层次控制的错误。为了训练LoHoVLA，我们引入了基于Ravens模拟器构建的LoHoSet数据集，该数据集包含20个长期任务，每个任务配有1,000个专家演示，涵盖视觉观察、语言目标、子任务及机器人动作。实验结果表明，在Ravens模拟器的长期具身任务中，LoHoVLA显著超越了分层和标准VLA方法。这些发现强调了统一架构在推进可泛化具身智能方面的巨大潜力。

English

Real-world embodied agents face long-horizon tasks, characterized by high-level goals demanding multi-step solutions beyond single actions. Successfully navigating these requires both high-level task planning (i.e., decomposing goals into sub-tasks) and low-level motion control (i.e., generating precise robot actions). While existing vision language action (VLA) models and hierarchical architectures offer potential in embodied tasks, the former often falter in planning, and the latter can suffer from coordination issues, both hampering performance. We introduce a new unified VLA framework for long-horizon tasks, dubbed LoHoVLA, to overcome these limitations. LoHoVLA leverages a large pretrained vision language model (VLM) as the backbone to jointly generate language and action tokens for sub-task generation and robot action prediction, respectively. This shared representation promotes better generalization across tasks. Additionally, LoHoVLA embraces a hierarchical closed-loop control mechanism to mitigate errors originating from both high-level planning and low-level control. To train LoHoVLA, we introduce LoHoSet, a dataset built on the Ravens simulator, containing 20 long-horizon tasks, each with 1,000 expert demonstrations composed of visual observations, linguistic goals, sub-tasks, and robot actions. Experimental results show that LoHoVLA significantly surpasses both hierarchical and standard VLA approaches on long-horizon embodied tasks in the Ravens simulator. These findings underscore the promise of unified architectures for advancing generalizable embodied intelligence.

LoHoVLA：面向长程具身任务的统一视觉-语言-动作模型

LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks

摘要

Support