视频语言规划

摘要

我们对在生成的视频和语言空间中实现复杂长视程任务的视觉规划感兴趣，利用最近在互联网规模数据上预训练的大型生成模型的进展。为此，我们提出了视频语言规划（VLP），这是一种算法，包括树搜索过程，我们在其中训练（i）视觉语言模型作为策略和值函数，以及（ii）文本到视频模型作为动态模型。VLP接受长视程任务指令和当前图像观察作为输入，并输出提供详细多模态（视频和语言）规范的长视频计划，描述如何完成最终任务。VLP随着计算预算的增加而扩展，更多的计算时间会导致改进的视频计划，并且能够在不同的机器人领域合成长视程视频计划：从多对象重新排列到多摄像头双臂灵巧操作。生成的视频计划可以通过目标条件策略转换为真实机器人动作，这些动作是根据生成视频的每个中间帧进行条件设定的。实验证明，与先前方法相比，VLP显着提高了长视程任务的成功率，无论是在模拟还是真实机器人上（跨3个硬件平台）。

English

We are interested in enabling visual planning for complex long-horizon tasks in the space of generated videos and language, leveraging recent advances in large generative models pretrained on Internet-scale data. To this end, we present video language planning (VLP), an algorithm that consists of a tree search procedure, where we train (i) vision-language models to serve as both policies and value functions, and (ii) text-to-video models as dynamics models. VLP takes as input a long-horizon task instruction and current image observation, and outputs a long video plan that provides detailed multimodal (video and language) specifications that describe how to complete the final task. VLP scales with increasing computation budget where more computation time results in improved video plans, and is able to synthesize long-horizon video plans across different robotics domains: from multi-object rearrangement, to multi-camera bi-arm dexterous manipulation. Generated video plans can be translated into real robot actions via goal-conditioned policies, conditioned on each intermediate frame of the generated video. Experiments show that VLP substantially improves long-horizon task success rates compared to prior methods on both simulated and real robots (across 3 hardware platforms).

视频语言规划

Video Language Planning

摘要

Support