这个与那个：语言手势控制的视频生成用于机器人规划

摘要

我们提出了一种机器人学习方法，用于沟通、规划和执行各种任务，命名为This&That。我们通过利用在互联网规模数据上训练的视频生成模型的强大能力，实现了针对一般任务的机器人规划，这些数据包含丰富的物理和语义上下文。在这项工作中，我们解决了基于视频的规划中的三个基本挑战：1）通过简单的人类指令进行明确的任务沟通，2）尊重用户意图的可控视频生成，以及3）将视觉规划转化为机器人动作。我们提出了语言手势调节来生成视频，相对于现有的仅使用语言的方法，在复杂和不确定的环境中更简单和更清晰。然后，我们建议一种行为克隆设计，无缝地将视频计划纳入其中。This&That展示了在解决上述三个挑战方面的最新有效性，并证明了使用视频生成作为通用任务规划和执行的中间表示的合理性。项目网站：https://cfeng16.github.io/this-and-that/.

English

We propose a robot learning method for communicating, planning, and executing a wide range of tasks, dubbed This&That. We achieve robot planning for general tasks by leveraging the power of video generative models trained on internet-scale data containing rich physical and semantic context. In this work, we tackle three fundamental challenges in video-based planning: 1) unambiguous task communication with simple human instructions, 2) controllable video generation that respects user intents, and 3) translating visual planning into robot actions. We propose language-gesture conditioning to generate videos, which is both simpler and clearer than existing language-only methods, especially in complex and uncertain environments. We then suggest a behavioral cloning design that seamlessly incorporates the video plans. This&That demonstrates state-of-the-art effectiveness in addressing the above three challenges, and justifies the use of video generation as an intermediate representation for generalizable task planning and execution. Project website: https://cfeng16.github.io/this-and-that/.

这个与那个：语言手势控制的视频生成用于机器人规划

This&That: Language-Gesture Controlled Video Generation for Robot Planning

摘要

Support