這個研究：語言手勢控制的影片生成用於機器人規劃

摘要

我們提出了一種機器人學習方法，用於溝通、規劃和執行各種任務，名為This&That。我們通過利用在互聯網規模數據上訓練的具有豐富物理和語義上下文的視頻生成模型的能力，實現了機器人對於一般任務的規劃。在這項工作中，我們應對了基於視頻的規劃中的三個基本挑戰：1）通過簡單的人類指令進行明確的任務溝通，2）尊重用戶意圖的可控視頻生成，以及3）將視覺規劃轉化為機器人動作。我們提出了語言-手勢條件來生成視頻，相較於現有僅使用語言的方法，特別是在複雜和不確定的環境中，這種方法更為簡單和清晰。然後，我們建議一種行為克隆設計，無縫地將視頻計劃納入其中。This&That展示了在應對上述三個挑戰方面的最新有效性，並證明了將視頻生成用作通用任務規劃和執行的中間表示的合理性。項目網站：https://cfeng16.github.io/this-and-that/.

English

We propose a robot learning method for communicating, planning, and executing a wide range of tasks, dubbed This&That. We achieve robot planning for general tasks by leveraging the power of video generative models trained on internet-scale data containing rich physical and semantic context. In this work, we tackle three fundamental challenges in video-based planning: 1) unambiguous task communication with simple human instructions, 2) controllable video generation that respects user intents, and 3) translating visual planning into robot actions. We propose language-gesture conditioning to generate videos, which is both simpler and clearer than existing language-only methods, especially in complex and uncertain environments. We then suggest a behavioral cloning design that seamlessly incorporates the video plans. This&That demonstrates state-of-the-art effectiveness in addressing the above three challenges, and justifies the use of video generation as an intermediate representation for generalizable task planning and execution. Project website: https://cfeng16.github.io/this-and-that/.

這個研究：語言手勢控制的影片生成用於機器人規劃

This&That: Language-Gesture Controlled Video Generation for Robot Planning

摘要

Support