影片語言規劃

摘要

我們對於在生成的影片和語言空間中為複雜的長期任務實現視覺規劃感興趣，利用最近在互聯網規模數據上預訓練的大型生成模型的進展。為此，我們提出了影片語言規劃（VLP），一種由樹搜索程序組成的算法，我們在其中訓練（i）視覺語言模型作為策略和價值函數，以及（ii）文本到影片模型作為動態模型。VLP接受長期任務指令和當前影像觀察作為輸入，並輸出提供詳細多模態（影片和語言）規格的長影片計劃，描述如何完成最終任務。VLP隨著計算預算的增加而擴展，更多的計算時間將產生改進的影片計劃，並能夠在不同的機器人領域中合成長期視頻計劃：從多對象重新排列到多攝像機雙臂靈巧操作。生成的影片計劃可以通過目標條件策略轉換為真實機器人動作，條件是在生成的影片的每個中間幀上。實驗表明，與先前方法相比，VLP顯著提高了長期任務成功率，無論是在模擬還是真實機器人上（跨3個硬件平台）。

English

We are interested in enabling visual planning for complex long-horizon tasks in the space of generated videos and language, leveraging recent advances in large generative models pretrained on Internet-scale data. To this end, we present video language planning (VLP), an algorithm that consists of a tree search procedure, where we train (i) vision-language models to serve as both policies and value functions, and (ii) text-to-video models as dynamics models. VLP takes as input a long-horizon task instruction and current image observation, and outputs a long video plan that provides detailed multimodal (video and language) specifications that describe how to complete the final task. VLP scales with increasing computation budget where more computation time results in improved video plans, and is able to synthesize long-horizon video plans across different robotics domains: from multi-object rearrangement, to multi-camera bi-arm dexterous manipulation. Generated video plans can be translated into real robot actions via goal-conditioned policies, conditioned on each intermediate frame of the generated video. Experiments show that VLP substantially improves long-horizon task success rates compared to prior methods on both simulated and real robots (across 3 hardware platforms).

影片語言規劃

Video Language Planning

摘要

Support