Gen2Act:在新颖场景中生成人类视频,实现通用机器人操作
Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation
September 24, 2024
作者: Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, Sean Kirmani
cs.AI
摘要
机器人操纵策略如何能够泛化到涉及未见过的物体类型和新动作的新任务?本文提出了一种解决方案,即通过人类视频生成预测运动信息,并将机器人策略条件化为生成的视频。我们展示了如何利用在易获得的网络数据上训练的视频生成模型,而不是尝试扩展昂贵的机器人数据收集,以实现泛化。我们的方法Gen2Act将语言条件下的操纵视为零样本人类视频生成,然后执行一个仅条件于生成视频的策略。为了训练该策略,我们使用的机器人交互数据量比视频预测模型训练数据量少一个数量级。Gen2Act根本不需要微调视频模型,我们直接使用预训练模型生成人类视频。我们在各种真实场景中的结果展示了Gen2Act如何实现操纵未见过的物体类型,并执行机器人数据中不存在的新任务。视频链接:https://homangab.github.io/gen2act/
English
How can robot manipulation policies generalize to novel tasks involving
unseen object types and new motions? In this paper, we provide a solution in
terms of predicting motion information from web data through human video
generation and conditioning a robot policy on the generated video. Instead of
attempting to scale robot data collection which is expensive, we show how we
can leverage video generation models trained on easily available web data, for
enabling generalization. Our approach Gen2Act casts language-conditioned
manipulation as zero-shot human video generation followed by execution with a
single policy conditioned on the generated video. To train the policy, we use
an order of magnitude less robot interaction data compared to what the video
prediction model was trained on. Gen2Act doesn't require fine-tuning the video
model at all and we directly use a pre-trained model for generating human
videos. Our results on diverse real-world scenarios show how Gen2Act enables
manipulating unseen object types and performing novel motions for tasks not
present in the robot data. Videos are at https://homangab.github.io/gen2act/Summary
AI-Generated Summary