给机器人一只手：通过手眼协调的人类视频演示学习通用操纵

摘要

手眼相机在基于视觉的机器人操作中表现出更高的样本效率和泛化能力。然而，对于机器人模仿来说，让人类远程操作员收集大量专家演示仍然很昂贵。另一方面，人类执行任务的视频收集成本要低得多，因为它们消除了对机器人远程操作的专业知识需求，并且可以快速在各种场景中捕获。因此，人类视频演示是一个有前景的数据源，可用于大规模学习具有泛化能力的机器人操作策略。在这项工作中，我们将狭窄的机器人模仿数据集与广泛的未标记人类视频演示相结合，极大地增强了手眼视觉运动策略的泛化能力。尽管人类和机器人数据之间存在明显的视觉领域差距，但我们的框架无需采用任何显式的领域自适应方法，因为我们利用了手眼相机的部分可观测性以及简单的固定图像遮罩方案。在涉及3自由度和6自由度机器人臂控制的八项真实世界任务中，我们的方法平均将手眼操作策略的成功率提高了58%（绝对值），使机器人能够泛化到机器人演示数据中未见的新环境配置和新任务。请查看视频结果：https://giving-robots-a-hand.github.io/。

English

Eye-in-hand cameras have shown promise in enabling greater sample efficiency and generalization in vision-based robotic manipulation. However, for robotic imitation, it is still expensive to have a human teleoperator collect large amounts of expert demonstrations with a real robot. Videos of humans performing tasks, on the other hand, are much cheaper to collect since they eliminate the need for expertise in robotic teleoperation and can be quickly captured in a wide range of scenarios. Therefore, human video demonstrations are a promising data source for learning generalizable robotic manipulation policies at scale. In this work, we augment narrow robotic imitation datasets with broad unlabeled human video demonstrations to greatly enhance the generalization of eye-in-hand visuomotor policies. Although a clear visual domain gap exists between human and robot data, our framework does not need to employ any explicit domain adaptation method, as we leverage the partial observability of eye-in-hand cameras as well as a simple fixed image masking scheme. On a suite of eight real-world tasks involving both 3-DoF and 6-DoF robot arm control, our method improves the success rates of eye-in-hand manipulation policies by 58% (absolute) on average, enabling robots to generalize to both new environment configurations and new tasks that are unseen in the robot demonstration data. See video results at https://giving-robots-a-hand.github.io/ .

给机器人一只手：通过手眼协调的人类视频演示学习通用操纵

Giving Robots a Hand: Learning Generalizable Manipulation with Eye-in-Hand Human Video Demonstrations

摘要

Support