机器人看见机器人做：使用单目4D重建模拟关节对象操作

摘要

人类可以通过观察他人学会操纵新物体；让机器人具备从这些演示中学习的能力将使其能够自然地指定新行为接口。本研究开发了机器人看、机器人学（RSRD），一种从单眼RGB人类演示和单个静态多视角物体扫描中模仿关节式物体操纵的方法。我们首先提出了4D可微分部件模型（4D-DPM），一种从单眼视频中恢复3D部件运动的方法，采用可微分渲染。这种分析合成方法使用部件中心特征场进行迭代优化，从而能够利用几何正则化器仅从单个视频中恢复3D运动。在获得这种4D重建后，机器人通过规划双手臂运动来复制物体轨迹，从而诱导演示的物体部件运动。通过将演示表示为部件中心轨迹，RSRD专注于复制演示的预期行为，同时考虑机器人自身的形态限制，而不是试图复制手的运动。我们在地面真实标注的3D部件轨迹上评估了4D-DPM的3D跟踪准确性，以及在双手臂YuMi机器人上的9个物体上每个10次试验的RSRD的物理执行性能。RSRD的每个阶段都实现了平均87%的成功率，总体端到端成功率为60%，共进行了90次试验。值得注意的是，这是仅使用从大型预训练视觉模型中提炼出的特征场而实现的，而没有任何特定任务的训练、微调、数据集收集或标注。项目页面：https://robot-see-robot-do.github.io

English

Humans can learn to manipulate new objects by simply watching others; providing robots with the ability to learn from such demonstrations would enable a natural interface specifying new behaviors. This work develops Robot See Robot Do (RSRD), a method for imitating articulated object manipulation from a single monocular RGB human demonstration given a single static multi-view object scan. We first propose 4D Differentiable Part Models (4D-DPM), a method for recovering 3D part motion from a monocular video with differentiable rendering. This analysis-by-synthesis approach uses part-centric feature fields in an iterative optimization which enables the use of geometric regularizers to recover 3D motions from only a single video. Given this 4D reconstruction, the robot replicates object trajectories by planning bimanual arm motions that induce the demonstrated object part motion. By representing demonstrations as part-centric trajectories, RSRD focuses on replicating the demonstration's intended behavior while considering the robot's own morphological limits, rather than attempting to reproduce the hand's motion. We evaluate 4D-DPM's 3D tracking accuracy on ground truth annotated 3D part trajectories and RSRD's physical execution performance on 9 objects across 10 trials each on a bimanual YuMi robot. Each phase of RSRD achieves an average of 87% success rate, for a total end-to-end success rate of 60% across 90 trials. Notably, this is accomplished using only feature fields distilled from large pretrained vision models -- without any task-specific training, fine-tuning, dataset collection, or annotation. Project page: https://robot-see-robot-do.github.io

机器人看见机器人做：使用单目4D重建模拟关节对象操作

Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction

摘要

Support