ロボットが見てロボットが行う：単眼4D再構築を用いた関節物体操作の模倣

要旨

人間は他者を単に見て新しい物体を操作する方法を学ぶことができます。ロボットにそのようなデモンストレーションから学習する能力を提供することで、新しい振る舞いを指定する自然なインターフェースが実現されます。本研究では、単眼RGB人間のデモンストレーションから静止した多視点物体スキャンを与えられた場合に、関節物体操作を模倣するための方法であるRobot See Robot Do（RSRD）を開発します。最初に、微分可能なレンダリングを用いた単眼ビデオから3D部位モーションを回復する方法である4D Differentiable Part Models（4D-DPM）を提案します。この合成による分析アプローチは、幾何学的正則化を使用して単一のビデオから3Dモーションを回復するための反復最適化を可能にする部位中心の特徴フィールドを使用します。この4D再構築を与えられた場合、ロボットは示された物体部位モーションを引き起こす両腕運動を計画することで物体の軌跡を複製します。デモンストレーションを部位中心の軌跡として表現することにより、RSRDはロボット自身の形態学的制約を考慮しながら、デモンストレーションの意図した振る舞いを複製することに焦点を当てます。私たちは、4D-DPMの3Dトラッキング精度をグラウンドトゥルースで注釈付けされた3D部位軌跡と、RSRDの9つの物体にわたる10回の試行ごとの物理的実行パフォーマンスを評価します。RSRDの各段階は、90回の試行全体で60%のエンドツーエンド成功率を達成し、平均87%の成功率を達成します。特筆すべきは、大規模な事前学習ビジョンモデルから抽出された特徴フィールドのみを使用して、タスク固有のトレーニング、微調整、データセット収集、または注釈なしで達成されたことです。プロジェクトページ：https://robot-see-robot-do.github.io

English

Humans can learn to manipulate new objects by simply watching others; providing robots with the ability to learn from such demonstrations would enable a natural interface specifying new behaviors. This work develops Robot See Robot Do (RSRD), a method for imitating articulated object manipulation from a single monocular RGB human demonstration given a single static multi-view object scan. We first propose 4D Differentiable Part Models (4D-DPM), a method for recovering 3D part motion from a monocular video with differentiable rendering. This analysis-by-synthesis approach uses part-centric feature fields in an iterative optimization which enables the use of geometric regularizers to recover 3D motions from only a single video. Given this 4D reconstruction, the robot replicates object trajectories by planning bimanual arm motions that induce the demonstrated object part motion. By representing demonstrations as part-centric trajectories, RSRD focuses on replicating the demonstration's intended behavior while considering the robot's own morphological limits, rather than attempting to reproduce the hand's motion. We evaluate 4D-DPM's 3D tracking accuracy on ground truth annotated 3D part trajectories and RSRD's physical execution performance on 9 objects across 10 trials each on a bimanual YuMi robot. Each phase of RSRD achieves an average of 87% success rate, for a total end-to-end success rate of 60% across 90 trials. Notably, this is accomplished using only feature fields distilled from large pretrained vision models -- without any task-specific training, fine-tuning, dataset collection, or annotation. Project page: https://robot-see-robot-do.github.io

ロボットが見てロボットが行う：単眼4D再構築を用いた関節物体操作の模倣

Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction

要旨

Support