4次元表現を用いた自己回帰型ロボットモデルの事前学習

要旨

大規模なラベルなしデータセットで事前学習された基盤モデルは、自然言語処理やコンピュータビジョンの分野に革命をもたらし、驚異的な汎化能力を示すことで、事前学習の重要性を浮き彫りにしました。しかし、ロボティクス分野での取り組みは、高コストなロボットアノテーションの必要性や、物理世界を効果的にモデル化する表現の欠如によって、同様の成功を収めることに苦戦しています。本論文では、人間のビデオデータから学習した低次元の4D表現を活用し、より優れた事前学習済みロボットモデルを実現するAuto-regressive Robotic Model (ARM4R)を提案します。具体的には、モノクロ深度推定を用いて時間軸に沿って2D表現を3D空間にリフトすることで得られるビデオからの3Dポイントトラッキング表現に焦点を当てます。これらの4D表現は、線形変換までの範囲でポイントとロボット状態表現の間で共有される幾何学的構造を維持し、人間のビデオデータから低次元のロボット制御への効率的な転移学習を可能にします。実験結果は、ARM4Rが人間のビデオデータからロボティクスへ効率的に転移し、様々なロボット環境や構成におけるタスクのパフォーマンスを一貫して向上させることを示しています。

English

Foundation models pre-trained on massive unlabeled datasets have revolutionized natural language and computer vision, exhibiting remarkable generalization capabilities, thus highlighting the importance of pre-training. Yet, efforts in robotics have struggled to achieve similar success, limited by either the need for costly robotic annotations or the lack of representations that effectively model the physical world. In this paper, we introduce ARM4R, an Auto-regressive Robotic Model that leverages low-level 4D Representations learned from human video data to yield a better pre-trained robotic model. Specifically, we focus on utilizing 3D point tracking representations from videos derived by lifting 2D representations into 3D space via monocular depth estimation across time. These 4D representations maintain a shared geometric structure between the points and robot state representations up to a linear transformation, enabling efficient transfer learning from human video data to low-level robotic control. Our experiments show that ARM4R can transfer efficiently from human video data to robotics and consistently improves performance on tasks across various robot environments and configurations.

4次元表現を用いた自己回帰型ロボットモデルの事前学習

Pre-training Auto-regressive Robotic Models with 4D Representations

要旨

Support