預訓練基於四維表徵的自回歸機器人模型

摘要

基於大規模無標註數據集預訓練的基礎模型，已在自然語言處理和計算機視覺領域引發革命，展現出卓越的泛化能力，從而凸顯了預訓練的重要性。然而，在機器人領域的努力卻難以取得類似的成功，受限於需要昂貴的機器人標註數據或缺乏能有效模擬物理世界的表示方法。本文介紹了ARM4R，一種自迴歸機器人模型，它利用從人類視頻數據中學習到的低層次四維表示，以產生更好的預訓練機器人模型。具體而言，我們專注於利用通過時間序列上的單目深度估計將二維表示提升至三維空間而獲得的視頻中的三維點追蹤表示。這些四維表示在點與機器人狀態表示之間保持著共享的幾何結構，直至線性變換，從而實現了從人類視頻數據到低層次機器人控制的高效遷移學習。我們的實驗表明，ARM4R能夠有效地從人類視頻數據遷移至機器人領域，並在多種機器人環境和配置下的任務中持續提升性能。

English

Foundation models pre-trained on massive unlabeled datasets have revolutionized natural language and computer vision, exhibiting remarkable generalization capabilities, thus highlighting the importance of pre-training. Yet, efforts in robotics have struggled to achieve similar success, limited by either the need for costly robotic annotations or the lack of representations that effectively model the physical world. In this paper, we introduce ARM4R, an Auto-regressive Robotic Model that leverages low-level 4D Representations learned from human video data to yield a better pre-trained robotic model. Specifically, we focus on utilizing 3D point tracking representations from videos derived by lifting 2D representations into 3D space via monocular depth estimation across time. These 4D representations maintain a shared geometric structure between the points and robot state representations up to a linear transformation, enabling efficient transfer learning from human video data to low-level robotic control. Our experiments show that ARM4R can transfer efficiently from human video data to robotics and consistently improves performance on tasks across various robot environments and configurations.

預訓練基於四維表徵的自回歸機器人模型

Pre-training Auto-regressive Robotic Models with 4D Representations

摘要

Support