Vid2Robot:具有跨注意力Transformer的端到端視頻條件策略學習
Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers
March 19, 2024
作者: Vidhi Jain, Maria Attarian, Nikhil J Joshi, Ayzaan Wahid, Danny Driess, Quan Vuong, Pannag R Sanketi, Pierre Sermanet, Stefan Welker, Christine Chan, Igor Gilitschenski, Yonatan Bisk, Debidatta Dwibedi
cs.AI
摘要
儘管大型機器人系統通常依賴文字指令來執行任務,本研究探討了一種不同的方法:機器人是否能夠直接從觀察人類中推斷任務?這種轉變要求機器人能夠解碼人類意圖並將其轉化為可在其物理限制和環境中執行的動作。我們介紹了Vid2Robot,一種新型的端對端基於視頻學習框架,適用於機器人。給定一個操作任務的視頻演示和當前的視覺觀察,Vid2Robot直接生成機器人動作。這是通過在大量人類視頻和機器人軌跡數據集上訓練的統一表示模型實現的。該模型利用交叉關注機制將提示視頻特徵融合到機器人的當前狀態中,並生成模仿觀察任務的適當動作。為了進一步提高策略性能,我們提出輔助對比損失,增強人類和機器人視頻表示之間的對齊。我們在現實世界的機器人上評估了Vid2Robot,展示了與使用人類演示視頻時其他基於視頻的策略相比,性能提高了20%。此外,我們的模型表現出新興能力,例如成功地將觀察到的動作從一個對象轉移到另一個對象,以及長時間範圍的組合,從而展示了其在現實應用中的潛力。項目網站:vid2robot.github.io
English
While large-scale robotic systems typically rely on textual instructions for
tasks, this work explores a different approach: can robots infer the task
directly from observing humans? This shift necessitates the robot's ability to
decode human intent and translate it into executable actions within its
physical constraints and environment. We introduce Vid2Robot, a novel
end-to-end video-based learning framework for robots. Given a video
demonstration of a manipulation task and current visual observations, Vid2Robot
directly produces robot actions. This is achieved through a unified
representation model trained on a large dataset of human video and robot
trajectory. The model leverages cross-attention mechanisms to fuse prompt video
features to the robot's current state and generate appropriate actions that
mimic the observed task. To further improve policy performance, we propose
auxiliary contrastive losses that enhance the alignment between human and robot
video representations. We evaluate Vid2Robot on real-world robots,
demonstrating a 20% improvement in performance compared to other
video-conditioned policies when using human demonstration videos. Additionally,
our model exhibits emergent capabilities, such as successfully transferring
observed motions from one object to another, and long-horizon composition, thus
showcasing its potential for real-world applications. Project website:
vid2robot.github.ioSummary
AI-Generated Summary