Vid2Robot：具有跨注意力Transformer的端到端視頻條件策略學習

摘要

儘管大型機器人系統通常依賴文字指令來執行任務，本研究探討了一種不同的方法：機器人是否能夠直接從觀察人類中推斷任務？這種轉變要求機器人能夠解碼人類意圖並將其轉化為可在其物理限制和環境中執行的動作。我們介紹了Vid2Robot，一種新型的端對端基於視頻學習框架，適用於機器人。給定一個操作任務的視頻演示和當前的視覺觀察，Vid2Robot直接生成機器人動作。這是通過在大量人類視頻和機器人軌跡數據集上訓練的統一表示模型實現的。該模型利用交叉關注機制將提示視頻特徵融合到機器人的當前狀態中，並生成模仿觀察任務的適當動作。為了進一步提高策略性能，我們提出輔助對比損失，增強人類和機器人視頻表示之間的對齊。我們在現實世界的機器人上評估了Vid2Robot，展示了與使用人類演示視頻時其他基於視頻的策略相比，性能提高了20%。此外，我們的模型表現出新興能力，例如成功地將觀察到的動作從一個對象轉移到另一個對象，以及長時間範圍的組合，從而展示了其在現實應用中的潛力。項目網站：vid2robot.github.io

English

While large-scale robotic systems typically rely on textual instructions for tasks, this work explores a different approach: can robots infer the task directly from observing humans? This shift necessitates the robot's ability to decode human intent and translate it into executable actions within its physical constraints and environment. We introduce Vid2Robot, a novel end-to-end video-based learning framework for robots. Given a video demonstration of a manipulation task and current visual observations, Vid2Robot directly produces robot actions. This is achieved through a unified representation model trained on a large dataset of human video and robot trajectory. The model leverages cross-attention mechanisms to fuse prompt video features to the robot's current state and generate appropriate actions that mimic the observed task. To further improve policy performance, we propose auxiliary contrastive losses that enhance the alignment between human and robot video representations. We evaluate Vid2Robot on real-world robots, demonstrating a 20% improvement in performance compared to other video-conditioned policies when using human demonstration videos. Additionally, our model exhibits emergent capabilities, such as successfully transferring observed motions from one object to another, and long-horizon composition, thus showcasing its potential for real-world applications. Project website: vid2robot.github.io

Vid2Robot：具有跨注意力Transformer的端到端視頻條件策略學習

Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers

摘要

Support