Vid2Robot:端到端视频条件策略学习与交叉注意力变换器
Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers
March 19, 2024
作者: Vidhi Jain, Maria Attarian, Nikhil J Joshi, Ayzaan Wahid, Danny Driess, Quan Vuong, Pannag R Sanketi, Pierre Sermanet, Stefan Welker, Christine Chan, Igor Gilitschenski, Yonatan Bisk, Debidatta Dwibedi
cs.AI
摘要
尽管大规模机器人系统通常依赖文本指令执行任务,本研究探讨了一种不同的方法:机器人能否直接从观察人类中推断任务?这种转变要求机器人能够解码人类意图,并将其转化为可在其物理约束和环境内执行的动作。我们引入了Vid2Robot,一种新颖的面向机器人的基于视频的端到端学习框架。给定一个操作任务的视频演示和当前的视觉观察,Vid2Robot直接生成机器人动作。这是通过在大量人类视频和机器人轨迹数据集上训练的统一表示模型实现的。该模型利用交叉注意力机制将提示视频特征融合到机器人的当前状态,并生成模仿观察任务的适当动作。为了进一步提高策略性能,我们提出了辅助对比损失,增强人类和机器人视频表示之间的对齐。我们在真实世界的机器人上评估了Vid2Robot,展示了与其他视频条件策略相比,在使用人类演示视频时性能提高了20%。此外,我们的模型展示了新兴的能力,例如成功地将观察到的动作从一个物体转移到另一个物体,以及长时间跨度的组合,从而展示了其在实际应用中的潜力。项目网站:vid2robot.github.io
English
While large-scale robotic systems typically rely on textual instructions for
tasks, this work explores a different approach: can robots infer the task
directly from observing humans? This shift necessitates the robot's ability to
decode human intent and translate it into executable actions within its
physical constraints and environment. We introduce Vid2Robot, a novel
end-to-end video-based learning framework for robots. Given a video
demonstration of a manipulation task and current visual observations, Vid2Robot
directly produces robot actions. This is achieved through a unified
representation model trained on a large dataset of human video and robot
trajectory. The model leverages cross-attention mechanisms to fuse prompt video
features to the robot's current state and generate appropriate actions that
mimic the observed task. To further improve policy performance, we propose
auxiliary contrastive losses that enhance the alignment between human and robot
video representations. We evaluate Vid2Robot on real-world robots,
demonstrating a 20% improvement in performance compared to other
video-conditioned policies when using human demonstration videos. Additionally,
our model exhibits emergent capabilities, such as successfully transferring
observed motions from one object to another, and long-horizon composition, thus
showcasing its potential for real-world applications. Project website:
vid2robot.github.ioSummary
AI-Generated Summary