Vid2Robot: クロスアテンショントランスフォーマーを用いたビデオ条件付きエンドツーエンドポリシー学習

要旨

大規模なロボットシステムは通常、タスクに対してテキストベースの指示に依存していますが、本研究では異なるアプローチを探求します：ロボットは人間の行動を観察することで直接タスクを推論できるか？この転換には、ロボットが人間の意図を解読し、それを自身の物理的制約と環境内で実行可能なアクションに変換する能力が必要です。本論文では、Vid2Robotという新しいエンドツーエンドのビデオベース学習フレームワークを紹介します。Vid2Robotは、操作タスクのビデオデモンストレーションと現在の視覚的観察を入力として、直接ロボットのアクションを生成します。これは、人間のビデオとロボットの軌跡からなる大規模なデータセットで訓練された統一表現モデルによって実現されます。このモデルは、クロスアテンション機構を活用して、プロンプトビデオの特徴をロボットの現在の状態に融合し、観察されたタスクを模倣する適切なアクションを生成します。さらに、ポリシーの性能を向上させるために、人間とロボットのビデオ表現間の整合性を強化する補助的なコントラスティブ損失を提案します。Vid2Robotを実世界のロボットで評価し、人間のデモンストレーションビデオを使用した場合、他のビデオ条件付きポリシーと比較して20％の性能向上を示しました。加えて、本モデルは、観察された動きをあるオブジェクトから別のオブジェクトに転送する能力や、長期的な構成といった新たな能力を示し、実世界での応用可能性を実証しています。プロジェクトウェブサイト：vid2robot.github.io

English

While large-scale robotic systems typically rely on textual instructions for tasks, this work explores a different approach: can robots infer the task directly from observing humans? This shift necessitates the robot's ability to decode human intent and translate it into executable actions within its physical constraints and environment. We introduce Vid2Robot, a novel end-to-end video-based learning framework for robots. Given a video demonstration of a manipulation task and current visual observations, Vid2Robot directly produces robot actions. This is achieved through a unified representation model trained on a large dataset of human video and robot trajectory. The model leverages cross-attention mechanisms to fuse prompt video features to the robot's current state and generate appropriate actions that mimic the observed task. To further improve policy performance, we propose auxiliary contrastive losses that enhance the alignment between human and robot video representations. We evaluate Vid2Robot on real-world robots, demonstrating a 20% improvement in performance compared to other video-conditioned policies when using human demonstration videos. Additionally, our model exhibits emergent capabilities, such as successfully transferring observed motions from one object to another, and long-horizon composition, thus showcasing its potential for real-world applications. Project website: vid2robot.github.io

Vid2Robot: クロスアテンショントランスフォーマーを用いたビデオ条件付きエンドツーエンドポリシー学習

Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers

要旨

Support