流れに合わせたアフォーダンスベースのロボット操作

要旨

アシストロボットの操作フレームワークを提案します。このフレームワークは、2つの基本的な課題に焦点を当てています。第一に、大規模モデルを効率的に下流のシーンアフォーダンス理解タスクに適応させること、特に日常生活シナリオにおいて、人間を含むマルチタスクデータを収集することが困難である場合。第二に、視覚アフォーダンスモデルを基盤としてロボットの軌道を効果的に学習すること。最初の課題には、パラメータ効率のプロンプト調整手法を用いて取り組み、凍結されたビジョンモデルに学習可能なテキストプロンプトを追加して、マルチタスクシナリオにおける操作可能性を予測します。次に、アフォーダンスに誘導されたロボットの軌道を学習するための教師付きフローマッチング手法を提案します。フローマッチングは、ロボットの視覚運動ポリシーを、ランダムなウェイポイントを所望のロボット軌道に流す条件付きプロセスとして表現します。最後に、日常生活活動全般にわたる10のタスクを含む実世界のデータセットを導入して、提案されたフレームワークをテストします。詳細な評価により、言語プロンプターを用いた操作アフォーダンスの学習において提案されたプロンプト調整手法が競争力のあるパフォーマンスを達成し、他のファインチューニングプロトコルを上回ることが示されました。単一のフローマッチングポリシーを用いたマルチタスクロボット軌道の学習は、他の行動クローニング手法よりも一貫して優れたパフォーマンスを示し、特にマルチモーダルなロボットアクション分布が与えられた場合に優れています。我々のフレームワークは、ロボット操作のためのアフォーダンスモデル学習と軌道生成をフローマッチングでシームレスに統合しています。

English

We present a framework for assistive robot manipulation, which focuses on two fundamental challenges: first, efficiently adapting large-scale models to downstream scene affordance understanding tasks, especially in daily living scenarios where gathering multi-task data involving humans requires strenuous effort; second, effectively learning robot trajectories by grounding the visual affordance model. We tackle the first challenge by employing a parameter-efficient prompt tuning method that prepends learnable text prompts to the frozen vision model to predict manipulation affordances in multi-task scenarios. Then we propose to learn robot trajectories guided by affordances in a supervised Flow Matching method. Flow matching represents a robot visuomotor policy as a conditional process of flowing random waypoints to desired robot trajectories. Finally, we introduce a real-world dataset with 10 tasks across Activities of Daily Living to test our framework. Our extensive evaluation highlights that the proposed prompt tuning method for learning manipulation affordance with language prompter achieves competitive performance and even outperforms other finetuning protocols across data scales, while satisfying parameter efficiency. Learning multi-task robot trajectories with a single flow matching policy also leads to consistently better performance than alternative behavior cloning methods, especially given multimodal robot action distributions. Our framework seamlessly unifies affordance model learning and trajectory generation with flow matching for robot manipulation.

流れに合わせたアフォーダンスベースのロボット操作

Affordance-based Robot Manipulation with Flow Matching

要旨

Support