GPT-4V(ision)によるロボティクス：人間のデモンストレーションからのマルチモーダルタスクプランニング

要旨

汎用視覚言語モデルであるGPT-4V(ision)を強化し、人間の行動観察を統合することでロボット操作を促進するパイプラインを紹介します。このシステムは、人間がタスクを実行する動画を分析し、アフォーダンスの洞察を取り入れた実行可能なロボットプログラムを作成します。計算プロセスは、まずGPT-4Vを使用して動画を分析し、環境と行動の詳細をテキストに変換することから始まり、次にGPT-4を活用したタスクプランナーが続きます。その後の分析では、視覚システムがタスクプランを用いて動画を再分析します。オブジェクト名はオープン語彙オブジェクト検出器を使用してグラウンディングされ、手とオブジェクトの関係に焦点を当てることで把持と解放の瞬間を検出します。この時空間的グラウンディングにより、視覚システムはさらにアフォーダンスデータ（例：把持タイプ、ウェイポイント、身体姿勢）を収集することができます。さまざまなシナリオでの実験により、この方法がゼロショットで人間のデモンストレーションから実ロボットの操作を実現する効果を実証しています。GPT-4V/GPT-4のプロンプトは以下のプロジェクトページで利用可能です： https://microsoft.github.io/GPT4Vision-Robot-Manipulation-Prompts/

English

We introduce a pipeline that enhances a general-purpose Vision Language Model, GPT-4V(ision), by integrating observations of human actions to facilitate robotic manipulation. This system analyzes videos of humans performing tasks and creates executable robot programs that incorporate affordance insights. The computation starts by analyzing the videos with GPT-4V to convert environmental and action details into text, followed by a GPT-4-empowered task planner. In the following analyses, vision systems reanalyze the video with the task plan. Object names are grounded using an open-vocabulary object detector, while focus on the hand-object relation helps to detect the moment of grasping and releasing. This spatiotemporal grounding allows the vision systems to further gather affordance data (e.g., grasp type, way points, and body postures). Experiments across various scenarios demonstrate this method's efficacy in achieving real robots' operations from human demonstrations in a zero-shot manner. The prompts of GPT-4V/GPT-4 are available at this project page: https://microsoft.github.io/GPT4Vision-Robot-Manipulation-Prompts/

GPT-4V(ision)によるロボティクス：人間のデモンストレーションからのマルチモーダルタスクプランニング

GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration

要旨

Support