GPT-4V(ision)用於機器人技術:從人類示範中進行多模式任務規劃
GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration
November 20, 2023
作者: Naoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun Takamatsu, Katsushi Ikeuchi
cs.AI
摘要
我們介紹了一個流程,通過整合人類行為觀察來增強通用視覺語言模型GPT-4V(ision),以促進機器人操作。該系統分析人類執行任務的視頻,並創建包含可供性見解的可執行機器人程序。計算始於使用GPT-4V分析視頻,將環境和動作細節轉換為文本,接著是由GPT-4增強的任務規劃器。在後續分析中,視覺系統重新分析帶有任務計劃的視頻。物體名稱通過開放詞彙的物體檢測器來定位,同時關注手-物體關係有助於檢測抓取和釋放的時刻。這種時空定位使視覺系統進一步收集可供性數據(例如,抓取類型、路徑點和身體姿勢)。在各種情境下的實驗證明了這種方法以零樣本方式實現從人類示範到真實機器人操作的有效性。GPT-4V/GPT-4的提示可在此項目頁面找到:https://microsoft.github.io/GPT4Vision-Robot-Manipulation-Prompts/
English
We introduce a pipeline that enhances a general-purpose Vision Language
Model, GPT-4V(ision), by integrating observations of human actions to
facilitate robotic manipulation. This system analyzes videos of humans
performing tasks and creates executable robot programs that incorporate
affordance insights. The computation starts by analyzing the videos with GPT-4V
to convert environmental and action details into text, followed by a
GPT-4-empowered task planner. In the following analyses, vision systems
reanalyze the video with the task plan. Object names are grounded using an
open-vocabulary object detector, while focus on the hand-object relation helps
to detect the moment of grasping and releasing. This spatiotemporal grounding
allows the vision systems to further gather affordance data (e.g., grasp type,
way points, and body postures). Experiments across various scenarios
demonstrate this method's efficacy in achieving real robots' operations from
human demonstrations in a zero-shot manner. The prompts of GPT-4V/GPT-4 are
available at this project page:
https://microsoft.github.io/GPT4Vision-Robot-Manipulation-Prompts/