GPT-4V(ision)用於機器人技術：從人類示範中進行多模式任務規劃

摘要

我們介紹了一個流程，通過整合人類行為觀察來增強通用視覺語言模型GPT-4V(ision)，以促進機器人操作。該系統分析人類執行任務的視頻，並創建包含可供性見解的可執行機器人程序。計算始於使用GPT-4V分析視頻，將環境和動作細節轉換為文本，接著是由GPT-4增強的任務規劃器。在後續分析中，視覺系統重新分析帶有任務計劃的視頻。物體名稱通過開放詞彙的物體檢測器來定位，同時關注手-物體關係有助於檢測抓取和釋放的時刻。這種時空定位使視覺系統進一步收集可供性數據（例如，抓取類型、路徑點和身體姿勢）。在各種情境下的實驗證明了這種方法以零樣本方式實現從人類示範到真實機器人操作的有效性。GPT-4V/GPT-4的提示可在此項目頁面找到：https://microsoft.github.io/GPT4Vision-Robot-Manipulation-Prompts/

English

We introduce a pipeline that enhances a general-purpose Vision Language Model, GPT-4V(ision), by integrating observations of human actions to facilitate robotic manipulation. This system analyzes videos of humans performing tasks and creates executable robot programs that incorporate affordance insights. The computation starts by analyzing the videos with GPT-4V to convert environmental and action details into text, followed by a GPT-4-empowered task planner. In the following analyses, vision systems reanalyze the video with the task plan. Object names are grounded using an open-vocabulary object detector, while focus on the hand-object relation helps to detect the moment of grasping and releasing. This spatiotemporal grounding allows the vision systems to further gather affordance data (e.g., grasp type, way points, and body postures). Experiments across various scenarios demonstrate this method's efficacy in achieving real robots' operations from human demonstrations in a zero-shot manner. The prompts of GPT-4V/GPT-4 are available at this project page: https://microsoft.github.io/GPT4Vision-Robot-Manipulation-Prompts/

GPT-4V(ision)用於機器人技術：從人類示範中進行多模式任務規劃

GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration

摘要

Support