GPT-4V(ision) 用于机器人技术：从人类示范中进行多模态任务规划

摘要

我们介绍了一个流程，通过整合人类行为观察来增强通用视觉语言模型GPT-4V(ision)，以促进机器人操作。该系统分析人类执行任务的视频，并创建包含可供性洞察的可执行机器人程序。计算从使用GPT-4V分析视频开始，将环境和动作细节转换为文本，然后使用GPT-4增强的任务规划器。在后续分析中，视觉系统使用任务计划重新分析视频。对象名称通过开放词汇对象检测器进行基础化，而关注手-物体关系有助于检测抓取和释放时刻。这种时空基础化使视觉系统进一步收集可供性数据（例如，抓取类型、路径点和身体姿势）。在各种场景中进行的实验表明，这种方法能够以零样本方式从人类演示中实现真实机器人的操作。GPT-4V/GPT-4的提示可在此项目页面找到：https://microsoft.github.io/GPT4Vision-Robot-Manipulation-Prompts/

English

We introduce a pipeline that enhances a general-purpose Vision Language Model, GPT-4V(ision), by integrating observations of human actions to facilitate robotic manipulation. This system analyzes videos of humans performing tasks and creates executable robot programs that incorporate affordance insights. The computation starts by analyzing the videos with GPT-4V to convert environmental and action details into text, followed by a GPT-4-empowered task planner. In the following analyses, vision systems reanalyze the video with the task plan. Object names are grounded using an open-vocabulary object detector, while focus on the hand-object relation helps to detect the moment of grasping and releasing. This spatiotemporal grounding allows the vision systems to further gather affordance data (e.g., grasp type, way points, and body postures). Experiments across various scenarios demonstrate this method's efficacy in achieving real robots' operations from human demonstrations in a zero-shot manner. The prompts of GPT-4V/GPT-4 are available at this project page: https://microsoft.github.io/GPT4Vision-Robot-Manipulation-Prompts/

GPT-4V(ision) 用于机器人技术：从人类示范中进行多模态任务规划

GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration

摘要

Support