GPT-4V(ision) 用于机器人技术:从人类示范中进行多模态任务规划
GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration
November 20, 2023
作者: Naoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun Takamatsu, Katsushi Ikeuchi
cs.AI
摘要
我们介绍了一个流程,通过整合人类行为观察来增强通用视觉语言模型GPT-4V(ision),以促进机器人操作。该系统分析人类执行任务的视频,并创建包含可供性洞察的可执行机器人程序。计算从使用GPT-4V分析视频开始,将环境和动作细节转换为文本,然后使用GPT-4增强的任务规划器。在后续分析中,视觉系统使用任务计划重新分析视频。对象名称通过开放词汇对象检测器进行基础化,而关注手-物体关系有助于检测抓取和释放时刻。这种时空基础化使视觉系统进一步收集可供性数据(例如,抓取类型、路径点和身体姿势)。在各种场景中进行的实验表明,这种方法能够以零样本方式从人类演示中实现真实机器人的操作。GPT-4V/GPT-4的提示可在此项目页面找到:https://microsoft.github.io/GPT4Vision-Robot-Manipulation-Prompts/
English
We introduce a pipeline that enhances a general-purpose Vision Language
Model, GPT-4V(ision), by integrating observations of human actions to
facilitate robotic manipulation. This system analyzes videos of humans
performing tasks and creates executable robot programs that incorporate
affordance insights. The computation starts by analyzing the videos with GPT-4V
to convert environmental and action details into text, followed by a
GPT-4-empowered task planner. In the following analyses, vision systems
reanalyze the video with the task plan. Object names are grounded using an
open-vocabulary object detector, while focus on the hand-object relation helps
to detect the moment of grasping and releasing. This spatiotemporal grounding
allows the vision systems to further gather affordance data (e.g., grasp type,
way points, and body postures). Experiments across various scenarios
demonstrate this method's efficacy in achieving real robots' operations from
human demonstrations in a zero-shot manner. The prompts of GPT-4V/GPT-4 are
available at this project page:
https://microsoft.github.io/GPT4Vision-Robot-Manipulation-Prompts/