章鱼:通过环境反馈实现具身视觉-语言编程
Octopus: Embodied Vision-Language Programmer from Environmental Feedback
October 12, 2023
作者: Jingkang Yang, Yuhao Dong, Shuai Liu, Bo Li, Ziyue Wang, Chencheng Jiang, Haoran Tan, Jiamu Kang, Yuanhan Zhang, Kaiyang Zhou, Ziwei Liu
cs.AI
摘要
大型视觉语言模型(VLMs)在多模态感知和推理方面取得了重大进展。此外,当无缝集成到具有实体的代理程序中时,它标志着朝着创建能够制定计划并精确执行命令的自主和具有上下文感知能力的系统迈出了关键一步。在本文中,我们介绍了章鱼(Octopus),这是一种新颖的VLM,旨在熟练解读代理的视觉和文本任务目标,并制定复杂的动作序列并生成可执行代码。我们的设计使代理能够熟练处理广泛的任务范围,从模拟器中的日常琐事到复杂视频游戏中的复杂交互。章鱼通过利用GPT-4来控制一个探索性代理进行训练,即在我们的实验环境OctoVerse中生成训练数据,即动作蓝图和相应的可执行代码。我们还收集反馈,以允许采用强化学习与环境反馈(RLEF)的增强训练方案。通过一系列实验,我们阐明了章鱼的功能,并提出了引人注目的结果,提出的RLEF能够改进代理的决策能力。通过开源我们的模型架构、模拟器和数据集,我们希望激发进一步的创新,并在更广泛的具有实体AI社区中促进协作应用。
English
Large vision-language models (VLMs) have achieved substantial progress in
multimodal perception and reasoning. Furthermore, when seamlessly integrated
into an embodied agent, it signifies a crucial stride towards the creation of
autonomous and context-aware systems capable of formulating plans and executing
commands with precision. In this paper, we introduce Octopus, a novel VLM
designed to proficiently decipher an agent's vision and textual task objectives
and to formulate intricate action sequences and generate executable code. Our
design allows the agent to adeptly handle a wide spectrum of tasks, ranging
from mundane daily chores in simulators to sophisticated interactions in
complex video games. Octopus is trained by leveraging GPT-4 to control an
explorative agent to generate training data, i.e., action blueprints and the
corresponding executable code, within our experimental environment called
OctoVerse. We also collect the feedback that allows the enhanced training
scheme of Reinforcement Learning with Environmental Feedback (RLEF). Through a
series of experiments, we illuminate Octopus's functionality and present
compelling results, and the proposed RLEF turns out to refine the agent's
decision-making. By open-sourcing our model architecture, simulator, and
dataset, we aspire to ignite further innovation and foster collaborative
applications within the broader embodied AI community.