JARVIS-1:具有记忆增强多模态语言模型的开放世界多任务代理
JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models
November 10, 2023
作者: Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, Yitao Liang
cs.AI
摘要
在开放世界中实现类人规划和控制,利用多模态观察是更具功能性的通用型智能体的关键里程碑。现有方法可以处理开放世界中的某些长程任务。然而,当开放世界任务的数量可能是无限的时,它们仍然存在困难,并且缺乏随着游戏时间推移逐渐增强任务完成能力。我们引入了 JARVIS-1,一个能够感知多模态输入(视觉观察和人类指令)、生成复杂计划并执行具体控制的开放世界智能体,全部在流行但具有挑战性的开放世界 Minecraft 宇宙中。具体来说,我们在预训练的多模态语言模型基础上开发了 JARVIS-1,该模型将视觉观察和文本指令映射到计划中。这些计划最终将被发送给目标条件控制器。我们为 JARVIS-1 配备了多模态记忆,这有助于利用预训练知识和实际游戏生存经验进行规划。在我们的实验中,JARVIS-1 在 Minecraft 宇宙基准测试的 200 多个不同任务中表现出几乎完美的性能,涵盖了入门到中级水平。JARVIS-1 在长程钻石镐任务中实现了 12.5% 的完成率。与以往记录相比,这是一个高达 5 倍的显著增加。此外,我们展示了 JARVIS-1 能够通过多模态记忆遵循终身学习范式进行自我改进,从而激发更一般的智能和改进的自主性。项目页面位于 https://craftjarvis-jarvis1.github.io。
English
Achieving human-like planning and control with multimodal observations in an
open world is a key milestone for more functional generalist agents. Existing
approaches can handle certain long-horizon tasks in an open world. However,
they still struggle when the number of open-world tasks could potentially be
infinite and lack the capability to progressively enhance task completion as
game time progresses. We introduce JARVIS-1, an open-world agent that can
perceive multimodal input (visual observations and human instructions),
generate sophisticated plans, and perform embodied control, all within the
popular yet challenging open-world Minecraft universe. Specifically, we develop
JARVIS-1 on top of pre-trained multimodal language models, which map visual
observations and textual instructions to plans. The plans will be ultimately
dispatched to the goal-conditioned controllers. We outfit JARVIS-1 with a
multimodal memory, which facilitates planning using both pre-trained knowledge
and its actual game survival experiences. In our experiments, JARVIS-1 exhibits
nearly perfect performances across over 200 varying tasks from the Minecraft
Universe Benchmark, ranging from entry to intermediate levels. JARVIS-1 has
achieved a completion rate of 12.5% in the long-horizon diamond pickaxe task.
This represents a significant increase up to 5 times compared to previous
records. Furthermore, we show that JARVIS-1 is able to self-improve
following a life-long learning paradigm thanks to multimodal memory, sparking a
more general intelligence and improved autonomy. The project page is available
at https://craftjarvis-jarvis1.github.io.