JARVIS-1：具有记忆增强多模态语言模型的开放世界多任务代理

摘要

在开放世界中实现类人规划和控制，利用多模态观察是更具功能性的通用型智能体的关键里程碑。现有方法可以处理开放世界中的某些长程任务。然而，当开放世界任务的数量可能是无限的时，它们仍然存在困难，并且缺乏随着游戏时间推移逐渐增强任务完成能力。我们引入了 JARVIS-1，一个能够感知多模态输入（视觉观察和人类指令）、生成复杂计划并执行具体控制的开放世界智能体，全部在流行但具有挑战性的开放世界 Minecraft 宇宙中。具体来说，我们在预训练的多模态语言模型基础上开发了 JARVIS-1，该模型将视觉观察和文本指令映射到计划中。这些计划最终将被发送给目标条件控制器。我们为 JARVIS-1 配备了多模态记忆，这有助于利用预训练知识和实际游戏生存经验进行规划。在我们的实验中，JARVIS-1 在 Minecraft 宇宙基准测试的 200 多个不同任务中表现出几乎完美的性能，涵盖了入门到中级水平。JARVIS-1 在长程钻石镐任务中实现了 12.5% 的完成率。与以往记录相比，这是一个高达 5 倍的显著增加。此外，我们展示了 JARVIS-1 能够通过多模态记忆遵循终身学习范式进行自我改进，从而激发更一般的智能和改进的自主性。项目页面位于 https://craftjarvis-jarvis1.github.io。

English

Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents. Existing approaches can handle certain long-horizon tasks in an open world. However, they still struggle when the number of open-world tasks could potentially be infinite and lack the capability to progressively enhance task completion as game time progresses. We introduce JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions), generate sophisticated plans, and perform embodied control, all within the popular yet challenging open-world Minecraft universe. Specifically, we develop JARVIS-1 on top of pre-trained multimodal language models, which map visual observations and textual instructions to plans. The plans will be ultimately dispatched to the goal-conditioned controllers. We outfit JARVIS-1 with a multimodal memory, which facilitates planning using both pre-trained knowledge and its actual game survival experiences. In our experiments, JARVIS-1 exhibits nearly perfect performances across over 200 varying tasks from the Minecraft Universe Benchmark, ranging from entry to intermediate levels. JARVIS-1 has achieved a completion rate of 12.5% in the long-horizon diamond pickaxe task. This represents a significant increase up to 5 times compared to previous records. Furthermore, we show that JARVIS-1 is able to self-improve following a life-long learning paradigm thanks to multimodal memory, sparking a more general intelligence and improved autonomy. The project page is available at https://craftjarvis-jarvis1.github.io.

JARVIS-1：具有记忆增强多模态语言模型的开放世界多任务代理

JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models

摘要

Support