Optimus-1：混合多模态记忆增强代理在长时域任务中表现出色

摘要

在人工智能领域，构建通用智能体一直是一个久远的愿景。现有的智能体在许多领域取得了显著进展，但仍然难以完成开放世界中的长视野任务。我们将这归因于缺乏必要的世界知识和多模态经验，这些知识和经验可以指导智能体完成各种长视野任务。在本文中，我们提出了一种混合多模态记忆模块来解决上述挑战。它 1) 将知识转化为分层有向知识图，使智能体能够明确表示和学习世界知识，以及 2) 将历史信息总结为抽象的多模态经验池，为智能体提供丰富的上下文学习参考。在混合多模态记忆模块的基础上，构建了一个多模态智能体 Optimus-1，配备了专用的知识引导规划器和经验驱动反射器，在面对 Minecraft 中的长视野任务时有更好的规划和反思。大量实验结果显示，Optimus-1 在具有挑战性的长视野任务基准上明显优于所有现有的智能体，并在许多任务上表现出接近人类水平的性能。此外，我们引入了各种多模态大型语言模型（MLLMs）作为 Optimus-1 的支柱。实验结果显示，Optimus-1 在混合多模态记忆模块的帮助下表现出强大的泛化能力，在许多任务上优于 GPT-4V 基线。

English

Building a general-purpose agent is a long-standing vision in the field of artificial intelligence. Existing agents have made remarkable progress in many domains, yet they still struggle to complete long-horizon tasks in an open world. We attribute this to the lack of necessary world knowledge and multimodal experience that can guide agents through a variety of long-horizon tasks. In this paper, we propose a Hybrid Multimodal Memory module to address the above challenges. It 1) transforms knowledge into Hierarchical Directed Knowledge Graph that allows agents to explicitly represent and learn world knowledge, and 2) summarises historical information into Abstracted Multimodal Experience Pool that provide agents with rich references for in-context learning. On top of the Hybrid Multimodal Memory module, a multimodal agent, Optimus-1, is constructed with dedicated Knowledge-guided Planner and Experience-Driven Reflector, contributing to a better planning and reflection in the face of long-horizon tasks in Minecraft. Extensive experimental results show that Optimus-1 significantly outperforms all existing agents on challenging long-horizon task benchmarks, and exhibits near human-level performance on many tasks. In addition, we introduce various Multimodal Large Language Models (MLLMs) as the backbone of Optimus-1. Experimental results show that Optimus-1 exhibits strong generalization with the help of the Hybrid Multimodal Memory module, outperforming the GPT-4V baseline on many tasks.

Optimus-1：混合多模态记忆增强代理在长时域任务中表现出色

Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks

摘要

Support