Optimus-1:混合多模态记忆增强代理在长时域任务中表现出色
Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks
August 7, 2024
作者: Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, Liqiang Nie
cs.AI
摘要
在人工智能领域,构建通用智能体一直是一个久远的愿景。现有的智能体在许多领域取得了显著进展,但仍然难以完成开放世界中的长视野任务。我们将这归因于缺乏必要的世界知识和多模态经验,这些知识和经验可以指导智能体完成各种长视野任务。在本文中,我们提出了一种混合多模态记忆模块来解决上述挑战。它 1) 将知识转化为分层有向知识图,使智能体能够明确表示和学习世界知识,以及 2) 将历史信息总结为抽象的多模态经验池,为智能体提供丰富的上下文学习参考。在混合多模态记忆模块的基础上,构建了一个多模态智能体 Optimus-1,配备了专用的知识引导规划器和经验驱动反射器,在面对 Minecraft 中的长视野任务时有更好的规划和反思。大量实验结果显示,Optimus-1 在具有挑战性的长视野任务基准上明显优于所有现有的智能体,并在许多任务上表现出接近人类水平的性能。此外,我们引入了各种多模态大型语言模型(MLLMs)作为 Optimus-1 的支柱。实验结果显示,Optimus-1 在混合多模态记忆模块的帮助下表现出强大的泛化能力,在许多任务上优于 GPT-4V 基线。
English
Building a general-purpose agent is a long-standing vision in the field of
artificial intelligence. Existing agents have made remarkable progress in many
domains, yet they still struggle to complete long-horizon tasks in an open
world. We attribute this to the lack of necessary world knowledge and
multimodal experience that can guide agents through a variety of long-horizon
tasks. In this paper, we propose a Hybrid Multimodal Memory module to address
the above challenges. It 1) transforms knowledge into Hierarchical Directed
Knowledge Graph that allows agents to explicitly represent and learn world
knowledge, and 2) summarises historical information into Abstracted Multimodal
Experience Pool that provide agents with rich references for in-context
learning. On top of the Hybrid Multimodal Memory module, a multimodal agent,
Optimus-1, is constructed with dedicated Knowledge-guided Planner and
Experience-Driven Reflector, contributing to a better planning and reflection
in the face of long-horizon tasks in Minecraft. Extensive experimental results
show that Optimus-1 significantly outperforms all existing agents on
challenging long-horizon task benchmarks, and exhibits near human-level
performance on many tasks. In addition, we introduce various Multimodal Large
Language Models (MLLMs) as the backbone of Optimus-1. Experimental results show
that Optimus-1 exhibits strong generalization with the help of the Hybrid
Multimodal Memory module, outperforming the GPT-4V baseline on many tasks.Summary
AI-Generated Summary