Optimus-1：混合多模態記憶強化代理在長時程任務中表現卓越

摘要

在人工智慧領域中，打造一個通用智能體一直是一個久遠的願景。現有的智能體在許多領域取得了顯著進展，但仍然難以完成開放世界中的長視程任務。我們認為這是由於缺乏必要的世界知識和多模態經驗，這些知識和經驗可以引導智能體完成各種長視程任務。在本文中，我們提出了一個混合多模態記憶模組來應對上述挑戰。該模組 1) 將知識轉化為階層式導向知識圖，使智能體能夠明確表示和學習世界知識，並 2) 將歷史信息總結為抽象多模態經驗池，為智能體提供豐富的參考資料以進行上下文學習。在混合多模態記憶模組之上，構建了一個多模態智能體 Optimus-1，該智能體具有專用的知識引導規劃器和經驗驅動反射器，在面對《Minecraft》中的長視程任務時有更好的規劃和反思能力。大量實驗結果顯示，Optimus-1在具有挑戰性的長視程任務基準上顯著優於所有現有的智能體，並在許多任務上展現出接近人類水平的表現。此外，我們引入了各種多模態大型語言模型（MLLMs）作為 Optimus-1 的基礎。實驗結果表明，在混合多模態記憶模組的幫助下，Optimus-1在許多任務上優於 GPT-4V 基準，展現出強大的泛化能力。

English

Building a general-purpose agent is a long-standing vision in the field of artificial intelligence. Existing agents have made remarkable progress in many domains, yet they still struggle to complete long-horizon tasks in an open world. We attribute this to the lack of necessary world knowledge and multimodal experience that can guide agents through a variety of long-horizon tasks. In this paper, we propose a Hybrid Multimodal Memory module to address the above challenges. It 1) transforms knowledge into Hierarchical Directed Knowledge Graph that allows agents to explicitly represent and learn world knowledge, and 2) summarises historical information into Abstracted Multimodal Experience Pool that provide agents with rich references for in-context learning. On top of the Hybrid Multimodal Memory module, a multimodal agent, Optimus-1, is constructed with dedicated Knowledge-guided Planner and Experience-Driven Reflector, contributing to a better planning and reflection in the face of long-horizon tasks in Minecraft. Extensive experimental results show that Optimus-1 significantly outperforms all existing agents on challenging long-horizon task benchmarks, and exhibits near human-level performance on many tasks. In addition, we introduce various Multimodal Large Language Models (MLLMs) as the backbone of Optimus-1. Experimental results show that Optimus-1 exhibits strong generalization with the help of the Hybrid Multimodal Memory module, outperforming the GPT-4V baseline on many tasks.

Optimus-1：混合多模態記憶強化代理在長時程任務中表現卓越

Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks

摘要

Support