ChatPaper.aiChatPaper

Optimus-3:邁向具備可擴展任務專家的通用型多模態《我的世界》智能體

Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts

June 12, 2025
作者: Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Weili Guan, Dongmei Jiang, Liqiang Nie
cs.AI

摘要

近期,基于多模态大语言模型(MLLMs)的智能体在多个领域取得了显著进展。然而,在如《我的世界》这样的开放世界环境中,构建一个具备感知、规划、行动、落地与反思能力的通用智能体仍面临诸多挑战:领域特定数据不足、异构任务间的相互干扰,以及开放世界场景中的视觉多样性。本文针对这些挑战,提出了三项关键贡献。首先,我们设计了一种知识增强的数据生成管道,为智能体开发提供可扩展且高质量的训练数据。其次,为缓解异构任务间的干扰,我们引入了一种基于任务级路由的专家混合(MoE)架构。最后,我们开发了一种多模态推理增强的强化学习方法,以提升智能体在《我的世界》中应对视觉多样性的推理能力。基于这些创新,我们推出了Optimus-3,一款面向《我的世界》的通用智能体。大量实验结果表明,Optimus-3在《我的世界》环境中的广泛任务上,均超越了通用的多模态大语言模型及现有的顶尖智能体。项目页面:https://cybertronagent.github.io/Optimus-3.github.io/
English
Recently, agents based on multimodal large language models (MLLMs) have achieved remarkable progress across various domains. However, building a generalist agent with capabilities such as perception, planning, action, grounding, and reflection in open-world environments like Minecraft remains challenges: insufficient domain-specific data, interference among heterogeneous tasks, and visual diversity in open-world settings. In this paper, we address these challenges through three key contributions. 1) We propose a knowledge-enhanced data generation pipeline to provide scalable and high-quality training data for agent development. 2) To mitigate interference among heterogeneous tasks, we introduce a Mixture-of-Experts (MoE) architecture with task-level routing. 3) We develop a Multimodal Reasoning-Augmented Reinforcement Learning approach to enhance the agent's reasoning ability for visual diversity in Minecraft. Built upon these innovations, we present Optimus-3, a general-purpose agent for Minecraft. Extensive experimental results demonstrate that Optimus-3 surpasses both generalist multimodal large language models and existing state-of-the-art agents across a wide range of tasks in the Minecraft environment. Project page: https://cybertronagent.github.io/Optimus-3.github.io/
PDF182June 13, 2025