MineExplorer:评估Minecraft中MLLM智能体的开放世界探索能力
MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft
May 29, 2026
作者: Tianjie Ju, Yueqing Sun, Zheng Wu, Wei Zhang, Yaqi Huo, Xi Su, Qi Gu, Xunliang Cai, Gongshen Liu, Zhuosheng Zhang
cs.AI
摘要
多模态大语言模型(MLLMs)在感知、推理和行为生成方面展现出强大能力,但它们在动态开放世界中持续探索的能力仍不明确。现有基于具身智能和游戏的基准测试常将交互压缩为短时任务,或将成功与特定领域游戏机制相纠缠。本文提出MineExplorer基准,用于评估MLLM智能体在《我的世界》中的开放世界探索能力。我们首先筛选出依赖《我的世界》特有知识解决的原子任务,以更好地反映通用开放世界推理。随后,我们基于ReAct风格的能力框架组织基准测试,并将原子任务组合为隐式多跳任务。为构建可靠实例,MineExplorer采用多智能体合成工作流,联合设计任务图、沙盒场景及基于规则的里程碑评估器。人工评估表明,多智能体合成工作流生成的实例可靠性显著优于单智能体基线。对先进MLLM智能体的实验表明,开放世界探索仍具挑战性——强模型能处理众多单跳任务,但需在更长轨迹中协调隐藏先决条件时性能急剧下降。进一步分析发现,任务难度与智能体完成度相关,且更大规模模型或思维模式并不总能转化为更优性能。代码与数据集见https://github.com/Jometeorie/MineExplorer。
English
Multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and action generation. However, their ability to sustain exploration in dynamic open worlds remains unclear. Existing embodied and game-based benchmarks often compress interaction into short-horizon tasks or entangle success with domain-specific game mechanics. In this paper, we introduce MineExplorer benchmark for evaluating open-world exploration capabilities of MLLM agents in Minecraft. We first filter atomic tasks whose solutions rely heavily on Minecraft-specific knowledge to better reflect general open-world reasoning. Then we organize the benchmark around a ReAct-style capability formulation and compose atomic tasks into implicit multi-hop tasks. To further construct reliable instances, MineExplorer uses a multi-agent synthesis workflow that jointly designs task graphs, sandbox scenes, and rule-based milestone evaluators. Human evaluation shows that the multi-agent synthesis workflow produces significantly more reliable instances than a single-agent baseline. Experiments with advanced MLLM agents show that open-world exploration remains challenging, as strong models can handle many single-hop tasks but degrade sharply when hidden prerequisites must be coordinated over longer trajectories. Further analysis finds that task difficulty tracks agent completion, and larger models or thinking modes do not consistently translate into better performance. Code and dataset are available at https://github.com/Jometeorie/MineExplorer.