MineExplorer:評估MLLM智能體在Minecraft中的開放世界探索
MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft
May 29, 2026
作者: Tianjie Ju, Yueqing Sun, Zheng Wu, Wei Zhang, Yaqi Huo, Xi Su, Qi Gu, Xunliang Cai, Gongshen Liu, Zhuosheng Zhang
cs.AI
摘要
多模態大型語言模型(MLLMs)在感知、推理與行動生成方面展現出強大能力,但其在動態開放世界中維持探索的能力仍不明確。現有的具身與遊戲基準常將互動壓縮為短期任務,或將成功與特定領域的遊戲機制相互糾纏。本文提出MineExplorer基準,用於評估Minecraft中MLLM智能體的開放世界探索能力。我們首先篩選出解決方案高度依賴Minecraft特有知識的原子任務,以更貼近一般開放世界推理。接著,我們以ReAct風格的能力框架組織基準,並將原子任務組合為隱式多跳任務。為進一步構建可靠實例,MineExplorer採用多智能體合成工作流,共同設計任務圖、沙盒場景與基於規則的里程碑評估器。人工評估顯示,多智能體合成工作流產生的實例顯著優於單智能體基線。與先進MLLM智能體的實驗表明,開放世界探索仍具挑戰性:強模型能處理多個單跳任務,但當隱藏前提需在較長軌跡中協調時表現急劇下降。進一步分析發現,任務難度與智能體完成度相關,且較大模型或思考模式並未一致轉化為更佳表現。程式碼與資料集見於 https://github.com/Jometeorie/MineExplorer。
English
Multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and action generation. However, their ability to sustain exploration in dynamic open worlds remains unclear. Existing embodied and game-based benchmarks often compress interaction into short-horizon tasks or entangle success with domain-specific game mechanics. In this paper, we introduce MineExplorer benchmark for evaluating open-world exploration capabilities of MLLM agents in Minecraft. We first filter atomic tasks whose solutions rely heavily on Minecraft-specific knowledge to better reflect general open-world reasoning. Then we organize the benchmark around a ReAct-style capability formulation and compose atomic tasks into implicit multi-hop tasks. To further construct reliable instances, MineExplorer uses a multi-agent synthesis workflow that jointly designs task graphs, sandbox scenes, and rule-based milestone evaluators. Human evaluation shows that the multi-agent synthesis workflow produces significantly more reliable instances than a single-agent baseline. Experiments with advanced MLLM agents show that open-world exploration remains challenging, as strong models can handle many single-hop tasks but degrade sharply when hidden prerequisites must be coordinated over longer trajectories. Further analysis finds that task difficulty tracks agent completion, and larger models or thinking modes do not consistently translate into better performance. Code and dataset are available at https://github.com/Jometeorie/MineExplorer.