MineExplorer: MinecraftにおけるMLLMエージェントによるオープンワールド探索の評価

要旨

マルチモーダル大規模言語モデル（MLLMs）は、知覚、推論、行動生成において強力な能力を示している。しかし、動的なオープンワールドにおける持続的な探索能力は依然として不明確である。既存の具現化型およびゲームベースのベンチマークは、相互作用を短期間のタスクに圧縮するか、ドメイン固有のゲームメカニクスと成功を絡め合わせることが多い。本論文では、MinecraftにおけるMLLMエージェントのオープンワールド探索能力を評価するためのMineExplorerベンチマークを紹介する。まず、解決策がMinecraft特有の知識に大きく依存する原子タスクをフィルタリングし、より汎用的なオープンワールド推論を反映させる。次に、ReActスタイルの能力定式化に基づいてベンチマークを構成し、原子タスクを暗黙のマルチホップタスクに合成する。さらに信頼性の高いインスタンスを構築するため、MineExplorerはマルチエージェント合成ワークフローを用いて、タスクグラフ、サンドボックスシーン、ルールベースのマイルストーン評価器を共同で設計する。人間による評価では、マルチエージェント合成ワークフローがシングルエージェントベースラインよりも有意に信頼性の高いインスタンスを生成することが示された。高度なMLLMエージェントを用いた実験では、強力なモデルが多くのシングルホップタスクを処理できる一方で、隠れた前提条件を長い軌跡にわたって調整する必要がある場合に性能が急激に低下するため、オープンワールド探索が依然として困難であることが明らかになった。さらなる分析では、タスクの難易度がエージェントの完了率と相関し、より大きなモデルや思考モードが必ずしも優れた性能に結びつかないことが判明した。コードとデータセットはhttps://github.com/Jometeorie/MineExplorerで公開されている。

English

Multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and action generation. However, their ability to sustain exploration in dynamic open worlds remains unclear. Existing embodied and game-based benchmarks often compress interaction into short-horizon tasks or entangle success with domain-specific game mechanics. In this paper, we introduce MineExplorer benchmark for evaluating open-world exploration capabilities of MLLM agents in Minecraft. We first filter atomic tasks whose solutions rely heavily on Minecraft-specific knowledge to better reflect general open-world reasoning. Then we organize the benchmark around a ReAct-style capability formulation and compose atomic tasks into implicit multi-hop tasks. To further construct reliable instances, MineExplorer uses a multi-agent synthesis workflow that jointly designs task graphs, sandbox scenes, and rule-based milestone evaluators. Human evaluation shows that the multi-agent synthesis workflow produces significantly more reliable instances than a single-agent baseline. Experiments with advanced MLLM agents show that open-world exploration remains challenging, as strong models can handle many single-hop tasks but degrade sharply when hidden prerequisites must be coordinated over longer trajectories. Further analysis finds that task difficulty tracks agent completion, and larger models or thinking modes do not consistently translate into better performance. Code and dataset are available at https://github.com/Jometeorie/MineExplorer.