MineExplorer: 마인크래프트에서 MLLM 에이전트의 오픈 월드 탐험 평가

초록

멀티모달 대규모 언어 모델(MLLM)은 지각, 추론, 행동 생성에서 강력한 능력을 보여주었다. 그러나 동적인 개방형 세계에서 지속적인 탐험 능력은 여전히 불명확하다. 기존의 구현형 및 게임 기반 벤치마크는 종종 상호작용을 단기 과제로 압축하거나, 도메인 특화 게임 메커니즘에 성공을 종속시킨다. 본 논문에서는 마인크래프트에서 MLLM 에이전트의 개방형 세계 탐험 능력을 평가하기 위한 MineExplorer 벤치마크를 소개한다. 먼저, 마인크래프트 특화 지식에 크게 의존하는 해결 방안을 가진 원자적 과제들을 필터링하여 일반적인 개방형 세계 추론을 더 잘 반영하도록 한다. 그런 다음 ReAct 스타일의 능력 구성 체계를 중심으로 벤치마크를 구성하고, 원자적 과제들을 암시적 다중 홉 과제로 조합한다. 신뢰할 수 있는 인스턴스를 추가로 구축하기 위해 MineExplorer는 작업 그래프, 샌드박스 장면, 규칙 기반 마일스톤 평가자를 공동으로 설계하는 다중 에이전트 합성 워크플로우를 사용한다. 인간 평가는 다중 에이전트 합성 워크플로우가 단일 에이전트 기준선보다 훨씬 더 신뢰할 수 있는 인스턴스를 생성함을 보여준다. 고급 MLLM 에이전트를 사용한 실험은 개방형 세계 탐험이 여전히 어려운 과제임을 보여주는데, 강력한 모델은 많은 단일 홉 과제를 처리할 수 있지만, 숨겨진 전제 조건이 더 긴 궤적에 걸쳐 조정되어야 할 때 성능이 급격히 저하된다. 추가 분석에 따르면 과제 난이도는 에이전트의 완료율과 연동되며, 더 큰 모델이나 사고 모드가 항상 더 나은 성능으로 이어지지는 않는다. 코드와 데이터셋은 https://github.com/Jometeorie/MineExplorer에서 확인할 수 있다.

English

Multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and action generation. However, their ability to sustain exploration in dynamic open worlds remains unclear. Existing embodied and game-based benchmarks often compress interaction into short-horizon tasks or entangle success with domain-specific game mechanics. In this paper, we introduce MineExplorer benchmark for evaluating open-world exploration capabilities of MLLM agents in Minecraft. We first filter atomic tasks whose solutions rely heavily on Minecraft-specific knowledge to better reflect general open-world reasoning. Then we organize the benchmark around a ReAct-style capability formulation and compose atomic tasks into implicit multi-hop tasks. To further construct reliable instances, MineExplorer uses a multi-agent synthesis workflow that jointly designs task graphs, sandbox scenes, and rule-based milestone evaluators. Human evaluation shows that the multi-agent synthesis workflow produces significantly more reliable instances than a single-agent baseline. Experiments with advanced MLLM agents show that open-world exploration remains challenging, as strong models can handle many single-hop tasks but degrade sharply when hidden prerequisites must be coordinated over longer trajectories. Further analysis finds that task difficulty tracks agent completion, and larger models or thinking modes do not consistently translate into better performance. Code and dataset are available at https://github.com/Jometeorie/MineExplorer.