游戏世界:迈向多模态游戏智能体的标准化与可验证评估
GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents
April 8, 2026
作者: Mingyu Ouyang, Siyuan Hu, Kevin Qinghong Lin, Hwee Tou Ng, Mike Zheng Shou
cs.AI
摘要
为实现与现实世界交互的具身通用人工智能,多模态大语言模型(MLLM)智能体仍面临响应延迟、反馈稀疏和错误不可逆等挑战。电子游戏凭借其丰富的视觉观测与闭环交互特性,成为需要细粒度感知、长程规划和精准控制的理想测试平台。然而,异构动作接口与启发式验证方法当前阻碍了对这些能力的系统化评估。为此,我们推出GameWorld基准测试框架,旨在通过浏览器环境对MLLM作为通用游戏智能体进行标准化可验证评估。研究涵盖两类游戏智能体接口:(i)直接发送键鼠指令的计算机使用型智能体;(ii)通过确定性语义动作解析在语义动作空间操作的通用多模态智能体。GameWorld包含34款异构游戏与170项任务,每项任务均配备可进行结果验证的状态指标。对18组模型-接口组合的测试表明,即使最优智能体仍远未达到人类玩家的游戏水平。全基准重复测试的大规模实验验证了该基准的鲁棒性,而针对实时交互、上下文记忆敏感性和动作有效性的深入研究则揭示了游戏智能体面临的更多挑战。通过提供标准化、可验证、可复现的评估框架,GameWorld为推进多模态游戏智能体及其他领域的研究奠定了坚实基础。项目页面详见https://gameworld-bench.github.io。
English
Towards an embodied generalist for real-world interaction, Multimodal Large Language Model (MLLM) agents still suffer from challenging latency, sparse feedback, and irreversible mistakes. Video games offer an ideal testbed with rich visual observations and closed-loop interaction, demanding fine-grained perception, long-horizon planning, and precise control. However, systematically evaluating these capabilities is currently hindered by heterogeneous action interfaces and heuristic verification. To this end, we introduce GameWorld, a benchmark designed for standardized and verifiable evaluation of MLLMs as generalist game agents in browser environments. Two game agent interfaces are studied: (i) computer-use agents that directly emit keyboard and mouse controls, and (ii) generalist multimodal agents that act in a semantic action space via deterministic Semantic Action Parsing. GameWorld contains 34 diverse games and 170 tasks, each paired with state-verifiable metrics for outcome-based evaluation. The results across 18 model-interface pairs suggest that even the best performing agent is far from achieving human capabilities on video games. Extensive experiments of repeated full-benchmark reruns demonstrate the robustness of the benchmark, while further studies on real-time interaction, context-memory sensitivity, and action validity expose more challenges ahead for game agents. Together, by offering a standardized, verifiable, and reproducible evaluation framework, GameWorld lays a robust foundation for advancing research on multimodal game agents and beyond. The project page is at https://gameworld-bench.github.io.