人工智能游戏库：通过人类游戏实现机器通用智能的可扩展开放式评估

摘要

在技术飞速发展的时代，如何严格评估机器智能相对于人类通用智能的广阔光谱，已变得日益重要且充满挑战。传统AI基准测试通常仅能评估人类有限活动范围内的狭窄能力。这些测试大多呈静态特性，当开发者显性或隐性地针对其进行优化时，其评估效果会迅速饱和。我们认为，评估AI系统类人通用智能更有前景的方法是通过一种特别强大的通用游戏博弈形式：研究它们如何游玩及学习游玩所有可想象的人类游戏，并与具有同等经验水平、时间或其他资源的人类玩家进行比较。我们将"人类游戏"定义为人类为人类设计的游戏，并论证这类人们能够想象并享受的游戏集合——"人类游戏多元宇宙"——作为评估框架的适用性。为实现这一愿景，我们推出AI GameStore这一可扩展的开放式平台，该平台利用大语言模型与人机协同机制，通过自动采集并适配来自热门数字游戏平台的标准化容器化游戏环境变体，来合成具有代表性的人类新游戏。作为概念验证，我们基于苹果应用商店和Steam平台的热门榜单生成了100款此类游戏，并对七款前沿视觉语言模型进行了短时游戏片段评估。最佳模型在多数游戏中的得分不足人类平均分的10%，尤其在挑战世界模型学习、记忆与规划能力的游戏中表现欠佳。最后我们提出构建AI GameStore的后续步骤，将其作为衡量并推动机器实现类人通用智能的实用路径。

English

Rigorously evaluating machine intelligence against the broad spectrum of human general intelligence has become increasingly important and challenging in this era of rapid technological advance. Conventional AI benchmarks typically assess only narrow capabilities in a limited range of human activity. Most are also static, quickly saturating as developers explicitly or implicitly optimize for them. We propose that a more promising way to evaluate human-like general intelligence in AI systems is through a particularly strong form of general game playing: studying how and how well they play and learn to play all conceivable human games, in comparison to human players with the same level of experience, time, or other resources. We define a "human game" to be a game designed by humans for humans, and argue for the evaluative suitability of this space of all such games people can imagine and enjoy -- the "Multiverse of Human Games". Taking a first step towards this vision, we introduce the AI GameStore, a scalable and open-ended platform that uses LLMs with humans-in-the-loop to synthesize new representative human games, by automatically sourcing and adapting standardized and containerized variants of game environments from popular human digital gaming platforms. As a proof of concept, we generated 100 such games based on the top charts of Apple App Store and Steam, and evaluated seven frontier vision-language models (VLMs) on short episodes of play. The best models achieved less than 10\% of the human average score on the majority of the games, and especially struggled with games that challenge world-model learning, memory and planning. We conclude with a set of next steps for building out the AI GameStore as a practical way to measure and drive progress toward human-like general intelligence in machines.

人工智能游戏库：通过人类游戏实现机器通用智能的可扩展开放式评估

AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games

摘要

Support