AI游戏库：基于人类游戏的机器通用智能可扩展开放式评估

摘要

在技术飞速发展的时代，如何严格评估机器智能相对于人类广泛通用智能的水平已变得日益重要且充满挑战。传统AI基准通常仅能评估有限人类活动领域中的狭窄能力，且多数属于静态测试——当开发者显性或隐性地针对这些基准进行优化时，其评估效果会迅速饱和。我们提出，评估AI系统类人通用智能更具前景的方法在于采用一种特殊形式的通用游戏博弈：通过研究AI系统如何游玩及学习所有可想象的人类游戏，并与具有同等经验、时间或其他资源的人类玩家进行比较，从而衡量其智能水平。我们将"人类游戏"定义为人类为人类设计的游戏，并论证以这个包含所有人类可想象且乐于游玩的游戏集合——"人类游戏多元宇宙"——作为评估维度的合理性。为实现这一愿景，我们开发了AI GameStore这一可扩展的开放式平台，通过采用大语言模型与人机协同循环机制，从主流数字游戏平台自动采集标准化、容器化的游戏环境变体，进而生成具有代表性的人类新游戏。作为概念验证，我们基于苹果应用商店和Steam平台热门榜单生成了100款此类游戏，并对七款前沿视觉语言模型进行了短时游戏测试。结果显示，最佳模型在大部分游戏中的得分不足人类平均水平的10%，尤其在挑战世界模型学习、记忆与规划能力的游戏中表现欠佳。最后我们提出了完善AI GameStore的后续步骤，将其构建为衡量并推动机器实现类人通用智能发展的实用工具。

English

Rigorously evaluating machine intelligence against the broad spectrum of human general intelligence has become increasingly important and challenging in this era of rapid technological advance. Conventional AI benchmarks typically assess only narrow capabilities in a limited range of human activity. Most are also static, quickly saturating as developers explicitly or implicitly optimize for them. We propose that a more promising way to evaluate human-like general intelligence in AI systems is through a particularly strong form of general game playing: studying how and how well they play and learn to play all conceivable human games, in comparison to human players with the same level of experience, time, or other resources. We define a "human game" to be a game designed by humans for humans, and argue for the evaluative suitability of this space of all such games people can imagine and enjoy -- the "Multiverse of Human Games". Taking a first step towards this vision, we introduce the AI GameStore, a scalable and open-ended platform that uses LLMs with humans-in-the-loop to synthesize new representative human games, by automatically sourcing and adapting standardized and containerized variants of game environments from popular human digital gaming platforms. As a proof of concept, we generated 100 such games based on the top charts of Apple App Store and Steam, and evaluated seven frontier vision-language models (VLMs) on short episodes of play. The best models achieved less than 10\% of the human average score on the majority of the games, and especially struggled with games that challenge world-model learning, memory and planning. We conclude with a set of next steps for building out the AI GameStore as a practical way to measure and drive progress toward human-like general intelligence in machines.

AI游戏库：基于人类游戏的机器通用智能可扩展开放式评估

AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games

摘要

Support