AIゲームストア：人間のゲームを用いた機械の汎用知能のスケーラブルで拡張性のある評価

要旨

技術の急速な進歩が続く現代において、機械の知能を人間の汎用知能の広範なスペクトルに対して厳密に評価することは、ますます重要性と困難性を増している。従来のAIベンチマークは、通常、限られた範囲の人間活動における狭い能力のみを評価する。また、その多くは静的であり、開発者が明示的または暗黙的にベンチマークに対して最適化を行うため、すぐに飽和してしまう。我々は、AIシステムにおける人間様の汎用知能を評価するより有望な方法は、特に強力な形式の汎用ゲームプレイ、すなわち、AIシステムが如何にして、またどの程度の質で、あらゆる考えられる人間のゲームをプレイし、学習するかを、同じ経験値、時間、その他のリソースを持つ人間のプレイヤーと比較して研究することであると提案する。我々は「人間のゲーム」を、人間が人間のために設計したゲームと定義し、人々が想像し楽しむことができるすべてのそのようなゲームの空間——「人間ゲームの多元宇宙」——の評価手法としての適合性を主張する。このビジョンに向けた第一歩として、我々はAI GameStoreを紹介する。これは、人間-in-the-loop型の大規模言語モデルを用いて、人気のある人間向けデジタルゲームプラットフォームから標準化されコンテナ化されたゲーム環境のバリアントを自動的に収集・適応させることで、新しい代表的な人間のゲームを合成する、スケーラブルで拡張性の高いプラットフォームである。概念実証として、我々はApple App StoreとSteamのトップチャートに基づいて100のそのようなゲームを生成し、7つの最先端視覚言語モデルに対して短いプレイセッションでの評価を行った。最高性能のモデルでも、大多数のゲームにおいて人間の平均スコアの10%未満しか達成できず、特に世界モデルの学習、記憶、計画を必要とするゲームに苦戦した。最後に、AI GameStoreを、機械における人間様の汎用知能への進歩を測定し推進する実用的な方法として構築するための次のステップを提示する。

English

Rigorously evaluating machine intelligence against the broad spectrum of human general intelligence has become increasingly important and challenging in this era of rapid technological advance. Conventional AI benchmarks typically assess only narrow capabilities in a limited range of human activity. Most are also static, quickly saturating as developers explicitly or implicitly optimize for them. We propose that a more promising way to evaluate human-like general intelligence in AI systems is through a particularly strong form of general game playing: studying how and how well they play and learn to play all conceivable human games, in comparison to human players with the same level of experience, time, or other resources. We define a "human game" to be a game designed by humans for humans, and argue for the evaluative suitability of this space of all such games people can imagine and enjoy -- the "Multiverse of Human Games". Taking a first step towards this vision, we introduce the AI GameStore, a scalable and open-ended platform that uses LLMs with humans-in-the-loop to synthesize new representative human games, by automatically sourcing and adapting standardized and containerized variants of game environments from popular human digital gaming platforms. As a proof of concept, we generated 100 such games based on the top charts of Apple App Store and Steam, and evaluated seven frontier vision-language models (VLMs) on short episodes of play. The best models achieved less than 10\% of the human average score on the majority of the games, and especially struggled with games that challenge world-model learning, memory and planning. We conclude with a set of next steps for building out the AI GameStore as a practical way to measure and drive progress toward human-like general intelligence in machines.

AIゲームストア：人間のゲームを用いた機械の汎用知能のスケーラブルで拡張性のある評価

AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games

要旨

Support