通过生成游戏衡量通用智能

摘要

我们推出了gg-bench，这是一套专为评估语言模型通用推理能力而设计的游戏环境集合。与多数静态基准测试不同，gg-bench是一个数据生成过程，能够按需生成新的评估实例。具体而言，gg-bench通过以下步骤合成生成：(1) 使用大型语言模型（LLM）生成新颖游戏的自然语言描述，(2) 利用LLM将每个游戏以代码形式实现为Gym环境，(3) 通过自我对弈在生成游戏上训练强化学习（RL）代理。我们通过让语言模型与这些RL代理的对战胜率来评估其性能，具体做法是向模型提示游戏描述、当前棋盘状态及有效移动列表，随后模型输出其希望执行的移动。gg-bench具有挑战性：采用上下文学习时，如GPT-4o和Claude 3.7 Sonnet等顶尖LLM在gg-bench上的胜率仅为7-9%，而如o1、o3-mini和DeepSeek-R1等推理模型的平均胜率则达到31-36%。我们公开了生成的游戏、数据生成过程及评估代码，以支持未来的模型开发工作及我们基准测试的扩展。

English

We present gg-bench, a collection of game environments designed to evaluate general reasoning capabilities in language models. Unlike most static benchmarks, gg-bench is a data generating process where new evaluation instances can be generated at will. In particular, gg-bench is synthetically generated by (1) using a large language model (LLM) to generate natural language descriptions of novel games, (2) using the LLM to implement each game in code as a Gym environment, and (3) training reinforcement learning (RL) agents via self-play on the generated games. We evaluate language models by their winrate against these RL agents by prompting models with the game description, current board state, and a list of valid moves, after which models output the moves they wish to take. gg-bench is challenging: state-of-the-art LLMs such as GPT-4o and Claude 3.7 Sonnet achieve winrates of 7-9% on gg-bench using in-context learning, while reasoning models such as o1, o3-mini and DeepSeek-R1 achieve average winrates of 31-36%. We release the generated games, data generation process, and evaluation code in order to support future modeling work and expansion of our benchmark.

通过生成游戏衡量通用智能

Measuring General Intelligence with Generated Games

摘要

Support