使用生成遊戲衡量通用智能

摘要

我們推出了gg-bench，這是一套專為評估語言模型通用推理能力而設計的遊戲環境集合。與大多數靜態基準測試不同，gg-bench是一個數據生成過程，能夠按需生成新的評估實例。具體而言，gg-bench是通過以下步驟合成生成的：(1) 使用大型語言模型（LLM）生成新穎遊戲的自然語言描述，(2) 利用LLM將每個遊戲以代碼形式實現為Gym環境，以及(3) 通過自我對弈在生成的遊戲上訓練強化學習（RL）代理。我們通過讓語言模型與這些RL代理對戰的勝率來評估其性能，方法是向模型提供遊戲描述、當前棋盤狀態及有效移動列表，隨後模型輸出其希望執行的移動。gg-bench具有挑戰性：使用上下文學習時，如GPT-4o和Claude 3.7 Sonnet等頂尖LLM在gg-bench上的勝率僅為7-9%，而如o1、o3-mini和DeepSeek-R1等推理模型的平均勝率則達到31-36%。我們公開了生成的遊戲、數據生成過程及評估代碼，以支持未來的模型開發工作及我們基準測試的擴展。

English

We present gg-bench, a collection of game environments designed to evaluate general reasoning capabilities in language models. Unlike most static benchmarks, gg-bench is a data generating process where new evaluation instances can be generated at will. In particular, gg-bench is synthetically generated by (1) using a large language model (LLM) to generate natural language descriptions of novel games, (2) using the LLM to implement each game in code as a Gym environment, and (3) training reinforcement learning (RL) agents via self-play on the generated games. We evaluate language models by their winrate against these RL agents by prompting models with the game description, current board state, and a list of valid moves, after which models output the moves they wish to take. gg-bench is challenging: state-of-the-art LLMs such as GPT-4o and Claude 3.7 Sonnet achieve winrates of 7-9% on gg-bench using in-context learning, while reasoning models such as o1, o3-mini and DeepSeek-R1 achieve average winrates of 31-36%. We release the generated games, data generation process, and evaluation code in order to support future modeling work and expansion of our benchmark.

使用生成遊戲衡量通用智能

Measuring General Intelligence with Generated Games

摘要

Support