ChatPaper.aiChatPaper

通过生成游戏衡量通用智能

Measuring General Intelligence with Generated Games

May 12, 2025
作者: Vivek Verma, David Huang, William Chen, Dan Klein, Nicholas Tomlin
cs.AI

摘要

我们推出了gg-bench,这是一套专为评估语言模型通用推理能力而设计的游戏环境集合。与多数静态基准测试不同,gg-bench是一个数据生成过程,能够按需生成新的评估实例。具体而言,gg-bench通过以下步骤合成生成:(1) 使用大型语言模型(LLM)生成新颖游戏的自然语言描述,(2) 利用LLM将每个游戏以代码形式实现为Gym环境,(3) 通过自我对弈在生成游戏上训练强化学习(RL)代理。我们通过让语言模型与这些RL代理的对战胜率来评估其性能,具体做法是向模型提示游戏描述、当前棋盘状态及有效移动列表,随后模型输出其希望执行的移动。gg-bench具有挑战性:采用上下文学习时,如GPT-4o和Claude 3.7 Sonnet等顶尖LLM在gg-bench上的胜率仅为7-9%,而如o1、o3-mini和DeepSeek-R1等推理模型的平均胜率则达到31-36%。我们公开了生成的游戏、数据生成过程及评估代码,以支持未来的模型开发工作及我们基准测试的扩展。
English
We present gg-bench, a collection of game environments designed to evaluate general reasoning capabilities in language models. Unlike most static benchmarks, gg-bench is a data generating process where new evaluation instances can be generated at will. In particular, gg-bench is synthetically generated by (1) using a large language model (LLM) to generate natural language descriptions of novel games, (2) using the LLM to implement each game in code as a Gym environment, and (3) training reinforcement learning (RL) agents via self-play on the generated games. We evaluate language models by their winrate against these RL agents by prompting models with the game description, current board state, and a list of valid moves, after which models output the moves they wish to take. gg-bench is challenging: state-of-the-art LLMs such as GPT-4o and Claude 3.7 Sonnet achieve winrates of 7-9% on gg-bench using in-context learning, while reasoning models such as o1, o3-mini and DeepSeek-R1 achieve average winrates of 31-36%. We release the generated games, data generation process, and evaluation code in order to support future modeling work and expansion of our benchmark.

Summary

AI-Generated Summary

PDF62May 14, 2025