ChatPaper.aiChatPaper

使用生成遊戲衡量通用智能

Measuring General Intelligence with Generated Games

May 12, 2025
作者: Vivek Verma, David Huang, William Chen, Dan Klein, Nicholas Tomlin
cs.AI

摘要

我們推出了gg-bench,這是一套專為評估語言模型通用推理能力而設計的遊戲環境集合。與大多數靜態基準測試不同,gg-bench是一個數據生成過程,能夠按需生成新的評估實例。具體而言,gg-bench是通過以下步驟合成生成的:(1) 使用大型語言模型(LLM)生成新穎遊戲的自然語言描述,(2) 利用LLM將每個遊戲以代碼形式實現為Gym環境,以及(3) 通過自我對弈在生成的遊戲上訓練強化學習(RL)代理。我們通過讓語言模型與這些RL代理對戰的勝率來評估其性能,方法是向模型提供遊戲描述、當前棋盤狀態及有效移動列表,隨後模型輸出其希望執行的移動。gg-bench具有挑戰性:使用上下文學習時,如GPT-4o和Claude 3.7 Sonnet等頂尖LLM在gg-bench上的勝率僅為7-9%,而如o1、o3-mini和DeepSeek-R1等推理模型的平均勝率則達到31-36%。我們公開了生成的遊戲、數據生成過程及評估代碼,以支持未來的模型開發工作及我們基準測試的擴展。
English
We present gg-bench, a collection of game environments designed to evaluate general reasoning capabilities in language models. Unlike most static benchmarks, gg-bench is a data generating process where new evaluation instances can be generated at will. In particular, gg-bench is synthetically generated by (1) using a large language model (LLM) to generate natural language descriptions of novel games, (2) using the LLM to implement each game in code as a Gym environment, and (3) training reinforcement learning (RL) agents via self-play on the generated games. We evaluate language models by their winrate against these RL agents by prompting models with the game description, current board state, and a list of valid moves, after which models output the moves they wish to take. gg-bench is challenging: state-of-the-art LLMs such as GPT-4o and Claude 3.7 Sonnet achieve winrates of 7-9% on gg-bench using in-context learning, while reasoning models such as o1, o3-mini and DeepSeek-R1 achieve average winrates of 31-36%. We release the generated games, data generation process, and evaluation code in order to support future modeling work and expansion of our benchmark.

Summary

AI-Generated Summary

PDF72May 14, 2025