生成されたゲームを用いた汎用知能の測定

要旨

gg-benchを紹介します。これは、言語モデルの一般的な推論能力を評価するために設計されたゲーム環境のコレクションです。ほとんどの静的ベンチマークとは異なり、gg-benchはデータ生成プロセスであり、新しい評価インスタンスを自由に生成できます。具体的には、gg-benchは以下の手順で合成生成されます。(1) 大規模言語モデル（LLM）を使用して新規ゲームの自然言語記述を生成、(2) LLMを使用して各ゲームをGym環境としてコード実装、(3) 生成されたゲーム上で自己対戦を通じて強化学習（RL）エージェントを訓練。言語モデルの評価は、ゲームの説明、現在のボード状態、有効な手のリストをプロンプトとして与え、モデルが選択した手を出力することで行います。gg-benchは難易度が高く、GPT-4oやClaude 3.7 Sonnetなどの最先端LLMでも、インコンテキスト学習を用いて7-9%の勝率しか達成できません。一方、o1、o3-mini、DeepSeek-R1などの推論モデルは、平均31-36%の勝率を達成しています。今後のモデリング作業やベンチマークの拡張を支援するため、生成されたゲーム、データ生成プロセス、評価コードを公開します。

English

We present gg-bench, a collection of game environments designed to evaluate general reasoning capabilities in language models. Unlike most static benchmarks, gg-bench is a data generating process where new evaluation instances can be generated at will. In particular, gg-bench is synthetically generated by (1) using a large language model (LLM) to generate natural language descriptions of novel games, (2) using the LLM to implement each game in code as a Gym environment, and (3) training reinforcement learning (RL) agents via self-play on the generated games. We evaluate language models by their winrate against these RL agents by prompting models with the game description, current board state, and a list of valid moves, after which models output the moves they wish to take. gg-bench is challenging: state-of-the-art LLMs such as GPT-4o and Claude 3.7 Sonnet achieve winrates of 7-9% on gg-bench using in-context learning, while reasoning models such as o1, o3-mini and DeepSeek-R1 achieve average winrates of 31-36%. We release the generated games, data generation process, and evaluation code in order to support future modeling work and expansion of our benchmark.

生成されたゲームを用いた汎用知能の測定

Measuring General Intelligence with Generated Games

要旨

Support