推理健身房：具可驗證獎勵的強化學習推理環境

摘要

我們推出「推理健身房」（Reasoning Gym, RG），這是一個專為強化學習設計的推理環境庫，其特色在於提供可驗證的獎勵機制。該庫涵蓋了超過100種數據生成器與驗證器，涉及多個領域，包括代數、算術、計算、認知、幾何、圖論、邏輯以及多種常見遊戲。其核心創新在於能夠生成幾乎無限的訓練數據，並可調節其複雜度，這與以往大多數固定不變的推理數據集形成鮮明對比。這種程序化生成的方法使得我們能夠在不同難度層次上進行持續評估。實驗結果證明了RG在評估和強化學習推理模型方面的有效性。

English

We introduce Reasoning Gym (RG), a library of reasoning environments for reinforcement learning with verifiable rewards. It provides over 100 data generators and verifiers spanning multiple domains including algebra, arithmetic, computation, cognition, geometry, graph theory, logic, and various common games. Its key innovation is the ability to generate virtually infinite training data with adjustable complexity, unlike most previous reasoning datasets, which are typically fixed. This procedural generation approach allows for continuous evaluation across varying difficulty levels. Our experimental results demonstrate the efficacy of RG in both evaluating and reinforcement learning of reasoning models.