추론 체육관: 검증 가능한 보상을 통한 강화 학습을 위한 추론 환경

초록

우리는 검증 가능한 보상을 제공하는 강화 학습을 위한 추론 환경 라이브러리인 Reasoning Gym(RG)을 소개한다. 이 라이브러리는 대수학, 산술, 계산, 인지, 기하학, 그래프 이론, 논리학 및 다양한 일반 게임을 포함한 여러 도메인에 걸쳐 100개 이상의 데이터 생성기와 검증기를 제공한다. RG의 핵심 혁신은 기존의 대부분의 추론 데이터셋이 일반적으로 고정된 것과 달리, 조정 가능한 복잡도로 사실상 무한한 훈련 데이터를 생성할 수 있는 능력이다. 이 절차적 생성 접근법은 다양한 난이도에 걸친 지속적인 평가를 가능하게 한다. 우리의 실험 결과는 RG가 추론 모델의 평가와 강화 학습 모두에서 효과적임을 입증한다.

English

We introduce Reasoning Gym (RG), a library of reasoning environments for reinforcement learning with verifiable rewards. It provides over 100 data generators and verifiers spanning multiple domains including algebra, arithmetic, computation, cognition, geometry, graph theory, logic, and various common games. Its key innovation is the ability to generate virtually infinite training data with adjustable complexity, unlike most previous reasoning datasets, which are typically fixed. This procedural generation approach allows for continuous evaluation across varying difficulty levels. Our experimental results demonstrate the efficacy of RG in both evaluating and reinforcement learning of reasoning models.