Logic-RL: ルールベース強化学習による大規模言語モデルの推論能力の解放

要旨

DeepSeek-R1の成功に触発され、我々は大規模推論モデルにおけるルールベース強化学習（RL）の可能性を探求した。推論ダイナミクスを分析するため、制御可能な複雑さと明確な解答検証が可能な合成論理パズルを訓練データとして使用した。効果的かつ安定したRL訓練を実現するため、いくつかの重要な技術的貢献を行った：思考と解答プロセスを重視するシステムプロンプト、近道を取る出力にペナルティを与える厳格なフォーマット報酬関数、安定した収束を達成する簡潔な訓練レシピである。我々の7Bモデルは、論理コーパスには存在しない、反省、検証、要約といった高度な推論スキルを発展させた。注目すべきは、わずか5,000の論理問題を訓練した後、AIMEやAMCといった難易度の高い数学ベンチマークに対して一般化能力を示したことである。

English

Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in large reasoning models. To analyze reasoning dynamics, we use synthetic logic puzzles as training data due to their controllable complexity and straightforward answer verification. We make some key technical contributions that lead to effective and stable RL training: a system prompt that emphasizes the thinking and answering process, a stringent format reward function that penalizes outputs for taking shortcuts, and a straightforward training recipe that achieves stable convergence. Our 7B model develops advanced reasoning skills-such as reflection, verification, and summarization-that are absent from the logic corpus. Remarkably, after training on just 5K logic problems, it demonstrates generalization abilities to the challenging math benchmarks AIME and AMC.

Logic-RL: ルールベース強化学習による大規模言語モデルの推論能力の解放

Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

要旨

Support