邏輯強化學習：基於規則的強化學習釋放大型語言模型的推理能力

摘要

受到DeepSeek-R1成功的啟發，我們探索了基於規則的強化學習（RL）在大型推理模型中的潛力。為了分析推理動態，我們使用合成邏輯謎題作為訓練數據，因為它們具有可控的複雜性和直接的答案驗證。我們做出了一些關鍵的技術貢獻，從而實現了有效且穩定的RL訓練：強調思考和回答過程的系統提示、嚴格格式的獎勵函數以懲罰走捷徑的輸出，以及實現穩定收斂的簡明訓練方案。我們的7B模型發展了高級推理技能——如反思、驗證和總結——這些技能在邏輯語料庫中並不存在。值得注意的是，僅在5K個邏輯問題上訓練後，它便展現出對具有挑戰性的數學基準AIME和AMC的泛化能力。

English

Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in large reasoning models. To analyze reasoning dynamics, we use synthetic logic puzzles as training data due to their controllable complexity and straightforward answer verification. We make some key technical contributions that lead to effective and stable RL training: a system prompt that emphasizes the thinking and answering process, a stringent format reward function that penalizes outputs for taking shortcuts, and a straightforward training recipe that achieves stable convergence. Our 7B model develops advanced reasoning skills-such as reflection, verification, and summarization-that are absent from the logic corpus. Remarkably, after training on just 5K logic problems, it demonstrates generalization abilities to the challenging math benchmarks AIME and AMC.

邏輯強化學習：基於規則的強化學習釋放大型語言模型的推理能力

Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

摘要

Support