透過隱式獎勵進行強化學習

摘要

在大型語言模型（LLMs）的推論時間擴展中，密集過程獎勵已被證明是比稀疏結果級獎勵更有效的替代方案，特別是在需要複雜多步推理的任務中。儘管密集獎勵對於強化學習（RL）LLMs也是一個吸引人的選擇，因為它們精細的獎勵有潛力解決一些結果獎勵的固有問題，如訓練效率和信用分配，但這種潛力仍然很大程度上未被實現。這主要是由於在線訓練過程獎勵模型（PRMs）的挑戰，其中收集高質量的過程標籤成本過高，使其特別容易受到獎勵破解的影響。為了應對這些挑戰，我們提出了PRIME（通過隱式獎勵實現過程強化），它通過隱式過程獎勵使在線PRM更新僅使用策略展開和結果標籤成為可能。PRIME與各種優勢函數結合，並放棄現有方法所需的專用獎勵模型訓練階段，從而大幅降低了開發成本。我們展示了PRIME在競爭性數學和編碼方面的有效性。從Qwen2.5-Math-7B-Base開始，PRIME在幾個關鍵推理基準上實現了15.1%的平均改進，優於SFT模型。值得注意的是，我們的結果模型Eurus-2-7B-PRIME在七個推理基準上超越了Qwen2.5-Math-7B-Instruct模型，並僅使用其10%的訓練數據。

English

Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs), particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive, making them particularly vulnerable to reward hacking. To address these challenges, we propose PRIME (Process Reinforcement through IMplicit rEwards), which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards. PRIME combines well with various advantage functions and forgoes the dedicated reward model training phrase that existing approaches require, substantially reducing the development overhead. We demonstrate PRIME's effectiveness on competitional math and coding. Starting from Qwen2.5-Math-7B-Base, PRIME achieves a 15.1% average improvement across several key reasoning benchmarks over the SFT model. Notably, our resulting model, Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks with 10% of its training data.

透過隱式獎勵進行強化學習

Process Reinforcement through Implicit Rewards

摘要

Support