透過隱式獎勵進行強化學習
Process Reinforcement through Implicit Rewards
February 3, 2025
作者: Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, Ning Ding
cs.AI
摘要
在大型語言模型(LLMs)的推論時間擴展中,密集過程獎勵已被證明是比稀疏結果級獎勵更有效的替代方案,特別是在需要複雜多步推理的任務中。儘管密集獎勵對於強化學習(RL)LLMs也是一個吸引人的選擇,因為它們精細的獎勵有潛力解決一些結果獎勵的固有問題,如訓練效率和信用分配,但這種潛力仍然很大程度上未被實現。這主要是由於在線訓練過程獎勵模型(PRMs)的挑戰,其中收集高質量的過程標籤成本過高,使其特別容易受到獎勵破解的影響。為了應對這些挑戰,我們提出了PRIME(通過隱式獎勵實現過程強化),它通過隱式過程獎勵使在線PRM更新僅使用策略展開和結果標籤成為可能。PRIME與各種優勢函數結合,並放棄現有方法所需的專用獎勵模型訓練階段,從而大幅降低了開發成本。我們展示了PRIME在競爭性數學和編碼方面的有效性。從Qwen2.5-Math-7B-Base開始,PRIME在幾個關鍵推理基準上實現了15.1%的平均改進,優於SFT模型。值得注意的是,我們的結果模型Eurus-2-7B-PRIME在七個推理基準上超越了Qwen2.5-Math-7B-Instruct模型,並僅使用其10%的訓練數據。
English
Dense process rewards have proven a more effective alternative to the sparse
outcome-level rewards in the inference-time scaling of large language models
(LLMs), particularly in tasks requiring complex multi-step reasoning. While
dense rewards also offer an appealing choice for the reinforcement learning
(RL) of LLMs since their fine-grained rewards have the potential to address
some inherent issues of outcome rewards, such as training efficiency and credit
assignment, this potential remains largely unrealized. This can be primarily
attributed to the challenges of training process reward models (PRMs) online,
where collecting high-quality process labels is prohibitively expensive, making
them particularly vulnerable to reward hacking. To address these challenges, we
propose PRIME (Process Reinforcement through IMplicit rEwards), which enables
online PRM updates using only policy rollouts and outcome labels through
implict process rewards. PRIME combines well with various advantage functions
and forgoes the dedicated reward model training phrase that existing approaches
require, substantially reducing the development overhead. We demonstrate
PRIME's effectiveness on competitional math and coding. Starting from
Qwen2.5-Math-7B-Base, PRIME achieves a 15.1% average improvement across several
key reasoning benchmarks over the SFT model. Notably, our resulting model,
Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoning
benchmarks with 10% of its training data.Summary
AI-Generated Summary