暗黙の報酬を通じたプロセスの強化

要旨

密なプロセス報酬は、大規模言語モデル（LLM）の推論時スケーリングにおいて、特に複雑な多段階推論を必要とするタスクにおいて、疎な結果レベルの報酬よりも効果的な代替手段であることが証明されています。密な報酬は、細かい報酬を持つため、トレーニング効率やクレジット割り当てなどの結果報酬の固有の問題に対処する可能性があるため、LLMの強化学習（RL）にとって魅力的な選択肢でもありますが、この潜在能力はほとんど実現されていません。これは、高品質なプロセスラベルを収集することが非常に高コストであり、報酬ハッキングに特に脆弱であるため、オンラインでプロセス報酬モデル（PRM）をトレーニングする課題に主に起因します。これらの課題に対処するために、私たちはPRIME（Process Reinforcement through IMplicit rEwards）を提案します。これにより、ポリシーロールアウトと結果ラベルを使用して暗黙のプロセス報酬を介してオンラインでPRMを更新することが可能となります。PRIMEは、さまざまなアドバンテージ関数と組み合わせることができ、既存のアプローチが必要とする専用の報酬モデルトレーニングフェーズを省略することで、開発オーバーヘッドを大幅に削減します。私たちはPRIMEの効果を数学とコーディングの競技において示しています。Qwen2.5-Math-7B-Baseから始めて、PRIMEはSFTモデルに比べていくつかの主要な推論ベンチマークで15.1％の平均改善を達成します。特に、私たちの結果として得られたモデルであるEurus-2-7B-PRIMEは、トレーニングデータの10％でQwen2.5-Math-7B-Instructを七つの推論ベンチマークで上回ります。

English

Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs), particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive, making them particularly vulnerable to reward hacking. To address these challenges, we propose PRIME (Process Reinforcement through IMplicit rEwards), which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards. PRIME combines well with various advantage functions and forgoes the dedicated reward model training phrase that existing approaches require, substantially reducing the development overhead. We demonstrate PRIME's effectiveness on competitional math and coding. Starting from Qwen2.5-Math-7B-Base, PRIME achieves a 15.1% average improvement across several key reasoning benchmarks over the SFT model. Notably, our resulting model, Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks with 10% of its training data.

暗黙の報酬を通じたプロセスの強化

Process Reinforcement through Implicit Rewards

要旨

Support