追逐尾端：基於評分標準的大規模語言模型訓練後獎勵建模之有效策略

摘要

強化微調（Reinforcement Fine-Tuning, RFT）常面臨獎勵過度優化的問題，即策略模型通過操縱獎勵信號來獲得高分，卻產出低質量的結果。我們的理論分析揭示，問題的關鍵在於高獎勵尾部的獎勵誤設：無法可靠地區分優秀回應與僅為良好的回應。這促使我們聚焦於高獎勵區域。然而，在基礎大語言模型（LLM）下，此類尾部樣本稀缺。雖然離策略範例（如來自更強模型或重寫的樣本）較易獲取，但直接在其上訓練會導致我們希望對齊的策略的獎勵誤設。為解決這一問題，我們研究了基於評分標準的獎勵機制。設計上，評分標準能夠利用離策略範例，同時對其人工痕跡保持不敏感。為引出能捕捉高獎勵尾部的評分標準，我們強調了區分優秀且多樣化回應的重要性，並引入了一套工作流程來實現這一理念。我們通過實證表明，基於評分標準的獎勵顯著緩解了獎勵過度優化，並有效提升了LLM的後訓練效果。我們的代碼可於https://github.com/Jun-Kai-Zhang/rubrics.git 獲取。

English

Reinforcement fine-tuning (RFT) often suffers from reward over-optimization, where a policy model hacks the reward signals to achieve high scores while producing low-quality outputs. Our theoretical analysis shows that the key lies in reward misspecification at the high-reward tail: the inability to reliably distinguish Excellent responses from merely Great ones. This motivate us to focus on the high-reward region. However, such tail examples are scarce under the base LLM. While off-policy exemplars (e.g. from stronger models or rewrites) are easier to obtain, naively training on them yields a misspecified reward for the policy we aim to align. To address this, we study rubric-based rewards. By design, rubrics can leverage off-policy examples while remaining insensitive to their artifacts. To elicit rubrics that capture the high-reward tail, we highlight the importance of distinguishing among great and diverse responses, and introduce a workflow to implement this idea. We empirically demonstrate that rubric-based rewards substantially mitigate reward over-optimization and deliver effective LLM post-training improvements. Our code can be accessed at https://github.com/Jun-Kai-Zhang/rubrics.git .

追逐尾端：基於評分標準的大規模語言模型訓練後獎勵建模之有效策略

Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training

摘要

Support