ChatPaper.aiChatPaper

追逐尾端:基於評分標準的大規模語言模型訓練後獎勵建模之有效策略

Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training

September 25, 2025
作者: Junkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong, Victor Veitch, Wei Wang, Yunzhong He, Bing Liu, Lifeng Jin
cs.AI

摘要

強化微調(Reinforcement Fine-Tuning, RFT)常面臨獎勵過度優化的問題,即策略模型通過操縱獎勵信號來獲得高分,卻產出低質量的結果。我們的理論分析揭示,問題的關鍵在於高獎勵尾部的獎勵誤設:無法可靠地區分優秀回應與僅為良好的回應。這促使我們聚焦於高獎勵區域。然而,在基礎大語言模型(LLM)下,此類尾部樣本稀缺。雖然離策略範例(如來自更強模型或重寫的樣本)較易獲取,但直接在其上訓練會導致我們希望對齊的策略的獎勵誤設。為解決這一問題,我們研究了基於評分標準的獎勵機制。設計上,評分標準能夠利用離策略範例,同時對其人工痕跡保持不敏感。為引出能捕捉高獎勵尾部的評分標準,我們強調了區分優秀且多樣化回應的重要性,並引入了一套工作流程來實現這一理念。我們通過實證表明,基於評分標準的獎勵顯著緩解了獎勵過度優化,並有效提升了LLM的後訓練效果。我們的代碼可於https://github.com/Jun-Kai-Zhang/rubrics.git 獲取。
English
Reinforcement fine-tuning (RFT) often suffers from reward over-optimization, where a policy model hacks the reward signals to achieve high scores while producing low-quality outputs. Our theoretical analysis shows that the key lies in reward misspecification at the high-reward tail: the inability to reliably distinguish Excellent responses from merely Great ones. This motivate us to focus on the high-reward region. However, such tail examples are scarce under the base LLM. While off-policy exemplars (e.g. from stronger models or rewrites) are easier to obtain, naively training on them yields a misspecified reward for the policy we aim to align. To address this, we study rubric-based rewards. By design, rubrics can leverage off-policy examples while remaining insensitive to their artifacts. To elicit rubrics that capture the high-reward tail, we highlight the importance of distinguishing among great and diverse responses, and introduce a workflow to implement this idea. We empirically demonstrate that rubric-based rewards substantially mitigate reward over-optimization and deliver effective LLM post-training improvements. Our code can be accessed at https://github.com/Jun-Kai-Zhang/rubrics.git .
PDF82September 29, 2025