追逐尾端:基於評分標準的大規模語言模型訓練後獎勵建模之有效策略
Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training
September 25, 2025
作者: Junkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong, Victor Veitch, Wei Wang, Yunzhong He, Bing Liu, Lifeng Jin
cs.AI
摘要
強化微調(Reinforcement Fine-Tuning, RFT)常面臨獎勵過度優化的問題,即策略模型通過操縱獎勵信號來獲得高分,卻產出低質量的結果。我們的理論分析揭示,問題的關鍵在於高獎勵尾部的獎勵誤設:無法可靠地區分優秀回應與僅為良好的回應。這促使我們聚焦於高獎勵區域。然而,在基礎大語言模型(LLM)下,此類尾部樣本稀缺。雖然離策略範例(如來自更強模型或重寫的樣本)較易獲取,但直接在其上訓練會導致我們希望對齊的策略的獎勵誤設。為解決這一問題,我們研究了基於評分標準的獎勵機制。設計上,評分標準能夠利用離策略範例,同時對其人工痕跡保持不敏感。為引出能捕捉高獎勵尾部的評分標準,我們強調了區分優秀且多樣化回應的重要性,並引入了一套工作流程來實現這一理念。我們通過實證表明,基於評分標準的獎勵顯著緩解了獎勵過度優化,並有效提升了LLM的後訓練效果。我們的代碼可於https://github.com/Jun-Kai-Zhang/rubrics.git 獲取。
English
Reinforcement fine-tuning (RFT) often suffers from reward
over-optimization, where a policy model hacks the reward signals to achieve
high scores while producing low-quality outputs. Our theoretical analysis shows
that the key lies in reward misspecification at the high-reward tail: the
inability to reliably distinguish Excellent responses from merely Great ones.
This motivate us to focus on the high-reward region. However, such tail
examples are scarce under the base LLM. While off-policy exemplars (e.g. from
stronger models or rewrites) are easier to obtain, naively training on them
yields a misspecified reward for the policy we aim to align. To address this,
we study rubric-based rewards. By design, rubrics can leverage off-policy
examples while remaining insensitive to their artifacts. To elicit rubrics that
capture the high-reward tail, we highlight the importance of distinguishing
among great and diverse responses, and introduce a workflow to implement this
idea. We empirically demonstrate that rubric-based rewards substantially
mitigate reward over-optimization and deliver effective LLM post-training
improvements. Our code can be accessed at
https://github.com/Jun-Kai-Zhang/rubrics.git .