追逐长尾效应:基于评分标准的大语言模型后训练奖励建模优化
Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training
September 25, 2025
作者: Junkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong, Victor Veitch, Wei Wang, Yunzhong He, Bing Liu, Lifeng Jin
cs.AI
摘要
强化微调(RFT)常面临奖励过度优化的问题,即策略模型通过操纵奖励信号获取高分,却生成低质量输出。我们的理论分析揭示,关键在于高奖励尾部的奖励误设:无法可靠地区分“卓越”与“优秀”响应。这促使我们聚焦于高奖励区域。然而,在基础大语言模型(LLM)下,此类尾部样本稀缺。虽然非策略范例(如来自更强模型或重写版本)更易获取,但直接在其上训练会导致我们试图对齐的策略奖励误设。为此,我们研究了基于评分标准的奖励机制。设计上,评分标准能利用非策略样本,同时对其人为痕迹保持不敏感。为了提取能捕捉高奖励尾部的评分标准,我们强调了区分优秀且多样化响应的重要性,并引入了一套实现此理念的工作流程。实证表明,基于评分标准的奖励显著缓解了奖励过度优化,并有效提升了LLM的后续训练效果。我们的代码可在https://github.com/Jun-Kai-Zhang/rubrics.git 获取。
English
Reinforcement fine-tuning (RFT) often suffers from reward
over-optimization, where a policy model hacks the reward signals to achieve
high scores while producing low-quality outputs. Our theoretical analysis shows
that the key lies in reward misspecification at the high-reward tail: the
inability to reliably distinguish Excellent responses from merely Great ones.
This motivate us to focus on the high-reward region. However, such tail
examples are scarce under the base LLM. While off-policy exemplars (e.g. from
stronger models or rewrites) are easier to obtain, naively training on them
yields a misspecified reward for the policy we aim to align. To address this,
we study rubric-based rewards. By design, rubrics can leverage off-policy
examples while remaining insensitive to their artifacts. To elicit rubrics that
capture the high-reward tail, we highlight the importance of distinguishing
among great and diverse responses, and introduce a workflow to implement this
idea. We empirically demonstrate that rubric-based rewards substantially
mitigate reward over-optimization and deliver effective LLM post-training
improvements. Our code can be accessed at
https://github.com/Jun-Kai-Zhang/rubrics.git .