追逐长尾效应：基于评分标准的大语言模型后训练奖励建模优化

摘要

强化微调（RFT）常面临奖励过度优化的问题，即策略模型通过操纵奖励信号获取高分，却生成低质量输出。我们的理论分析揭示，关键在于高奖励尾部的奖励误设：无法可靠地区分“卓越”与“优秀”响应。这促使我们聚焦于高奖励区域。然而，在基础大语言模型（LLM）下，此类尾部样本稀缺。虽然非策略范例（如来自更强模型或重写版本）更易获取，但直接在其上训练会导致我们试图对齐的策略奖励误设。为此，我们研究了基于评分标准的奖励机制。设计上，评分标准能利用非策略样本，同时对其人为痕迹保持不敏感。为了提取能捕捉高奖励尾部的评分标准，我们强调了区分优秀且多样化响应的重要性，并引入了一套实现此理念的工作流程。实证表明，基于评分标准的奖励显著缓解了奖励过度优化，并有效提升了LLM的后续训练效果。我们的代码可在https://github.com/Jun-Kai-Zhang/rubrics.git 获取。

English

Reinforcement fine-tuning (RFT) often suffers from reward over-optimization, where a policy model hacks the reward signals to achieve high scores while producing low-quality outputs. Our theoretical analysis shows that the key lies in reward misspecification at the high-reward tail: the inability to reliably distinguish Excellent responses from merely Great ones. This motivate us to focus on the high-reward region. However, such tail examples are scarce under the base LLM. While off-policy exemplars (e.g. from stronger models or rewrites) are easier to obtain, naively training on them yields a misspecified reward for the policy we aim to align. To address this, we study rubric-based rewards. By design, rubrics can leverage off-policy examples while remaining insensitive to their artifacts. To elicit rubrics that capture the high-reward tail, we highlight the importance of distinguishing among great and diverse responses, and introduce a workflow to implement this idea. We empirically demonstrate that rubric-based rewards substantially mitigate reward over-optimization and deliver effective LLM post-training improvements. Our code can be accessed at https://github.com/Jun-Kai-Zhang/rubrics.git .

追逐长尾效应：基于评分标准的大语言模型后训练奖励建模优化

Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training

摘要

Support