ChatPaper.aiChatPaper

追逐长尾效应:基于评分标准的大语言模型后训练奖励建模优化

Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training

September 25, 2025
作者: Junkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong, Victor Veitch, Wei Wang, Yunzhong He, Bing Liu, Lifeng Jin
cs.AI

摘要

强化微调(RFT)常面临奖励过度优化的问题,即策略模型通过操纵奖励信号获取高分,却生成低质量输出。我们的理论分析揭示,关键在于高奖励尾部的奖励误设:无法可靠地区分“卓越”与“优秀”响应。这促使我们聚焦于高奖励区域。然而,在基础大语言模型(LLM)下,此类尾部样本稀缺。虽然非策略范例(如来自更强模型或重写版本)更易获取,但直接在其上训练会导致我们试图对齐的策略奖励误设。为此,我们研究了基于评分标准的奖励机制。设计上,评分标准能利用非策略样本,同时对其人为痕迹保持不敏感。为了提取能捕捉高奖励尾部的评分标准,我们强调了区分优秀且多样化响应的重要性,并引入了一套实现此理念的工作流程。实证表明,基于评分标准的奖励显著缓解了奖励过度优化,并有效提升了LLM的后续训练效果。我们的代码可在https://github.com/Jun-Kai-Zhang/rubrics.git 获取。
English
Reinforcement fine-tuning (RFT) often suffers from reward over-optimization, where a policy model hacks the reward signals to achieve high scores while producing low-quality outputs. Our theoretical analysis shows that the key lies in reward misspecification at the high-reward tail: the inability to reliably distinguish Excellent responses from merely Great ones. This motivate us to focus on the high-reward region. However, such tail examples are scarce under the base LLM. While off-policy exemplars (e.g. from stronger models or rewrites) are easier to obtain, naively training on them yields a misspecified reward for the policy we aim to align. To address this, we study rubric-based rewards. By design, rubrics can leverage off-policy examples while remaining insensitive to their artifacts. To elicit rubrics that capture the high-reward tail, we highlight the importance of distinguishing among great and diverse responses, and introduce a workflow to implement this idea. We empirically demonstrate that rubric-based rewards substantially mitigate reward over-optimization and deliver effective LLM post-training improvements. Our code can be accessed at https://github.com/Jun-Kai-Zhang/rubrics.git .
PDF82September 29, 2025