P-GenRM:基于测试时用户缩放的个性化生成奖励模型
P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling
February 12, 2026
作者: Pinyi Zhang, Ting-En Lin, Yuchuan Wu, Jingyang Chen, Zongqi Wang, Hua Yang, Ze Xu, Fei Huang, Kai Zhang, Yongbin Li
cs.AI
摘要
大型语言模型的个性化对齐旨在通过强化学习使模型响应适配用户个人偏好,其核心挑战在于开放场景中如何获取精准的用户特定奖励信号。现有个性化奖励模型存在两大局限:(1)将多样化的场景特定偏好过度简化为少量固定评估原则;(2)对反馈数据有限的新用户泛化能力不足。为此,我们提出首个支持测试时用户自适应缩放的个人化生成式奖励模型P-GenRM。该模型将偏好信号转化为结构化评估链,动态生成跨场景的自适应角色画像与评分标准,并通过用户原型聚类实现双粒度缩放机制:在个体层面自适应缩放聚合用户评分方案,在原型层面融合相似用户偏好。这种设计能有效抑制偏好推断噪声,并借助原型迁移提升对未见过用户的泛化能力。实验表明,P-GenRM在广泛使用的个性化奖励模型基准上实现平均2.31%的性能提升,并在分布外数据集展现强大泛化性。值得注意的是,测试时用户缩放机制带来额外3%的增益,证明了该模型在保持测试可扩展性的同时实现更强个性化对齐。
English
Personalized alignment of large language models seeks to adapt responses to individual user preferences, typically via reinforcement learning. A key challenge is obtaining accurate, user-specific reward signals in open-ended scenarios. Existing personalized reward models face two persistent limitations: (1) oversimplifying diverse, scenario-specific preferences into a small, fixed set of evaluation principles, and (2) struggling with generalization to new users with limited feedback. To this end, we propose P-GenRM, the first Personalized Generative Reward Model with test-time user-based scaling. P-GenRM transforms preference signals into structured evaluation chains that derive adaptive personas and scoring rubrics across various scenarios. It further clusters users into User Prototypes and introduces a dual-granularity scaling mechanism: at the individual level, it adaptively scales and aggregates each user's scoring scheme; at the prototype level, it incorporates preferences from similar users. This design mitigates noise in inferred preferences and enhances generalization to unseen users through prototype-based transfer. Empirical results show that P-GenRM achieves state-of-the-art results on widely-used personalized reward model benchmarks, with an average improvement of 2.31%, and demonstrates strong generalization on an out-of-distribution dataset. Notably, Test-time User-based scaling provides an additional 3% boost, demonstrating stronger personalized alignment with test-time scalability.