个性化奖励基准：基于人类对齐的个性化奖励模型评估框架

摘要

多元对齐已成为大型语言模型发展的关键前沿领域，其中奖励模型作为捕捉多样化人类价值的核心机制。虽然通用回复质量的基准测试已较为普遍，但如何评估奖励模型对个体用户偏好的建模能力仍是一个开放挑战。为弥补这一空白，我们推出个性化奖励基准测试——这是一个旨在严格评估奖励模型个性化偏好建模能力的新型基准框架。我们基于对用户特定评分标准的严格遵守（或违反）来构建优选与劣选回复对，确保偏好区分具有完全的个人针对性。值得注意的是，人工评估证实配对样本间的主要区分因素严格限定于个人偏好，且两种回复均保持较高的通用质量（如准确性、相关性和帮助性）。大量测试表明，现有最先进的奖励模型在个性化任务上表现欠佳，最高准确率仅为75.94%。关键的是，由于有效的奖励模型基准应能预测其在下游任务中的表现，我们通过实验证明：与现有基线相比，我们的基准在Best-of-N采样和近端策略优化两种场景下，与下游任务性能的相关性均显著更高。这些发现确立了个性化奖励基准作为评估奖励模型下游应用性能的稳健且精确的代理指标。

English

Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values. While benchmarks for general response quality are prevalent, evaluating how well reward models account for individual user preferences remains an open challenge. To bridge this gap, we introduce Personalized RewardBench, a novel benchmark designed to rigorously assess reward models' capacity to model personalized preferences. We construct chosen and rejected response pairs based on strict adherence to (or violation of) user-specific rubrics, ensuring that preference distinctions are uniquely tailored to the individual. In particular, human evaluations confirm that the primary discriminative factor between pairs is strictly personal preference, with both responses maintaining high general quality (e.g., correctness, relevance and helpfulness). Extensive testing reveals that existing state-of-the-art reward models struggle significantly with personalization, peaking at an accuracy of just 75.94%. Crucially, because an effective reward model benchmark should predict a reward model's performance on downstream tasks, we conduct experiments demonstrating that our benchmark exhibits a significantly higher correlation with downstream performance in both Best-of-N (BoN) sampling and Proximal Policy Optimization (PPO) compared to existing baselines. These findings establish Personalized RewardBench as a robust and accurate proxy for evaluating reward models' performance in downstream applications.

个性化奖励基准：基于人类对齐的个性化奖励模型评估框架

Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization

摘要

Support