个性化奖励基准：以人类对齐的个性化方式评估奖励模型

摘要

多元对齐已成为大型语言模型发展的关键前沿，奖励模型作为捕捉多样化人类价值观的核心机制。尽管通用响应质量的基准测试已十分普遍，但如何评估奖励模型对个体用户偏好的建模能力仍是一个开放挑战。为填补这一空白，我们推出个性化奖励基准测试——一种专为严格评估奖励模型个性化偏好建模能力而设计的新型基准。我们基于对用户特定标准的严格遵守（或违反）构建优选与拒选响应配对，确保偏好区分具有完全个性化的特性。特别是人工评估证实，配对样本间的主要区分因素严格遵循个人偏好，且两个响应均保持较高的通用质量（如正确性、相关性和帮助性）。广泛测试表明，现有前沿奖励模型在个性化任务上表现欠佳，最高准确率仅为75.94%。关键的是，由于有效的奖励模型基准应能预测其在下游任务中的表现，我们通过实验证明：相较于现有基线，该基准在BoN采样和近端策略优化两种下游任务中均展现出显著更高的性能相关性。这些发现确立了个性化奖励基准作为评估奖励模型下游应用性能的稳健且精确的代理指标。

English

Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values. While benchmarks for general response quality are prevalent, evaluating how well reward models account for individual user preferences remains an open challenge. To bridge this gap, we introduce Personalized RewardBench, a novel benchmark designed to rigorously assess reward models' capacity to model personalized preferences. We construct chosen and rejected response pairs based on strict adherence to (or violation of) user-specific rubrics, ensuring that preference distinctions are uniquely tailored to the individual. In particular, human evaluations confirm that the primary discriminative factor between pairs is strictly personal preference, with both responses maintaining high general quality (e.g., correctness, relevance and helpfulness). Extensive testing reveals that existing state-of-the-art reward models struggle significantly with personalization, peaking at an accuracy of just 75.94%. Crucially, because an effective reward model benchmark should predict a reward model's performance on downstream tasks, we conduct experiments demonstrating that our benchmark exhibits a significantly higher correlation with downstream performance in both Best-of-N (BoN) sampling and Proximal Policy Optimization (PPO) compared to existing baselines. These findings establish Personalized RewardBench as a robust and accurate proxy for evaluating reward models' performance in downstream applications.

个性化奖励基准：以人类对齐的个性化方式评估奖励模型

Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization

摘要

Support