パーソナライズド・リワードベンチ：人間に適合したパーソナライゼーションによる報酬モデルの評価

要旨

多元的アライメントは大規模言語モデル（LLM）の発展において重要なフロンティアとして台頭しており、報酬モデル（RM）は多様な人間の価値観を捉える中心的なメカニズムとして機能している。一般的な応答品質のベンチマークは広く存在するが、報酬モデルが個々のユーザー嗜好をどの程度適切に考慮しているかを評価することは、未解決の課題として残っている。このギャップを埋めるため、我々は報酬モデルの個人化された嗜好をモデル化する能力を厳密に評価するために設計された新しいベンチマークであるPersonalized RewardBenchを提案する。我々は、ユーザー固有の評価基準への厳密な準拠（または違反）に基づいて、選択された応答と拒否された応答のペアを構築し、嗜好の区別が個人に特化していることを保証する。特に、人間による評価では、ペア間の主要な識別要因が厳密に個人の嗜好であり、両応答が高い一般的品質（正確性、関連性、有益性など）を維持していることが確認されている。広範なテストにより、既存の最先端報酬モデルは個人化に著しく苦戦しており、精度は最高で75.94%に留まることが明らかになった。決定的に、効果的な報酬モデルベンチマークは下流タスクにおける報酬モデルの性能を予測すべきであるため、我々は実験を行い、本ベンチマークがBest-of-N（BoN）サンプリングと近接方策最適化（PPO）の両方において、既存のベースラインと比較して下流性能とはるかに高い相関を示すことを実証した。これらの知見は、Personalized RewardBenchが下流アプリケーションにおける報酬モデルの性能を評価するための堅牢かつ正確な代理指標であることを確立する。

English

Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values. While benchmarks for general response quality are prevalent, evaluating how well reward models account for individual user preferences remains an open challenge. To bridge this gap, we introduce Personalized RewardBench, a novel benchmark designed to rigorously assess reward models' capacity to model personalized preferences. We construct chosen and rejected response pairs based on strict adherence to (or violation of) user-specific rubrics, ensuring that preference distinctions are uniquely tailored to the individual. In particular, human evaluations confirm that the primary discriminative factor between pairs is strictly personal preference, with both responses maintaining high general quality (e.g., correctness, relevance and helpfulness). Extensive testing reveals that existing state-of-the-art reward models struggle significantly with personalization, peaking at an accuracy of just 75.94%. Crucially, because an effective reward model benchmark should predict a reward model's performance on downstream tasks, we conduct experiments demonstrating that our benchmark exhibits a significantly higher correlation with downstream performance in both Best-of-N (BoN) sampling and Proximal Policy Optimization (PPO) compared to existing baselines. These findings establish Personalized RewardBench as a robust and accurate proxy for evaluating reward models' performance in downstream applications.

パーソナライズド・リワードベンチ：人間に適合したパーソナライゼーションによる報酬モデルの評価

Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization

要旨

Support