개인화된 리워드 벤치: 인간 맞춤형 개인화를 통한 보상 모델 평가

초록

다양적 정렬(Pluralistic alignment)은 대규모 언어 모델(LLM) 개발의 중요한 최전선으로 부상했으며, 보상 모델(RM)이 다양한 인간 가치를 포착하는 핵심 메커니즘으로 작용하고 있다. 일반적인 응답 품질에 대한 벤치마크는 널리 퍼져 있지만, 보상 모델이 개별 사용자 선호도를 얼마나 잘 반영하는지 평가하는 것은 여전히 해결되지 않은 과제로 남아 있다. 이러한 격차를 해소하기 위해 우리는 보상 모델의 개인화된 선호도 모델링 능력을 엄격하게 평가하기 위해 설계된 새로운 벤치마크인 Personalized RewardBench를 소개한다. 우리는 사용자별 기준(rubric)의 엄격한 준수(또는 위반)를 기반으로 선호 응답(chosen)과 비선호 응답(rejected) 쌍을 구성하여, 선호도 구분이 개인에 맞게 독특하게 조정되도록 보장한다. 특히 인간 평가를 통해 쌍 간의 주요 판별 요인이 순전히 개인적 선호도이며, 두 응답 모두 높은 일반적 품질(예: 정확성, 관련성, 도움 정도)을 유지한다는 점이 확인되었다. 광범위한 테스트 결과, 기존의 최첨단 보상 모델들은 개인화에서 상당한 어려움을 겪으며, 최고 정확도가 75.94%에 그치는 것으로 나타났다. 결정적으로, 효과적인 보상 모델 벤치마크는 하류 작업(downstream task)에서의 보상 모델 성능을 예측해야 하므로, 우리는 실험을 통해 기존 기준선(baselines)과 비교하여 본 벤치마크가 Best-of-N (BoN) 샘플링과 Proximal Policy Optimization (PPO) 모두에서 하류 성능과 유의미하게 높은 상관관계를 보인다는 것을 입증했다. 이러한 결과들은 Personalized RewardBench가 하류 애플리케이션에서의 보상 모델 성능을 평가하기 위한 강력하고 정확한 대리 지표(proxy)로 자리매김함을 보여준다.

English

Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values. While benchmarks for general response quality are prevalent, evaluating how well reward models account for individual user preferences remains an open challenge. To bridge this gap, we introduce Personalized RewardBench, a novel benchmark designed to rigorously assess reward models' capacity to model personalized preferences. We construct chosen and rejected response pairs based on strict adherence to (or violation of) user-specific rubrics, ensuring that preference distinctions are uniquely tailored to the individual. In particular, human evaluations confirm that the primary discriminative factor between pairs is strictly personal preference, with both responses maintaining high general quality (e.g., correctness, relevance and helpfulness). Extensive testing reveals that existing state-of-the-art reward models struggle significantly with personalization, peaking at an accuracy of just 75.94%. Crucially, because an effective reward model benchmark should predict a reward model's performance on downstream tasks, we conduct experiments demonstrating that our benchmark exhibits a significantly higher correlation with downstream performance in both Best-of-N (BoN) sampling and Proximal Policy Optimization (PPO) compared to existing baselines. These findings establish Personalized RewardBench as a robust and accurate proxy for evaluating reward models' performance in downstream applications.

개인화된 리워드 벤치: 인간 맞춤형 개인화를 통한 보상 모델 평가

Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization

초록

Support