도움인가, 통제인가? 보상 모델 앙상블은 보상 해킹을 완화하지만 완전히 제거하지는 못한다

초록

보상 모델은 언어 모델 애플리케이션을 인간의 선호도에 맞추는 데 핵심적인 역할을 합니다. 그러나 이러한 설정은 언어 모델이 높은 추정 보상을 달성하기 위해 보상 모델의 오류를 악용하도록 유도하는데, 이는 종종 '보상 해킹(reward hacking)'이라고 불리는 현상입니다. 이를 완화하기 위한 자연스러운 접근 방식은 보상 모델 앙상블을 훈련시켜 모델 출력을 집계함으로써 더 견고한 보상 추정치를 얻는 것입니다. 우리는 보상 앙상블을 훈련 시(강화 학습을 통해)와 추론 시(재순위를 통해) 정렬에 적용하는 방법을 탐구합니다. 첫째, 보상 모델이 과소 지정(underspecified)되어 있음을 보입니다: 분포 내에서 유사한 성능을 보이는 보상 모델도 분포 이동이 발생할 때 정렬에 사용되면 매우 다른 보상을 산출할 수 있습니다. 둘째, 이러한 과소 지정은 과최적화(overoptimization)를 초래하는데, 하나의 보상 모델에 맞춰 정렬하더라도 동일한 데이터로 훈련된 다른 보상 모델로 측정한 보상이 개선되지 않습니다. 셋째, 보상 앙상블을 사용하면 과최적화가 완화되며, 사전 훈련 시드(pretraining seeds)에 따라 달라지는 앙상블은 미세 조정 시드(fine-tuning seeds)만 다른 앙상블보다 더 나은 일반화 성능을 보이며, 둘 다 개별 보상 모델을 능가합니다. 그러나 사전 훈련된 보상 앙상블도 보상 해킹을 완전히 제거하지는 못합니다: 우리는 앙상블 내 모든 보상 모델이 유사한 오류 패턴을 보이기 때문에 앙상블링으로 완화되지 않는 여러 질적 보상 해킹 현상을 보여줍니다.

English

Reward models play a key role in aligning language model applications towards human preferences. However, this setup creates an incentive for the language model to exploit errors in the reward model to achieve high estimated reward, a phenomenon often termed reward hacking. A natural mitigation is to train an ensemble of reward models, aggregating over model outputs to obtain a more robust reward estimate. We explore the application of reward ensembles to alignment at both training time (through reinforcement learning) and inference time (through reranking). First, we show that reward models are underspecified: reward models that perform similarly in-distribution can yield very different rewards when used in alignment, due to distribution shift. Second, underspecification results in overoptimization, where alignment to one reward model does not improve reward as measured by another reward model trained on the same data. Third, overoptimization is mitigated by the use of reward ensembles, and ensembles that vary by their pretraining seeds lead to better generalization than ensembles that differ only by their fine-tuning seeds, with both outperforming individual reward models. However, even pretrain reward ensembles do not eliminate reward hacking: we show several qualitative reward hacking phenomena that are not mitigated by ensembling because all reward models in the ensemble exhibit similar error patterns.

도움인가, 통제인가? 보상 모델 앙상블은 보상 해킹을 완화하지만 완전히 제거하지는 못한다

Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking

초록

Support