대규모 언어 모델에서 RLHF의 비밀 파트 II: 보상 모델링

초록

인간 피드백을 통한 강화 학습(Reinforcement Learning from Human Feedback, RLHF)은 언어 모델을 인간의 가치와 의도에 맞추는 데 있어 핵심 기술로 자리 잡았으며, 이를 통해 모델이 더 유용하고 안전한 응답을 생성할 수 있게 되었습니다. 보상 모델은 인간의 선호도를 대리하여 강화 학습 최적화를 이끌기 위해 훈련됩니다. 보상 모델이 높은 성능을 달성하는 데 있어 중심적인 역할을 하는 것으로 여겨지지만, 실제 응용에서는 다음과 같은 문제에 직면합니다: (1) 데이터셋 내의 잘못된 또는 모호한 선호도 쌍은 보상 모델이 인간의 의도를 정확히 파악하는 데 방해가 될 수 있습니다. (2) 특정 분포의 데이터로 훈련된 보상 모델은 해당 분포를 벗어난 예제에 대해 일반화하기 어려우며, 반복적인 RLHF 훈련에는 적합하지 않습니다. 이 보고서에서는 이러한 두 가지 문제를 해결하고자 합니다. (1) 데이터 관점에서, 우리는 다중 보상 모델의 투표 메커니즘을 기반으로 데이터 내 선호도의 강도를 측정하는 방법을 제안합니다. 실험 결과는 선호도 강도가 다른 데이터가 보상 모델 성능에 미치는 영향이 다르다는 것을 확인시켜 줍니다. 우리는 데이터셋 내 잘못된 또는 모호한 선호도의 영향을 완화하고 고품질 선호도 데이터를 최대한 활용하기 위한 일련의 새로운 방법을 소개합니다. (2) 알고리즘적 관점에서, 우리는 대조 학습(contrastive learning)을 도입하여 보상 모델이 선택된 응답과 거부된 응답을 구분하는 능력을 강화함으로써 모델의 일반화 성능을 개선합니다. 더 나아가, 메타 학습(meta-learning)을 활용하여 보상 모델이 분포 외(out-of-distribution) 샘플에서도 미묘한 차이를 구분할 수 있는 능력을 유지하도록 하며, 이 접근법은 반복적인 RLHF 최적화에 활용될 수 있습니다.

English

Reinforcement Learning from Human Feedback (RLHF) has become a crucial technology for aligning language models with human values and intentions, enabling models to produce more helpful and harmless responses. Reward models are trained as proxies for human preferences to drive reinforcement learning optimization. While reward models are often considered central to achieving high performance, they face the following challenges in practical applications: (1) Incorrect and ambiguous preference pairs in the dataset may hinder the reward model from accurately capturing human intent. (2) Reward models trained on data from a specific distribution often struggle to generalize to examples outside that distribution and are not suitable for iterative RLHF training. In this report, we attempt to address these two issues. (1) From a data perspective, we propose a method to measure the strength of preferences within the data, based on a voting mechanism of multiple reward models. Experimental results confirm that data with varying preference strengths have different impacts on reward model performance. We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset and fully leverage high-quality preference data. (2) From an algorithmic standpoint, we introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses, thereby improving model generalization. Furthermore, we employ meta-learning to enable the reward model to maintain the ability to differentiate subtle differences in out-of-distribution samples, and this approach can be utilized for iterative RLHF optimization.

대규모 언어 모델에서 RLHF의 비밀 파트 II: 보상 모델링

Secrets of RLHF in Large Language Models Part II: Reward Modeling

초록

Support