인간 피드백을 통한 강화 학습에서 데이터 스케일링 경향성과 효과 탐구

초록

인간 피드백을 통한 강화 학습(RLHF)은 대규모 언어 모델을 인간의 선호에 맞추는 데 있어 핵심적인 역할을 합니다. 최근 연구는 주로 알고리즘 개선에 초점을 맞추어 왔지만, 프롬프트 데이터 구성의 중요성은 간과되어 왔습니다. 본 논문은 이러한 격차를 해소하기 위해 RLHF 성능 확장에서의 데이터 기반 병목 현상, 특히 보상 해킹과 응답 다양성 감소를 탐구합니다. 우리는 보상 해킹을 완화하기 위해 추론 작업 검증기(RTV)와 생성적 보상 모델(GenRM)을 결합한 하이브리드 보상 시스템을 도입했습니다. 또한, 응답 다양성을 유지하고 학습 효과를 향상시키기 위한 새로운 프롬프트 선택 방법인 Pre-PPO를 제안합니다. 추가적으로, RLHF 훈련 초기에 수학 및 코딩 작업을 우선적으로 다루는 것이 성능을 크게 개선한다는 사실을 발견했습니다. 두 가지 모델 크기에 걸친 실험을 통해 우리의 방법의 효과성과 확장성을 검증했습니다. 결과에 따르면, RTV가 보상 해킹에 가장 강력한 저항력을 보였으며, 그 다음으로는 ground truth를 사용한 GenRM, 그리고 SFT Best-of-N 응답을 사용한 GenRM 순으로 나타났습니다. 우리의 전략은 작업별 미묘한 차이를 신속하게 포착할 수 있게 하여 전반적인 RLHF 성능을 크게 개선했습니다. 이 연구는 신중한 데이터 구성의 중요성을 강조하고, RLHF의 성능 장벽을 극복하기 위한 실용적인 방법을 제공합니다.

English

Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning large language models with human preferences. While recent research has focused on algorithmic improvements, the importance of prompt-data construction has been overlooked. This paper addresses this gap by exploring data-driven bottlenecks in RLHF performance scaling, particularly reward hacking and decreasing response diversity. We introduce a hybrid reward system combining reasoning task verifiers (RTV) and a generative reward model (GenRM) to mitigate reward hacking. We also propose a novel prompt-selection method, Pre-PPO, to maintain response diversity and enhance learning effectiveness. Additionally, we find that prioritizing mathematical and coding tasks early in RLHF training significantly improves performance. Experiments across two model sizes validate our methods' effectiveness and scalability. Results show that RTV is most resistant to reward hacking, followed by GenRM with ground truth, and then GenRM with SFT Best-of-N responses. Our strategies enable rapid capture of subtle task-specific distinctions, leading to substantial improvements in overall RLHF performance. This work highlights the importance of careful data construction and provides practical methods to overcome performance barriers in RLHF.

인간 피드백을 통한 강화 학습에서 데이터 스케일링 경향성과 효과 탐구

Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback

초록

Support