아직 갈 길이 멀다: RLHF에서의 길이 상관관계 연구

초록

인간 피드백을 통한 강화 학습(RLHF)을 사용하여 대규모 언어 모델을 정렬하는 데 있어 큰 성공이 보고되었습니다. 오픈소스 선호도 데이터셋과 보상 모델은 일반적인 채팅 설정을 넘어 더 넓은 실험을 가능하게 하여, 웹 질문 응답, 요약, 다중 턴 대화와 같은 작업에서 시스템을 더 "도움이 되도록" 만드는 데 특히 기여했습니다. 도움을 최적화할 때, RLHF가 모델이 더 긴 출력을 생성하도록 유도하는 것이 지속적으로 관찰되었습니다. 본 논문은 이러한 설정에서 RLHF가 보고한 개선의 상당 부분이 응답 길이를 최적화하는 데 기인한다는 것을 보여줍니다. 먼저, 도움을 위한 세 가지 오픈소스 선호도 데이터셋으로 훈련된 보상 모델에 대해 보상과 길이 간의 관계를 연구합니다. 여기서 길이는 보상과 강한 상관관계를 가지며, 보상 점수의 개선은 대부분 출력 길이 분포의 변화에 의해 주도됩니다. 그런 다음, RL과 보상 모델 학습 중에 개입을 탐구하여 길이를 증가시키지 않고도 RLHF와 동일한 하류 개선을 달성할 수 있는지 확인합니다. 우리의 개입은 길이 증가를 완화하지만, 모든 설정에서 균일하게 효과적이지는 않습니다. 더 나아가, 길이에만 기반한 보상을 사용하여 RLHF를 실행하더라도 초기 정책 모델에 비해 대부분의 하류 개선을 재현할 수 있음을 발견했습니다. 이는 이러한 설정에서 보상 모델이 아직 갈 길이 멀다는 것을 보여줍니다.

English

Great successes have been reported using Reinforcement Learning from Human Feedback (RLHF) to align large language models. Open-source preference datasets and reward models have enabled wider experimentation beyond generic chat settings, particularly to make systems more "helpful" for tasks like web question answering, summarization, and multi-turn dialogue. When optimizing for helpfulness, RLHF has been consistently observed to drive models to produce longer outputs. This paper demonstrates that optimizing for response length is a significant factor behind RLHF's reported improvements in these settings. First, we study the relationship between reward and length for reward models trained on three open-source preference datasets for helpfulness. Here, length correlates strongly with reward, and improvements in reward score are driven in large part by shifting the distribution over output lengths. We then explore interventions during both RL and reward model learning to see if we can achieve the same downstream improvements as RLHF without increasing length. While our interventions mitigate length increases, they aren't uniformly effective across settings. Furthermore, we find that even running RLHF with a reward based solely on length can reproduce most of the downstream improvements over the initial policy model, showing that reward models in these settings have a long way to go.

아직 갈 길이 멀다: RLHF에서의 길이 상관관계 연구

A Long Way to Go: Investigating Length Correlations in RLHF

초록

Support