SLiC-HF: 인간 피드백을 통한 시퀀스 가능성 보정

초록

인간 피드백을 통해 학습하는 것이 언어 모델을 인간의 선호에 맞추는 데 효과적임이 입증되었습니다. 기존 연구들은 주로 인간 선호 데이터로 훈련된 보상 모델에서 할당된 보상 점수를 사용해 언어 모델을 최적화하는 인간 피드백 강화 학습(RLHF)에 의존해 왔습니다. 본 연구에서는 최근 도입된 시퀀스 가능성 보정(SLiC)을 활용해 인간 선호를 효과적으로 학습할 수 있는 방법(SLiC-HF)을 제시합니다. 더 나아가, 이 방법이 오프-폴리시, 오프라인 강화 학습 데이터와 유사하게 다른 모델을 위해 수집된 인간 피드백 데이터를 사용하여 수행될 수 있음을 보여줍니다. TL;DR 요약 작업에 대한 자동 및 인간 평가 실험을 통해 SLiC-HF가 지도 미세 조정 기준선을 크게 개선함을 확인했습니다. 또한, SLiC-HF는 기존 연구에서 사용된 PPO RLHF 구현에 비해 경쟁력 있는 대안을 제시하면서도 구현이 훨씬 간단하고, 튜닝이 용이하며, 실제로 계산 효율성이 더 높습니다.

English

Learning from human feedback has been shown to be effective at aligning language models with human preferences. Past work has often relied on Reinforcement Learning from Human Feedback (RLHF), which optimizes the language model using reward scores assigned from a reward model trained on human preference data. In this work we show how the recently introduced Sequence Likelihood Calibration (SLiC), can also be used to effectively learn from human preferences (SLiC-HF). Furthermore, we demonstrate this can be done with human feedback data collected for a different model, similar to off-policy, offline RL data. Automatic and human evaluation experiments on the TL;DR summarization task show that SLiC-HF significantly improves supervised fine-tuning baselines. Furthermore, SLiC-HF presents a competitive alternative to the PPO RLHF implementation used in past work while being much simpler to implement, easier to tune and more computationally efficient in practice.

SLiC-HF: 인간 피드백을 통한 시퀀스 가능성 보정

SLiC-HF: Sequence Likelihood Calibration with Human Feedback

초록

Support