SLiC-HF:利用人类反馈进行序列可能性校准
SLiC-HF: Sequence Likelihood Calibration with Human Feedback
May 17, 2023
作者: Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, Peter J. Liu
cs.AI
摘要
已经证明,从人类反馈中学习对于使语言模型与人类偏好保持一致是有效的。过去的研究通常依赖于从人类反馈中进行强化学习(RLHF),该方法使用从训练在人类偏好数据上的奖励模型得出的奖励分数来优化语言模型。在这项工作中,我们展示了最近引入的序列似然校准(SLiC)也可以用于有效地从人类偏好中学习(SLiC-HF)。此外,我们证明可以利用为不同模型收集的人类反馈数据来实现这一点,类似于离线的离线强化学习数据。在TL;DR摘要任务上进行的自动和人类评估实验表明,SLiC-HF显著改善了监督微调基线。此外,SLiC-HF提供了一个竞争性的选择,可以替代过去工作中使用的PPO RLHF实现,同时实现更简单的实现、更容易调整和更高的计算效率。
English
Learning from human feedback has been shown to be effective at aligning
language models with human preferences. Past work has often relied on
Reinforcement Learning from Human Feedback (RLHF), which optimizes the language
model using reward scores assigned from a reward model trained on human
preference data. In this work we show how the recently introduced Sequence
Likelihood Calibration (SLiC), can also be used to effectively learn from human
preferences (SLiC-HF). Furthermore, we demonstrate this can be done with human
feedback data collected for a different model, similar to off-policy, offline
RL data. Automatic and human evaluation experiments on the TL;DR summarization
task show that SLiC-HF significantly improves supervised fine-tuning baselines.
Furthermore, SLiC-HF presents a competitive alternative to the PPO RLHF
implementation used in past work while being much simpler to implement, easier
to tune and more computationally efficient in practice.