ChatPaper.aiChatPaper

RLAIF:通过人类反馈将强化学习扩展至人工智能 反馈

RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

September 1, 2023
作者: Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, Abhinav Rastogi
cs.AI

摘要

人类反馈强化学习(RLHF)在使大型语言模型(LLMs)与人类偏好保持一致方面非常有效,但收集高质量的人类偏好标签是一个关键瓶颈。我们对RLHF和来自人工智能反馈(RLAIF)的强化学习进行了一对一的比较。RLAIF是一种技术,其中偏好由现成的LLM标记,而非人类。我们发现它们带来了类似的改进。在摘要任务中,人类评估者在约70%的情况下更喜欢RLAIF和RLHF生成的结果,而不是基线监督微调模型。此外,当被要求评价RLAIF和RLHF的摘要时,人类以相同的比例更喜欢两者。这些结果表明,RLAIF可以实现人类水平的性能,为RLHF的可扩展性限制提供了潜在解决方案。
English
Reinforcement learning from human feedback (RLHF) is effective at aligning large language models (LLMs) to human preferences, but gathering high quality human preference labels is a key bottleneck. We conduct a head-to-head comparison of RLHF vs. RL from AI Feedback (RLAIF) - a technique where preferences are labeled by an off-the-shelf LLM in lieu of humans, and we find that they result in similar improvements. On the task of summarization, human evaluators prefer generations from both RLAIF and RLHF over a baseline supervised fine-tuned model in ~70% of cases. Furthermore, when asked to rate RLAIF vs. RLHF summaries, humans prefer both at equal rates. These results suggest that RLAIF can yield human-level performance, offering a potential solution to the scalability limitations of RLHF.

Summary

AI-Generated Summary

PDF501December 15, 2024