RLAIF:通過人類反饋將強化學習擴展到人工智能領域 反饋
RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
September 1, 2023
作者: Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, Abhinav Rastogi
cs.AI
摘要
從人類反饋中學習的強化學習(RLHF)對於使大型語言模型(LLMs)與人類偏好保持一致是有效的,但收集高質量的人類偏好標籤是一個關鍵瓶頸。我們對RLHF與從人工智能反饋(RLAIF)學習的強化學習進行了一次直接比較 - RLAIF是一種技術,其中偏好由現成的LLM標記,而非人類,我們發現它們帶來了類似的改進。在摘要任務中,人類評估者在約70%的情況下更喜歡RLAIF和RLHF生成的結果,而不是基於監督微調模型的基準。此外,當被要求對RLAIF與RLHF的摘要進行評分時,人類以相同比率偏好兩者。這些結果表明,RLAIF可以產生人類級別的性能,為RLHF的可擴展性限制提供了潛在解決方案。
English
Reinforcement learning from human feedback (RLHF) is effective at aligning
large language models (LLMs) to human preferences, but gathering high quality
human preference labels is a key bottleneck. We conduct a head-to-head
comparison of RLHF vs. RL from AI Feedback (RLAIF) - a technique where
preferences are labeled by an off-the-shelf LLM in lieu of humans, and we find
that they result in similar improvements. On the task of summarization, human
evaluators prefer generations from both RLAIF and RLHF over a baseline
supervised fine-tuned model in ~70% of cases. Furthermore, when asked to rate
RLAIF vs. RLHF summaries, humans prefer both at equal rates. These results
suggest that RLAIF can yield human-level performance, offering a potential
solution to the scalability limitations of RLHF.Summary
AI-Generated Summary