RLAIF：通過人類反饋將強化學習擴展到人工智能領域反饋

摘要

從人類反饋中學習的強化學習（RLHF）對於使大型語言模型（LLMs）與人類偏好保持一致是有效的，但收集高質量的人類偏好標籤是一個關鍵瓶頸。我們對RLHF與從人工智能反饋（RLAIF）學習的強化學習進行了一次直接比較 - RLAIF是一種技術，其中偏好由現成的LLM標記，而非人類，我們發現它們帶來了類似的改進。在摘要任務中，人類評估者在約70％的情況下更喜歡RLAIF和RLHF生成的結果，而不是基於監督微調模型的基準。此外，當被要求對RLAIF與RLHF的摘要進行評分時，人類以相同比率偏好兩者。這些結果表明，RLAIF可以產生人類級別的性能，為RLHF的可擴展性限制提供了潛在解決方案。

English

Reinforcement learning from human feedback (RLHF) is effective at aligning large language models (LLMs) to human preferences, but gathering high quality human preference labels is a key bottleneck. We conduct a head-to-head comparison of RLHF vs. RL from AI Feedback (RLAIF) - a technique where preferences are labeled by an off-the-shelf LLM in lieu of humans, and we find that they result in similar improvements. On the task of summarization, human evaluators prefer generations from both RLAIF and RLHF over a baseline supervised fine-tuned model in ~70% of cases. Furthermore, when asked to rate RLAIF vs. RLHF summaries, humans prefer both at equal rates. These results suggest that RLAIF can yield human-level performance, offering a potential solution to the scalability limitations of RLHF.

RLAIF：通過人類反饋將強化學習擴展到人工智能領域反饋

RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

摘要

Summary

Support

Support