ChatPaper.aiChatPaper

探索基於人類反饋的強化學習中的數據擴展趨勢與效應

Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback

March 28, 2025
作者: Wei Shen, Guanlin Liu, Zheng Wu, Ruofei Zhu, Qingping Yang, Chao Xin, Yu Yue, Lin Yan
cs.AI

摘要

基於人類反饋的強化學習(RLHF)對於使大型語言模型與人類偏好保持一致至關重要。儘管近期研究主要集中在算法改進上,但提示數據構建的重要性卻被忽視。本文通過探討RLHF性能擴展中的數據驅動瓶頸,特別是獎勵欺騙和響應多樣性下降,來填補這一空白。我們引入了一種混合獎勵系統,結合了推理任務驗證器(RTV)和生成式獎勵模型(GenRM),以減輕獎勵欺騙。我們還提出了一種新穎的提示選擇方法——Pre-PPO,以保持響應多樣性並提升學習效果。此外,我們發現,在RLHF訓練早期優先處理數學和編碼任務能顯著提高性能。在兩種模型規模上的實驗驗證了我們方法的有效性和可擴展性。結果表明,RTV對獎勵欺騙的抵抗力最強,其次是基於真實數據的GenRM,然後是基於SFT Best-of-N響應的GenRM。我們的策略能夠快速捕捉細微的任務特定區別,從而大幅提升整體RLHF性能。這項工作強調了精心構建數據的重要性,並提供了克服RLHF性能障礙的實用方法。
English
Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning large language models with human preferences. While recent research has focused on algorithmic improvements, the importance of prompt-data construction has been overlooked. This paper addresses this gap by exploring data-driven bottlenecks in RLHF performance scaling, particularly reward hacking and decreasing response diversity. We introduce a hybrid reward system combining reasoning task verifiers (RTV) and a generative reward model (GenRM) to mitigate reward hacking. We also propose a novel prompt-selection method, Pre-PPO, to maintain response diversity and enhance learning effectiveness. Additionally, we find that prioritizing mathematical and coding tasks early in RLHF training significantly improves performance. Experiments across two model sizes validate our methods' effectiveness and scalability. Results show that RTV is most resistant to reward hacking, followed by GenRM with ground truth, and then GenRM with SFT Best-of-N responses. Our strategies enable rapid capture of subtle task-specific distinctions, leading to substantial improvements in overall RLHF performance. This work highlights the importance of careful data construction and provides practical methods to overcome performance barriers in RLHF.

Summary

AI-Generated Summary

PDF442March 31, 2025