ChatPaper.aiChatPaper

精簡高效:基於全局價值引導的解耦價值策略優化

Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance

February 24, 2025
作者: Chenghua Huang, Lu Wang, Fangkai Yang, Pu Zhao, Zhixu Li, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang
cs.AI

摘要

基於近端策略優化(PPO)的人類反饋強化學習(RLHF)對於使大型語言模型(LLMs)與人類偏好保持一致至關重要。該方法需要聯合訓練一個行動者和一個評論者,並依賴於預訓練且固定的獎勵模型進行指導。這種方法由於行動者與評論者之間的相互依賴性,增加了計算複雜性和不穩定性。此外,在LLM任務中,PPO無法獲取真實的環境獎勵,這限制了其適應性。在這種情況下,預訓練一個價值模型或獎勵模型變得等價,因為兩者都提供了固定的監督信號,而無需新的真實反饋。為了解決這些問題,我們提出了解耦價值策略優化(DVPO),這是一個簡潔的框架,用預訓練的全局價值模型(GVM)取代了傳統的獎勵建模。GVM基於策略軌跡進行條件化,並預測令牌級別的“回報到來”估計值。通過將價值模型與策略訓練解耦(通過凍結的GVM驅動的RL目標),DVPO消除了行動者與評論者之間的相互依賴性,與傳統的RLHF相比,減少了40%的GPU內存使用量和35%的訓練時間。跨基準測試的實驗表明,DVPO在性能上超越了高效的RLHF方法(如DPO),並與最先進的PPO相匹配。
English
Proximal Policy Optimization (PPO)-based Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human preferences. It requires joint training of an actor and critic with a pretrained, fixed reward model for guidance. This approach increases computational complexity and instability due to actor-critic interdependence. Additionally, PPO lacks access to true environment rewards in LLM tasks, limiting its adaptability. Under such conditions, pretraining a value model or a reward model becomes equivalent, as both provide fixed supervisory signals without new ground-truth feedback. To address these issues, we propose Decoupled Value Policy Optimization (DVPO), a lean framework that replaces traditional reward modeling with a pretrained global value model (GVM). The GVM is conditioned on policy trajectories and predicts token-level return-to-go estimates. By decoupling value model from policy training (via frozen GVM-driven RL objectives), DVPO eliminates actor-critic interdependence, reducing GPU memory usage by 40\% and training time by 35\% compared to conventional RLHF. Experiments across benchmarks show DVPO outperforms efficient RLHF methods (e.g., DPO) while matching state-of-the-art PPO in performance.

Summary

AI-Generated Summary

PDF102February 28, 2025