精簡高效：基於全局價值引導的解耦價值策略優化

摘要

基於近端策略優化（PPO）的人類反饋強化學習（RLHF）對於使大型語言模型（LLMs）與人類偏好保持一致至關重要。該方法需要聯合訓練一個行動者和一個評論者，並依賴於預訓練且固定的獎勵模型進行指導。這種方法由於行動者與評論者之間的相互依賴性，增加了計算複雜性和不穩定性。此外，在LLM任務中，PPO無法獲取真實的環境獎勵，這限制了其適應性。在這種情況下，預訓練一個價值模型或獎勵模型變得等價，因為兩者都提供了固定的監督信號，而無需新的真實反饋。為了解決這些問題，我們提出了解耦價值策略優化（DVPO），這是一個簡潔的框架，用預訓練的全局價值模型（GVM）取代了傳統的獎勵建模。GVM基於策略軌跡進行條件化，並預測令牌級別的“回報到來”估計值。通過將價值模型與策略訓練解耦（通過凍結的GVM驅動的RL目標），DVPO消除了行動者與評論者之間的相互依賴性，與傳統的RLHF相比，減少了40%的GPU內存使用量和35%的訓練時間。跨基準測試的實驗表明，DVPO在性能上超越了高效的RLHF方法（如DPO），並與最先進的PPO相匹配。

English

Proximal Policy Optimization (PPO)-based Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human preferences. It requires joint training of an actor and critic with a pretrained, fixed reward model for guidance. This approach increases computational complexity and instability due to actor-critic interdependence. Additionally, PPO lacks access to true environment rewards in LLM tasks, limiting its adaptability. Under such conditions, pretraining a value model or a reward model becomes equivalent, as both provide fixed supervisory signals without new ground-truth feedback. To address these issues, we propose Decoupled Value Policy Optimization (DVPO), a lean framework that replaces traditional reward modeling with a pretrained global value model (GVM). The GVM is conditioned on policy trajectories and predicts token-level return-to-go estimates. By decoupling value model from policy training (via frozen GVM-driven RL objectives), DVPO eliminates actor-critic interdependence, reducing GPU memory usage by 40\% and training time by 35\% compared to conventional RLHF. Experiments across benchmarks show DVPO outperforms efficient RLHF methods (e.g., DPO) while matching state-of-the-art PPO in performance.

精簡高效：基於全局價值引導的解耦價值策略優化

Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance

摘要

Support