迭代價值函數優化引導解碼
Iterative Value Function Optimization for Guided Decoding
March 4, 2025
作者: Zhenhua Liu, Lijun Li, Ruizhe Chen, Yuxian Jiang, Tong Zhu, Wenliang Chen, Jing Shao
cs.AI
摘要
儘管從人類反饋中進行強化學習(RLHF)已成為控制語言模型輸出的主流方法,但其存在計算成本高和訓練不穩定的問題。引導解碼,尤其是價值引導方法,提供了一種成本效益高的替代方案,它能在不重新訓練模型的情況下控制輸出。然而,價值函數的準確性對於價值引導解碼至關重要,因為不準確的估計可能導致次優決策和性能下降。現有方法在準確估計最佳價值函數方面存在困難,導致控制效果不佳。我們提出了迭代價值函數優化,這是一個新穎的框架,通過兩個關鍵組件來解決這些限制:蒙特卡洛價值估計,通過探索多樣化的軌跡來減少估計方差;以及迭代在線策略優化,通過收集來自價值引導策略的軌跡來逐步改進價值估計。在文本摘要、多輪對話和指令遵循等任務上的大量實驗證明了價值引導解碼方法在對齊語言模型方面的有效性。這些方法不僅實現了對齊,還通過利用基於原則的價值函數優化來實現高效且有效的控制,從而顯著降低了計算成本。
English
While Reinforcement Learning from Human Feedback (RLHF) has become the
predominant method for controlling language model outputs, it suffers from high
computational costs and training instability. Guided decoding, especially
value-guided methods, offers a cost-effective alternative by controlling
outputs without re-training models. However, the accuracy of the value function
is crucial for value-guided decoding, as inaccuracies can lead to suboptimal
decision-making and degraded performance. Existing methods struggle with
accurately estimating the optimal value function, leading to less effective
control. We propose Iterative Value Function Optimization, a novel framework
that addresses these limitations through two key components: Monte Carlo Value
Estimation, which reduces estimation variance by exploring diverse
trajectories, and Iterative On-Policy Optimization, which progressively
improves value estimation through collecting trajectories from value-guided
policies. Extensive experiments on text summarization, multi-turn dialogue, and
instruction following demonstrate the effectiveness of value-guided decoding
approaches in aligning language models. These approaches not only achieve
alignment but also significantly reduce computational costs by leveraging
principled value function optimization for efficient and effective control.Summary
AI-Generated Summary