迭代價值函數優化引導解碼

摘要

儘管從人類反饋中進行強化學習（RLHF）已成為控制語言模型輸出的主流方法，但其存在計算成本高和訓練不穩定的問題。引導解碼，尤其是價值引導方法，提供了一種成本效益高的替代方案，它能在不重新訓練模型的情況下控制輸出。然而，價值函數的準確性對於價值引導解碼至關重要，因為不準確的估計可能導致次優決策和性能下降。現有方法在準確估計最佳價值函數方面存在困難，導致控制效果不佳。我們提出了迭代價值函數優化，這是一個新穎的框架，通過兩個關鍵組件來解決這些限制：蒙特卡洛價值估計，通過探索多樣化的軌跡來減少估計方差；以及迭代在線策略優化，通過收集來自價值引導策略的軌跡來逐步改進價值估計。在文本摘要、多輪對話和指令遵循等任務上的大量實驗證明了價值引導解碼方法在對齊語言模型方面的有效性。這些方法不僅實現了對齊，還通過利用基於原則的價值函數優化來實現高效且有效的控制，從而顯著降低了計算成本。

English

While Reinforcement Learning from Human Feedback (RLHF) has become the predominant method for controlling language model outputs, it suffers from high computational costs and training instability. Guided decoding, especially value-guided methods, offers a cost-effective alternative by controlling outputs without re-training models. However, the accuracy of the value function is crucial for value-guided decoding, as inaccuracies can lead to suboptimal decision-making and degraded performance. Existing methods struggle with accurately estimating the optimal value function, leading to less effective control. We propose Iterative Value Function Optimization, a novel framework that addresses these limitations through two key components: Monte Carlo Value Estimation, which reduces estimation variance by exploring diverse trajectories, and Iterative On-Policy Optimization, which progressively improves value estimation through collecting trajectories from value-guided policies. Extensive experiments on text summarization, multi-turn dialogue, and instruction following demonstrate the effectiveness of value-guided decoding approaches in aligning language models. These approaches not only achieve alignment but also significantly reduce computational costs by leveraging principled value function optimization for efficient and effective control.

迭代價值函數優化引導解碼

Iterative Value Function Optimization for Guided Decoding

摘要

Support