ChatPaper.aiChatPaper

從語言模型進行受控解碼

Controlled Decoding from Language Models

October 25, 2023
作者: Sidharth Mudgal, Jong Lee, Harish Ganapathy, YaGuang Li, Tao Wang, Yanping Huang, Zhifeng Chen, Heng-Tze Cheng, Michael Collins, Trevor Strohman, Jilin Chen, Alex Beutel, Ahmad Beirami
cs.AI

摘要

我們提出了控制解碼(CD),這是一種新穎的離策略強化學習方法,用於控制語言模型的自回歸生成,以實現高獎勵結果。CD通過一個值函數來解決離策略強化學習問題,我們稱之為前綴評分器。前綴評分器在推斷時用於引導生成向著更高獎勵結果。我們展示前綴評分器可以在(可能是)離策略數據上進行訓練,以預測從部分解碼的回應繼續解碼時的預期獎勵。我們在實驗中證明了CD作為Reddit對話語料庫上的控制機制的有效性。我們還展示了CD設計的模塊化使得可以控制多個獎勵,有效解決多目標強化學習問題,而不增加額外的複雜性。最後,我們展示了CD可以在推斷時以一種新穎的塊狀方式應用,同樣不需要進行任何訓練時間的更改,從本質上彌合了流行的最佳K策略和基於標記級別的強化學習之間的差距。這使得CD成為語言模型對齊的一種有前途的方法。
English
We propose controlled decoding (CD), a novel off-policy reinforcement learning method to control the autoregressive generation from language models towards high reward outcomes. CD solves an off-policy reinforcement learning problem through a value function for the reward, which we call a prefix scorer. The prefix scorer is used at inference time to steer the generation towards higher reward outcomes. We show that the prefix scorer may be trained on (possibly) off-policy data to predict the expected reward when decoding is continued from a partially decoded response. We empirically demonstrate that CD is effective as a control mechanism on Reddit conversations corpus. We also show that the modularity of the design of CD makes it possible to control for multiple rewards, effectively solving a multi-objective reinforcement learning problem with no additional complexity. Finally, we show that CD can be applied in a novel blockwise fashion at inference-time, again without the need for any training-time changes, essentially bridging the gap between the popular best-of-K strategy and token-level reinforcement learning. This makes CD a promising approach for alignment of language models.
PDF152December 15, 2024