语言模型的受控解码

摘要

我们提出了控制解码（CD）这一新颖的离策略强化学习方法，用于控制语言模型自回归生成朝着高奖励结果发展。CD通过一个称为前缀评分器的奖励值函数解决了一个离策略强化学习问题。前缀评分器在推断时用于引导生成朝着更高奖励结果发展。我们展示了前缀评分器可以在（可能是）离策略数据上进行训练，以预测从部分解码响应继续解码时的预期奖励。我们在 Reddit 对话语料库上通过实证方法证明了 CD 作为控制机制的有效性。我们还展示了 CD 设计的模块化使得可以控制多个奖励，有效解决了多目标强化学习问题，而无需增加复杂性。最后，我们展示了 CD 可以以一种新颖的分块方式在推断时应用，同样无需进行任何训练时更改，实质上弥合了流行的最佳-K 策略和基于标记级强化学习之间的差距。这使得 CD 成为对齐语言模型的一种有前途的方法。

English

We propose controlled decoding (CD), a novel off-policy reinforcement learning method to control the autoregressive generation from language models towards high reward outcomes. CD solves an off-policy reinforcement learning problem through a value function for the reward, which we call a prefix scorer. The prefix scorer is used at inference time to steer the generation towards higher reward outcomes. We show that the prefix scorer may be trained on (possibly) off-policy data to predict the expected reward when decoding is continued from a partially decoded response. We empirically demonstrate that CD is effective as a control mechanism on Reddit conversations corpus. We also show that the modularity of the design of CD makes it possible to control for multiple rewards, effectively solving a multi-objective reinforcement learning problem with no additional complexity. Finally, we show that CD can be applied in a novel blockwise fashion at inference-time, again without the need for any training-time changes, essentially bridging the gap between the popular best-of-K strategy and token-level reinforcement learning. This makes CD a promising approach for alignment of language models.

语言模型的受控解码

Controlled Decoding from Language Models

摘要

Support