DeAL：大型语言模型的解码时间对齐。

摘要

如今，大型语言模型（LLMs）被期望能够生成符合人类偏好的内容。当前的研究侧重于在模型训练时通过诸如人类反馈强化学习（RLHF）等技术实现对齐。然而，目前尚不清楚这些方法是否是向模型教授对齐目标的有效选择。首先，无法整合多个自定义奖励和依赖模型开发者对普遍和静态原则的观点是主要限制因素。其次，模型训练中的残余差距以及这些方法的可靠性也存在疑问（例如，即使经过安全训练，也容易被越狱）。为了解决这些问题，我们提出了DeAL，这是一个允许用户自定义奖励函数并实现LLMs解码时对齐的框架。在其核心，我们将解码视为一个启发式引导的搜索过程，并促进各种对齐目标的使用。我们的实验涉及编程约束，如关键词和长度约束（在LLM时代前被广泛研究），以及抽象目标，如无害性和有益性（在LLM时代后提出），表明我们可以通过细粒度的权衡来实现对齐目标的遵循，并解决LLMs中的残余差距。最后，虽然DeAL可以有效地与RLHF和提示技术配对使用，但其通用性使解码速度较慢，这是我们留给未来工作优化的部分。

English

Large Language Models (LLMs) are nowadays expected to generate content aligned with human preferences. Current work focuses on alignment at model training time, through techniques such as Reinforcement Learning with Human Feedback (RLHF). However, it is unclear if such methods are an effective choice to teach alignment objectives to the model. First, the inability to incorporate multiple, custom rewards and reliance on a model developer's view of universal and static principles are key limitations. Second, the residual gaps in model training and the reliability of such approaches are also questionable (e.g. susceptibility to jail-breaking even after safety training). To address these, we propose DeAL, a framework that allows the user to customize reward functions and enables Decoding-time Alignment of LLMs (DeAL). At its core, we view decoding as a heuristic-guided search process and facilitate the use of a wide variety of alignment objectives. Our experiments with programmatic constraints such as keyword and length constraints (studied widely in the pre-LLM era) and abstract objectives such as harmlessness and helpfulness (proposed in the post-LLM era) show that we can DeAL with fine-grained trade-offs, improve adherence to alignment objectives, and address residual gaps in LLMs. Lastly, while DeAL can be effectively paired with RLHF and prompting techniques, its generality makes decoding slower, an optimization we leave for future work.

DeAL：大型语言模型的解码时间对齐。

DeAL: Decoding-time Alignment for Large Language Models

摘要

Support