利用进化启发式方法进行奖励引导，用于解码时间对齐

摘要

LLM的广泛适用性和日益普及性引发了将LLM响应与用户和利益相关者偏好对齐的需求。已经提出了许多偏好优化方法，用于微调LLM参数以实现良好的对齐。然而，众所周知，这种参数调整会干扰模型在许多任务上的性能。此外，在这种情况下跟踪不断变化的用户偏好是棘手的。通过奖励模型指导的解码时间对齐解决了这些问题，但会增加推理时间。然而，大多数这类方法未能在探索和利用奖励之间取得合适的平衡，往往是由于这两个方面的混合表述，无法提供良好对齐的响应。为了解决这个问题，我们将这两个方面解耦，并以进化方式实现：探索通过从突变指令解码，利用则表示为定期用获得良好奖励的个体替换奖励较低的个体。实证证据表明，这种策略在两个广泛接受的对齐基准AlpacaEval 2和MT-Bench上优于许多偏好优化和解码时间对齐方法。我们的实现将在以下网址提供：https://darwin-alignment.github.io。

English

The widespread applicability and increasing omnipresence of LLMs have instigated a need to align LLM responses to user and stakeholder preferences. Many preference optimization approaches have been proposed that fine-tune LLM parameters to achieve good alignment. However, such parameter tuning is known to interfere with model performance on many tasks. Moreover, keeping up with shifting user preferences is tricky in such a situation. Decoding-time alignment with reward model guidance solves these issues at the cost of increased inference time. However, most of such methods fail to strike the right balance between exploration and exploitation of reward -- often due to the conflated formulation of these two aspects - to give well-aligned responses. To remedy this we decouple these two aspects and implement them in an evolutionary fashion: exploration is enforced by decoding from mutated instructions and exploitation is represented as the periodic replacement of poorly-rewarded generations with well-rewarded ones. Empirical evidences indicate that this strategy outperforms many preference optimization and decode-time alignment approaches on two widely accepted alignment benchmarks AlpacaEval 2 and MT-Bench. Our implementation will be available at: https://darwin-alignment.github.io.

利用进化启发式方法进行奖励引导，用于解码时间对齐

Reward Steering with Evolutionary Heuristics for Decoding-time Alignment

摘要

Support