BOND：将LLM与最佳N次蒸馏对齐

摘要

人类反馈强化学习（RLHF）是当今最先进的大型语言模型中质量和安全性的关键驱动因素。然而，在推断时一个令人惊讶的简单而强大的策略是Best-of-N抽样，它从N个候选生成中选择最佳生成。在本文中，我们提出了Best-of-N蒸馏（BOND），这是一种新颖的RLHF算法，旨在模拟Best-of-N，但在推断时避免其显著的计算开销。具体而言，BOND是一种分布匹配算法，强制使策略生成的分布接近Best-of-N分布。我们使用Jeffreys散度（前向和后向KL的线性组合）来平衡模式覆盖和模式寻找行为，并推导出一个利用移动锚点提高效率的迭代公式。通过在提取式摘要和Gemma模型上进行实验，我们展示了我们方法的有效性和几种设计选择。将Gemma策略与BOND对齐优于其他RLHF算法，在多个基准测试上改善了结果。

English

Reinforcement learning from human feedback (RLHF) is a key driver of quality and safety in state-of-the-art large language models. Yet, a surprisingly simple and strong inference-time strategy is Best-of-N sampling that selects the best generation among N candidates. In this paper, we propose Best-of-N Distillation (BOND), a novel RLHF algorithm that seeks to emulate Best-of-N but without its significant computational overhead at inference time. Specifically, BOND is a distribution matching algorithm that forces the distribution of generations from the policy to get closer to the Best-of-N distribution. We use the Jeffreys divergence (a linear combination of forward and backward KL) to balance between mode-covering and mode-seeking behavior, and derive an iterative formulation that utilizes a moving anchor for efficiency. We demonstrate the effectiveness of our approach and several design choices through experiments on abstractive summarization and Gemma models. Aligning Gemma policies with BOND outperforms other RLHF algorithms by improving results on several benchmarks.

BOND：将LLM与最佳N次蒸馏对齐

BOND: Aligning LLMs with Best-of-N Distillation

摘要

Support