BOND:将LLM与最佳N次蒸馏对齐
BOND: Aligning LLMs with Best-of-N Distillation
July 19, 2024
作者: Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot, Johan Ferret, Nino Vieillard, Alexandre Ramé, Bobak Shariari, Sarah Perrin, Abe Friesen, Geoffrey Cideron, Sertan Girgin, Piotr Stanczyk, Andrea Michi, Danila Sinopalnikov, Sabela Ramos, Amélie Héliou, Aliaksei Severyn, Matt Hoffman, Nikola Momchev, Olivier Bachem
cs.AI
摘要
人类反馈强化学习(RLHF)是当今最先进的大型语言模型中质量和安全性的关键驱动因素。然而,在推断时一个令人惊讶的简单而强大的策略是Best-of-N抽样,它从N个候选生成中选择最佳生成。在本文中,我们提出了Best-of-N蒸馏(BOND),这是一种新颖的RLHF算法,旨在模拟Best-of-N,但在推断时避免其显著的计算开销。具体而言,BOND是一种分布匹配算法,强制使策略生成的分布接近Best-of-N分布。我们使用Jeffreys散度(前向和后向KL的线性组合)来平衡模式覆盖和模式寻找行为,并推导出一个利用移动锚点提高效率的迭代公式。通过在提取式摘要和Gemma模型上进行实验,我们展示了我们方法的有效性和几种设计选择。将Gemma策略与BOND对齐优于其他RLHF算法,在多个基准测试上改善了结果。
English
Reinforcement learning from human feedback (RLHF) is a key driver of quality
and safety in state-of-the-art large language models. Yet, a surprisingly
simple and strong inference-time strategy is Best-of-N sampling that selects
the best generation among N candidates. In this paper, we propose Best-of-N
Distillation (BOND), a novel RLHF algorithm that seeks to emulate Best-of-N but
without its significant computational overhead at inference time. Specifically,
BOND is a distribution matching algorithm that forces the distribution of
generations from the policy to get closer to the Best-of-N distribution. We use
the Jeffreys divergence (a linear combination of forward and backward KL) to
balance between mode-covering and mode-seeking behavior, and derive an
iterative formulation that utilizes a moving anchor for efficiency. We
demonstrate the effectiveness of our approach and several design choices
through experiments on abstractive summarization and Gemma models. Aligning
Gemma policies with BOND outperforms other RLHF algorithms by improving results
on several benchmarks.Summary
AI-Generated Summary