BOND：將LLMs與最佳的N次蒸餾進行對齊

摘要

人類反饋強化學習（RLHF）是當今最先進的大型語言模型中質量和安全的關鍵驅動因素。然而，在推論時，一種驚人簡單且強大的策略是最佳N採樣，它從N個候選中選擇最佳生成物。本文提出了最佳N蒸餾（BOND），這是一種新穎的RLHF算法，旨在模擬最佳N，但在推論時避免其重大的計算開銷。具體而言，BOND是一種分布匹配算法，強迫從策略生成的分布接近最佳N分布。我們使用Jeffreys散度（前向和後向KL的線性組合）來平衡模式覆蓋和模式尋找行為，並推導出一個利用移動錨點的迭代公式。通過在提取式摘要和Gemma模型上的實驗，我們展示了我們方法的有效性和幾個設計選擇。將Gemma策略與BOND對齊優於其他RLHF算法，通過改進幾個基準測試的結果。

English

Reinforcement learning from human feedback (RLHF) is a key driver of quality and safety in state-of-the-art large language models. Yet, a surprisingly simple and strong inference-time strategy is Best-of-N sampling that selects the best generation among N candidates. In this paper, we propose Best-of-N Distillation (BOND), a novel RLHF algorithm that seeks to emulate Best-of-N but without its significant computational overhead at inference time. Specifically, BOND is a distribution matching algorithm that forces the distribution of generations from the policy to get closer to the Best-of-N distribution. We use the Jeffreys divergence (a linear combination of forward and backward KL) to balance between mode-covering and mode-seeking behavior, and derive an iterative formulation that utilizes a moving anchor for efficiency. We demonstrate the effectiveness of our approach and several design choices through experiments on abstractive summarization and Gemma models. Aligning Gemma policies with BOND outperforms other RLHF algorithms by improving results on several benchmarks.

BOND：將LLMs與最佳的N次蒸餾進行對齊

BOND: Aligning LLMs with Best-of-N Distillation

摘要

Support