BPO：通过遵循行为接近性LLM来加速在线偏好学习

摘要

直接偏好对齐（DAP）已成为一种有前途的范式，用于将大型语言模型（LLMs）与人类偏好进行对齐，这些偏好来自预先收集的离线偏好数据集。尽管最近的研究表明现有的离线DAP方法可以直接受益于在线训练样本，我们强调需要开发特定的在线DAP算法，以充分利用在线训练的力量。具体而言，我们确定学习的LLM应遵循行为LLM的行为接近性，该行为LLM收集训练样本。为此，我们提出了在接近行为LLM的在线偏好优化（BPO），强调构建适当的信任区域以实现LLM对齐的重要性。我们进行了大量实验，通过将其与各种DAP方法集成，验证了我们方法的有效性和适用性，在使用相同数量的偏好数据进行训练时，在各种任务中实现了显著的性能改进。即使只引入一个额外的数据收集阶段，我们的在线BPO也将其离线DAP基线从72.0%提高到TL;DR上的80.2%，从82.2%提高到Anthropic Helpfulness上的89.1%，在与人类参考文本的胜率方面。

English

Direct alignment from preferences (DAP) has emerged as a promising paradigm for aligning large language models (LLMs) to human desiderata from pre-collected, offline preference datasets. While recent studies indicate that existing offline DAP methods can directly benefit from online training samples, we highlight the need to develop specific online DAP algorithms to fully harness the power of online training. Specifically, we identify that the learned LLM should adhere to the proximity of the behavior LLM, which collects the training samples. To this end, we propose online Preference Optimization in proximity to the Behavior LLM (BPO), emphasizing the importance of constructing a proper trust region for LLM alignment. We conduct extensive experiments to validate the effectiveness and applicability of our approach by integrating it with various DAP methods, resulting in significant performance improvements across a wide range of tasks when training with the same amount of preference data. Even when only introducing one additional data collection phase, our online BPO improves its offline DAP baseline from 72.0% to 80.2% on TL;DR and from 82.2% to 89.1% on Anthropic Helpfulness in terms of win rate against human reference text.

BPO：通过遵循行为接近性LLM来加速在线偏好学习

BPO: Supercharging Online Preference Learning by Adhering to the Proximity of Behavior LLM

摘要

Support