通过元加权在线采样实现对齐：弥合数据生成与偏好优化之间的鸿沟

摘要

偏好优化对于将大型语言模型（LLMs）与人类价值观和意图对齐至关重要。这一过程中的一大挑战在于预收集的离线偏好数据与不断演变的模型策略之间的分布不匹配。现有方法尝试通过静态启发式或解耦的在线采样策略来缩小这一差距，但它们往往难以适应模型的动态学习状态。为弥合这一差距，我们提出了元加权自适应偏好优化（MetaAPO），这是一种新颖的框架，能够动态地将数据生成与模型训练相结合。MetaAPO采用轻量级元学习器作为“对齐差距估计器”，评估在线采样相对于离线数据的潜在收益。这指导了有针对性的在线生成，并为优化目标分配样本级元权重，动态平衡在线与离线数据的质量与分布。在AlpacaEval 2、Arena-Hard和MT-Bench上的实验表明，MetaAPO在各种设置下均优于现有的偏好优化方法，同时减少了42%的在线标注成本。

English

Preference optimization is crucial for aligning large language models (LLMs) with human values and intentions. A significant challenge in this process is the distribution mismatch between pre-collected offline preference data and the evolving model policy. Existing methods attempt to reduce this gap using static heuristics or decoupled online sampling strategies, but they often fail to adapt to the model's dynamic learning state. To bridge this gap, we propose Meta-Weighted Adaptive Preference Optimization (MetaAPO), a novel framework that dynamically couples data generation with model training. MetaAPO employs a lightweight meta-learner, as an "alignment gap estimator", to evaluate the potential benefits of on-policy sampling in relation to offline data. This guides targeted online generation and assigns sample-wise meta-weights to the optimization objective, dynamically balancing the quality and distribution of online and offline data. Experiments on AlpacaEval 2, Arena-Hard and MT-Bench demonstrate that MetaAPO consistently outperforms existing preference optimization approaches across various settings, while reducing 42% in online annotation costs.

通过元加权在线采样实现对齐：弥合数据生成与偏好优化之间的鸿沟

Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization

摘要

Support