通过元加权在线采样实现对齐:弥合数据生成与偏好优化之间的鸿沟
Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization
September 27, 2025
作者: Junming Yang, Ning Xu, Biao Liu, Shiqi Qiao, Xin Geng
cs.AI
摘要
偏好优化对于将大型语言模型(LLMs)与人类价值观和意图对齐至关重要。这一过程中的一大挑战在于预收集的离线偏好数据与不断演变的模型策略之间的分布不匹配。现有方法尝试通过静态启发式或解耦的在线采样策略来缩小这一差距,但它们往往难以适应模型的动态学习状态。为弥合这一差距,我们提出了元加权自适应偏好优化(MetaAPO),这是一种新颖的框架,能够动态地将数据生成与模型训练相结合。MetaAPO采用轻量级元学习器作为“对齐差距估计器”,评估在线采样相对于离线数据的潜在收益。这指导了有针对性的在线生成,并为优化目标分配样本级元权重,动态平衡在线与离线数据的质量与分布。在AlpacaEval 2、Arena-Hard和MT-Bench上的实验表明,MetaAPO在各种设置下均优于现有的偏好优化方法,同时减少了42%的在线标注成本。
English
Preference optimization is crucial for aligning large language models (LLMs)
with human values and intentions. A significant challenge in this process is
the distribution mismatch between pre-collected offline preference data and the
evolving model policy. Existing methods attempt to reduce this gap using static
heuristics or decoupled online sampling strategies, but they often fail to
adapt to the model's dynamic learning state. To bridge this gap, we propose
Meta-Weighted Adaptive Preference Optimization (MetaAPO), a novel framework
that dynamically couples data generation with model training. MetaAPO employs a
lightweight meta-learner, as an "alignment gap estimator", to evaluate the
potential benefits of on-policy sampling in relation to offline data. This
guides targeted online generation and assigns sample-wise meta-weights to the
optimization objective, dynamically balancing the quality and distribution of
online and offline data. Experiments on AlpacaEval 2, Arena-Hard and MT-Bench
demonstrate that MetaAPO consistently outperforms existing preference
optimization approaches across various settings, while reducing 42% in online
annotation costs.