メタ重み付きオンラインサンプリングによるアライメント：データ生成と選好最適化のギャップを埋める

要旨

大規模言語モデル（LLM）を人間の価値観や意図に整合させるためには、選好最適化が重要である。このプロセスにおける大きな課題は、事前に収集されたオフラインの選好データと進化するモデルポリシーとの間の分布ミスマッチである。既存の手法では、静的なヒューリスティックや分離されたオンラインサンプリング戦略を用いてこのギャップを縮めようとするが、モデルの動的な学習状態に適応できないことが多い。このギャップを埋めるために、我々はMeta-Weighted Adaptive Preference Optimization（MetaAPO）という新しいフレームワークを提案する。MetaAPOは、データ生成とモデル学習を動的に結合し、軽量なメタ学習器を「整合ギャップ推定器」として使用して、オンポリシーサンプリングの潜在的な利点をオフラインデータと関連付けて評価する。これにより、ターゲットを絞ったオンライン生成を導き、最適化目標にサンプルごとのメタ重みを割り当てることで、オンラインとオフラインデータの品質と分布を動的にバランスさせる。AlpacaEval 2、Arena-Hard、MT-Benchでの実験により、MetaAPOが様々な設定において既存の選好最適化手法を一貫して上回り、オンラインアノテーションコストを42％削減できることが示された。

English

Preference optimization is crucial for aligning large language models (LLMs) with human values and intentions. A significant challenge in this process is the distribution mismatch between pre-collected offline preference data and the evolving model policy. Existing methods attempt to reduce this gap using static heuristics or decoupled online sampling strategies, but they often fail to adapt to the model's dynamic learning state. To bridge this gap, we propose Meta-Weighted Adaptive Preference Optimization (MetaAPO), a novel framework that dynamically couples data generation with model training. MetaAPO employs a lightweight meta-learner, as an "alignment gap estimator", to evaluate the potential benefits of on-policy sampling in relation to offline data. This guides targeted online generation and assigns sample-wise meta-weights to the optimization objective, dynamically balancing the quality and distribution of online and offline data. Experiments on AlpacaEval 2, Arena-Hard and MT-Bench demonstrate that MetaAPO consistently outperforms existing preference optimization approaches across various settings, while reducing 42% in online annotation costs.

メタ重み付きオンラインサンプリングによるアライメント：データ生成と選好最適化のギャップを埋める

Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization

要旨

Support