通過元加權在線採樣實現對齊：彌合數據生成與偏好優化之間的鴻溝

摘要

偏好優化對於使大型語言模型（LLMs）與人類價值觀和意圖保持一致至關重要。這一過程中的一個重大挑戰是預先收集的離線偏好數據與不斷演變的模型策略之間的分佈不匹配。現有方法嘗試通過靜態啟發式或解耦的在線採樣策略來縮小這一差距，但它們往往無法適應模型的動態學習狀態。為彌合這一差距，我們提出了元加權自適應偏好優化（MetaAPO），這是一種新穎的框架，能夠動態地將數據生成與模型訓練相結合。MetaAPO採用一個輕量級的元學習器作為“對齊差距估計器”，來評估在線採樣相對於離線數據的潛在收益。這指導了有針對性的在線生成，並為優化目標分配了樣本級別的元權重，從而動態平衡在線和離線數據的質量與分佈。在AlpacaEval 2、Arena-Hard和MT-Bench上的實驗表明，MetaAPO在各種設置下始終優於現有的偏好優化方法，同時減少了42%的在線註釋成本。

English

Preference optimization is crucial for aligning large language models (LLMs) with human values and intentions. A significant challenge in this process is the distribution mismatch between pre-collected offline preference data and the evolving model policy. Existing methods attempt to reduce this gap using static heuristics or decoupled online sampling strategies, but they often fail to adapt to the model's dynamic learning state. To bridge this gap, we propose Meta-Weighted Adaptive Preference Optimization (MetaAPO), a novel framework that dynamically couples data generation with model training. MetaAPO employs a lightweight meta-learner, as an "alignment gap estimator", to evaluate the potential benefits of on-policy sampling in relation to offline data. This guides targeted online generation and assigns sample-wise meta-weights to the optimization objective, dynamically balancing the quality and distribution of online and offline data. Experiments on AlpacaEval 2, Arena-Hard and MT-Bench demonstrate that MetaAPO consistently outperforms existing preference optimization approaches across various settings, while reducing 42% in online annotation costs.

通過元加權在線採樣實現對齊：彌合數據生成與偏好優化之間的鴻溝

Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization

摘要

Support