尋找最佳平衡點:用於擴展偏好優化的偏好數據建構
Finding the Sweet Spot: Preference Data Construction for Scaling Preference Optimization
February 24, 2025
作者: Yao Xiao, Hai Ye, Linyao Chen, Hwee Tou Ng, Lidong Bing, Xiaoli Li, Roy Ka-wei Lee
cs.AI
摘要
迭代數據生成與模型重訓練被廣泛用於對齊大型語言模型(LLMs)。這一過程通常涉及一個策略模型來生成在策略回應,以及一個獎勵模型來指導訓練數據的選擇。直接偏好優化(DPO)通過構建選擇與拒絕回應的偏好對,進一步增強了這一流程。在本研究中,我們旨在通過重複隨機採樣來擴大在策略樣本的數量,從而提升對齊性能。傳統做法是選取獎勵最高的樣本作為選擇,獎勵最低的作為拒絕,用於DPO。然而,我們的實驗表明,隨著樣本量的增加,此策略會導致性能下降。為解決這一問題,我們從樣本獎勵的基礎正態分布視角出發,探討了偏好數據的構建。我們將獎勵空間劃分為七個代表性點,並系統性地探索了所有21種(C_7^2)配對組合。通過在AlpacaEval 2上對四個模型的評估,我們發現選擇位於獎勵位置mu - 2sigma的拒絕回應,而非最低獎勵,對實現最佳性能至關重要。最終,我們提出了一種可擴展的偏好數據構建策略,該策略隨著樣本規模的擴大,持續提升模型性能。
English
Iterative data generation and model retraining are widely used to align large
language models (LLMs). It typically involves a policy model to generate
on-policy responses and a reward model to guide training data selection. Direct
Preference Optimization (DPO) further enhances this process by constructing
preference pairs of chosen and rejected responses. In this work, we aim to
scale up the number of on-policy samples via repeated random sampling to
improve alignment performance. Conventional practice selects the sample with
the highest reward as chosen and the lowest as rejected for DPO. However, our
experiments reveal that this strategy leads to a decline in performance
as the sample size increases. To address this, we investigate preference data
construction through the lens of underlying normal distribution of sample
rewards. We categorize the reward space into seven representative points and
systematically explore all 21 (C_7^2) pairwise combinations. Through
evaluations on four models using AlpacaEval 2, we find that selecting the
rejected response at reward position mu - 2sigma rather than the minimum
reward, is crucial for optimal performance. We finally introduce a scalable
preference data construction strategy that consistently enhances model
performance as the sample scale increases.Summary
AI-Generated Summary