尋找最佳平衡點：用於擴展偏好優化的偏好數據建構

摘要

迭代數據生成與模型重訓練被廣泛用於對齊大型語言模型（LLMs）。這一過程通常涉及一個策略模型來生成在策略回應，以及一個獎勵模型來指導訓練數據的選擇。直接偏好優化（DPO）通過構建選擇與拒絕回應的偏好對，進一步增強了這一流程。在本研究中，我們旨在通過重複隨機採樣來擴大在策略樣本的數量，從而提升對齊性能。傳統做法是選取獎勵最高的樣本作為選擇，獎勵最低的作為拒絕，用於DPO。然而，我們的實驗表明，隨著樣本量的增加，此策略會導致性能下降。為解決這一問題，我們從樣本獎勵的基礎正態分布視角出發，探討了偏好數據的構建。我們將獎勵空間劃分為七個代表性點，並系統性地探索了所有21種（C_7^2）配對組合。通過在AlpacaEval 2上對四個模型的評估，我們發現選擇位於獎勵位置mu - 2sigma的拒絕回應，而非最低獎勵，對實現最佳性能至關重要。最終，我們提出了一種可擴展的偏好數據構建策略，該策略隨著樣本規模的擴大，持續提升模型性能。

English

Iterative data generation and model retraining are widely used to align large language models (LLMs). It typically involves a policy model to generate on-policy responses and a reward model to guide training data selection. Direct Preference Optimization (DPO) further enhances this process by constructing preference pairs of chosen and rejected responses. In this work, we aim to scale up the number of on-policy samples via repeated random sampling to improve alignment performance. Conventional practice selects the sample with the highest reward as chosen and the lowest as rejected for DPO. However, our experiments reveal that this strategy leads to a decline in performance as the sample size increases. To address this, we investigate preference data construction through the lens of underlying normal distribution of sample rewards. We categorize the reward space into seven representative points and systematically explore all 21 (C_7^2) pairwise combinations. Through evaluations on four models using AlpacaEval 2, we find that selecting the rejected response at reward position mu - 2sigma rather than the minimum reward, is crucial for optimal performance. We finally introduce a scalable preference data construction strategy that consistently enhances model performance as the sample scale increases.

尋找最佳平衡點：用於擴展偏好優化的偏好數據建構

Finding the Sweet Spot: Preference Data Construction for Scaling Preference Optimization

摘要

Support