ChatPaper.aiChatPaper

SIMPLEMIX:在語言模型偏好學習中簡單混合離線與在線數據的簡易方法

SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning

May 5, 2025
作者: Tianjian Li, Daniel Khashabi
cs.AI

摘要

將語言模型與人類偏好對齊依賴於成對偏好數據集。雖然一些研究表明,在偏好學習中,在線策略數據始終優於離線策略數據,但其他研究指出,在線策略數據的優勢可能依賴於具體任務,這凸顯了系統性探索兩者相互作用的必要性。 在本研究中,我們展示了在線策略和離線策略數據在偏好優化中具有互補優勢:在線策略數據在數學和編碼等推理任務中表現尤為出色,而離線策略數據則在創意寫作和個性化推薦等開放性任務中表現更佳。基於這些發現,我們提出了SIMPLEMIX方法,通過簡單混合這兩種數據源來結合在線策略和離線策略偏好學習的互補優勢。我們在多樣化任務和基準測試中的實證結果表明,SIMPLEMIX顯著提升了語言模型的對齊效果。具體而言,SIMPLEMIX在Alpaca Eval 2.0上平均比在線策略DPO和離線策略DPO提高了6.03%。此外,它還比之前更為複雜的在線和離線策略數據結合方法(如HyPO和DPO-Mix-P)平均提升了3.05%。
English
Aligning language models with human preferences relies on pairwise preference datasets. While some studies suggest that on-policy data consistently outperforms off -policy data for preference learning, others indicate that the advantages of on-policy data may be task-dependent, highlighting the need for a systematic exploration of their interplay. In this work, we show that on-policy and off-policy data offer complementary strengths in preference optimization: on-policy data is particularly effective for reasoning tasks like math and coding, while off-policy data performs better on open-ended tasks such as creative writing and making personal recommendations. Guided by these findings, we introduce SIMPLEMIX, an approach to combine the complementary strengths of on-policy and off-policy preference learning by simply mixing these two data sources. Our empirical results across diverse tasks and benchmarks demonstrate that SIMPLEMIX substantially improves language model alignment. Specifically, SIMPLEMIX improves upon on-policy DPO and off-policy DPO by an average of 6.03% on Alpaca Eval 2.0. Moreover, it outperforms prior approaches that are much more complex in combining on- and off-policy data, such as HyPO and DPO-Mix-P, by an average of 3.05%.
PDF72May 9, 2025