ChatPaper.aiChatPaper

SIMPLEMIX:在语言模型偏好学习中简单混合离策略与在策略数据的巧妙方法

SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning

May 5, 2025
作者: Tianjian Li, Daniel Khashabi
cs.AI

摘要

语言模型与人类偏好的对齐依赖于成对偏好数据集。尽管有研究表明,在偏好学习中,策略内数据始终优于策略外数据,但也有研究指出,策略内数据的优势可能因任务而异,这凸显了系统探索两者相互作用的必要性。 在本研究中,我们揭示了策略内与策略外数据在偏好优化中具有互补优势:策略内数据在数学和编程等推理任务上表现尤为出色,而策略外数据则在创意写作和个性化推荐等开放式任务中更为有效。基于这些发现,我们提出了SIMPLEMIX方法,通过简单混合这两种数据源,结合策略内与策略外偏好学习的互补优势。我们在多种任务和基准测试中的实证结果表明,SIMPLEMIX显著提升了语言模型的对齐效果。具体而言,在Alpaca Eval 2.0上,SIMPLEMIX相较于策略内DPO和策略外DPO平均提升了6.03%。此外,它比之前更为复杂的策略内与策略外数据结合方法,如HyPO和DPO-Mix-P,平均高出3.05%。
English
Aligning language models with human preferences relies on pairwise preference datasets. While some studies suggest that on-policy data consistently outperforms off -policy data for preference learning, others indicate that the advantages of on-policy data may be task-dependent, highlighting the need for a systematic exploration of their interplay. In this work, we show that on-policy and off-policy data offer complementary strengths in preference optimization: on-policy data is particularly effective for reasoning tasks like math and coding, while off-policy data performs better on open-ended tasks such as creative writing and making personal recommendations. Guided by these findings, we introduce SIMPLEMIX, an approach to combine the complementary strengths of on-policy and off-policy preference learning by simply mixing these two data sources. Our empirical results across diverse tasks and benchmarks demonstrate that SIMPLEMIX substantially improves language model alignment. Specifically, SIMPLEMIX improves upon on-policy DPO and off-policy DPO by an average of 6.03% on Alpaca Eval 2.0. Moreover, it outperforms prior approaches that are much more complex in combining on- and off-policy data, such as HyPO and DPO-Mix-P, by an average of 3.05%.

Summary

AI-Generated Summary

PDF41May 9, 2025