Pre-DPO: 가이드 참조 모델을 활용한 직접 선호 최적화에서의 데이터 활용 개선

초록

Direct Preference Optimization(DPO)은 명시적인 보상 모델 없이 인간의 선호도를 직접 최적화함으로써 대규모 언어 모델(LLM)에 대한 인간 피드백 기반 강화 학습(RLHF)을 단순화합니다. 우리는 DPO 훈련 중에 참조 모델이 데이터 가중치 조정자의 역할을 한다는 것을 발견했습니다. 그러나 DPO에서 정책 모델과 참조 모델을 동일하게 초기화하는 일반적인 관행은 데이터 활용의 비효율성을 초래하고 성능 상한을 부과할 수 있습니다. 한편, Simple Preference Optimization(SimPO)에서는 참조 모델이 없어 훈련의 견고성이 감소하고 치명적인 망각을 방지하기 위해 더 엄격한 조건이 필요합니다. 본 연구에서는 이러한 문제를 해결하기 위해 Pre-DPO를 제안합니다. Pre-DPO는 가이드 참조 모델을 활용하여 선호도 최적화 성능을 향상시키는 간단하면서도 효과적인 DPO 기반 훈련 패러다임입니다. 이 참조 모델은 훈련 선호도 데이터를 통해 달성할 수 있는 최적 정책 상태에 대한 통찰력을 제공하며, 모델에 더 적합한 샘플에는 더 높은 가중치를, 덜 적합한 샘플에는 더 낮은 가중치를 적응적으로 부여하는 가이드 메커니즘 역할을 합니다. AlpacaEval 2.0 및 Arena-Hard v0.1 벤치마크에서의 광범위한 실험을 통해 Pre-DPO가 외부 모델이나 추가 데이터에 의존하지 않고도 DPO와 SimPO의 성능을 지속적으로 개선함을 입증했습니다.

English

Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback (RLHF) for large language models (LLMs) by directly optimizing human preferences without an explicit reward model. We find that during DPO training, the reference model plays the role of a data weight adjuster. However, the common practice of initializing the policy and reference models identically in DPO can lead to inefficient data utilization and impose a performance ceiling. Meanwhile, the lack of a reference model in Simple Preference Optimization (SimPO) reduces training robustness and necessitates stricter conditions to prevent catastrophic forgetting. In this work, we propose Pre-DPO, a simple yet effective DPO-based training paradigm that enhances preference optimization performance by leveraging a guiding reference model. This reference model provides foresight into the optimal policy state achievable through the training preference data, serving as a guiding mechanism that adaptively assigns higher weights to samples more suitable for the model and lower weights to those less suitable. Extensive experiments on AlpacaEval 2.0 and Arena-Hard v0.1 benchmarks demonstrate that Pre-DPO consistently improves the performance of both DPO and SimPO, without relying on external models or additional data.

Pre-DPO: 가이드 참조 모델을 활용한 직접 선호 최적화에서의 데이터 활용 개선

Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model

초록

Support