Pre-DPO：ガイディング参照モデルを用いた直接選好最適化におけるデータ活用の改善

要旨

Direct Preference Optimization (DPO)は、大規模言語モデル（LLM）に対する人間のフィードバックからの強化学習（RLHF）を簡素化し、明示的な報酬モデルなしで人間の選好を直接最適化します。DPOのトレーニング中、参照モデルはデータの重み調整役として機能することがわかります。しかし、DPOにおいてポリシーモデルと参照モデルを同一に初期化する一般的な慣習は、データの非効率的な利用を招き、性能の上限を課す可能性があります。一方、Simple Preference Optimization (SimPO)では参照モデルが存在しないため、トレーニングの堅牢性が低下し、破滅的な忘却を防ぐためにより厳しい条件が必要となります。本研究では、Pre-DPOを提案します。これは、ガイドとなる参照モデルを活用して選好最適化の性能を向上させる、シンプルでありながら効果的なDPOベースのトレーニングパラダイムです。この参照モデルは、トレーニング選好データを通じて達成可能な最適なポリシー状態に対する洞察を提供し、モデルに適したサンプルにはより高い重みを、適さないサンプルにはより低い重みを適応的に割り当てるガイドメカニズムとして機能します。AlpacaEval 2.0およびArena-Hard v0.1ベンチマークでの広範な実験により、Pre-DPOが外部モデルや追加データに依存することなく、DPOとSimPOの両方の性能を一貫して向上させることが実証されています。

English

Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback (RLHF) for large language models (LLMs) by directly optimizing human preferences without an explicit reward model. We find that during DPO training, the reference model plays the role of a data weight adjuster. However, the common practice of initializing the policy and reference models identically in DPO can lead to inefficient data utilization and impose a performance ceiling. Meanwhile, the lack of a reference model in Simple Preference Optimization (SimPO) reduces training robustness and necessitates stricter conditions to prevent catastrophic forgetting. In this work, we propose Pre-DPO, a simple yet effective DPO-based training paradigm that enhances preference optimization performance by leveraging a guiding reference model. This reference model provides foresight into the optimal policy state achievable through the training preference data, serving as a guiding mechanism that adaptively assigns higher weights to samples more suitable for the model and lower weights to those less suitable. Extensive experiments on AlpacaEval 2.0 and Arena-Hard v0.1 benchmarks demonstrate that Pre-DPO consistently improves the performance of both DPO and SimPO, without relying on external models or additional data.

Pre-DPO：ガイディング参照モデルを用いた直接選好最適化におけるデータ活用の改善

Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model

要旨

Support