時間的セルフリワーディング言語モデル：過去-未来による選択-拒否の分離

要旨

自己報酬型言語モデルは、大規模言語モデル（LLM）が応答を生成すると同時に、LLM-as-a-Judgeプロンプティングを用いて自身の出力を評価し、反復的な直接選好最適化（DPO）を通じて生成能力を動的に向上させるアーキテクチャを提案しています。しかし、我々の分析によると、既存の自己報酬型パラダイムには重大な限界があります。選ばれた応答と拒否された応答の同期した改善により、対照サンプル間の表現の差が徐々に狭まり、効果的な選好学習が損なわれてしまうのです。我々は、過去、現在、未来のモデル生成を戦略的に調整して学習信号を維持する時間的自己報酬型言語モデルを提案します。我々の二段階フレームワークでは、(1) 過去の初期モデルの出力を用いて拒否された応答を固定する「アンカー付き拒否」と、(2) 次世代モデルの予測を用いて選ばれたサンプルを動的にキュレーションする「未来誘導型選択」を導入します。Llama、Qwen、Mistralの3つのモデルファミリーと異なるモデルサイズ（Llama3B/8B/70B）を用いた広範な実験により、同じ計算リソースを使用した自己報酬型と比較して、我々の手法を用いた場合の大幅な改善が実証されました。例えば、Llama3.1-8Bは我々の手法によりAlpacaEval 2.0で29.44の勝率を達成し、自己報酬型ベースライン（19.69）を9.75ポイント上回りました。特に、我々の手法は、数学的推論（GSM8K）、知識ベースのQA（ARC、TruthfulQA）、コード生成（HumanEval）タスクにおいて、そのようなトレーニングデータを特別に収集していないにもかかわらず、優れた分布外汎化性能を示しています。

English

Self-Rewarding Language Models propose an architecture in which the Large Language Models(LLMs) both generates responses and evaluates its own outputs via LLM-as-a-Judge prompting, dynamically improving its generative capabilities through iterative Direct Preference Optimization (DPO). However, our analysis reveals a critical limitation in existing Self-Rewarding paradigms: the synchronized improvement of chosen and rejected responses progressively narrows the representational difference between contrasting samples, undermining effective preference learning. We propose Temporal Self-Rewarding Language Models that strategically coordinate past, present, and future model generations to sustain learning signals. Our dual-phase framework introduces: (1) Anchored Rejection - fixing rejected responses using the past initial model's outputs and (2) Future-Guided Chosen - dynamically curating chosen samples using next-generation model predictions. Extensive experiments across three model families (Llama, Qwen, Mistral) and different model sizes (Llama3B/8B/70B) demonstrate significant improvements when trained with our method compared to Self-Rewarding using same computation resources. For example, Llama3.1-8B reaches a 29.44 win rate on AlpacaEval 2.0 with our method, outperforming the Self-Rewarding baseline (19.69) by 9.75. Notably, our method also demonstrates superior out-of-distribution generalization across mathematical reasoning (GSM8K), knowledge-based QA (ARC, TruthfulQA), and code generation (HumanEval) tasks, even though we do not specifically collect such training data.

時間的セルフリワーディング言語モデル：過去-未来による選択-拒否の分離

Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future

要旨

Support