ChatPaper.aiChatPaper

時間自獎勵語言模型:通過過去與未來解耦選擇與拒絕

Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future

August 8, 2025
作者: Yidong Wang, Xin Wang, Cunxiang Wang, Junfeng Fang, Qiufeng Wang, Jianing Chu, Xuran Meng, Shuxun Yang, Libo Qin, Yue Zhang, Wei Ye, Shikun Zhang
cs.AI

摘要

自我獎勵語言模型提出了一種架構,其中大型語言模型(LLMs)不僅生成回應,還通過LLM-as-a-Judge提示來評估其自身輸出,並通過迭代的直接偏好優化(DPO)動態提升其生成能力。然而,我們的分析揭示了現有自我獎勵範式中的一個關鍵限制:被選中和被拒絕回應的同步改進逐漸縮小了對比樣本之間的表示差異,從而削弱了有效的偏好學習。我們提出了時間性自我獎勵語言模型,該模型策略性地協調過去、現在和未來的模型生成,以維持學習信號。我們的雙階段框架引入了:(1)錨定拒絕——使用過去初始模型的輸出固定被拒絕的回應,以及(2)未來引導選擇——使用下一代模型的預測動態策劃被選中的樣本。在三個模型家族(Llama、Qwen、Mistral)和不同模型大小(Llama3B/8B/70B)上的廣泛實驗表明,與使用相同計算資源的自我獎勵方法相比,採用我們的方法訓練的模型有顯著提升。例如,Llama3.1-8B在AlpacaEval 2.0上達到了29.44的勝率,比自我獎勵基線(19.69)高出9.75。值得注意的是,我們的方法在數學推理(GSM8K)、基於知識的問答(ARC、TruthfulQA)和代碼生成(HumanEval)任務上也展現了優異的分佈外泛化能力,儘管我們並未專門收集此類訓練數據。
English
Self-Rewarding Language Models propose an architecture in which the Large Language Models(LLMs) both generates responses and evaluates its own outputs via LLM-as-a-Judge prompting, dynamically improving its generative capabilities through iterative Direct Preference Optimization (DPO). However, our analysis reveals a critical limitation in existing Self-Rewarding paradigms: the synchronized improvement of chosen and rejected responses progressively narrows the representational difference between contrasting samples, undermining effective preference learning. We propose Temporal Self-Rewarding Language Models that strategically coordinate past, present, and future model generations to sustain learning signals. Our dual-phase framework introduces: (1) Anchored Rejection - fixing rejected responses using the past initial model's outputs and (2) Future-Guided Chosen - dynamically curating chosen samples using next-generation model predictions. Extensive experiments across three model families (Llama, Qwen, Mistral) and different model sizes (Llama3B/8B/70B) demonstrate significant improvements when trained with our method compared to Self-Rewarding using same computation resources. For example, Llama3.1-8B reaches a 29.44 win rate on AlpacaEval 2.0 with our method, outperforming the Self-Rewarding baseline (19.69) by 9.75. Notably, our method also demonstrates superior out-of-distribution generalization across mathematical reasoning (GSM8K), knowledge-based QA (ARC, TruthfulQA), and code generation (HumanEval) tasks, even though we do not specifically collect such training data.
PDF132August 12, 2025