ChatPaper.aiChatPaper

时序自奖励语言模型:通过过去-未来解耦选择-拒绝机制

Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future

August 8, 2025
作者: Yidong Wang, Xin Wang, Cunxiang Wang, Junfeng Fang, Qiufeng Wang, Jianing Chu, Xuran Meng, Shuxun Yang, Libo Qin, Yue Zhang, Wei Ye, Shikun Zhang
cs.AI

摘要

自奖励语言模型提出了一种架构,其中大型语言模型(LLMs)不仅生成响应,还通过LLM作为评判者的提示机制评估自身输出,借助迭代的直接偏好优化(DPO)动态提升其生成能力。然而,我们的分析揭示了现有自奖励范式的一个关键局限:被选与拒绝响应的同步改进逐渐缩小了对比样本之间的表征差异,削弱了有效的偏好学习。我们提出了时序自奖励语言模型,该模型策略性地协调过去、现在及未来的模型生成,以维持学习信号。我们的双阶段框架引入了:(1) 锚定拒绝——利用过去初始模型的输出固定拒绝响应;(2) 未来引导选择——利用下一代模型的预测动态筛选被选样本。在Llama、Qwen、Mistral三大模型家族及不同模型规模(Llama3B/8B/70B)上的广泛实验表明,采用我们的方法训练,相比使用相同计算资源的自奖励方法,取得了显著提升。例如,Llama3.1-8B在我们的方法下,在AlpacaEval 2.0上达到了29.44的胜率,较自奖励基线(19.69)高出9.75。值得注意的是,尽管未专门收集此类训练数据,我们的方法在数学推理(GSM8K)、知识问答(ARC, TruthfulQA)及代码生成(HumanEval)任务上也展现出了卓越的分布外泛化能力。
English
Self-Rewarding Language Models propose an architecture in which the Large Language Models(LLMs) both generates responses and evaluates its own outputs via LLM-as-a-Judge prompting, dynamically improving its generative capabilities through iterative Direct Preference Optimization (DPO). However, our analysis reveals a critical limitation in existing Self-Rewarding paradigms: the synchronized improvement of chosen and rejected responses progressively narrows the representational difference between contrasting samples, undermining effective preference learning. We propose Temporal Self-Rewarding Language Models that strategically coordinate past, present, and future model generations to sustain learning signals. Our dual-phase framework introduces: (1) Anchored Rejection - fixing rejected responses using the past initial model's outputs and (2) Future-Guided Chosen - dynamically curating chosen samples using next-generation model predictions. Extensive experiments across three model families (Llama, Qwen, Mistral) and different model sizes (Llama3B/8B/70B) demonstrate significant improvements when trained with our method compared to Self-Rewarding using same computation resources. For example, Llama3.1-8B reaches a 29.44 win rate on AlpacaEval 2.0 with our method, outperforming the Self-Rewarding baseline (19.69) by 9.75. Notably, our method also demonstrates superior out-of-distribution generalization across mathematical reasoning (GSM8K), knowledge-based QA (ARC, TruthfulQA), and code generation (HumanEval) tasks, even though we do not specifically collect such training data.
PDF132August 12, 2025