LaSeR:基于末位令牌自奖励的强化学习
LaSeR: Reinforcement Learning with Last-Token Self-Rewarding
October 16, 2025
作者: Wenkai Yang, Weijie Liu, Ruobing Xie, Yiju Guo, Lulu Wu, Saiyong Yang, Yankai Lin
cs.AI
摘要
可验证奖励的强化学习(RLVR)最近已成为提升大型语言模型(LLMs)推理能力的核心范式。针对测试时缺乏验证信号的问题,先前的研究将模型自我验证能力的训练融入标准RLVR流程中,从而在单一LLM内统一了推理与验证能力。然而,以往的做法要求LLM使用两个独立的提示模板依次生成解决方案和自我验证,这大大降低了效率。在本研究中,我们从理论上揭示了自我验证RL目标的闭式解可简化为一个极其简洁的形式:解决方案的真实推理奖励等于其最后一个令牌的自我奖励分数,该分数通过策略模型在解决方案最后一个令牌处对任一预设令牌的下一令牌对数概率与一个预先计算的常数之差,再乘以KL系数来计算。基于这一洞见,我们提出了LaSeR(基于最后一个令牌自我奖励的强化学习),该算法仅通过在原始RLVR损失上增加一个均方误差损失,使最后一个令牌的自我奖励分数与基于验证器的推理奖励对齐,从而联合优化LLMs的推理和自我奖励能力。优化后的自我奖励分数可在训练和测试中用于提升模型性能。值得注意的是,我们的算法直接从生成后立即预测的最后一个令牌的下一令牌概率分布中得出这些分数,仅需额外进行一次令牌推理的最小成本。实验表明,我们的方法不仅提升了模型的推理性能,还赋予其显著的自我奖励能力,从而增强了其在推理时的扩展性能。
English
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as
a core paradigm for enhancing the reasoning capabilities of Large Language
Models (LLMs). To address the lack of verification signals at test time, prior
studies incorporate the training of model's self-verification capability into
the standard RLVR process, thereby unifying reasoning and verification
capabilities within a single LLM. However, previous practice requires the LLM
to sequentially generate solutions and self-verifications using two separate
prompt templates, which significantly reduces efficiency. In this work, we
theoretically reveal that the closed-form solution to the RL objective of
self-verification can be reduced to a remarkably simple form: the true
reasoning reward of a solution is equal to its last-token self-rewarding score,
which is computed as the difference between the policy model's next-token
log-probability assigned to any pre-specified token at the solution's last
token and a pre-calculated constant, scaled by the KL coefficient. Based on
this insight, we propose LaSeR (Reinforcement Learning with Last-Token
Self-Rewarding), an algorithm that simply augments the original RLVR loss with
a MSE loss that aligns the last-token self-rewarding scores with verifier-based
reasoning rewards, jointly optimizing the reasoning and self-rewarding
capabilities of LLMs. The optimized self-rewarding scores can be utilized in
both training and testing to enhance model performance. Notably, our algorithm
derives these scores from the predicted next-token probability distribution of
the last token immediately after generation, incurring only the minimal extra
cost of one additional token inference. Experiments show that our method not
only improves the model's reasoning performance but also equips it with
remarkable self-rewarding capability, thereby boosting its inference-time
scaling performance.