ChatPaper.aiChatPaper

基于似然奖励机制的通用大语言模型推理设计

Likelihood-Based Reward Designs for General LLM Reasoning

February 3, 2026
作者: Ariel Kwiatkowski, Natasha Butt, Ismail Labiad, Julia Kempe, Yann Ollivier
cs.AI

摘要

基于强化学习在推理基准上微调大型语言模型时,通常需要为每个基准设定特定的奖励函数(常为二元形式)。这种做法存在两个潜在局限:奖励函数的设计需求以及二元奖励可能存在的稀疏性。本文系统研究了基于生成参考答案(或数据中存在的其他提示续写)概率或对数概率的奖励机制,其优势在于不依赖特定验证器且具备大规模可用性。近期多项研究(如VeriFree、JEPO、RLPR、NOVER)已倡导使用类似奖励机制。我们通过系统对比基于似然度的奖励变体与标准基线,在标准数学推理基准和无法使用外部验证器的长文本答案场景中测试性能。研究发现,在思维链学习中使用参考答案的对数概率作为奖励,是唯一能在所有实验设置中均表现优异的方案。这种奖励机制也与预训练阶段使用的下一词元对数似然损失保持一致。在可验证场景中,对数概率奖励相较于标准二元奖励能带来相当或更高的成功率,并显著改善困惑度指标;在不可验证场景中,其表现与监督微调相当。而基于概率的方法(如VeriFree)因正确答案概率趋近于零,在不可验证场景中表现停滞。总体而言,本研究确立了对数概率奖励作为思维链微调的有效方法,成功衔接了短答案可验证与长答案不可验证的设置场景。
English
Fine-tuning large language models (LLMs) on reasoning benchmarks via reinforcement learning requires a specific reward function, often binary, for each benchmark. This comes with two potential limitations: the need to design the reward, and the potentially sparse nature of binary rewards. Here, we systematically investigate rewards derived from the probability or log-probability of emitting the reference answer (or any other prompt continuation present in the data), which have the advantage of not relying on specific verifiers and being available at scale. Several recent works have advocated for the use of similar rewards (e.g., VeriFree, JEPO, RLPR, NOVER). We systematically compare variants of likelihood-based rewards with standard baselines, testing performance both on standard mathematical reasoning benchmarks, and on long-form answers where no external verifier is available. We find that using the log-probability of the reference answer as the reward for chain-of-thought (CoT) learning is the only option that performs well in all setups. This reward is also consistent with the next-token log-likelihood loss used during pretraining. In verifiable settings, log-probability rewards bring comparable or better success rates than reinforcing with standard binary rewards, and yield much better perplexity. In non-verifiable settings, they perform on par with SFT. On the other hand, methods based on probability, such as VeriFree, flatline on non-verifiable settings due to vanishing probabilities of getting the correct answer. Overall, this establishes log-probability rewards as a viable method for CoT fine-tuning, bridging the short, verifiable and long, non-verifiable answer settings.
PDF70February 6, 2026