基于似然奖励的通用大语言模型推理优化设计
Likelihood-Based Reward Designs for General LLM Reasoning
February 3, 2026
作者: Ariel Kwiatkowski, Natasha Butt, Ismail Labiad, Julia Kempe, Yann Ollivier
cs.AI
摘要
在推理基准测试中通过强化学习微调大型语言模型时,通常需要为每个基准设定特定的奖励函数(常为二元形式)。这种做法存在两个潜在局限:奖励函数需要人工设计,且二元奖励可能具有稀疏性。本文系统研究了基于参考答案(或数据中存在的其他提示续写)生成概率或对数概率的奖励机制,其优势在于不依赖特定验证器且可大规模获取。近期多项研究(如VeriFree、JEPO、RLPR、NOVER)已倡导使用类似奖励机制。我们通过系统对比基于似然度的奖励变体与标准基线,在标准数学推理基准和无法使用外部验证器的长文本答案场景下测试性能。研究发现,在思维链学习中,使用参考答案的对数概率作为奖励是唯一能在所有设定下均表现良好的方案,该奖励机制也与预训练阶段使用的下一词元对数似然损失保持一致。在可验证场景中,对数概率奖励相较于标准二元奖励能取得相当或更高的成功率,并显著降低困惑度;在不可验证场景中,其表现与监督微调相当。而基于概率的方法(如VeriFree)因正确答案概率趋近于零,在不可验证场景中效果停滞。总体而言,本研究确立了对数概率奖励作为思维链微调的有效方法,弥合了短文本可验证与长文本不可验证答案场景之间的鸿沟。
English
Fine-tuning large language models (LLMs) on reasoning benchmarks via reinforcement learning requires a specific reward function, often binary, for each benchmark. This comes with two potential limitations: the need to design the reward, and the potentially sparse nature of binary rewards. Here, we systematically investigate rewards derived from the probability or log-probability of emitting the reference answer (or any other prompt continuation present in the data), which have the advantage of not relying on specific verifiers and being available at scale. Several recent works have advocated for the use of similar rewards (e.g., VeriFree, JEPO, RLPR, NOVER). We systematically compare variants of likelihood-based rewards with standard baselines, testing performance both on standard mathematical reasoning benchmarks, and on long-form answers where no external verifier is available. We find that using the log-probability of the reference answer as the reward for chain-of-thought (CoT) learning is the only option that performs well in all setups. This reward is also consistent with the next-token log-likelihood loss used during pretraining. In verifiable settings, log-probability rewards bring comparable or better success rates than reinforcing with standard binary rewards, and yield much better perplexity. In non-verifiable settings, they perform on par with SFT. On the other hand, methods based on probability, such as VeriFree, flatline on non-verifiable settings due to vanishing probabilities of getting the correct answer. Overall, this establishes log-probability rewards as a viable method for CoT fine-tuning, bridging the short, verifiable and long, non-verifiable answer settings.