超越二元奖励：训练语言模型推理其不确定性

摘要

当语言模型（LMs）通过强化学习（RL）训练以生成自然语言“推理链”时，其在多种复杂问答任务上的表现得到提升。目前，几乎所有成功的RL推理应用都采用二元奖励函数来评估LM输出的正确性。由于此类奖励函数不会对猜测或低置信度输出进行惩罚，它们常常无意中导致校准度下降，并增加LM在其他问题领域生成错误响应（或“幻觉”）的频率。本文介绍了RLCR（带校准奖励的强化学习），一种训练推理模型的方法，它同时提高了准确性和校准置信度估计。在RLCR过程中，LMs在推理后生成预测和数值置信度估计。它们被训练以优化一个奖励函数，该函数在二元正确性评分基础上增加了Brier评分——一种激励校准预测的置信度估计评分规则。我们首先证明，该奖励函数（或任何使用有界、适当评分规则的类似奖励函数）能产生预测既准确又校准良好的模型。接着，我们展示在多种数据集上，RLCR显著改善了校准度，且准确度无损失，无论是在域内还是域外评估中——均优于普通RL训练和训练用于分配事后置信度分数的分类器。尽管普通RL损害校准，RLCR却提升了它。最后，我们证明在测试时，可以通过置信度加权缩放方法利用口头表达的置信度来提高准确性和校准度。我们的结果表明，明确优化校准能够产生更普遍可靠的推理模型。

English

When language models (LMs) are trained via reinforcement learning (RL) to generate natural language "reasoning chains", their performance improves on a variety of difficult question answering tasks. Today, almost all successful applications of RL for reasoning use binary reward functions that evaluate the correctness of LM outputs. Because such reward functions do not penalize guessing or low-confidence outputs, they often have the unintended side-effect of degrading calibration and increasing the rate at which LMs generate incorrect responses (or "hallucinate") in other problem domains. This paper describes RLCR (Reinforcement Learning with Calibration Rewards), an approach to training reasoning models that jointly improves accuracy and calibrated confidence estimation. During RLCR, LMs generate both predictions and numerical confidence estimates after reasoning. They are trained to optimize a reward function that augments a binary correctness score with a Brier score -- a scoring rule for confidence estimates that incentivizes calibrated prediction. We first prove that this reward function (or any analogous reward function that uses a bounded, proper scoring rule) yields models whose predictions are both accurate and well-calibrated. We next show that across diverse datasets, RLCR substantially improves calibration with no loss in accuracy, on both in-domain and out-of-domain evaluations -- outperforming both ordinary RL training and classifiers trained to assign post-hoc confidence scores. While ordinary RL hurts calibration, RLCR improves it. Finally, we demonstrate that verbalized confidence can be leveraged at test time to improve accuracy and calibration via confidence-weighted scaling methods. Our results show that explicitly optimizing for calibration can produce more generally reliable reasoning models.

超越二元奖励：训练语言模型推理其不确定性

Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty

摘要

Support