二値的な報酬を超えて：不確実性について推論する言語モデルの訓練

要旨

言語モデル（LM）が強化学習（RL）を用いて自然言語の「推論チェーン」を生成するように訓練されると、様々な難易度の高い質問応答タスクにおいてその性能が向上します。現在、推論におけるRLの成功例のほとんどは、LMの出力の正しさを評価する二値報酬関数を使用しています。しかし、このような報酬関数は推測や低信頼度の出力を罰しないため、しばしば意図しない副作用として、他の問題領域においてLMが誤った応答（または「幻覚」）を生成する割合を増加させ、キャリブレーションを低下させることがあります。本論文では、RLCR（Reinforcement Learning with Calibration Rewards）というアプローチを紹介します。これは、推論モデルの訓練において、精度とキャリブレーションされた信頼度推定を同時に改善する方法です。RLCRでは、LMは推論後に予測と数値的な信頼度推定を生成します。これらは、二値の正解スコアにBrierスコア（キャリブレーションされた予測を促す信頼度推定のスコアリングルール）を加えた報酬関数を最適化するように訓練されます。まず、この報酬関数（または類似の有界で適切なスコアリングルールを使用する報酬関数）が、正確かつ良好にキャリブレーションされた予測を生成するモデルをもたらすことを証明します。次に、多様なデータセットにおいて、RLCRがキャリブレーションを大幅に改善し、精度を損なうことなく、ドメイン内およびドメイン外の評価で優れた結果を示すことを示します。通常のRL訓練や事後的に信頼度スコアを割り当てるように訓練された分類器を凌駕します。通常のRLはキャリブレーションを損なうのに対し、RLCRはそれを改善します。最後に、テスト時に言語化された信頼度を活用して、信頼度に基づくスケーリング手法を用いて精度とキャリブレーションを向上させることを実証します。我々の結果は、キャリブレーションを明示的に最適化することが、より一般的に信頼性の高い推論モデルを生成できることを示しています。

English

When language models (LMs) are trained via reinforcement learning (RL) to generate natural language "reasoning chains", their performance improves on a variety of difficult question answering tasks. Today, almost all successful applications of RL for reasoning use binary reward functions that evaluate the correctness of LM outputs. Because such reward functions do not penalize guessing or low-confidence outputs, they often have the unintended side-effect of degrading calibration and increasing the rate at which LMs generate incorrect responses (or "hallucinate") in other problem domains. This paper describes RLCR (Reinforcement Learning with Calibration Rewards), an approach to training reasoning models that jointly improves accuracy and calibrated confidence estimation. During RLCR, LMs generate both predictions and numerical confidence estimates after reasoning. They are trained to optimize a reward function that augments a binary correctness score with a Brier score -- a scoring rule for confidence estimates that incentivizes calibrated prediction. We first prove that this reward function (or any analogous reward function that uses a bounded, proper scoring rule) yields models whose predictions are both accurate and well-calibrated. We next show that across diverse datasets, RLCR substantially improves calibration with no loss in accuracy, on both in-domain and out-of-domain evaluations -- outperforming both ordinary RL training and classifiers trained to assign post-hoc confidence scores. While ordinary RL hurts calibration, RLCR improves it. Finally, we demonstrate that verbalized confidence can be leveraged at test time to improve accuracy and calibration via confidence-weighted scaling methods. Our results show that explicitly optimizing for calibration can produce more generally reliable reasoning models.

二値的な報酬を超えて：不確実性について推論する言語モデルの訓練

Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty

要旨

Support