モードを超えて：言語モデルにおける分布的推論のための強化学習

要旨

質問が与えられると、言語モデル（LM）は潜在的な回答群に対する分布を暗黙的に符号化する。実際には、LMの学習後処理において、この分布は単一の優勢なモードに収束させられることが多い。これは、一つの正解を仮定するベンチマーク形式の評価では一般的に問題とならないが、現実世界の多くのタスクには、複数の有効な回答や不可避な不確実性が本質的に伴う。例えば、医療診断、曖昧な質問応答、情報が不完全な設定などが該当する。このような場合、LMには複数の蓋然的な仮説を、理想的にはそれぞれに対して信頼度推定値も付与して生成することが望まれる。また、非最頻値の回答を生成するために計算集約的な反復サンプリングを必要としないことも望ましい。本論文では、推論時に複数の回答に対して分布的推論を行うようLMを訓練するための、多回答強化学習アプローチを提案する。我々はRLの目的関数を修正し、モデルが単一のフォワードパスで明示的に複数の候補回答を生成できるようにすることで、推論時の探索の側面をモデルの生成プロセスに内在化させる。質問応答、医療診断、コーディングのベンチマークにおいて、単一回答で訓練されたベースラインと比較して、多様性、網羅性、集合レベル較正スコアの改善が認められた。本アプローチで訓練されたモデルは、複数の回答を生成するのに競合手法よりも少ないトークン数で済む。コーディングタスクでは、精度も大幅に向上する。これらの結果は、多回答RLが、k-bestなどの推論時スケーリング手法に代わる、原理的で計算効率の良い選択肢であることを示している。コードおよび詳細情報はhttps://multi-answer-rl.github.io/で閲覧可能。

English

Given a question, a language model (LM) implicitly encodes a distribution over possible answers. In practice, post-training procedures for LMs often collapse this distribution onto a single dominant mode. While this is generally not a problem for benchmark-style evaluations that assume one correct answer, many real-world tasks inherently involve multiple valid answers or irreducible uncertainty. Examples include medical diagnosis, ambiguous question answering, and settings with incomplete information. In these cases, we would like LMs to generate multiple plausible hypotheses, ideally with confidence estimates for each one, and without computationally intensive repeated sampling to generate non-modal answers. This paper describes a multi-answer reinforcement learning approach for training LMs to perform distributional reasoning over multiple answers during inference. We modify the RL objective to enable models to explicitly generate multiple candidate answers in a single forward pass, internalizing aspects of inference-time search into the model's generative process. Across question-answering, medical diagnostic, and coding benchmarks, we observe improved diversity, coverage, and set-level calibration scores compared to single answer trained baselines. Models trained with our approach require fewer tokens to generate multiple answers than competing approaches. On coding tasks, they are also substantially more accurate. These results position multi-answer RL as a principled and compute-efficient alternative to inference-time scaling procedures such as best-of-k. Code and more information can be found at https://multi-answer-rl.github.io/.

モードを超えて：言語モデルにおける分布的推論のための強化学習

Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models

要旨

Support