ChatPaper.aiChatPaper

超越众数思维:语言模型分布推理的强化学习方法

Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models

March 25, 2026
作者: Isha Puri, Mehul Damani, Idan Shenfeld, Marzyeh Ghassemi, Jacob Andreas, Yoon Kim
cs.AI

摘要

给定一个问题,语言模型(LM)会隐式编码可能答案的概率分布。在实际应用中,LM的后训练过程常将该分布坍缩为单一主导模式。虽然这对假设存在唯一正确答案的基准式评估通常不构成问题,但许多现实任务本身涉及多个有效答案或不可约的不确定性,例如医疗诊断、模糊问答以及信息不完整的情境。在这些情况下,我们希望LM能生成多个合理假设,并理想地为每个假设提供置信度估计,同时无需通过计算密集的重复采样来生成非模态答案。本文提出一种多答案强化学习方法,用于训练LM在推理过程中对多个答案进行分布推理。我们通过修改RL目标,使模型能在单次前向传播中显式生成多个候选答案,将推理时搜索的某些方面内化到模型的生成过程中。在问答、医疗诊断和编程基准测试中,相较于单答案训练的基线模型,我们观察到多样性、覆盖度及集合级校准分数的提升。采用本方法训练的模型生成多个答案所需的标记数少于竞争方法。在编程任务中,其准确性也显著更高。这些结果表明多答案RL可作为一种原则性强、计算效率高的替代方案,优于最佳K选等推理时扩展方法。代码及更多信息请访问https://multi-answer-rl.github.io/。
English
Given a question, a language model (LM) implicitly encodes a distribution over possible answers. In practice, post-training procedures for LMs often collapse this distribution onto a single dominant mode. While this is generally not a problem for benchmark-style evaluations that assume one correct answer, many real-world tasks inherently involve multiple valid answers or irreducible uncertainty. Examples include medical diagnosis, ambiguous question answering, and settings with incomplete information. In these cases, we would like LMs to generate multiple plausible hypotheses, ideally with confidence estimates for each one, and without computationally intensive repeated sampling to generate non-modal answers. This paper describes a multi-answer reinforcement learning approach for training LMs to perform distributional reasoning over multiple answers during inference. We modify the RL objective to enable models to explicitly generate multiple candidate answers in a single forward pass, internalizing aspects of inference-time search into the model's generative process. Across question-answering, medical diagnostic, and coding benchmarks, we observe improved diversity, coverage, and set-level calibration scores compared to single answer trained baselines. Models trained with our approach require fewer tokens to generate multiple answers than competing approaches. On coding tasks, they are also substantially more accurate. These results position multi-answer RL as a principled and compute-efficient alternative to inference-time scaling procedures such as best-of-k. Code and more information can be found at https://multi-answer-rl.github.io/.
PDF11March 28, 2026