패러다임의 한계를 넘어서: 언어 모델의 분포 추론을 위한 강화 학습

초록

질문이 주어지면 언어 모델(LM)은 가능한 답변에 대한 분포를 암묵적으로 인코딩합니다. 실제로 LM의 사후 훈련 절차는 종종 이 분포를 단일 주 모드로 축소합니다. 하나의 정답을 가정하는 벤치마크 스타일 평가에서는 일반적으로 문제가 되지 않지만, 많은 실제 과제에는 본질적으로 여러 개의 유효한 답변이나 불가역적인 불확실성이 내재되어 있습니다. 의학적 진단, 모호한 질의응답, 불완전한 정보가 있는 환경 등이 그 예시입니다. 이러한 경우에는 LM이 다수의 타당한 가설을 생성하고, 각 가설에 대한 신뢰도 추정치를 제공하며, 비최빈값 답변을 생성하기 위한 계산 집약적인 반복 샘플링 없이도 이를 수행하기를 원합니다. 본 논문은 추론 과정에서 다중 답변에 대한 분포적 추론을 수행하도록 LM을 훈련시키기 위한 다중 답변 강화 학습 접근법을 설명합니다. 우리는 RL 목적 함수를 수정하여 모델이 단일 순전파 과정에서 명시적으로 여러 후보 답변을 생성할 수 있도록 하여, 추론 시 탐색의 측면을 모델의 생성 과정 내부로 내재화합니다. 질의응답, 의료 진단, 코딩 벤치마크 전반에 걸쳐 단일 답변 훈련 기준 모델과 비교하여 개선된 다양성, coverage 및 집합 수준 보정 점수를 관찰합니다. 우리의 접근법으로 훈련된 모델은 경쟁 접근법보다 더 적은 토큰으로 여러 답변을 생성할 수 있습니다. 코딩 과제에서는 정확도도 훨씬 더 높습니다. 이러한 결과는 다중 답변 RL이 k-최선선택(best-of-k)과 같은 추론 시 스케일링 절차에 대한 원칙적이고 계산 효율적인 대안으로 자리매김함을 보여줍니다. 코드와 추가 정보는 https://multi-answer-rl.github.io/에서 확인할 수 있습니다.

English

Given a question, a language model (LM) implicitly encodes a distribution over possible answers. In practice, post-training procedures for LMs often collapse this distribution onto a single dominant mode. While this is generally not a problem for benchmark-style evaluations that assume one correct answer, many real-world tasks inherently involve multiple valid answers or irreducible uncertainty. Examples include medical diagnosis, ambiguous question answering, and settings with incomplete information. In these cases, we would like LMs to generate multiple plausible hypotheses, ideally with confidence estimates for each one, and without computationally intensive repeated sampling to generate non-modal answers. This paper describes a multi-answer reinforcement learning approach for training LMs to perform distributional reasoning over multiple answers during inference. We modify the RL objective to enable models to explicitly generate multiple candidate answers in a single forward pass, internalizing aspects of inference-time search into the model's generative process. Across question-answering, medical diagnostic, and coding benchmarks, we observe improved diversity, coverage, and set-level calibration scores compared to single answer trained baselines. Models trained with our approach require fewer tokens to generate multiple answers than competing approaches. On coding tasks, they are also substantially more accurate. These results position multi-answer RL as a principled and compute-efficient alternative to inference-time scaling procedures such as best-of-k. Code and more information can be found at https://multi-answer-rl.github.io/.

패러다임의 한계를 넘어서: 언어 모델의 분포 추론을 위한 강화 학습

Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models

초록

Support