다수가 항상 옳은 것은 아니다: 솔루션 집계를 위한 강화학습 훈련

초록

테스트 시점의 계산량을 확장하여 여러 독립적인 해결책을 생성하고 이를 선택하거나 통합하는 방식은 도전적인 추론 과제에서 대규모 언어 모델(LLMs)의 성능을 향상시키는 핵심 패러다임으로 자리 잡았습니다. 기존 연구 대부분은 단순 다수결 투표나 보상 모델 순위를 통해 해결책을 통합하는 데 의존했지만, 이러한 접근 방식은 제한된 이점만을 제공할 가능성이 있습니다. 본 연구에서는 통합을 명시적인 추론 기술로 학습하는 방식을 제안합니다: 후보 해결책 집합이 주어졌을 때, 검증 가능한 보상을 통해 강화 학습을 사용하여 최종 정답을 검토, 조정 및 종합하는 통합 모델을 훈련시킵니다. 여기서 핵심 요소는 쉬운 예제와 어려운 예제를 신중하게 균형 있게 조정하여, 모델이 소수지만 정답인 답변과 쉬운 다수 정답 모두를 복원할 수 있도록 하는 것입니다. 실험적으로, 우리의 방법인 AggLM은 여러 벤치마크에서 강력한 규칙 기반 및 보상 모델 기준선을 능가하는 성능을 보였습니다. 더 나아가, 이 방법은 훈련 데이터에 포함된 것보다 더 강력한 모델을 포함한 다양한 모델의 해결책에도 효과적으로 일반화되며, 더 많은 해결책을 사용한 다수결 투표보다 훨씬 적은 토큰 수를 요구합니다.

English

Scaling up test-time compute, by generating multiple independent solutions and selecting or aggregating among them, has become a central paradigm for improving large language models (LLMs) on challenging reasoning tasks. While most prior work relies on simple majority voting or reward model ranking to aggregate solutions, these approaches may only yield limited benefits. In this work, we propose to learn aggregation as an explicit reasoning skill: given a set of candidate solutions, we train an aggregator model to review, reconcile, and synthesize a final, correct answer using reinforcement learning from verifiable rewards. A key ingredient is careful balancing of easy and hard training examples, allowing the model to learn both to recover minority-but-correct answers as well as easy majority-correct answers. Empirically, we find our method, AggLM, outperforms both strong rule-based and reward-model baselines, across multiple benchmarks. Furthermore, it generalizes effectively to solutions from differing models, including stronger ones than contained in the training data, all while requiring substantially fewer tokens than majority voting with larger numbers of solutions.

다수가 항상 옳은 것은 아니다: 솔루션 집계를 위한 강화학습 훈련

The Majority is not always right: RL training for solution aggregation

초록

Support