多數未必正確：強化學習在解決方案聚合中的訓練應用

摘要

通過生成多個獨立解並從中進行選擇或聚合，來擴大測試時的計算規模，已成為提升大型語言模型（LLMs）在挑戰性推理任務上表現的核心策略。儘管大多數先前的研究依賴於簡單的多數投票或獎勵模型排名來聚合解，這些方法可能僅帶來有限的效益。在本研究中，我們提出將聚合作為一項明確的推理技能來學習：給定一組候選解，我們訓練一個聚合模型，利用可驗證獎勵的強化學習來審查、調和並綜合出最終的正確答案。關鍵要素在於精心平衡易於和困難的訓練樣例，使模型既能學會恢復少數但正確的答案，也能掌握容易的多數正確答案。實證表明，我們的方法AggLM在多個基準測試中均優於基於規則和獎勵模型的基線方法。此外，它能夠有效地泛化到來自不同模型的解，包括訓練數據中未包含的更強模型，同時相比於使用更多解進行多數投票，所需標記數量大幅減少。

English

Scaling up test-time compute, by generating multiple independent solutions and selecting or aggregating among them, has become a central paradigm for improving large language models (LLMs) on challenging reasoning tasks. While most prior work relies on simple majority voting or reward model ranking to aggregate solutions, these approaches may only yield limited benefits. In this work, we propose to learn aggregation as an explicit reasoning skill: given a set of candidate solutions, we train an aggregator model to review, reconcile, and synthesize a final, correct answer using reinforcement learning from verifiable rewards. A key ingredient is careful balancing of easy and hard training examples, allowing the model to learn both to recover minority-but-correct answers as well as easy majority-correct answers. Empirically, we find our method, AggLM, outperforms both strong rule-based and reward-model baselines, across multiple benchmarks. Furthermore, it generalizes effectively to solutions from differing models, including stronger ones than contained in the training data, all while requiring substantially fewer tokens than majority voting with larger numbers of solutions.

多數未必正確：強化學習在解決方案聚合中的訓練應用

The Majority is not always right: RL training for solution aggregation

摘要

Support