多数未必正确：强化学习在解决方案聚合中的训练应用

摘要

通过生成多个独立解决方案并进行选择或聚合，扩大测试时的计算规模已成为提升大语言模型（LLMs）在复杂推理任务上表现的核心范式。尽管先前的研究大多依赖简单的多数投票或奖励模型排序来整合解决方案，这些方法可能仅带来有限的改进。在本研究中，我们提出将聚合作为一种显式推理技能来学习：给定一组候选解决方案，我们训练一个聚合模型，通过可验证奖励的强化学习，来审查、调和并综合出最终正确答案。关键在于精心平衡训练样本的难易程度，使模型既能学会恢复少数但正确的答案，也能掌握多数正确的简单答案。实证结果表明，我们的方法AggLM在多个基准测试上均优于基于规则和奖励模型的基线方法。此外，它能够有效泛化至不同模型的解决方案，包括训练数据中未包含的更强模型，同时相比需要大量解决方案的多数投票，显著减少了所需的token数量。

English

Scaling up test-time compute, by generating multiple independent solutions and selecting or aggregating among them, has become a central paradigm for improving large language models (LLMs) on challenging reasoning tasks. While most prior work relies on simple majority voting or reward model ranking to aggregate solutions, these approaches may only yield limited benefits. In this work, we propose to learn aggregation as an explicit reasoning skill: given a set of candidate solutions, we train an aggregator model to review, reconcile, and synthesize a final, correct answer using reinforcement learning from verifiable rewards. A key ingredient is careful balancing of easy and hard training examples, allowing the model to learn both to recover minority-but-correct answers as well as easy majority-correct answers. Empirically, we find our method, AggLM, outperforms both strong rule-based and reward-model baselines, across multiple benchmarks. Furthermore, it generalizes effectively to solutions from differing models, including stronger ones than contained in the training data, all while requiring substantially fewer tokens than majority voting with larger numbers of solutions.

多数未必正确：强化学习在解决方案聚合中的训练应用

The Majority is not always right: RL training for solution aggregation

摘要

Support