多数決が常に正しいとは限らない：解の集約のための強化学習トレーニング

要旨

テスト時の計算リソースを拡大し、複数の独立した解を生成してそれらを選択または集約するという手法は、大規模言語モデル（LLM）の難しい推論タスクにおける性能向上の中心的なパラダイムとなっています。これまでの研究の多くは、単純な多数決や報酬モデルによるランキングを用いて解を集約していましたが、これらのアプローチでは限定的な効果しか得られない可能性があります。本研究では、集約を明示的な推論スキルとして学習することを提案します。具体的には、候補となる解のセットが与えられた場合、検証可能な報酬からの強化学習を用いて、集約モデルがそれらをレビューし、調整し、最終的な正しい答えを合成するように訓練します。重要な要素は、簡単な例と難しい例のバランスを慎重に取ることです。これにより、モデルは少数派ではあるが正しい答えを回復する能力と、簡単な多数派の正しい答えを導く能力の両方を学習できます。実験的に、我々の手法であるAggLMは、複数のベンチマークにおいて、強力なルールベースの手法や報酬モデルのベースラインを上回る性能を示しました。さらに、トレーニングデータに含まれるものよりも強力なモデルを含む、異なるモデルからの解に対しても効果的に汎化し、多数決よりも大幅に少ないトークン数で済むことが確認されました。

English

Scaling up test-time compute, by generating multiple independent solutions and selecting or aggregating among them, has become a central paradigm for improving large language models (LLMs) on challenging reasoning tasks. While most prior work relies on simple majority voting or reward model ranking to aggregate solutions, these approaches may only yield limited benefits. In this work, we propose to learn aggregation as an explicit reasoning skill: given a set of candidate solutions, we train an aggregator model to review, reconcile, and synthesize a final, correct answer using reinforcement learning from verifiable rewards. A key ingredient is careful balancing of easy and hard training examples, allowing the model to learn both to recover minority-but-correct answers as well as easy majority-correct answers. Empirically, we find our method, AggLM, outperforms both strong rule-based and reward-model baselines, across multiple benchmarks. Furthermore, it generalizes effectively to solutions from differing models, including stronger ones than contained in the training data, all while requiring substantially fewer tokens than majority voting with larger numbers of solutions.

多数決が常に正しいとは限らない：解の集約のための強化学習トレーニング

The Majority is not always right: RL training for solution aggregation

要旨

Support