ペアワイズRM：ノックアウトトーナメントを用いたベストオブNサンプリングを実行する

要旨

Best-of-N（BoN）サンプリングは、大規模言語モデル（LLM）のテスト時スケーリングのための一般的な戦略であり、複数の世代から最適な候補解を選択するために報酬モデルに依存しています。ただし、従来の報酬モデルはしばしば恣意的で一貫性のないスコアを割り当てるため、その効果が制限されています。この課題に対処するために、私たちはBoNサンプリング用にペアワイズ報酬モデル（Pairwise RM）を提案し、ノックアウトトーナメントを組み合わせます。絶対的なスコアの割り当てではなく、1つの数学問題が与えられた場合、Pairwise RMは2つの候補解の正確さを同時に評価します。このアプローチにより、恣意的なスコアリングの必要性がなくなり、並列比較を通じた解の交差検証が可能となります。ノックアウトトーナメントでは、Pairwise RMが候補解間でペアワイズ比較を行い、間違った解を反復的に排除します。私たちは、NumiaMathから導出された443Kのペアワイズ比較からなる大規模データセット\ourdatasetを構築し、gemini-1.5-flashを使用して注釈付けを行い、Pairwise RMを教師付き微調整を通じてトレーニングします。MATH-500とOlympiad Benchでの実験は、従来の識別的報酬モデルに比べて大幅な改善を示しています。また、難解な問題の上位50％で40\%から60\%の相対的な改善が達成されています。

English

Best-of-N (BoN) sampling, a common strategy for test-time scaling of Large Language Models (LLMs), relies on reward models to select the best candidate solution from multiple generations. However, traditional reward models often assign arbitrary and inconsistent scores, limiting their effectiveness. To address this, we propose a Pairwise Reward Model (Pairwise RM) combined with a knockout tournament for BoN sampling. Instead of assigning absolute scores, given one math problem, Pairwise RM evaluates two candidate solutions' correctness simultaneously. This approach eliminates the need for arbitrary scoring and enables cross-validation of solutions through parallel comparison. In the knockout tournament, Pairwise RM conducts pairwise comparisons between candidate solutions and eliminates the incorrect ones iteratively. We construct \ourdataset, a large-scale dataset of 443K pairwise comparisons derived from NumiaMath and annotated using gemini-1.5-flash, and train the Pairwise RM via supervised fine-tuning. Experiments on MATH-500 and the Olympiad Bench demonstrate significant improvements over traditional discriminative reward models. And a 40\% to 60\% relative improvement is achieved on the top 50\% challenging problems.

ペアワイズRM：ノックアウトトーナメントを用いたベストオブNサンプリングを実行する

Pairwise RM: Perform Best-of-N Sampling with Knockout Tournament

要旨

Support