兩兩比較的 RM：使用淘汰賽進行最佳 N 抽樣

摘要

最佳-N（BoN）抽樣是大型語言模型（LLMs）在測試時進行縮放的常見策略，依賴獎勵模型從多個世代中選擇最佳候選解決方案。然而，傳統的獎勵模型通常賦予任意和不一致的分數，限制了其有效性。為了解決這個問題，我們提出了一種配對獎勵模型（Pairwise RM），結合淘汰賽錦標賽用於BoN抽樣。Pairwise RM不是賦予絕對分數，而是在給定一個數學問題時，同時評估兩個候選解決方案的正確性。這種方法消除了任意評分的需要，並通過平行比較實現解決方案的交叉驗證。在淘汰賽錦標賽中，Pairwise RM在候選解決方案之間進行兩兩比較，並逐步淘汰不正確的解決方案。我們構建了\ourdataset，這是一個由NumiaMath衍生的443K個配對比較的大規模數據集，並使用gemini-1.5-flash進行標註，通過監督微調訓練Pairwise RM。在MATH-500和奧林匹亞基準上的實驗表明，相對於傳統的區分性獎勵模型，取得了顯著的改進。在前50%具有挑戰性的問題上實現了40%至60%的相對改進。

English

Best-of-N (BoN) sampling, a common strategy for test-time scaling of Large Language Models (LLMs), relies on reward models to select the best candidate solution from multiple generations. However, traditional reward models often assign arbitrary and inconsistent scores, limiting their effectiveness. To address this, we propose a Pairwise Reward Model (Pairwise RM) combined with a knockout tournament for BoN sampling. Instead of assigning absolute scores, given one math problem, Pairwise RM evaluates two candidate solutions' correctness simultaneously. This approach eliminates the need for arbitrary scoring and enables cross-validation of solutions through parallel comparison. In the knockout tournament, Pairwise RM conducts pairwise comparisons between candidate solutions and eliminates the incorrect ones iteratively. We construct \ourdataset, a large-scale dataset of 443K pairwise comparisons derived from NumiaMath and annotated using gemini-1.5-flash, and train the Pairwise RM via supervised fine-tuning. Experiments on MATH-500 and the Olympiad Bench demonstrate significant improvements over traditional discriminative reward models. And a 40\% to 60\% relative improvement is achieved on the top 50\% challenging problems.

兩兩比較的 RM：使用淘汰賽進行最佳 N 抽樣

Pairwise RM: Perform Best-of-N Sampling with Knockout Tournament

摘要

Support