DuaShepherd: 数学的推論における段階的正確性と潜在的な報酬の統合

要旨

本論文では、大規模言語モデル（LLM）の数学的推論能力を向上させるために、正しさと潜在性という2つの補完的な報酬信号を統合した新しい報酬モデリングフレームワーク「DuaShepherd」を提案する。正しさに基づく信号は段階的な誤りの識別を重視する一方で、潜在性に基づく信号は正しい最終解答に到達する可能性に焦点を当てる。我々は、両方の信号を含む大規模な報酬モデリングデータセットを構築するための自動化パイプラインを開発した。マルチタスク設定で2つの報酬モデルを学習するために、統一されたマルチヘッドアーキテクチャを探索し、正しさと潜在性を並行して学習することの利点を実証した。これら2つの信号を複合確率として組み合わせることで、我々のモデルは複数のベンチマークで一貫した性能向上を達成した。MATH500およびProcessBenchでの実証評価により、この組み合わせた報酬は、いずれかの報酬タイプのみで学習したモデルを大幅に上回り、同等のリソース制約下で最先端の性能を達成することが確認された。

English

In this paper, we propose DuaShepherd, a novel reward modeling framework that integrates two complementary reward signals, correctness and potential, to enhance the mathematical reasoning capabilities of Large Language Models (LLMs). While correctness-based signals emphasize identification of stepwise errors, potential-based signals focus on the likelihood of reaching the correct final answer. We developed an automated pipeline for constructing large-scale reward modeling dataset with both signals. A unified, multi-head architecture was explored to train the two reward models in a multi-task setup, demonstrating benefits from learning both correctness and potential in parallel. By combining these two signals into a compound probability, our model achieves consistent performance improvements across multiple benchmarks. Empirical evaluations on MATH500 and ProcessBench confirm that this combined reward significantly outperforms models trained on either reward type alone, achieving state-of-the-art performance under comparable resource constraints.

DuaShepherd: 数学的推論における段階的正確性と潜在的な報酬の統合

DuaShepherd: Integrating Stepwise Correctness and Potential Rewards for Mathematical Reasoning

要旨

Support