DuaShepherd：整合逐步正确性与潜在奖励以促进数学推理

摘要

本文提出了一種名為DuaShepherd的新穎獎勵建模框架，該框架整合了兩種互補的獎勵信號——正確性與潛力，以增強大型語言模型（LLMs）的數學推理能力。其中，基於正確性的信號強調逐步錯誤的識別，而基於潛力的信號則關注於達到正確最終答案的可能性。我們開發了一條自動化管道，用於構建包含這兩種信號的大規模獎勵建模數據集。通過探索一種統一的多頭架構，在多任務設置下訓練這兩個獎勵模型，展示了同時學習正確性與潛力的優勢。將這兩種信號結合為一個複合概率，我們的模型在多個基準測試中實現了持續的性能提升。在MATH500和ProcessBench上的實證評估證實，這種組合獎勵顯著優於僅基於單一獎勵類型訓練的模型，在可比較的資源限制下達到了最先進的性能。

English

In this paper, we propose DuaShepherd, a novel reward modeling framework that integrates two complementary reward signals, correctness and potential, to enhance the mathematical reasoning capabilities of Large Language Models (LLMs). While correctness-based signals emphasize identification of stepwise errors, potential-based signals focus on the likelihood of reaching the correct final answer. We developed an automated pipeline for constructing large-scale reward modeling dataset with both signals. A unified, multi-head architecture was explored to train the two reward models in a multi-task setup, demonstrating benefits from learning both correctness and potential in parallel. By combining these two signals into a compound probability, our model achieves consistent performance improvements across multiple benchmarks. Empirical evaluations on MATH500 and ProcessBench confirm that this combined reward significantly outperforms models trained on either reward type alone, achieving state-of-the-art performance under comparable resource constraints.

DuaShepherd：整合逐步正确性与潜在奖励以促进数学推理

DuaShepherd: Integrating Stepwise Correctness and Potential Rewards for Mathematical Reasoning

摘要

Support