DuaShepherd：融合逐步正确性与潜在奖励的数学推理方法

摘要

本文提出了一种新颖的奖励建模框架——DuaShepherd，该框架整合了正确性与潜力这两类互补的奖励信号，旨在增强大语言模型（LLMs）的数学推理能力。其中，基于正确性的信号着重于识别步骤中的错误，而基于潜力的信号则关注于最终获得正确答案的可能性。我们开发了一套自动化流程，用于构建包含这两类信号的大规模奖励建模数据集。通过探索一种统一的多头架构，在多任务设置下训练这两个奖励模型，证明了同时学习正确性与潜力的优势。通过将这两类信号结合为复合概率，我们的模型在多个基准测试中实现了持续的性能提升。在MATH500和ProcessBench上的实证评估表明，这种组合奖励显著优于仅使用单一奖励类型训练的模型，在资源约束相当的情况下达到了最先进的性能水平。

English

In this paper, we propose DuaShepherd, a novel reward modeling framework that integrates two complementary reward signals, correctness and potential, to enhance the mathematical reasoning capabilities of Large Language Models (LLMs). While correctness-based signals emphasize identification of stepwise errors, potential-based signals focus on the likelihood of reaching the correct final answer. We developed an automated pipeline for constructing large-scale reward modeling dataset with both signals. A unified, multi-head architecture was explored to train the two reward models in a multi-task setup, demonstrating benefits from learning both correctness and potential in parallel. By combining these two signals into a compound probability, our model achieves consistent performance improvements across multiple benchmarks. Empirical evaluations on MATH500 and ProcessBench confirm that this combined reward significantly outperforms models trained on either reward type alone, achieving state-of-the-art performance under comparable resource constraints.

DuaShepherd：融合逐步正确性与潜在奖励的数学推理方法

DuaShepherd: Integrating Stepwise Correctness and Potential Rewards for Mathematical Reasoning

摘要

Support