DuaShepherd: 수학적 추론을 위한 단계별 정확성과 잠재적 보상의 통합

초록

본 논문에서는 대규모 언어 모델(LLMs)의 수학적 추론 능력을 향상시키기 위해 정확성(correctness)과 잠재성(potential)이라는 두 가지 상호 보완적인 보상 신호를 통합한 새로운 보상 모델링 프레임워크인 DuaShepherd를 제안합니다. 정확성 기반 신호는 단계별 오류 식별을 강조하는 반면, 잠재성 기반 신호는 최종 정답에 도달할 가능성에 초점을 맞춥니다. 우리는 두 신호를 모두 포함한 대규모 보상 모델링 데이터셋을 구축하기 위한 자동화된 파이프라인을 개발했습니다. 또한, 다중 작업 설정에서 두 보상 모델을 학습하기 위해 통합된 다중 헤드 아키텍처를 탐구하였으며, 정확성과 잠재성을 병렬로 학습함으로써 얻는 이점을 입증했습니다. 이 두 신호를 복합 확률로 결합함으로써, 우리의 모델은 여러 벤치마크에서 일관된 성능 향상을 달성했습니다. MATH500과 ProcessBench에 대한 실험적 평가 결과, 이 결합된 보상은 단일 보상 유형으로 학습된 모델들을 크게 능가하며, 비슷한 자원 제약 하에서 최첨단 성능을 달성함을 확인했습니다.

English

In this paper, we propose DuaShepherd, a novel reward modeling framework that integrates two complementary reward signals, correctness and potential, to enhance the mathematical reasoning capabilities of Large Language Models (LLMs). While correctness-based signals emphasize identification of stepwise errors, potential-based signals focus on the likelihood of reaching the correct final answer. We developed an automated pipeline for constructing large-scale reward modeling dataset with both signals. A unified, multi-head architecture was explored to train the two reward models in a multi-task setup, demonstrating benefits from learning both correctness and potential in parallel. By combining these two signals into a compound probability, our model achieves consistent performance improvements across multiple benchmarks. Empirical evaluations on MATH500 and ProcessBench confirm that this combined reward significantly outperforms models trained on either reward type alone, achieving state-of-the-art performance under comparable resource constraints.

DuaShepherd: 수학적 추론을 위한 단계별 정확성과 잠재적 보상의 통합

DuaShepherd: Integrating Stepwise Correctness and Potential Rewards for Mathematical Reasoning

초록

Support