정확성 이상: 강화 학습을 통한 과정과 결과 보상의 조화

초록

검증 가능한 보상을 통한 강화 학습(RLVR)은 수학적 추론 작업을 위한 주요 패러다임으로 자리 잡으며, 추론 능력의 안정적인 향상을 제공해 왔습니다. 그러나 RLVR의 결과 보상 모델(ORMs)은 너무 거칠어서 정답 내의 결함 있는 추론이나 오답 내의 유효한 추론을 구분하지 못합니다. 이러한 세분성의 부재는 상당한 수준의 노이즈와 오해의 소지가 있는 그래디언트를 초래하며, 추론 과정의 품질 향상을 저해합니다. 반면, 과정 보상 모델(PRMs)은 중간 단계에 대한 세밀한 지침을 제공하지만, 종종 부정확성을 겪고 보장 해킹에 취약합니다. 이 딜레마를 해결하기 위해, 우리는 PRocess cOnsistency Filter(PROF)를 소개합니다. PROF는 노이즈가 있는 세밀한 과정 보상과 정확한 거친 결과 보상을 조화시키는 효과적인 데이터 처리 정제 방법입니다. PRM과 ORM을 목적 함수에 단순히 혼합하는 방식(arXiv:archive/2506.18896) 대신, PROF는 일관성 기반 샘플 선택을 통해 이들의 상호 보완적 강점을 활용합니다. 우리의 접근 방식은 더 높은 평균 과정 값을 가진 정답과 더 낮은 평균 과정 값을 가진 오답을 유지하면서, 양성/음성 훈련 샘플의 균형을 유지합니다. 광범위한 실험을 통해 우리의 방법이 혼합 접근법에 비해 최종 정확도를 4% 이상 일관적으로 향상시킬 뿐만 아니라, 중간 추론 단계의 품질도 강화한다는 것을 입증했습니다. 코드와 훈련 레시피는 https://github.com/Chenluye99/PROF에서 확인할 수 있습니다.

English

Reinforcement learning with verifiable rewards (RLVR) has emerged to be a predominant paradigm for mathematical reasoning tasks, offering stable improvements in reasoning ability. However, Outcome Reward Models (ORMs) in RLVR are too coarse-grained to distinguish flawed reasoning within correct answers or valid reasoning within incorrect answers. This lack of granularity introduces noisy and misleading gradients significantly and hinders further progress in reasoning process quality. While Process Reward Models (PRMs) offer fine-grained guidance for intermediate steps, they frequently suffer from inaccuracies and are susceptible to reward hacking. To resolve this dilemma, we introduce PRocess cOnsistency Filter (PROF), an effective data process curation method that harmonizes noisy, fine-grained process rewards with accurate, coarse-grained outcome rewards. Rather than naively blending PRM and ORM in the objective function (arXiv:archive/2506.18896), PROF leverages their complementary strengths through consistency-driven sample selection. Our approach retains correct responses with higher averaged process values and incorrect responses with lower averaged process values, while maintaining positive/negative training sample balance. Extensive experiments demonstrate that our method not only consistently improves the final accuracy over 4% compared to the blending approaches, but also strengthens the quality of intermediate reasoning steps. Codes and training recipes are available at https://github.com/Chenluye99/PROF.

정확성 이상: 강화 학습을 통한 과정과 결과 보상의 조화

Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training

초록

Support