StepWiser: 더 현명한 추론을 위한 단계적 생성형 판단 시스템

초록

모델이 복잡한 문제를 해결하기 위해 점점 더 다단계 추론 전략을 활용함에 따라, 이러한 중간 단계들의 논리적 타당성을 감독하는 것은 중요한 연구 과제가 되었습니다. 프로세스 보상 모델은 단계별 피드백을 제공함으로써 이를 해결하지만, 현재의 접근 방식에는 두 가지 주요 단점이 있습니다: 일반적으로 설명 없이 분류기로 기능하며, 정적 데이터셋을 사용한 지도 미세 조정에 의존하기 때문에 일반화가 제한됩니다. 최근의 발전에 영감을 받아, 우리는 단계별 보상 모델링을 분류 작업에서 추론 작업 자체로 재구성합니다. 이를 위해, 우리는 정책 모델의 추론 단계(즉, 메타-추론)에 대해 사고하는 생성적 판단자를 제안하며, 최종 판결을 내리기 전에 사고 토큰을 출력합니다. 우리의 모델인 StepWiser는 롤아웃의 상대적 결과를 사용한 강화 학습으로 훈련됩니다. 우리는 이 모델이 (i) 기존 방법보다 중간 단계에서 더 나은 판단 정확도를 제공하고, (ii) 훈련 시 정책 모델을 개선하는 데 사용될 수 있으며, (iii) 추론 시 탐색을 개선한다는 것을 보여줍니다.

English

As models increasingly leverage multi-step reasoning strategies to solve complex problems, supervising the logical validity of these intermediate steps has become a critical research challenge. Process reward models address this by providing step-by-step feedback, but current approaches have two major drawbacks: they typically function as classifiers without providing explanations, and their reliance on supervised fine-tuning with static datasets limits generalization. Inspired by recent advances, we reframe stepwise reward modeling from a classification task to a reasoning task itself. We thus propose a generative judge that reasons about the policy model's reasoning steps (i.e., meta-reasons), outputting thinking tokens before delivering a final verdict. Our model, StepWiser, is trained by reinforcement learning using relative outcomes of rollouts. We show it provides (i) better judgment accuracy on intermediate steps than existing methods; (ii) can be used to improve the policy model at training time; and (iii) improves inference-time search.

StepWiser: 더 현명한 추론을 위한 단계적 생성형 판단 시스템

StepWiser: Stepwise Generative Judges for Wiser Reasoning

초록

Support