StepWiser:逐步生成式評判者以實現更明智的推理
StepWiser: Stepwise Generative Judges for Wiser Reasoning
August 26, 2025
作者: Wei Xiong, Wenting Zhao, Weizhe Yuan, Olga Golovneva, Tong Zhang, Jason Weston, Sainbayar Sukhbaatar
cs.AI
摘要
隨著模型日益利用多步推理策略來解決複雜問題,監督這些中間步驟的邏輯有效性已成為一項關鍵的研究挑戰。過程獎勵模型通過提供逐步反饋來應對這一挑戰,但現有方法存在兩大缺陷:它們通常作為分類器運作而不提供解釋,且依賴於靜態數據集的監督微調限制了其泛化能力。受最新進展的啟發,我們將逐步獎勵建模從分類任務重新定義為推理任務本身。因此,我們提出了一種生成式評判者,它對策略模型的推理步驟(即元推理)進行推理,在給出最終判斷之前輸出思考標記。我們的模型StepWiser通過使用推演結果的相對比較進行強化學習訓練。我們展示了它在以下方面的優勢:(i) 在中間步驟上的判斷準確性優於現有方法;(ii) 可用於在訓練時改進策略模型;以及(iii) 提升了推理時的搜索效率。
English
As models increasingly leverage multi-step reasoning strategies to solve
complex problems, supervising the logical validity of these intermediate steps
has become a critical research challenge. Process reward models address this by
providing step-by-step feedback, but current approaches have two major
drawbacks: they typically function as classifiers without providing
explanations, and their reliance on supervised fine-tuning with static datasets
limits generalization. Inspired by recent advances, we reframe stepwise reward
modeling from a classification task to a reasoning task itself. We thus propose
a generative judge that reasons about the policy model's reasoning steps (i.e.,
meta-reasons), outputting thinking tokens before delivering a final verdict.
Our model, StepWiser, is trained by reinforcement learning using relative
outcomes of rollouts. We show it provides (i) better judgment accuracy on
intermediate steps than existing methods; (ii) can be used to improve the
policy model at training time; and (iii) improves inference-time search.