ChatPaper.aiChatPaper

StepWiser:逐步生成式評判者以實現更明智的推理

StepWiser: Stepwise Generative Judges for Wiser Reasoning

August 26, 2025
作者: Wei Xiong, Wenting Zhao, Weizhe Yuan, Olga Golovneva, Tong Zhang, Jason Weston, Sainbayar Sukhbaatar
cs.AI

摘要

隨著模型日益利用多步推理策略來解決複雜問題,監督這些中間步驟的邏輯有效性已成為一項關鍵的研究挑戰。過程獎勵模型通過提供逐步反饋來應對這一挑戰,但現有方法存在兩大缺陷:它們通常作為分類器運作而不提供解釋,且依賴於靜態數據集的監督微調限制了其泛化能力。受最新進展的啟發,我們將逐步獎勵建模從分類任務重新定義為推理任務本身。因此,我們提出了一種生成式評判者,它對策略模型的推理步驟(即元推理)進行推理,在給出最終判斷之前輸出思考標記。我們的模型StepWiser通過使用推演結果的相對比較進行強化學習訓練。我們展示了它在以下方面的優勢:(i) 在中間步驟上的判斷準確性優於現有方法;(ii) 可用於在訓練時改進策略模型;以及(iii) 提升了推理時的搜索效率。
English
As models increasingly leverage multi-step reasoning strategies to solve complex problems, supervising the logical validity of these intermediate steps has become a critical research challenge. Process reward models address this by providing step-by-step feedback, but current approaches have two major drawbacks: they typically function as classifiers without providing explanations, and their reliance on supervised fine-tuning with static datasets limits generalization. Inspired by recent advances, we reframe stepwise reward modeling from a classification task to a reasoning task itself. We thus propose a generative judge that reasons about the policy model's reasoning steps (i.e., meta-reasons), outputting thinking tokens before delivering a final verdict. Our model, StepWiser, is trained by reinforcement learning using relative outcomes of rollouts. We show it provides (i) better judgment accuracy on intermediate steps than existing methods; (ii) can be used to improve the policy model at training time; and (iii) improves inference-time search.
PDF182August 28, 2025