StepWiser：逐步生成式評判者以實現更明智的推理

摘要

隨著模型日益利用多步推理策略來解決複雜問題，監督這些中間步驟的邏輯有效性已成為一項關鍵的研究挑戰。過程獎勵模型通過提供逐步反饋來應對這一挑戰，但現有方法存在兩大缺陷：它們通常作為分類器運作而不提供解釋，且依賴於靜態數據集的監督微調限制了其泛化能力。受最新進展的啟發，我們將逐步獎勵建模從分類任務重新定義為推理任務本身。因此，我們提出了一種生成式評判者，它對策略模型的推理步驟（即元推理）進行推理，在給出最終判斷之前輸出思考標記。我們的模型StepWiser通過使用推演結果的相對比較進行強化學習訓練。我們展示了它在以下方面的優勢：(i) 在中間步驟上的判斷準確性優於現有方法；(ii) 可用於在訓練時改進策略模型；以及(iii) 提升了推理時的搜索效率。

English

As models increasingly leverage multi-step reasoning strategies to solve complex problems, supervising the logical validity of these intermediate steps has become a critical research challenge. Process reward models address this by providing step-by-step feedback, but current approaches have two major drawbacks: they typically function as classifiers without providing explanations, and their reliance on supervised fine-tuning with static datasets limits generalization. Inspired by recent advances, we reframe stepwise reward modeling from a classification task to a reasoning task itself. We thus propose a generative judge that reasons about the policy model's reasoning steps (i.e., meta-reasons), outputting thinking tokens before delivering a final verdict. Our model, StepWiser, is trained by reinforcement learning using relative outcomes of rollouts. We show it provides (i) better judgment accuracy on intermediate steps than existing methods; (ii) can be used to improve the policy model at training time; and (iii) improves inference-time search.

StepWiser：逐步生成式評判者以實現更明智的推理

StepWiser: Stepwise Generative Judges for Wiser Reasoning

摘要

Support