ChatPaper.aiChatPaper

StepWiser:逐步生成式评判者,助力更明智的推理

StepWiser: Stepwise Generative Judges for Wiser Reasoning

August 26, 2025
作者: Wei Xiong, Wenting Zhao, Weizhe Yuan, Olga Golovneva, Tong Zhang, Jason Weston, Sainbayar Sukhbaatar
cs.AI

摘要

随着模型越来越多地采用多步推理策略来解决复杂问题,监督这些中间步骤的逻辑有效性已成为一项关键的研究挑战。过程奖励模型通过提供逐步反馈来应对这一挑战,但现有方法存在两大缺陷:它们通常作为分类器运行而不提供解释,且依赖于静态数据集的有监督微调,限制了泛化能力。受最新进展启发,我们将逐步奖励建模从分类任务重新定义为推理任务本身。因此,我们提出了一种生成式评判器,它能够对策略模型的推理步骤(即元推理)进行推理,在给出最终判断前输出思考标记。我们的模型StepWiser通过使用推演结果的相对差异进行强化学习训练。实验表明,该模型在以下方面优于现有方法:(i)对中间步骤的判断准确度更高;(ii)可用于训练时改进策略模型;(iii)提升了推理时的搜索效率。
English
As models increasingly leverage multi-step reasoning strategies to solve complex problems, supervising the logical validity of these intermediate steps has become a critical research challenge. Process reward models address this by providing step-by-step feedback, but current approaches have two major drawbacks: they typically function as classifiers without providing explanations, and their reliance on supervised fine-tuning with static datasets limits generalization. Inspired by recent advances, we reframe stepwise reward modeling from a classification task to a reasoning task itself. We thus propose a generative judge that reasons about the policy model's reasoning steps (i.e., meta-reasons), outputting thinking tokens before delivering a final verdict. Our model, StepWiser, is trained by reinforcement learning using relative outcomes of rollouts. We show it provides (i) better judgment accuracy on intermediate steps than existing methods; (ii) can be used to improve the policy model at training time; and (iii) improves inference-time search.
PDF152August 28, 2025