StepWiser:逐步生成式评判者,助力更明智的推理
StepWiser: Stepwise Generative Judges for Wiser Reasoning
August 26, 2025
作者: Wei Xiong, Wenting Zhao, Weizhe Yuan, Olga Golovneva, Tong Zhang, Jason Weston, Sainbayar Sukhbaatar
cs.AI
摘要
随着模型越来越多地采用多步推理策略来解决复杂问题,监督这些中间步骤的逻辑有效性已成为一项关键的研究挑战。过程奖励模型通过提供逐步反馈来应对这一挑战,但现有方法存在两大缺陷:它们通常作为分类器运行而不提供解释,且依赖于静态数据集的有监督微调,限制了泛化能力。受最新进展启发,我们将逐步奖励建模从分类任务重新定义为推理任务本身。因此,我们提出了一种生成式评判器,它能够对策略模型的推理步骤(即元推理)进行推理,在给出最终判断前输出思考标记。我们的模型StepWiser通过使用推演结果的相对差异进行强化学习训练。实验表明,该模型在以下方面优于现有方法:(i)对中间步骤的判断准确度更高;(ii)可用于训练时改进策略模型;(iii)提升了推理时的搜索效率。
English
As models increasingly leverage multi-step reasoning strategies to solve
complex problems, supervising the logical validity of these intermediate steps
has become a critical research challenge. Process reward models address this by
providing step-by-step feedback, but current approaches have two major
drawbacks: they typically function as classifiers without providing
explanations, and their reliance on supervised fine-tuning with static datasets
limits generalization. Inspired by recent advances, we reframe stepwise reward
modeling from a classification task to a reasoning task itself. We thus propose
a generative judge that reasons about the policy model's reasoning steps (i.e.,
meta-reasons), outputting thinking tokens before delivering a final verdict.
Our model, StepWiser, is trained by reinforcement learning using relative
outcomes of rollouts. We show it provides (i) better judgment accuracy on
intermediate steps than existing methods; (ii) can be used to improve the
policy model at training time; and (iii) improves inference-time search.