StepWiser：逐步生成式评判者，助力更明智的推理

摘要

随着模型越来越多地采用多步推理策略来解决复杂问题，监督这些中间步骤的逻辑有效性已成为一项关键的研究挑战。过程奖励模型通过提供逐步反馈来应对这一挑战，但现有方法存在两大缺陷：它们通常作为分类器运行而不提供解释，且依赖于静态数据集的有监督微调，限制了泛化能力。受最新进展启发，我们将逐步奖励建模从分类任务重新定义为推理任务本身。因此，我们提出了一种生成式评判器，它能够对策略模型的推理步骤（即元推理）进行推理，在给出最终判断前输出思考标记。我们的模型StepWiser通过使用推演结果的相对差异进行强化学习训练。实验表明，该模型在以下方面优于现有方法：（i）对中间步骤的判断准确度更高；（ii）可用于训练时改进策略模型；（iii）提升了推理时的搜索效率。

English

As models increasingly leverage multi-step reasoning strategies to solve complex problems, supervising the logical validity of these intermediate steps has become a critical research challenge. Process reward models address this by providing step-by-step feedback, but current approaches have two major drawbacks: they typically function as classifiers without providing explanations, and their reliance on supervised fine-tuning with static datasets limits generalization. Inspired by recent advances, we reframe stepwise reward modeling from a classification task to a reasoning task itself. We thus propose a generative judge that reasons about the policy model's reasoning steps (i.e., meta-reasons), outputting thinking tokens before delivering a final verdict. Our model, StepWiser, is trained by reinforcement learning using relative outcomes of rollouts. We show it provides (i) better judgment accuracy on intermediate steps than existing methods; (ii) can be used to improve the policy model at training time; and (iii) improves inference-time search.

StepWiser：逐步生成式评判者，助力更明智的推理

StepWiser: Stepwise Generative Judges for Wiser Reasoning

摘要

Support