ChatPaper.aiChatPaper

SPARK:面向无参考强化学习的逐步过程感知奖励机制

SPARK: Stepwise Process-Aware Rewards for Reference-Free Reinforcement Learning

December 2, 2025
作者: Salman Rahman, Sruthi Gorantla, Arpit Gupta, Swastik Roy, Nanyun Peng, Yang Liu
cs.AI

摘要

提供密集步骤级反馈的过程奖励模型(PRM)虽在强化学习中展现出潜力,但其应用仍受限于昂贵的步骤级标注或真实参考答案的需求。我们提出SPARK三阶段框架:第一阶段由生成器模型产生多样化解法,验证器模型通过并行扩展(自洽性验证)和序列扩展(元批判)进行评估。第二阶段,我们将这些验证输出作为合成训练数据,用于微调生成式过程奖励模型,使其后续在训练中充当奖励信号。实验表明,在步骤级聚合多个独立验证生成的训练数据优于真实结果监督方法——在ProcessBench(数学推理错误步骤识别基准)上达到67.5 F1值,优于参考答案引导训练的66.4和GPT-4o的61.9。最终阶段,我们将带有思维链验证的生成式PRM(PRM-CoT)作为数学推理强化学习的奖励模型,并引入格式约束防止奖励破解。基于Qwen2.5-Math-7B模型,我们在六项数学推理基准测试中取得47.4%的平均准确率,超越基于真实结果的RLVR方法(43.9%)。这项研究实现了无需参考答案却优于真实监督方法的强化学习训练,为缺乏可验证答案或难以获取真实参考的领域开辟了新途径。
English
Process reward models (PRMs) that provide dense, step-level feedback have shown promise for reinforcement learning, yet their adoption remains limited by the need for expensive step-level annotations or ground truth references. We propose SPARK: a three-stage framework where in the first stage a generator model produces diverse solutions and a verifier model evaluates them using parallel scaling (self-consistency) and sequential scaling (meta-critique). In the second stage, we use these verification outputs as synthetic training data to fine-tune generative process reward models, which subsequently serve as reward signals during training. We show that aggregating multiple independent verifications at the step level produces training data for process reward models that surpass ground-truth outcome supervision, achieving 67.5 F1 on ProcessBench (a benchmark for identifying erroneous steps in mathematical reasoning) compared to 66.4 for reference-guided training and 61.9 for GPT-4o. In the final stage, we apply our generative PRM with chain-of-thought verification (PRM-CoT) as the reward model in RL experiments on mathematical reasoning, and introduce format constraints to prevent reward hacking. Using Qwen2.5-Math-7B, we achieve 47.4% average accuracy across six mathematical reasoning benchmarks, outperforming ground-truth-based RLVR (43.9%). Our work enables reference-free RL training that exceeds ground-truth methods, opening new possibilities for domains lacking verifiable answers or accessible ground truth.
PDF102December 10, 2025