ChatPaper.aiChatPaper

SPARK:面向無參考強化學習的逐步流程感知獎勵機制

SPARK: Stepwise Process-Aware Rewards for Reference-Free Reinforcement Learning

December 2, 2025
作者: Salman Rahman, Sruthi Gorantla, Arpit Gupta, Swastik Roy, Nanyun Peng, Yang Liu
cs.AI

摘要

雖然能提供密集步驟級回饋的流程獎勵模型(PRM)已展現強化學習潛力,但其應用仍受限於昂貴的步驟級標註或標準答案需求。我們提出SPARK三階段框架:第一階段由生成模型產生多樣化解法,驗證模型透過平行擴展(自我一致性)與序列擴展(元批判)進行評估;第二階段將驗證輸出作為合成訓練資料,微調生成式流程獎勵模型,使其在訓練時擔任獎勵信號。我們證明在步驟層級聚合多重獨立驗證所產生的訓練資料,其效果超越標準答案監督機制——在數學推理錯誤步驟識別基準ProcessBench上達到67.5 F1分數,優於參考指導訓練的66.4與GPT-4o的61.9。最終階段將帶有思維鏈驗證的生成式PRM(PRM-CoT)作為數學推理強化學習實驗的獎勵模型,並引入格式約束防範獎勵破解。基於Qwen2.5-Math-7B模型,我們在六項數學推理基準中達成47.4%平均準確率,勝過基於標準答案的RLVR方法(43.9%)。本研究實現了無需參考答案卻能超越標準答案方法的強化學習訓練,為缺乏可驗證答案或難以取得標準答案的領域開拓新可能。
English
Process reward models (PRMs) that provide dense, step-level feedback have shown promise for reinforcement learning, yet their adoption remains limited by the need for expensive step-level annotations or ground truth references. We propose SPARK: a three-stage framework where in the first stage a generator model produces diverse solutions and a verifier model evaluates them using parallel scaling (self-consistency) and sequential scaling (meta-critique). In the second stage, we use these verification outputs as synthetic training data to fine-tune generative process reward models, which subsequently serve as reward signals during training. We show that aggregating multiple independent verifications at the step level produces training data for process reward models that surpass ground-truth outcome supervision, achieving 67.5 F1 on ProcessBench (a benchmark for identifying erroneous steps in mathematical reasoning) compared to 66.4 for reference-guided training and 61.9 for GPT-4o. In the final stage, we apply our generative PRM with chain-of-thought verification (PRM-CoT) as the reward model in RL experiments on mathematical reasoning, and introduce format constraints to prevent reward hacking. Using Qwen2.5-Math-7B, we achieve 47.4% average accuracy across six mathematical reasoning benchmarks, outperforming ground-truth-based RLVR (43.9%). Our work enables reference-free RL training that exceeds ground-truth methods, opening new possibilities for domains lacking verifiable answers or accessible ground truth.
PDF102December 10, 2025