ReasonFlux-PRM:面向大语言模型长链思维推理的轨迹感知概率路线图
ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs
June 23, 2025
作者: Jiaru Zou, Ling Yang, Jingwen Gu, Jiahao Qiu, Ke Shen, Jingrui He, Mengdi Wang
cs.AI
摘要
过程奖励模型(PRMs)近期崭露头角,成为监督大型语言模型(LLMs)中间推理步骤的强大框架。以往的PRMs主要基于模型最终输出响应进行训练,难以稳健评估中间思维轨迹,尤其是在前沿推理模型如Deepseek-R1生成的轨迹-响应输出这一新兴场景中。本研究中,我们提出了ReasonFlux-PRM,一种专为评估轨迹-响应型推理痕迹而设计的新型轨迹感知PRM。ReasonFlux-PRM融合了步骤级与轨迹级监督,实现了与结构化思维链数据对齐的细粒度奖励分配。我们调整ReasonFlux-PRM,使其支持离线和在线环境下的奖励监督,包括:(i) 为下游小模型的监督微调筛选高质量模型蒸馏数据,(ii) 在强化学习过程中为策略优化提供密集过程级奖励,以及(iii) 实现奖励引导的Best-of-N测试时扩展。在AIME、MATH500和GPQA-Diamond等具有挑战性的下游基准测试中,实证结果表明,ReasonFlux-PRM-7B在数据选择质量上优于强PRMs(如Qwen2.5-Math-PRM-72B)及人工筛选基线。此外,我们衍生的ReasonFlux-PRM-7B带来了持续的性能提升,在监督微调中平均增益12.1%,在强化学习中提升4.5%,在测试时扩展中增加6.3%。我们还发布了高效的ReasonFlux-PRM-1.5B,适用于资源受限应用及边缘部署。项目地址:https://github.com/Gen-Verse/ReasonFlux
English
Process Reward Models (PRMs) have recently emerged as a powerful framework
for supervising intermediate reasoning steps in large language models (LLMs).
Previous PRMs are primarily trained on model final output responses and
struggle to evaluate intermediate thinking trajectories robustly, especially in
the emerging setting of trajectory-response outputs generated by frontier
reasoning models like Deepseek-R1. In this work, we introduce ReasonFlux-PRM, a
novel trajectory-aware PRM explicitly designed to evaluate the
trajectory-response type of reasoning traces. ReasonFlux-PRM incorporates both
step-level and trajectory-level supervision, enabling fine-grained reward
assignment aligned with structured chain-of-thought data. We adapt
ReasonFlux-PRM to support reward supervision under both offline and online
settings, including (i) selecting high-quality model distillation data for
downstream supervised fine-tuning of smaller models, (ii) providing dense
process-level rewards for policy optimization during reinforcement learning,
and (iii) enabling reward-guided Best-of-N test-time scaling. Empirical results
on challenging downstream benchmarks such as AIME, MATH500, and GPQA-Diamond
demonstrate that ReasonFlux-PRM-7B selects higher quality data than strong PRMs
(e.g., Qwen2.5-Math-PRM-72B) and human-curated baselines. Furthermore, our
derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving
average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement
learning, and 6.3% in test-time scaling. We also release our efficient
ReasonFlux-PRM-1.5B for resource-constrained applications and edge deployment.
Projects: https://github.com/Gen-Verse/ReasonFlux