ChatPaper.aiChatPaper

ReasonFlux-PRM:面向大语言模型长链思维推理的轨迹感知概率路线图

ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs

June 23, 2025
作者: Jiaru Zou, Ling Yang, Jingwen Gu, Jiahao Qiu, Ke Shen, Jingrui He, Mengdi Wang
cs.AI

摘要

过程奖励模型(PRMs)近期崭露头角,成为监督大型语言模型(LLMs)中间推理步骤的强大框架。以往的PRMs主要基于模型最终输出响应进行训练,难以稳健评估中间思维轨迹,尤其是在前沿推理模型如Deepseek-R1生成的轨迹-响应输出这一新兴场景中。本研究中,我们提出了ReasonFlux-PRM,一种专为评估轨迹-响应型推理痕迹而设计的新型轨迹感知PRM。ReasonFlux-PRM融合了步骤级与轨迹级监督,实现了与结构化思维链数据对齐的细粒度奖励分配。我们调整ReasonFlux-PRM,使其支持离线和在线环境下的奖励监督,包括:(i) 为下游小模型的监督微调筛选高质量模型蒸馏数据,(ii) 在强化学习过程中为策略优化提供密集过程级奖励,以及(iii) 实现奖励引导的Best-of-N测试时扩展。在AIME、MATH500和GPQA-Diamond等具有挑战性的下游基准测试中,实证结果表明,ReasonFlux-PRM-7B在数据选择质量上优于强PRMs(如Qwen2.5-Math-PRM-72B)及人工筛选基线。此外,我们衍生的ReasonFlux-PRM-7B带来了持续的性能提升,在监督微调中平均增益12.1%,在强化学习中提升4.5%,在测试时扩展中增加6.3%。我们还发布了高效的ReasonFlux-PRM-1.5B,适用于资源受限应用及边缘部署。项目地址:https://github.com/Gen-Verse/ReasonFlux
English
Process Reward Models (PRMs) have recently emerged as a powerful framework for supervising intermediate reasoning steps in large language models (LLMs). Previous PRMs are primarily trained on model final output responses and struggle to evaluate intermediate thinking trajectories robustly, especially in the emerging setting of trajectory-response outputs generated by frontier reasoning models like Deepseek-R1. In this work, we introduce ReasonFlux-PRM, a novel trajectory-aware PRM explicitly designed to evaluate the trajectory-response type of reasoning traces. ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data. We adapt ReasonFlux-PRM to support reward supervision under both offline and online settings, including (i) selecting high-quality model distillation data for downstream supervised fine-tuning of smaller models, (ii) providing dense process-level rewards for policy optimization during reinforcement learning, and (iii) enabling reward-guided Best-of-N test-time scaling. Empirical results on challenging downstream benchmarks such as AIME, MATH500, and GPQA-Diamond demonstrate that ReasonFlux-PRM-7B selects higher quality data than strong PRMs (e.g., Qwen2.5-Math-PRM-72B) and human-curated baselines. Furthermore, our derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement learning, and 6.3% in test-time scaling. We also release our efficient ReasonFlux-PRM-1.5B for resource-constrained applications and edge deployment. Projects: https://github.com/Gen-Verse/ReasonFlux
PDF251June 24, 2025