ReasonFlux-PRM：面向长链思维推理的轨迹感知概率路线图方法于大语言模型中的应用

摘要

過程獎勵模型（Process Reward Models, PRMs）近期作為一種強大的框架，用於監督大型語言模型（LLMs）中的中間推理步驟。以往的PRMs主要基於模型的最終輸出進行訓練，難以穩健地評估中間的思維軌跡，尤其是在前沿推理模型如Deepseek-R1所生成的軌跡-響應輸出這一新興情境下。本研究提出了ReasonFlux-PRM，這是一種新穎的軌跡感知PRM，專門設計用於評估軌跡-響應類型的推理痕跡。ReasonFlux-PRM結合了步驟級和軌跡級的監督，能夠根據結構化的思維鏈數據進行細粒度的獎勵分配。我們使ReasonFlux-PRM適應於支持離線和在線設置下的獎勵監督，包括：(i) 為下游小型模型的監督微調選擇高質量的模型蒸餾數據，(ii) 在強化學習過程中提供密集的過程級獎勵以優化策略，以及(iii) 實現獎勵引導的最佳N測試時擴展。在AIME、MATH500和GPQA-Diamond等具有挑戰性的下游基準測試中，實證結果顯示，ReasonFlux-PRM-7B選擇的數據質量高於強勁的PRMs（如Qwen2.5-Math-PRM-72B）和人工篩選的基線。此外，我們衍生的ReasonFlux-PRM-7B帶來了持續的性能提升，在監督微調中平均增益12.1%，在強化學習中增益4.5%，在測試時擴展中增益6.3%。我們還發布了高效的ReasonFlux-PRM-1.5B，適用於資源受限的應用和邊緣部署。項目詳情請訪問：https://github.com/Gen-Verse/ReasonFlux

English

Process Reward Models (PRMs) have recently emerged as a powerful framework for supervising intermediate reasoning steps in large language models (LLMs). Previous PRMs are primarily trained on model final output responses and struggle to evaluate intermediate thinking trajectories robustly, especially in the emerging setting of trajectory-response outputs generated by frontier reasoning models like Deepseek-R1. In this work, we introduce ReasonFlux-PRM, a novel trajectory-aware PRM explicitly designed to evaluate the trajectory-response type of reasoning traces. ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data. We adapt ReasonFlux-PRM to support reward supervision under both offline and online settings, including (i) selecting high-quality model distillation data for downstream supervised fine-tuning of smaller models, (ii) providing dense process-level rewards for policy optimization during reinforcement learning, and (iii) enabling reward-guided Best-of-N test-time scaling. Empirical results on challenging downstream benchmarks such as AIME, MATH500, and GPQA-Diamond demonstrate that ReasonFlux-PRM-7B selects higher quality data than strong PRMs (e.g., Qwen2.5-Math-PRM-72B) and human-curated baselines. Furthermore, our derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement learning, and 6.3% in test-time scaling. We also release our efficient ReasonFlux-PRM-1.5B for resource-constrained applications and edge deployment. Projects: https://github.com/Gen-Verse/ReasonFlux

ReasonFlux-PRM：面向长链思维推理的轨迹感知概率路线图方法于大语言模型中的应用

ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs

摘要

Support