ReasonFlux-PRM: 長い連鎖思考推論のための軌道認識PRMをLLMに適用

要旨

プロセス報酬モデル（PRM）は、大規模言語モデル（LLM）の中間推論ステップを監督するための強力なフレームワークとして最近注目を集めています。従来のPRMは主にモデルの最終出力レスポンスに基づいて訓練されており、特にDeepseek-R1のような最先端の推論モデルによって生成される軌跡-レスポンス型の出力において、中間思考軌跡を堅牢に評価するのに苦労していました。本研究では、軌跡-レスポンス型の推論トレースを評価するために明示的に設計された新しい軌跡認識型PRMであるReasonFlux-PRMを紹介します。ReasonFlux-PRMは、ステップレベルと軌跡レベルの両方の監督を組み込んでおり、構造化された連鎖思考データに沿ったきめ細かい報酬割り当てを可能にします。私たちはReasonFlux-PRMを、オフラインおよびオンライン設定の両方で報酬監督をサポートするように適応させました。これには、(i) 下流の教師ありファインチューニングのための高品質なモデル蒸留データの選択、(ii) 強化学習中のポリシー最適化のための密なプロセスレベル報酬の提供、(iii) 報酬ガイドによるBest-of-Nテストタイムスケーリングの実現が含まれます。AIME、MATH500、GPQA-Diamondなどの挑戦的な下流ベンチマークでの実証結果は、ReasonFlux-PRM-7Bが強力なPRM（例：Qwen2.5-Math-PRM-72B）や人間がキュレートしたベースラインよりも高品質なデータを選択することを示しています。さらに、私たちが導出したReasonFlux-PRM-7Bは、教師ありファインチューニングで平均12.1%、強化学習で4.5%、テストタイムスケーリングで6.3%の一貫した性能向上を達成しました。また、リソースが制約されたアプリケーションやエッジデプロイメントのための効率的なReasonFlux-PRM-1.5Bも公開しています。プロジェクト: https://github.com/Gen-Verse/ReasonFlux

English

Process Reward Models (PRMs) have recently emerged as a powerful framework for supervising intermediate reasoning steps in large language models (LLMs). Previous PRMs are primarily trained on model final output responses and struggle to evaluate intermediate thinking trajectories robustly, especially in the emerging setting of trajectory-response outputs generated by frontier reasoning models like Deepseek-R1. In this work, we introduce ReasonFlux-PRM, a novel trajectory-aware PRM explicitly designed to evaluate the trajectory-response type of reasoning traces. ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data. We adapt ReasonFlux-PRM to support reward supervision under both offline and online settings, including (i) selecting high-quality model distillation data for downstream supervised fine-tuning of smaller models, (ii) providing dense process-level rewards for policy optimization during reinforcement learning, and (iii) enabling reward-guided Best-of-N test-time scaling. Empirical results on challenging downstream benchmarks such as AIME, MATH500, and GPQA-Diamond demonstrate that ReasonFlux-PRM-7B selects higher quality data than strong PRMs (e.g., Qwen2.5-Math-PRM-72B) and human-curated baselines. Furthermore, our derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement learning, and 6.3% in test-time scaling. We also release our efficient ReasonFlux-PRM-1.5B for resource-constrained applications and edge deployment. Projects: https://github.com/Gen-Verse/ReasonFlux

ReasonFlux-PRM: 長い連鎖思考推論のための軌道認識PRMをLLMに適用

ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs

要旨

Support