基於流程挖掘的推理感知GRPO
Reasoning-Aware GRPO using Process Mining
October 29, 2025
作者: Taekhyun Park, Yongjae Lee, Hyerim Bae
cs.AI
摘要
基於強化學習(RL)的後訓練技術對於實現大型推理模型(LRMs)的多步驟推理能力至關重要,然而現有的獎勵機制通常僅側重於結果導向。我們提出PM4GRPO,這是一種具備推理感知能力的群組相對策略優化(GRPO)方法,通過在標準答案/格式獎勵基礎上融入對推理過程的評估信號。為實現這一目標,我們運用流程挖掘技術計算標量一致性獎勵,用以量化策略模型的推理過程與預訓練教師模型的吻合程度。在五個基準測試上的實證結果表明,PM4GRPO在基於GRPO的後訓練中顯著優於現有方法。這些成果凸顯了利用流程挖掘技術實現推理感知GRPO能有效增強策略模型的推理能力。
English
Reinforcement learning (RL)-based post-training has been crucial for enabling
multi-step reasoning in large reasoning models (LRMs), yet current reward
schemes are typically outcome-centric. We propose PM4GRPO, a reasoning-aware
Group Relative Policy Optimization (GRPO) that augments standard answer/format
rewards with signals over the reasoning procedure. To this end, process mining
techniques are utilized to compute a scalar conformance reward that measures
how closely a policy model's reasoning aligns with the pretrained teacher
model. The empirical results on five benchmarks demonstrate that PM4GRPO
significantly outperforms existing methodologies for GRPO-based post-training.
These results highlight that leveraging process mining for reasoning-aware GRPO
effectively enhances the reasoning capabilities of policy models.