ChatPaper.aiChatPaper

Fin-PRM:面向大语言模型金融推理的领域专用过程奖励模型

Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models

August 21, 2025
作者: Yuanchen Zhou, Shuo Jiang, Jie Zhu, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang
cs.AI

摘要

过程奖励模型(PRMs)作为一种监督大型语言模型(LLMs)中间推理的有前景框架崭露头角,然而现有的PRMs主要针对通用或科学、技术、工程与数学(STEM)领域进行训练,在诸如金融等特定领域情境中表现不足,这些领域的推理更具结构性、符号化,且对事实与法规的正确性更为敏感。我们推出了Fin-PRM,一种专为评估金融任务中中间推理步骤而设计的领域专业化、轨迹感知型PRM。Fin-PRM整合了步骤级与轨迹级的奖励监督,实现了与金融逻辑对齐的推理轨迹的精细评估。我们将Fin-PRM应用于离线与在线奖励学习场景,支持三大关键应用:(i)为基于蒸馏的监督微调选择高质量推理轨迹,(ii)为强化学习提供密集的过程级奖励,以及(iii)在测试时引导奖励信息的最佳N推理。在包括CFLUE和FinQA在内的金融推理基准测试中,实验结果显示Fin-PRM在轨迹选择质量上持续超越通用PRMs及强领域基线。采用Fin-PRM训练的下游模型相较于基线取得了显著提升,监督学习提升12.9%,强化学习提升5.2%,测试时性能提升5.1%。这些发现凸显了领域专业化奖励模型在使LLMs与专家级金融推理对齐方面的价值。我们的项目资源将发布于https://github.com/aliyun/qwen-dianjin。
English
Process Reward Models (PRMs) have emerged as a promising framework for supervising intermediate reasoning in large language models (LLMs), yet existing PRMs are primarily trained on general or Science, Technology, Engineering, and Mathematics (STEM) domains and fall short in domain-specific contexts such as finance, where reasoning is more structured, symbolic, and sensitive to factual and regulatory correctness. We introduce Fin-PRM, a domain-specialized, trajectory-aware PRM tailored to evaluate intermediate reasoning steps in financial tasks. Fin-PRM integrates step-level and trajectory-level reward supervision, enabling fine-grained evaluation of reasoning traces aligned with financial logic. We apply Fin-PRM in both offline and online reward learning settings, supporting three key applications: (i) selecting high-quality reasoning trajectories for distillation-based supervised fine-tuning, (ii) providing dense process-level rewards for reinforcement learning, and (iii) guiding reward-informed Best-of-N inference at test time. Experimental results on financial reasoning benchmarks, including CFLUE and FinQA, demonstrate that Fin-PRM consistently outperforms general-purpose PRMs and strong domain baselines in trajectory selection quality. Downstream models trained with Fin-PRM yield substantial improvements with baselines, with gains of 12.9\% in supervised learning, 5.2\% in reinforcement learning, and 5.1\% in test-time performance. These findings highlight the value of domain-specialized reward modeling for aligning LLMs with expert-level financial reasoning. Our project resources will be available at https://github.com/aliyun/qwen-dianjin.
PDF32August 22, 2025