Fin-PRM:面向大型語言模型金融推理的領域專用過程獎勵模型
Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models
August 21, 2025
作者: Yuanchen Zhou, Shuo Jiang, Jie Zhu, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang
cs.AI
摘要
過程獎勵模型(PRMs)已成為監督大型語言模型(LLMs)中間推理的一種有前景的框架,然而現有的PRMs主要在通用或科學、技術、工程和數學(STEM)領域進行訓練,在特定領域(如金融)中表現不足,這些領域的推理更具結構性、符號化,並且對事實和法規的正確性更為敏感。我們引入了Fin-PRM,這是一種專門針對金融任務的、軌跡感知的PRM,用於評估中間推理步驟。Fin-PRM整合了步驟級和軌跡級的獎勵監督,使得能夠對符合金融邏輯的推理軌跡進行細粒度評估。我們在離線和在線獎勵學習設置中應用Fin-PRM,支持三個關鍵應用:(i)選擇高質量的推理軌跡進行基於蒸餾的監督微調,(ii)為強化學習提供密集的過程級獎勵,以及(iii)在測試時引導獎勵信息的最佳N推理。在包括CFLUE和FinQA在內的金融推理基準上的實驗結果表明,Fin-PRM在軌跡選擇質量上始終優於通用PRMs和強領域基線。使用Fin-PRM訓練的下游模型在基線上取得了顯著改進,在監督學習中提升了12.9%,在強化學習中提升了5.2%,在測試時性能上提升了5.1%。這些發現凸顯了領域專用獎勵模型在將LLMs與專家級金融推理對齊方面的價值。我們的項目資源將在https://github.com/aliyun/qwen-dianjin 上提供。
English
Process Reward Models (PRMs) have emerged as a promising framework for
supervising intermediate reasoning in large language models (LLMs), yet
existing PRMs are primarily trained on general or Science, Technology,
Engineering, and Mathematics (STEM) domains and fall short in domain-specific
contexts such as finance, where reasoning is more structured, symbolic, and
sensitive to factual and regulatory correctness. We introduce Fin-PRM,
a domain-specialized, trajectory-aware PRM tailored to evaluate intermediate
reasoning steps in financial tasks. Fin-PRM integrates step-level and
trajectory-level reward supervision, enabling fine-grained evaluation of
reasoning traces aligned with financial logic. We apply Fin-PRM in both offline
and online reward learning settings, supporting three key applications: (i)
selecting high-quality reasoning trajectories for distillation-based supervised
fine-tuning, (ii) providing dense process-level rewards for reinforcement
learning, and (iii) guiding reward-informed Best-of-N inference at test time.
Experimental results on financial reasoning benchmarks, including CFLUE and
FinQA, demonstrate that Fin-PRM consistently outperforms general-purpose PRMs
and strong domain baselines in trajectory selection quality. Downstream models
trained with Fin-PRM yield substantial improvements with baselines, with gains
of 12.9\% in supervised learning, 5.2\% in reinforcement learning, and 5.1\% in
test-time performance. These findings highlight the value of domain-specialized
reward modeling for aligning LLMs with expert-level financial reasoning. Our
project resources will be available at https://github.com/aliyun/qwen-dianjin.