Fin-PRM: 대규모 언어 모델을 위한 금융 추론 전용 프로세스 보상 모델

초록

프로세스 보상 모델(PRMs)은 대규모 언어 모델(LLMs)의 중간 추론 과정을 감독하기 위한 유망한 프레임워크로 부상했으나, 기존 PRMs는 주로 일반적인 도메인이나 과학, 기술, 공학, 수학(STEM) 분야에서 훈련되어 금융과 같은 도메인 특화적 맥락에서는 한계를 보입니다. 금융 분야에서는 추론이 더 구조화되고 상징적이며 사실적 및 규제적 정확성에 민감합니다. 우리는 금융 작업에서 중간 추론 단계를 평가하기 위해 도메인 특화적이고 궤적 인식(aware) PRM인 Fin-PRM을 소개합니다. Fin-PRM은 단계별 및 궤적별 보상 감독을 통합하여 금융 논리에 부합하는 추론 흔적을 세밀하게 평가할 수 있습니다. 우리는 Fin-PRM을 오프라인 및 온라인 보상 학습 설정에 적용하여 세 가지 주요 응용을 지원합니다: (i) 증류 기반 지도 미세 조정을 위한 고품질 추론 궤적 선택, (ii) 강화 학습을 위한 밀집 프로세스 수준 보상 제공, (iii) 테스트 시 보안 Best-of-N 추론을 안내. CFLUE 및 FinQA를 포함한 금융 추론 벤치마크에서의 실험 결과는 Fin-PRM이 궤적 선택 품질에서 일반 목적 PRMs 및 강력한 도메인 베이스라인을 지속적으로 능가함을 보여줍니다. Fin-PRM으로 훈련된 다운스트림 모델은 베이스라인 대비 상당한 개선을 보이며, 지도 학습에서 12.9%, 강화 학습에서 5.2%, 테스트 시간 성능에서 5.1%의 향상을 달성했습니다. 이러한 결과는 전문가 수준의 금융 추론과 LLMs를 정렬하기 위한 도메인 특화 보상 모델링의 가치를 강조합니다. 우리의 프로젝트 리소스는 https://github.com/aliyun/qwen-dianjin에서 확인할 수 있습니다.

English

Process Reward Models (PRMs) have emerged as a promising framework for supervising intermediate reasoning in large language models (LLMs), yet existing PRMs are primarily trained on general or Science, Technology, Engineering, and Mathematics (STEM) domains and fall short in domain-specific contexts such as finance, where reasoning is more structured, symbolic, and sensitive to factual and regulatory correctness. We introduce Fin-PRM, a domain-specialized, trajectory-aware PRM tailored to evaluate intermediate reasoning steps in financial tasks. Fin-PRM integrates step-level and trajectory-level reward supervision, enabling fine-grained evaluation of reasoning traces aligned with financial logic. We apply Fin-PRM in both offline and online reward learning settings, supporting three key applications: (i) selecting high-quality reasoning trajectories for distillation-based supervised fine-tuning, (ii) providing dense process-level rewards for reinforcement learning, and (iii) guiding reward-informed Best-of-N inference at test time. Experimental results on financial reasoning benchmarks, including CFLUE and FinQA, demonstrate that Fin-PRM consistently outperforms general-purpose PRMs and strong domain baselines in trajectory selection quality. Downstream models trained with Fin-PRM yield substantial improvements with baselines, with gains of 12.9\% in supervised learning, 5.2\% in reinforcement learning, and 5.1\% in test-time performance. These findings highlight the value of domain-specialized reward modeling for aligning LLMs with expert-level financial reasoning. Our project resources will be available at https://github.com/aliyun/qwen-dianjin.

Fin-PRM: 대규모 언어 모델을 위한 금융 추론 전용 프로세스 보상 모델

Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models

초록

Support