Agentar-Fin-R1：ドメイン専門知識、トレーニング効率、高度な推論による金融知能の強化

要旨

大規模言語モデル（LLMs）は金融アプリケーションにおいて大きな可能性を示しているが、既存のモデルは高度な推論能力、厳格な信頼性基準、およびドメイン固有の要件への効率的な適応を必要とするシナリオに直面した際に、しばしば限界を示す。本論文では、Qwen3基盤モデルをベースに設計された金融向け大規模言語モデル「Agentar-Fin-R1」シリーズ（8Bおよび32Bパラメータ）を紹介する。このモデルは、金融アプリケーションにおける推論能力、信頼性、およびドメイン特化性を強化するために特別に設計されている。最適化アプローチとして、高品質で体系的な金融タスクラベルシステムと、多層的な信頼性保証フレームワークを統合している。このフレームワークは、高品質な信頼性のある知識エンジニアリング、マルチエージェントによる信頼性のあるデータ合成、および厳格なデータ検証ガバナンスを包含している。ラベルガイドによる自動難易度認識最適化、2段階トレーニングパイプライン、および動的な属性システムを通じて、トレーニング効率の大幅な向上を実現した。本モデルは、Fineva、FinEval、FinanceIQなどの主要な金融ベンチマーク、およびMATH-500やGPQA-diamondなどの一般的な推論データセットにおいて包括的な評価を受けた。実世界での展開能力を徹底的に評価するために、エージェントレベルの金融推論とコンプライアンス検証に焦点を当てた「Finova」評価ベンチマークを新たに提案した。実験結果は、Agentar-Fin-R1が金融タスクにおいて最先端の性能を達成するだけでなく、優れた一般的な推論能力も示し、ハイステークスな金融アプリケーションにおける信頼性の高いソリューションとしての有効性を実証している。Finovaベンチマークはhttps://github.com/antgroup/Finovaで公開されている。

English

Large Language Models (LLMs) exhibit considerable promise in financial applications; however, prevailing models frequently demonstrate limitations when confronted with scenarios that necessitate sophisticated reasoning capabilities, stringent trustworthiness criteria, and efficient adaptation to domain-specific requirements. We introduce the Agentar-Fin-R1 series of financial large language models (8B and 32B parameters), specifically engineered based on the Qwen3 foundation model to enhance reasoning capabilities, reliability, and domain specialization for financial applications. Our optimization approach integrates a high-quality, systematic financial task label system with a comprehensive multi-layered trustworthiness assurance framework. This framework encompasses high-quality trustworthy knowledge engineering, multi-agent trustworthy data synthesis, and rigorous data validation governance. Through label-guided automated difficulty-aware optimization, tow-stage training pipeline, and dynamic attribution systems, we achieve substantial improvements in training efficiency. Our models undergo comprehensive evaluation on mainstream financial benchmarks including Fineva, FinEval, and FinanceIQ, as well as general reasoning datasets such as MATH-500 and GPQA-diamond. To thoroughly assess real-world deployment capabilities, we innovatively propose the Finova evaluation benchmark, which focuses on agent-level financial reasoning and compliance verification. Experimental results demonstrate that Agentar-Fin-R1 not only achieves state-of-the-art performance on financial tasks but also exhibits exceptional general reasoning capabilities, validating its effectiveness as a trustworthy solution for high-stakes financial applications. The Finova bench is available at https://github.com/antgroup/Finova.