ChatPaper.aiChatPaper

FinForge:半合成金融基准数据集生成框架

FinForge: Semi-Synthetic Financial Benchmark Generation

January 11, 2026
作者: Glenn Matlin, Akhil Theerthala, Anant Gupta, Anirudh JM, Rayan Castilla, Yi Mei Ng, Sudheer Chava
cs.AI

摘要

在金融等高專業性與高風險領域中,由於缺乏開放、高品質且具領域專屬性的數據集,語言模型的評估仍面臨重大挑戰。現有通用基準測試雖涵蓋範圍廣泛,但缺乏足夠的深度與領域真實性,難以評估語言模型在現實金融推理中所需的概念理解與定量分析能力。為解決此問題,我們提出FinForge——一個透過專家引導數據策展與受控式語言模型合成的混合方法,構建金融專屬評估基準的可擴展半合成流程。該流程結合從權威金融來源進行人工與程式化語料庫建構,並運用Gemini 2.5 Flash進行結構化問題生成與驗證。為驗證流程效能,我們發布FinForge-5k基準快照,包含經人工驗證的5,000餘組問答對,涵蓋11個金融子領域,其來源為從10萬份經過驗證的總計1.43億詞元的文件中精煉的語料庫。針對頂尖開源與閉源模型在FinForge-5k上的評估顯示,金融推理能力存在顯著差異,領先模型準確率接近80%。這些發現凸顯該框架能有效診斷現有模型局限,並引導未來金融領域能力的提升。所有程式碼與數據已公開於https://github.com/gtfintechlab/FinForge。
English
Evaluating Language Models (LMs) in specialized, high-stakes domains such as finance remains a significant challenge due to the scarcity of open, high-quality, and domain-specific datasets. Existing general-purpose benchmarks provide broad coverage but lack the depth and domain fidelity needed to assess LMs' capabilities for real-world financial reasoning, which requires both conceptual understanding and quantitative rigor. To address this gap, we introduce FinForge, a scalable, semi-synthetic pipeline for constructing finance-specific evaluation benchmarks through a hybrid of expert-guided data curation and controlled LM-based synthesis. FinForge combines manual and programmatic corpus construction from authoritative financial sources with structured question generation and validation using Gemini 2.5 Flash. To demonstrate the pipeline's efficacy, we produce FinForge-5k, a snapshot benchmark comprising over 5,000 human-validated question-answer pairs across 11 finance subdomains, derived from a curated corpus of 100,000 verified documents totaling 143M tokens. Evaluation of state-of-the-art open-source and closed-source models on FinForge-5k reveals significant differences in financial reasoning, with leading models achieving accuracy levels near 80%. These findings underscore the framework's utility for diagnosing current model limitations and guiding future improvements in financial domain competence. All code and data are available at https://github.com/gtfintechlab/FinForge.
PDF13January 31, 2026