FinForge:半合成金融基准数据生成框架
FinForge: Semi-Synthetic Financial Benchmark Generation
January 11, 2026
作者: Glenn Matlin, Akhil Theerthala, Anant Gupta, Anirudh JM, Rayan Castilla, Yi Mei Ng, Sudheer Chava
cs.AI
摘要
在金融等高专业性、高风险的领域中,由于缺乏开放、高质量且领域特定的数据集,语言模型的评估仍面临重大挑战。现有通用基准测试虽覆盖广泛,但缺乏评估语言模型在现实金融推理能力所需的深度和领域保真度——这种能力既需要概念理解,又要求定量严谨性。为弥补这一空白,我们推出FinForge:一种通过专家指导的数据策管与基于语言模型的受控合成相结合的可扩展半自动化流程,用于构建金融领域专项评估基准。该流程融合了从权威金融源进行人工与程序化语料构建的方法,并利用Gemini 2.5 Flash实现结构化问题生成与验证。为验证其有效性,我们基于10万份经过验证的文档(总计1.43亿词元)构建的精选语料库,开发出包含11个金融子领域、超5000道人工验证问答对的FinForge-5k基准测试集。通过对主流开源与闭源模型的测试发现,金融推理能力存在显著差异,领先模型的准确率接近80%。这些结果印证了该框架在诊断现有模型局限、指导金融领域能力改进方面的价值。全部代码与数据已发布于https://github.com/gtfintechlab/FinForge。
English
Evaluating Language Models (LMs) in specialized, high-stakes domains such as finance remains a significant challenge due to the scarcity of open, high-quality, and domain-specific datasets. Existing general-purpose benchmarks provide broad coverage but lack the depth and domain fidelity needed to assess LMs' capabilities for real-world financial reasoning, which requires both conceptual understanding and quantitative rigor. To address this gap, we introduce FinForge, a scalable, semi-synthetic pipeline for constructing finance-specific evaluation benchmarks through a hybrid of expert-guided data curation and controlled LM-based synthesis. FinForge combines manual and programmatic corpus construction from authoritative financial sources with structured question generation and validation using Gemini 2.5 Flash. To demonstrate the pipeline's efficacy, we produce FinForge-5k, a snapshot benchmark comprising over 5,000 human-validated question-answer pairs across 11 finance subdomains, derived from a curated corpus of 100,000 verified documents totaling 143M tokens. Evaluation of state-of-the-art open-source and closed-source models on FinForge-5k reveals significant differences in financial reasoning, with leading models achieving accuracy levels near 80%. These findings underscore the framework's utility for diagnosing current model limitations and guiding future improvements in financial domain competence. All code and data are available at https://github.com/gtfintechlab/FinForge.