BizFinBench.v2:面向专业级金融能力对齐的统一双模双语基准测试框架
BizFinBench.v2: A Unified Dual-Mode Bilingual Benchmark for Expert-Level Financial Capability Alignment
January 10, 2026
作者: Xin Guo, Rongjunchen Zhang, Guilong Lu, Xuntao Guo, Shuai Jia, Zhi Yang, Liwen Zhang
cs.AI
摘要
大型语言模型经历了快速演进,已成为金融业务智能化的关键技术。然而,现有基准测试常受限于模拟或通用样本依赖、聚焦单一离线静态场景等缺陷,导致其难以契合金融服务对真实性与实时响应能力的要求,造成基准表现与实际业务效能之间存在显著差距。为此,我们推出首个基于中美股市真实业务数据并融合在线评估的大规模评测基准BizFinBench.v2。通过对金融平台真实用户查询进行聚类分析,我们构建了覆盖四大核心业务场景的八项基础任务与两项在线任务,共计29,578组专家级问答对。实验结果表明:ChatGPT-5在主要任务中以61.5%的正确率表现突出,但与金融专家仍存明显差距;在线任务中DeepSeek-R1优于所有其他商用大模型。错误分析进一步揭示了现有模型在金融实际业务场景中的具体能力短板。BizFinBench.v2突破了现有基准的局限性,实现了对LLM金融能力的业务级解构,为评估大模型在金融领域规模化部署的效能提供了精准依据。数据与代码已开源:https://github.com/HiThink-Research/BizFinBench.v2。
English
Large language models have undergone rapid evolution, emerging as a pivotal technology for intelligence in financial operations. However, existing benchmarks are often constrained by pitfalls such as reliance on simulated or general-purpose samples and a focus on singular, offline static scenarios. Consequently, they fail to align with the requirements for authenticity and real-time responsiveness in financial services, leading to a significant discrepancy between benchmark performance and actual operational efficacy. To address this, we introduce BizFinBench.v2, the first large-scale evaluation benchmark grounded in authentic business data from both Chinese and U.S. equity markets, integrating online assessment. We performed clustering analysis on authentic user queries from financial platforms, resulting in eight fundamental tasks and two online tasks across four core business scenarios, totaling 29,578 expert-level Q&A pairs. Experimental results demonstrate that ChatGPT-5 achieves a prominent 61.5% accuracy in main tasks, though a substantial gap relative to financial experts persists; in online tasks, DeepSeek-R1 outperforms all other commercial LLMs. Error analysis further identifies the specific capability deficiencies of existing models within practical financial business contexts. BizFinBench.v2 transcends the limitations of current benchmarks, achieving a business-level deconstruction of LLM financial capabilities and providing a precise basis for evaluating efficacy in the widespread deployment of LLMs within the financial domain. The data and code are available at https://github.com/HiThink-Research/BizFinBench.v2.