ChatPaper.aiChatPaper

BizFinBench.v2:專家級金融能力校準的統一雙模雙語基準測試集

BizFinBench.v2: A Unified Dual-Mode Bilingual Benchmark for Expert-Level Financial Capability Alignment

January 10, 2026
作者: Xin Guo, Rongjunchen Zhang, Guilong Lu, Xuntao Guo, Shuai Jia, Zhi Yang, Liwen Zhang
cs.AI

摘要

大型語言模型經歷了快速演進,已成為金融業務智能化的關鍵技術。然而,現有基準測試常受限於依賴模擬或通用樣本、聚焦單一離線靜態場景等缺陷,導致其難以契合金融服務對真實性與即時回應的要求,造成基準表現與實際業務效能存在顯著落差。為此,我們推出首個基於中美股市真實業務數據並融合線上評估的大規模評測基準BizFinBench.v2。透過對金融平台真實用戶查詢進行聚類分析,我們構建了覆蓋四大核心業務場景的八項基礎任務與兩項線上任務,共計29,578組專家級問答對。實驗結果表明:ChatGPT-5在主任務中以61.5%的準確率表現突出,但相較金融專家仍存明顯差距;在線上任務中,DeepSeek-R1則優於所有其他商業大模型。錯誤分析進一步揭示了現有模型在金融實務場景中的具體能力短板。BizFinBench.v2突破了現有基準的侷限性,實現了對大模型金融能力的業務級解構,為評估大模型在金融領域規模化應用的效能提供了精準依據。數據與程式碼已公開於https://github.com/HiThink-Research/BizFinBench.v2。
English
Large language models have undergone rapid evolution, emerging as a pivotal technology for intelligence in financial operations. However, existing benchmarks are often constrained by pitfalls such as reliance on simulated or general-purpose samples and a focus on singular, offline static scenarios. Consequently, they fail to align with the requirements for authenticity and real-time responsiveness in financial services, leading to a significant discrepancy between benchmark performance and actual operational efficacy. To address this, we introduce BizFinBench.v2, the first large-scale evaluation benchmark grounded in authentic business data from both Chinese and U.S. equity markets, integrating online assessment. We performed clustering analysis on authentic user queries from financial platforms, resulting in eight fundamental tasks and two online tasks across four core business scenarios, totaling 29,578 expert-level Q&A pairs. Experimental results demonstrate that ChatGPT-5 achieves a prominent 61.5% accuracy in main tasks, though a substantial gap relative to financial experts persists; in online tasks, DeepSeek-R1 outperforms all other commercial LLMs. Error analysis further identifies the specific capability deficiencies of existing models within practical financial business contexts. BizFinBench.v2 transcends the limitations of current benchmarks, achieving a business-level deconstruction of LLM financial capabilities and providing a precise basis for evaluating efficacy in the widespread deployment of LLMs within the financial domain. The data and code are available at https://github.com/HiThink-Research/BizFinBench.v2.
PDF92January 16, 2026