BizFinBench：一個以商業驅動的真實世界金融基準，用於評估大型語言模型

摘要

大型語言模型在通用任務上表現出色，然而評估其在邏輯密集、精確度至關重要的領域（如金融、法律和醫療）中的可靠性仍具挑戰性。為此，我們推出了BizFinBench，這是首個專門設計用於評估LLM在現實世界金融應用中的基準測試。BizFinBench包含6,781條精心註解的中文查詢，涵蓋五個維度：數值計算、推理、信息提取、預測識別和基於知識的問答，並細分為九個精細類別。該基準測試既包含客觀指標也包含主觀指標。我們還引入了IteraJudge，這是一種新穎的LLM評估方法，能夠在LLM作為客觀指標評估者時減少偏見。我們對25個模型進行了基準測試，包括專有和開源系統。大量實驗表明，沒有模型能在所有任務中佔據主導地位。我們的評估揭示了不同的能力模式：(1)在數值計算方面，Claude-3.5-Sonnet（63.18）和DeepSeek-R1（64.04）領先，而像Qwen2.5-VL-3B（15.92）這樣的小模型則顯著落後；(2)在推理方面，專有模型佔據優勢（ChatGPT-o3：83.58，Gemini-2.0-Flash：81.15），開源模型落後多達19.49分；(3)在信息提取方面，性能差距最大，DeepSeek-R1得分71.46，而Qwen3-1.7B僅得11.23；(4)在預測識別方面，性能差異最小，頂級模型得分在39.16至50.00之間。我們發現，雖然當前LLM能夠勝任常規的金融查詢，但在需要跨概念推理的複雜場景中仍存在困難。BizFinBench為未來研究提供了一個嚴謹且與商業對齊的基準測試。代碼和數據集可在https://github.com/HiThink-Research/BizFinBench獲取。

English

Large language models excel in general tasks, yet assessing their reliability in logic-heavy, precision-critical domains like finance, law, and healthcare remains challenging. To address this, we introduce BizFinBench, the first benchmark specifically designed to evaluate LLMs in real-world financial applications. BizFinBench consists of 6,781 well-annotated queries in Chinese, spanning five dimensions: numerical calculation, reasoning, information extraction, prediction recognition, and knowledge-based question answering, grouped into nine fine-grained categories. The benchmark includes both objective and subjective metrics. We also introduce IteraJudge, a novel LLM evaluation method that reduces bias when LLMs serve as evaluators in objective metrics. We benchmark 25 models, including both proprietary and open-source systems. Extensive experiments show that no model dominates across all tasks. Our evaluation reveals distinct capability patterns: (1) In Numerical Calculation, Claude-3.5-Sonnet (63.18) and DeepSeek-R1 (64.04) lead, while smaller models like Qwen2.5-VL-3B (15.92) lag significantly; (2) In Reasoning, proprietary models dominate (ChatGPT-o3: 83.58, Gemini-2.0-Flash: 81.15), with open-source models trailing by up to 19.49 points; (3) In Information Extraction, the performance spread is the largest, with DeepSeek-R1 scoring 71.46, while Qwen3-1.7B scores 11.23; (4) In Prediction Recognition, performance variance is minimal, with top models scoring between 39.16 and 50.00. We find that while current LLMs handle routine finance queries competently, they struggle with complex scenarios requiring cross-concept reasoning. BizFinBench offers a rigorous, business-aligned benchmark for future research. The code and dataset are available at https://github.com/HiThink-Research/BizFinBench.

BizFinBench：一個以商業驅動的真實世界金融基準，用於評估大型語言模型

BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs

摘要

Support