BizFinBench:面向业务场景的真实世界金融基准,用于评估大语言模型
BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs
May 26, 2025
作者: Guilong Lu, Xuntao Guo, Rongjunchen Zhang, Wenqiao Zhu, Ji Liu
cs.AI
摘要
大型语言模型在通用任务上表现出色,但在逻辑密集、精度要求高的领域如金融、法律和医疗保健中评估其可靠性仍具挑战性。为此,我们推出了BizFinBench,这是首个专门设计用于评估LLM在现实世界金融应用中的基准测试。BizFinBench包含6,781条精心标注的中文查询,涵盖五个维度:数值计算、推理、信息抽取、预测识别和基于知识的问答,细分为九个子类别。该基准测试既包含客观指标也包含主观指标。我们还引入了IteraJudge,一种新颖的LLM评估方法,旨在减少LLM作为评估者在客观指标中的偏见。我们对25个模型进行了基准测试,包括专有和开源系统。大量实验表明,没有模型能在所有任务中占据主导地位。我们的评估揭示了不同的能力模式:(1) 在数值计算方面,Claude-3.5-Sonnet(63.18)和DeepSeek-R1(64.04)领先,而较小模型如Qwen2.5-VL-3B(15.92)显著落后;(2) 在推理方面,专有模型占据优势(ChatGPT-o3:83.58,Gemini-2.0-Flash:81.15),开源模型落后多达19.49分;(3) 在信息抽取方面,性能差异最大,DeepSeek-R1得分71.46,而Qwen3-1.7B仅得11.23;(4) 在预测识别方面,性能差异最小,顶级模型得分在39.16至50.00之间。我们发现,尽管当前LLM能够胜任常规金融查询,但在需要跨概念推理的复杂场景中表现欠佳。BizFinBench为未来研究提供了一个严格且与业务对齐的基准测试。代码和数据集可在https://github.com/HiThink-Research/BizFinBench获取。
English
Large language models excel in general tasks, yet assessing their reliability
in logic-heavy, precision-critical domains like finance, law, and healthcare
remains challenging. To address this, we introduce BizFinBench, the first
benchmark specifically designed to evaluate LLMs in real-world financial
applications. BizFinBench consists of 6,781 well-annotated queries in Chinese,
spanning five dimensions: numerical calculation, reasoning, information
extraction, prediction recognition, and knowledge-based question answering,
grouped into nine fine-grained categories. The benchmark includes both
objective and subjective metrics. We also introduce IteraJudge, a novel LLM
evaluation method that reduces bias when LLMs serve as evaluators in objective
metrics. We benchmark 25 models, including both proprietary and open-source
systems. Extensive experiments show that no model dominates across all tasks.
Our evaluation reveals distinct capability patterns: (1) In Numerical
Calculation, Claude-3.5-Sonnet (63.18) and DeepSeek-R1 (64.04) lead, while
smaller models like Qwen2.5-VL-3B (15.92) lag significantly; (2) In Reasoning,
proprietary models dominate (ChatGPT-o3: 83.58, Gemini-2.0-Flash: 81.15), with
open-source models trailing by up to 19.49 points; (3) In Information
Extraction, the performance spread is the largest, with DeepSeek-R1 scoring
71.46, while Qwen3-1.7B scores 11.23; (4) In Prediction Recognition,
performance variance is minimal, with top models scoring between 39.16 and
50.00. We find that while current LLMs handle routine finance queries
competently, they struggle with complex scenarios requiring cross-concept
reasoning. BizFinBench offers a rigorous, business-aligned benchmark for future
research. The code and dataset are available at
https://github.com/HiThink-Research/BizFinBench.Summary
AI-Generated Summary