BizFinBench：面向业务场景的真实世界金融基准，用于评估大语言模型

摘要

大型语言模型在通用任务上表现出色，但在逻辑密集、精度要求高的领域如金融、法律和医疗保健中评估其可靠性仍具挑战性。为此，我们推出了BizFinBench，这是首个专门设计用于评估LLM在现实世界金融应用中的基准测试。BizFinBench包含6,781条精心标注的中文查询，涵盖五个维度：数值计算、推理、信息抽取、预测识别和基于知识的问答，细分为九个子类别。该基准测试既包含客观指标也包含主观指标。我们还引入了IteraJudge，一种新颖的LLM评估方法，旨在减少LLM作为评估者在客观指标中的偏见。我们对25个模型进行了基准测试，包括专有和开源系统。大量实验表明，没有模型能在所有任务中占据主导地位。我们的评估揭示了不同的能力模式：(1) 在数值计算方面，Claude-3.5-Sonnet（63.18）和DeepSeek-R1（64.04）领先，而较小模型如Qwen2.5-VL-3B（15.92）显著落后；(2) 在推理方面，专有模型占据优势（ChatGPT-o3：83.58，Gemini-2.0-Flash：81.15），开源模型落后多达19.49分；(3) 在信息抽取方面，性能差异最大，DeepSeek-R1得分71.46，而Qwen3-1.7B仅得11.23；(4) 在预测识别方面，性能差异最小，顶级模型得分在39.16至50.00之间。我们发现，尽管当前LLM能够胜任常规金融查询，但在需要跨概念推理的复杂场景中表现欠佳。BizFinBench为未来研究提供了一个严格且与业务对齐的基准测试。代码和数据集可在https://github.com/HiThink-Research/BizFinBench获取。

English

Large language models excel in general tasks, yet assessing their reliability in logic-heavy, precision-critical domains like finance, law, and healthcare remains challenging. To address this, we introduce BizFinBench, the first benchmark specifically designed to evaluate LLMs in real-world financial applications. BizFinBench consists of 6,781 well-annotated queries in Chinese, spanning five dimensions: numerical calculation, reasoning, information extraction, prediction recognition, and knowledge-based question answering, grouped into nine fine-grained categories. The benchmark includes both objective and subjective metrics. We also introduce IteraJudge, a novel LLM evaluation method that reduces bias when LLMs serve as evaluators in objective metrics. We benchmark 25 models, including both proprietary and open-source systems. Extensive experiments show that no model dominates across all tasks. Our evaluation reveals distinct capability patterns: (1) In Numerical Calculation, Claude-3.5-Sonnet (63.18) and DeepSeek-R1 (64.04) lead, while smaller models like Qwen2.5-VL-3B (15.92) lag significantly; (2) In Reasoning, proprietary models dominate (ChatGPT-o3: 83.58, Gemini-2.0-Flash: 81.15), with open-source models trailing by up to 19.49 points; (3) In Information Extraction, the performance spread is the largest, with DeepSeek-R1 scoring 71.46, while Qwen3-1.7B scores 11.23; (4) In Prediction Recognition, performance variance is minimal, with top models scoring between 39.16 and 50.00. We find that while current LLMs handle routine finance queries competently, they struggle with complex scenarios requiring cross-concept reasoning. BizFinBench offers a rigorous, business-aligned benchmark for future research. The code and dataset are available at https://github.com/HiThink-Research/BizFinBench.

BizFinBench：面向业务场景的真实世界金融基准，用于评估大语言模型

BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs

摘要

Support