BizFinBench: 大規模言語モデル評価のためのビジネス駆動型実世界財務ベンチマーク

要旨

大規模言語モデルは一般的なタスクにおいて優れた性能を発揮するが、金融、法律、医療といった論理性と正確性が求められる分野での信頼性を評価することは依然として課題である。この問題に対処するため、我々は現実世界の金融アプリケーションにおけるLLMの評価に特化した初のベンチマーク「BizFinBench」を導入する。BizFinBenchは6,781件の詳細な注釈付き中国語クエリで構成され、数値計算、推論、情報抽出、予測認識、知識ベースの質問応答という5つの次元にまたがり、9つの細分化されたカテゴリに分類される。このベンチマークには客観的および主観的な評価指標が含まれる。また、LLMが評価者として客観的指標を評価する際のバイアスを低減する新しいLLM評価手法「IteraJudge」を導入する。我々は25のモデル（プロプライエタリおよびオープンソースシステムを含む）をベンチマークした。大規模な実験の結果、全てのタスクで優位なモデルは存在しないことが明らかになった。評価からは以下のような能力パターンが浮かび上がる：(1) 数値計算では、Claude-3.5-Sonnet（63.18）とDeepSeek-R1（64.04）がリードし、Qwen2.5-VL-3B（15.92）のような小型モデルは大きく遅れをとる；(2) 推論では、プロプライエタリモデルが優位（ChatGPT-o3: 83.58, Gemini-2.0-Flash: 81.15）で、オープンソースモデルは最大19.49ポイントの差をつけられる；(3) 情報抽出では、性能のばらつきが最も大きく、DeepSeek-R1が71.46を記録する一方、Qwen3-1.7Bは11.23にとどまる；(4) 予測認識では、性能のばらつきが最小で、トップモデルのスコアは39.16から50.00の間に収まる。現在のLLMは日常的な金融クエリには対応できるが、複数の概念をまたぐ推論を必要とする複雑なシナリオには苦戦することがわかった。BizFinBenchは、将来の研究に向けた厳密でビジネスに即したベンチマークを提供する。コードとデータセットはhttps://github.com/HiThink-Research/BizFinBenchで公開されている。

English

Large language models excel in general tasks, yet assessing their reliability in logic-heavy, precision-critical domains like finance, law, and healthcare remains challenging. To address this, we introduce BizFinBench, the first benchmark specifically designed to evaluate LLMs in real-world financial applications. BizFinBench consists of 6,781 well-annotated queries in Chinese, spanning five dimensions: numerical calculation, reasoning, information extraction, prediction recognition, and knowledge-based question answering, grouped into nine fine-grained categories. The benchmark includes both objective and subjective metrics. We also introduce IteraJudge, a novel LLM evaluation method that reduces bias when LLMs serve as evaluators in objective metrics. We benchmark 25 models, including both proprietary and open-source systems. Extensive experiments show that no model dominates across all tasks. Our evaluation reveals distinct capability patterns: (1) In Numerical Calculation, Claude-3.5-Sonnet (63.18) and DeepSeek-R1 (64.04) lead, while smaller models like Qwen2.5-VL-3B (15.92) lag significantly; (2) In Reasoning, proprietary models dominate (ChatGPT-o3: 83.58, Gemini-2.0-Flash: 81.15), with open-source models trailing by up to 19.49 points; (3) In Information Extraction, the performance spread is the largest, with DeepSeek-R1 scoring 71.46, while Qwen3-1.7B scores 11.23; (4) In Prediction Recognition, performance variance is minimal, with top models scoring between 39.16 and 50.00. We find that while current LLMs handle routine finance queries competently, they struggle with complex scenarios requiring cross-concept reasoning. BizFinBench offers a rigorous, business-aligned benchmark for future research. The code and dataset are available at https://github.com/HiThink-Research/BizFinBench.

BizFinBench: 大規模言語モデル評価のためのビジネス駆動型実世界財務ベンチマーク

BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs

要旨

Support