BizFinBench: 대형 언어 모델 평가를 위한 비즈니스 중심의 실세계 금융 벤치마크

초록

대규모 언어 모델(LLM)은 일반적인 작업에서 뛰어난 성능을 보이지만, 금융, 법률, 의료와 같이 논리적 사고와 정밀성이 요구되는 분야에서의 신뢰성을 평가하는 것은 여전히 어려운 과제입니다. 이를 해결하기 위해, 우리는 실제 금융 애플리케이션에서 LLM을 평가하기 위해 특별히 설계된 첫 번째 벤치마크인 BizFinBench를 소개합니다. BizFinBench는 중국어로 작성된 6,781개의 잘 주석이 달린 질문으로 구성되어 있으며, 수치 계산, 추론, 정보 추출, 예측 인식, 지식 기반 질문 응답 등 다섯 가지 차원을 아우르며 아홉 개의 세부 카테고리로 분류됩니다. 이 벤치마크는 객관적 및 주관적 지표를 모두 포함합니다. 또한, 우리는 LLM이 객관적 지표에서 평가자로 사용될 때 편향을 줄이는 새로운 LLM 평가 방법인 IteraJudge를 도입했습니다. 우리는 독점 및 오픈소스 시스템을 포함한 25개의 모델을 벤치마크했습니다. 광범위한 실험 결과, 모든 작업에서 단일 모델이 우위를 점하지 않음을 확인했습니다. 우리의 평가는 다음과 같은 뚜렷한 능력 패턴을 보여줍니다: (1) 수치 계산에서 Claude-3.5-Sonnet(63.18)와 DeepSeek-R1(64.04)이 선두를 달렸고, Qwen2.5-VL-3B(15.92)와 같은 소형 모델은 크게 뒤처졌습니다; (2) 추론에서는 독점 모델이 우세했으며(ChatGPT-o3: 83.58, Gemini-2.0-Flash: 81.15), 오픈소스 모델은 최대 19.49점 차이로 뒤처졌습니다; (3) 정보 추출에서는 성능 편차가 가장 컸으며, DeepSeek-R1은 71.46점을 기록한 반면 Qwen3-1.7B는 11.23점을 기록했습니다; (4) 예측 인식에서는 성능 변동이 가장 적었으며, 상위 모델들은 39.16에서 50.00점 사이의 점수를 기록했습니다. 우리는 현재의 LLM이 일상적인 금융 질문을 능숙하게 처리하지만, 개념 간 추론이 필요한 복잡한 시나리오에서는 어려움을 겪는다는 것을 발견했습니다. BizFinBench는 향후 연구를 위한 엄격하고 비즈니스에 부합하는 벤치마크를 제공합니다. 코드와 데이터셋은 https://github.com/HiThink-Research/BizFinBench에서 확인할 수 있습니다.

English

Large language models excel in general tasks, yet assessing their reliability in logic-heavy, precision-critical domains like finance, law, and healthcare remains challenging. To address this, we introduce BizFinBench, the first benchmark specifically designed to evaluate LLMs in real-world financial applications. BizFinBench consists of 6,781 well-annotated queries in Chinese, spanning five dimensions: numerical calculation, reasoning, information extraction, prediction recognition, and knowledge-based question answering, grouped into nine fine-grained categories. The benchmark includes both objective and subjective metrics. We also introduce IteraJudge, a novel LLM evaluation method that reduces bias when LLMs serve as evaluators in objective metrics. We benchmark 25 models, including both proprietary and open-source systems. Extensive experiments show that no model dominates across all tasks. Our evaluation reveals distinct capability patterns: (1) In Numerical Calculation, Claude-3.5-Sonnet (63.18) and DeepSeek-R1 (64.04) lead, while smaller models like Qwen2.5-VL-3B (15.92) lag significantly; (2) In Reasoning, proprietary models dominate (ChatGPT-o3: 83.58, Gemini-2.0-Flash: 81.15), with open-source models trailing by up to 19.49 points; (3) In Information Extraction, the performance spread is the largest, with DeepSeek-R1 scoring 71.46, while Qwen3-1.7B scores 11.23; (4) In Prediction Recognition, performance variance is minimal, with top models scoring between 39.16 and 50.00. We find that while current LLMs handle routine finance queries competently, they struggle with complex scenarios requiring cross-concept reasoning. BizFinBench offers a rigorous, business-aligned benchmark for future research. The code and dataset are available at https://github.com/HiThink-Research/BizFinBench.

BizFinBench: 대형 언어 모델 평가를 위한 비즈니스 중심의 실세계 금융 벤치마크

BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs

초록

Support