FINESSE-Bench:大型语言模型金融領域知識與技術分析的分層基準測試套件
FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models
May 14, 2026
作者: Dmitry Stanishevskii, Nini Kamkia, Alexey Khoroshilov, Dmitry Zmitrovich, Denis Kokosinskii, Zhirayr Hayrapetyan, Andrei Kalmykov
cs.AI
摘要
大型語言模型(LLMs)正日益廣泛應用於財務分析、報告生成、投資決策支援、風險管理、合規審查及專業培訓等領域。然而,針對其在金融領域專業能力的穩健評測仍不完整。目前廣泛使用的開放基準,如FinQA、ConvFinQA與TAT-QA,雖然在推動金融問答與數值推理方面發揮了重要作用,但它們主要聚焦於財務報告的問答任務,並未建立明確的專業難度層級。更廣泛的資源如FinanceBench、PIXIU、FinBen與FLaME,雖擴展了金融任務的覆蓋範圍,但如何評估從基礎知識到專家級金融推理的過渡能力,仍是未解難題。為此,我們提出FINESSE-Bench,這是一套由八個專業基準組成的評測套件,包含3,993道題目,用於對LLMs的金融能力進行分層評測。FINESSE-Bench融合了受專業認證考試啟發的題庫(類CFA一至三級、類CMT二級與類CFTe一級)、應用型交易任務集,以及俄語金融奧林匹克競賽基準。此設計可同時評估模型的領域廣度、難度遞增下的性能衰減、計算任務解決能力,以及模型在專業金融領域的行為表現。我們亦提出統一的評測協議,涵蓋選擇題、數值答案與簡短開放式作答,並基於「以LLM為裁判」的範式,為自由形式答案設計了自動評分機制。FINESSE-Bench旨在補充現有開放金融基準的不足,並作為對大型語言模型進行更具實質性專業金融能力評測的有效工具。
English
Large language models (LLMs) are increasingly being applied to financial analysis, reporting, investment decision support, risk management, compliance, and professional training. However, robust evaluation of their domain competence in finance remains incomplete. Widely used open benchmarks such as FinQA, ConvFinQA, and TAT-QA have played an important role in advancing financial question answering and numerical reasoning, but they focus primarily on question answering over financial reports and do not provide an explicit hierarchy of professional difficulty. Broader resources, including FinanceBench, PIXIU, FinBen, and FLaME, expand the coverage of financial tasks, yet the problem of evaluating the transition from foundational knowledge to expert-level financial reasoning remains open. In this work, we present FINESSE-Bench, a suite of eight specialized benchmarks comprising 3,993 questions for hierarchical evaluation of financial competencies in LLMs. FINESSE-Bench combines exam-oriented datasets inspired by professional certifications (CFA-like Levels 1-3, CMT-like Level 2, and CFTe-like Level 1), applied trading task collections, and a Russian-language olympiad benchmark. This design enables evaluation of domain breadth, performance degradation as difficulty increases, the ability to solve computational tasks, and model behavior in specialized financial domains. We also describe a unified evaluation protocol covering multiple-choice questions, numerical answers, and short open-ended responses, together with an automated scoring scheme for freeform answers based on the LLM-as-judge paradigm. FINESSE-Bench is intended both as a complement to existing open financial benchmarks and as a tool for more substantive evaluation of professionally relevant financial competencies in large language models.