FINESSE-Bench:面向大型语言模型金融领域知识与技术分析的分层基准测试套件
FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models
May 14, 2026
作者: Dmitry Stanishevskii, Nini Kamkia, Alexey Khoroshilov, Dmitry Zmitrovich, Denis Kokosinskii, Zhirayr Hayrapetyan, Andrei Kalmykov
cs.AI
摘要
大型语言模型(LLMs)正越来越多地应用于金融分析、报告、投资决策支持、风险管理、合规以及专业培训等领域。然而,对其在金融领域专业能力的稳健评估仍不完整。诸如FinQA、ConvFinQA和TAT-QA等广泛使用的开放基准在推动金融问答和数值推理方面发挥了重要作用,但它们主要侧重于金融报告上的问答,并未提供明确的专业难度层级。包括FinanceBench、PIXIU、FinBen和FLaME在内的更广泛资源拓展了金融任务的覆盖范围,但如何评估从基础知识到专家级金融推理的过渡问题仍未解决。在此工作中,我们提出了FINESSE-Bench,一套包含八个专门基准、共3993个问题的套件,用于对LLMs的金融能力进行分层评估。FINESSE-Bench结合了受专业认证启发(类似CFA一级至三级、类似CMT二级以及类似CFTe一级)的考试导向数据集、应用交易任务集合以及一个俄语奥林匹克基准。该设计使得我们能够评估领域广度、随难度增加的性能退化、解决计算任务的能力以及模型在专业金融领域中的行为。我们还描述了一个统一的评估协议,涵盖多项选择题、数值答案和简短开放式回答,以及基于LLM-as-judge范式的自由形式答案自动评分方案。FINESSE-Bench旨在既作为现有开放金融基准的补充,也作为对大型语言模型中专业相关金融能力进行更实质性评估的工具。
English
Large language models (LLMs) are increasingly being applied to financial analysis, reporting, investment decision support, risk management, compliance, and professional training. However, robust evaluation of their domain competence in finance remains incomplete. Widely used open benchmarks such as FinQA, ConvFinQA, and TAT-QA have played an important role in advancing financial question answering and numerical reasoning, but they focus primarily on question answering over financial reports and do not provide an explicit hierarchy of professional difficulty. Broader resources, including FinanceBench, PIXIU, FinBen, and FLaME, expand the coverage of financial tasks, yet the problem of evaluating the transition from foundational knowledge to expert-level financial reasoning remains open. In this work, we present FINESSE-Bench, a suite of eight specialized benchmarks comprising 3,993 questions for hierarchical evaluation of financial competencies in LLMs. FINESSE-Bench combines exam-oriented datasets inspired by professional certifications (CFA-like Levels 1-3, CMT-like Level 2, and CFTe-like Level 1), applied trading task collections, and a Russian-language olympiad benchmark. This design enables evaluation of domain breadth, performance degradation as difficulty increases, the ability to solve computational tasks, and model behavior in specialized financial domains. We also describe a unified evaluation protocol covering multiple-choice questions, numerical answers, and short open-ended responses, together with an automated scoring scheme for freeform answers based on the LLM-as-judge paradigm. FINESSE-Bench is intended both as a complement to existing open financial benchmarks and as a tool for more substantive evaluation of professionally relevant financial competencies in large language models.