FINESSE-Bench: 대규모 언어 모델의 금융 도메인 지식 및 기술적 분석을 위한 계층적 벤치마크 스위트

초록

대규모 언어 모델(LLM)은 재무 분석, 보고, 투자 의사 결정 지원, 위험 관리, 규정 준수 및 전문 교육에 점점 더 많이 적용되고 있다. 그러나 금융 분야에서 이들의 도메인 역량에 대한 강건한 평가는 여전히 불완전하다. FinQA, ConvFinQA, TAT-QA와 같은 널리 사용되는 공개 벤치마크는 금융 질의응답 및 수치 추론의 발전에 중요한 역할을 해왔지만, 주로 재무 보고서에 대한 질의응답에 초점을 맞추고 있으며 전문적 난이도의 명시적 계층을 제공하지 않는다. FinanceBench, PIXIU, FinBen, FLaME를 포함한 보다 포괄적인 자원은 금융 과제의 범위를 확장하지만, 기초 지식에서 전문가 수준의 금융 추론으로의 전환을 평가하는 문제는 여전히 해결되지 않은 과제로 남아 있다. 본 연구에서는 LLM의 금융 역량을 계층적으로 평가하기 위해 8개의 특화 벤치마크로 구성된 FINESSE-Bench를 제시한다. FINESSE-Bench는 총 3,993개의 질문으로 이루어져 있으며, 전문 자격증에서 영감을 받은 시험 중심 데이터셋(CFA 수준 1-3, CMT 수준 2, CFTe 수준 1), 응용 트레이딩 과제 모음, 그리고 러시아어 올림피아드 벤치마크를 결합한다. 이러한 설계를 통해 도메인 폭, 난이도 증가에 따른 성능 저하, 계산 과제 해결 능력, 특화 금융 분야에서의 모델 행동을 평가할 수 있다. 또한 객관식 질문, 수치형 답변, 짧은 개방형 응답을 포함하는 통합 평가 프로토콜과 LLM-as-judge 패러다임에 기반한 자유 형식 답변의 자동 채점 방식을 설명한다. FINESSE-Bench는 기존의 공개 금융 벤치마크를 보완함과 동시에 대규모 언어 모델의 전문적으로 관련된 금융 역량을 보다 실질적으로 평가하기 위한 도구로 활용될 것을 목표로 한다.

English

Large language models (LLMs) are increasingly being applied to financial analysis, reporting, investment decision support, risk management, compliance, and professional training. However, robust evaluation of their domain competence in finance remains incomplete. Widely used open benchmarks such as FinQA, ConvFinQA, and TAT-QA have played an important role in advancing financial question answering and numerical reasoning, but they focus primarily on question answering over financial reports and do not provide an explicit hierarchy of professional difficulty. Broader resources, including FinanceBench, PIXIU, FinBen, and FLaME, expand the coverage of financial tasks, yet the problem of evaluating the transition from foundational knowledge to expert-level financial reasoning remains open. In this work, we present FINESSE-Bench, a suite of eight specialized benchmarks comprising 3,993 questions for hierarchical evaluation of financial competencies in LLMs. FINESSE-Bench combines exam-oriented datasets inspired by professional certifications (CFA-like Levels 1-3, CMT-like Level 2, and CFTe-like Level 1), applied trading task collections, and a Russian-language olympiad benchmark. This design enables evaluation of domain breadth, performance degradation as difficulty increases, the ability to solve computational tasks, and model behavior in specialized financial domains. We also describe a unified evaluation protocol covering multiple-choice questions, numerical answers, and short open-ended responses, together with an automated scoring scheme for freeform answers based on the LLM-as-judge paradigm. FINESSE-Bench is intended both as a complement to existing open financial benchmarks and as a tool for more substantive evaluation of professionally relevant financial competencies in large language models.