FINESSE-Bench：大規模言語モデルにおける金融ドメイン知識とテクニカル分析のための階層的ベンチマークスイート

要旨

大規模言語モデル（LLM）は、金融分析、報告、投資判断支援、リスク管理、コンプライアンス、専門家トレーニングなどにますます応用されている。しかし、金融におけるドメイン能力の頑健な評価は依然として不完全である。FinQA、ConvFinQA、TAT-QAなどの広く使われているオープンベンチマークは、金融質問応答と数値推論の進展に重要な役割を果たしてきたが、これらは主に財務報告書に対する質問応答に焦点を当てており、専門的な難易度の明確な階層を提供していない。FinanceBench、PIXIU、FinBen、FLaMEなどのより広範なリソースは、金融タスクのカバレッジを拡大しているが、基礎知識からエキスパートレベルの金融推論への移行を評価する問題は依然として未解決である。本研究では、LLMにおける金融能力の階層的評価のための3,993問からなる8つの専門ベンチマーク群であるFINESSE-Benchを紹介する。FINESSE-Benchは、専門資格（CFA類似のレベル1～3、CMT類似のレベル2、CFTe類似のレベル1）に着想を得た試験指向データセット、応用トレーディングタスクコレクション、ロシア語のオリンピアードベンチマークを組み合わせている。この設計により、ドメインの広がり、難易度上昇に伴う性能低下、計算タスクを解く能力、専門金融領域におけるモデルの挙動を評価することが可能になる。また、選択問題、数値解答、短い自由記述回答をカバーする統一評価プロトコルと、LLM-as-judgeパラダイムに基づく自由形式回答の自動スコアリング方式についても説明する。FINESSE-Benchは、既存のオープンな金融ベンチマークに対する補完として、また大規模言語モデルにおける専門的に関連する金融能力のより実質的な評価のためのツールとして意図されている。

English

Large language models (LLMs) are increasingly being applied to financial analysis, reporting, investment decision support, risk management, compliance, and professional training. However, robust evaluation of their domain competence in finance remains incomplete. Widely used open benchmarks such as FinQA, ConvFinQA, and TAT-QA have played an important role in advancing financial question answering and numerical reasoning, but they focus primarily on question answering over financial reports and do not provide an explicit hierarchy of professional difficulty. Broader resources, including FinanceBench, PIXIU, FinBen, and FLaME, expand the coverage of financial tasks, yet the problem of evaluating the transition from foundational knowledge to expert-level financial reasoning remains open. In this work, we present FINESSE-Bench, a suite of eight specialized benchmarks comprising 3,993 questions for hierarchical evaluation of financial competencies in LLMs. FINESSE-Bench combines exam-oriented datasets inspired by professional certifications (CFA-like Levels 1-3, CMT-like Level 2, and CFTe-like Level 1), applied trading task collections, and a Russian-language olympiad benchmark. This design enables evaluation of domain breadth, performance degradation as difficulty increases, the ability to solve computational tasks, and model behavior in specialized financial domains. We also describe a unified evaluation protocol covering multiple-choice questions, numerical answers, and short open-ended responses, together with an automated scoring scheme for freeform answers based on the LLM-as-judge paradigm. FINESSE-Bench is intended both as a complement to existing open financial benchmarks and as a tool for more substantive evaluation of professionally relevant financial competencies in large language models.