スコアからスキルへ：金融大規模言語モデルを評価するための認知診断フレームワーク

要旨

大規模言語モデル（LLM）は金融アプリケーションにおいて有望であることが示されているが、既存のベンチマークの不備により、この高リスク領域における適性は未だ十分に証明されていない。既存のベンチマークはスコアレベルの評価にのみ依存し、単一のスコアで性能を要約するため、モデルが実際に何を知っているかやその正確な限界についての微妙な理解が曖昧になる。また、これらのベンチマークは金融概念の狭いサブセットのみをカバーするデータセットに依存しており、実世界のアプリケーションに必要な他の要素を見落としている。これらのギャップを埋めるため、我々は金融LLM向けに初めての認知診断評価フレームワークであるFinCDMを導入し、単一の集計数値ではなく、スキルタグ付きタスクにおける応答パターンに基づいて、金融スキルや知識の有無を評価することを可能にした。さらに、公認会計士（CPA）試験から派生した初の認知的に情報化された金融評価データセットであるCPA-QKAを構築し、実世界の会計および金融スキルを包括的にカバーする。このデータセットは、ドメインエキスパートによって厳密に注釈され、高い相互注釈者一致率と細かい知識ラベルを付けた質問を作成、検証、注釈している。30のプロプライエタリ、オープンソース、およびドメイン固有のLLMに対する広範な実験により、FinCDMが隠れた知識ギャップを明らかにし、従来のベンチマークが見落としていた税務や規制推論などの未テスト領域を特定し、モデル間の行動クラスターを発見することが示された。FinCDMは、解釈可能でスキルを意識した診断を可能にすることで、より信頼性が高くターゲットを絞ったモデル開発を支援する金融LLM評価の新たなパラダイムを導入し、すべてのデータセットと評価スクリプトを公開してさらなる研究を支援する。

English

Large Language Models (LLMs) have shown promise for financial applications, yet their suitability for this high-stakes domain remains largely unproven due to inadequacies in existing benchmarks. Existing benchmarks solely rely on score-level evaluation, summarizing performance with a single score that obscures the nuanced understanding of what models truly know and their precise limitations. They also rely on datasets that cover only a narrow subset of financial concepts, while overlooking other essentials for real-world applications. To address these gaps, we introduce FinCDM, the first cognitive diagnosis evaluation framework tailored for financial LLMs, enabling the evaluation of LLMs at the knowledge-skill level, identifying what financial skills and knowledge they have or lack based on their response patterns across skill-tagged tasks, rather than a single aggregated number. We construct CPA-QKA, the first cognitively informed financial evaluation dataset derived from the Certified Public Accountant (CPA) examination, with comprehensive coverage of real-world accounting and financial skills. It is rigorously annotated by domain experts, who author, validate, and annotate questions with high inter-annotator agreement and fine-grained knowledge labels. Our extensive experiments on 30 proprietary, open-source, and domain-specific LLMs show that FinCDM reveals hidden knowledge gaps, identifies under-tested areas such as tax and regulatory reasoning overlooked by traditional benchmarks, and uncovers behavioral clusters among models. FinCDM introduces a new paradigm for financial LLM evaluation by enabling interpretable, skill-aware diagnosis that supports more trustworthy and targeted model development, and all datasets and evaluation scripts will be publicly released to support further research.

スコアからスキルへ：金融大規模言語モデルを評価するための認知診断フレームワーク

From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models

要旨

Support