从评分到技能:评估金融大语言模型的认知诊断框架
From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models
August 19, 2025
作者: Ziyan Kuang, Feiyu Zhu, Maowei Jiang, Yanzhao Lai, Zelin Wang, Zhitong Wang, Meikang Qiu, Jiajia Huang, Min Peng, Qianqian Xie, Sophia Ananiadou
cs.AI
摘要
大型语言模型(LLMs)在金融应用领域展现出潜力,但由于现有基准测试的不足,其在这一高风险领域的适用性仍大多未经证实。现有基准测试仅依赖分数层面的评估,通过单一分数总结模型表现,这掩盖了对模型真正掌握内容及其具体局限性的细致理解。此外,这些基准测试依赖的数据集仅覆盖了金融概念的狭窄子集,而忽视了现实世界应用中的其他关键要素。为填补这些空白,我们推出了FinCDM,这是首个专为金融LLMs设计的认知诊断评估框架,它能够在知识技能层面评估LLMs,基于模型在技能标注任务上的响应模式,识别其具备或缺乏的金融技能与知识,而非依赖单一的汇总分数。我们构建了CPA-QKA,这是首个基于注册会计师(CPA)考试、具有认知洞察力的金融评估数据集,全面覆盖了现实世界的会计与金融技能。该数据集由领域专家严格标注,他们编写、验证并标注问题,确保了高标注者间一致性和细粒度的知识标签。我们对30个专有、开源及领域特定的LLMs进行了广泛实验,结果表明,FinCDM揭示了隐藏的知识缺口,识别了如税务和监管推理等传统基准测试忽视的未充分测试领域,并发现了模型间的行为聚类。FinCDM通过支持可解释、技能感知的诊断,为金融LLM评估引入了新范式,促进了更可信、目标明确的模型开发,所有数据集和评估脚本将公开发布,以支持进一步研究。
English
Large Language Models (LLMs) have shown promise for financial applications,
yet their suitability for this high-stakes domain remains largely unproven due
to inadequacies in existing benchmarks. Existing benchmarks solely rely on
score-level evaluation, summarizing performance with a single score that
obscures the nuanced understanding of what models truly know and their precise
limitations. They also rely on datasets that cover only a narrow subset of
financial concepts, while overlooking other essentials for real-world
applications. To address these gaps, we introduce FinCDM, the first cognitive
diagnosis evaluation framework tailored for financial LLMs, enabling the
evaluation of LLMs at the knowledge-skill level, identifying what financial
skills and knowledge they have or lack based on their response patterns across
skill-tagged tasks, rather than a single aggregated number. We construct
CPA-QKA, the first cognitively informed financial evaluation dataset derived
from the Certified Public Accountant (CPA) examination, with comprehensive
coverage of real-world accounting and financial skills. It is rigorously
annotated by domain experts, who author, validate, and annotate questions with
high inter-annotator agreement and fine-grained knowledge labels. Our extensive
experiments on 30 proprietary, open-source, and domain-specific LLMs show that
FinCDM reveals hidden knowledge gaps, identifies under-tested areas such as tax
and regulatory reasoning overlooked by traditional benchmarks, and uncovers
behavioral clusters among models. FinCDM introduces a new paradigm for
financial LLM evaluation by enabling interpretable, skill-aware diagnosis that
supports more trustworthy and targeted model development, and all datasets and
evaluation scripts will be publicly released to support further research.