ChatPaper.aiChatPaper

从评分到技能:评估金融大语言模型的认知诊断框架

From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models

August 19, 2025
作者: Ziyan Kuang, Feiyu Zhu, Maowei Jiang, Yanzhao Lai, Zelin Wang, Zhitong Wang, Meikang Qiu, Jiajia Huang, Min Peng, Qianqian Xie, Sophia Ananiadou
cs.AI

摘要

大型语言模型(LLMs)在金融应用领域展现出潜力,但由于现有基准测试的不足,其在这一高风险领域的适用性仍大多未经证实。现有基准测试仅依赖分数层面的评估,通过单一分数总结模型表现,这掩盖了对模型真正掌握内容及其具体局限性的细致理解。此外,这些基准测试依赖的数据集仅覆盖了金融概念的狭窄子集,而忽视了现实世界应用中的其他关键要素。为填补这些空白,我们推出了FinCDM,这是首个专为金融LLMs设计的认知诊断评估框架,它能够在知识技能层面评估LLMs,基于模型在技能标注任务上的响应模式,识别其具备或缺乏的金融技能与知识,而非依赖单一的汇总分数。我们构建了CPA-QKA,这是首个基于注册会计师(CPA)考试、具有认知洞察力的金融评估数据集,全面覆盖了现实世界的会计与金融技能。该数据集由领域专家严格标注,他们编写、验证并标注问题,确保了高标注者间一致性和细粒度的知识标签。我们对30个专有、开源及领域特定的LLMs进行了广泛实验,结果表明,FinCDM揭示了隐藏的知识缺口,识别了如税务和监管推理等传统基准测试忽视的未充分测试领域,并发现了模型间的行为聚类。FinCDM通过支持可解释、技能感知的诊断,为金融LLM评估引入了新范式,促进了更可信、目标明确的模型开发,所有数据集和评估脚本将公开发布,以支持进一步研究。
English
Large Language Models (LLMs) have shown promise for financial applications, yet their suitability for this high-stakes domain remains largely unproven due to inadequacies in existing benchmarks. Existing benchmarks solely rely on score-level evaluation, summarizing performance with a single score that obscures the nuanced understanding of what models truly know and their precise limitations. They also rely on datasets that cover only a narrow subset of financial concepts, while overlooking other essentials for real-world applications. To address these gaps, we introduce FinCDM, the first cognitive diagnosis evaluation framework tailored for financial LLMs, enabling the evaluation of LLMs at the knowledge-skill level, identifying what financial skills and knowledge they have or lack based on their response patterns across skill-tagged tasks, rather than a single aggregated number. We construct CPA-QKA, the first cognitively informed financial evaluation dataset derived from the Certified Public Accountant (CPA) examination, with comprehensive coverage of real-world accounting and financial skills. It is rigorously annotated by domain experts, who author, validate, and annotate questions with high inter-annotator agreement and fine-grained knowledge labels. Our extensive experiments on 30 proprietary, open-source, and domain-specific LLMs show that FinCDM reveals hidden knowledge gaps, identifies under-tested areas such as tax and regulatory reasoning overlooked by traditional benchmarks, and uncovers behavioral clusters among models. FinCDM introduces a new paradigm for financial LLM evaluation by enabling interpretable, skill-aware diagnosis that supports more trustworthy and targeted model development, and all datasets and evaluation scripts will be publicly released to support further research.
PDF573August 21, 2025