从评分到技能：评估金融大语言模型的认知诊断框架

摘要

大型语言模型（LLMs）在金融应用领域展现出潜力，但由于现有基准测试的不足，其在这一高风险领域的适用性仍大多未经证实。现有基准测试仅依赖分数层面的评估，通过单一分数总结模型表现，这掩盖了对模型真正掌握内容及其具体局限性的细致理解。此外，这些基准测试依赖的数据集仅覆盖了金融概念的狭窄子集，而忽视了现实世界应用中的其他关键要素。为填补这些空白，我们推出了FinCDM，这是首个专为金融LLMs设计的认知诊断评估框架，它能够在知识技能层面评估LLMs，基于模型在技能标注任务上的响应模式，识别其具备或缺乏的金融技能与知识，而非依赖单一的汇总分数。我们构建了CPA-QKA，这是首个基于注册会计师（CPA）考试、具有认知洞察力的金融评估数据集，全面覆盖了现实世界的会计与金融技能。该数据集由领域专家严格标注，他们编写、验证并标注问题，确保了高标注者间一致性和细粒度的知识标签。我们对30个专有、开源及领域特定的LLMs进行了广泛实验，结果表明，FinCDM揭示了隐藏的知识缺口，识别了如税务和监管推理等传统基准测试忽视的未充分测试领域，并发现了模型间的行为聚类。FinCDM通过支持可解释、技能感知的诊断，为金融LLM评估引入了新范式，促进了更可信、目标明确的模型开发，所有数据集和评估脚本将公开发布，以支持进一步研究。

English

Large Language Models (LLMs) have shown promise for financial applications, yet their suitability for this high-stakes domain remains largely unproven due to inadequacies in existing benchmarks. Existing benchmarks solely rely on score-level evaluation, summarizing performance with a single score that obscures the nuanced understanding of what models truly know and their precise limitations. They also rely on datasets that cover only a narrow subset of financial concepts, while overlooking other essentials for real-world applications. To address these gaps, we introduce FinCDM, the first cognitive diagnosis evaluation framework tailored for financial LLMs, enabling the evaluation of LLMs at the knowledge-skill level, identifying what financial skills and knowledge they have or lack based on their response patterns across skill-tagged tasks, rather than a single aggregated number. We construct CPA-QKA, the first cognitively informed financial evaluation dataset derived from the Certified Public Accountant (CPA) examination, with comprehensive coverage of real-world accounting and financial skills. It is rigorously annotated by domain experts, who author, validate, and annotate questions with high inter-annotator agreement and fine-grained knowledge labels. Our extensive experiments on 30 proprietary, open-source, and domain-specific LLMs show that FinCDM reveals hidden knowledge gaps, identifies under-tested areas such as tax and regulatory reasoning overlooked by traditional benchmarks, and uncovers behavioral clusters among models. FinCDM introduces a new paradigm for financial LLM evaluation by enabling interpretable, skill-aware diagnosis that supports more trustworthy and targeted model development, and all datasets and evaluation scripts will be publicly released to support further research.

从评分到技能：评估金融大语言模型的认知诊断框架

From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models

摘要

Support