從分數到技能：評估金融大語言模型的認知診斷框架

摘要

大型語言模型（LLMs）在金融應用中展現出潛力，然而由於現有基準測試的不足，其在高風險領域的適用性仍未被充分證實。現有的基準測試僅依賴於分數層面的評估，用單一分數總結模型表現，這掩蓋了對模型真正掌握知識及其具體限制的細緻理解。此外，這些測試依賴的數據集僅涵蓋了金融概念的一小部分，而忽略了現實世界應用中其他關鍵要素。為彌補這些不足，我們引入了FinCDM，這是首個專為金融LLMs設計的認知診斷評估框架，它能夠在知識技能層面評估LLMs，基於模型在技能標記任務中的響應模式，識別其具備或欠缺的金融技能與知識，而非僅依賴一個聚合分數。我們構建了CPA-QKA，這是首個基於註冊會計師（CPA）考試的認知導向金融評估數據集，全面覆蓋了現實世界的會計與金融技能。該數據集由領域專家嚴格註釋，他們編寫、驗證並標註問題，確保了高水平的註釋者間一致性和細粒度的知識標籤。我們對30個專有、開源及領域特定的LLMs進行了廣泛實驗，結果表明FinCDM揭示了隱藏的知識缺口，識別了傳統基準測試忽視的稅務與監管推理等測試不足的領域，並發現了模型間的行為集群。FinCDM通過實現可解釋的、技能感知的診斷，為金融LLMs評估引入了新範式，支持更可信賴且有針對性的模型開發，所有數據集和評估腳本將公開發布，以支持進一步研究。

English

Large Language Models (LLMs) have shown promise for financial applications, yet their suitability for this high-stakes domain remains largely unproven due to inadequacies in existing benchmarks. Existing benchmarks solely rely on score-level evaluation, summarizing performance with a single score that obscures the nuanced understanding of what models truly know and their precise limitations. They also rely on datasets that cover only a narrow subset of financial concepts, while overlooking other essentials for real-world applications. To address these gaps, we introduce FinCDM, the first cognitive diagnosis evaluation framework tailored for financial LLMs, enabling the evaluation of LLMs at the knowledge-skill level, identifying what financial skills and knowledge they have or lack based on their response patterns across skill-tagged tasks, rather than a single aggregated number. We construct CPA-QKA, the first cognitively informed financial evaluation dataset derived from the Certified Public Accountant (CPA) examination, with comprehensive coverage of real-world accounting and financial skills. It is rigorously annotated by domain experts, who author, validate, and annotate questions with high inter-annotator agreement and fine-grained knowledge labels. Our extensive experiments on 30 proprietary, open-source, and domain-specific LLMs show that FinCDM reveals hidden knowledge gaps, identifies under-tested areas such as tax and regulatory reasoning overlooked by traditional benchmarks, and uncovers behavioral clusters among models. FinCDM introduces a new paradigm for financial LLM evaluation by enabling interpretable, skill-aware diagnosis that supports more trustworthy and targeted model development, and all datasets and evaluation scripts will be publicly released to support further research.

從分數到技能：評估金融大語言模型的認知診斷框架

From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models

摘要

Support