ChatPaper.aiChatPaper

從分數到技能:評估金融大語言模型的認知診斷框架

From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models

August 19, 2025
作者: Ziyan Kuang, Feiyu Zhu, Maowei Jiang, Yanzhao Lai, Zelin Wang, Zhitong Wang, Meikang Qiu, Jiajia Huang, Min Peng, Qianqian Xie, Sophia Ananiadou
cs.AI

摘要

大型語言模型(LLMs)在金融應用中展現出潛力,然而由於現有基準測試的不足,其在高風險領域的適用性仍未被充分證實。現有的基準測試僅依賴於分數層面的評估,用單一分數總結模型表現,這掩蓋了對模型真正掌握知識及其具體限制的細緻理解。此外,這些測試依賴的數據集僅涵蓋了金融概念的一小部分,而忽略了現實世界應用中其他關鍵要素。為彌補這些不足,我們引入了FinCDM,這是首個專為金融LLMs設計的認知診斷評估框架,它能夠在知識技能層面評估LLMs,基於模型在技能標記任務中的響應模式,識別其具備或欠缺的金融技能與知識,而非僅依賴一個聚合分數。我們構建了CPA-QKA,這是首個基於註冊會計師(CPA)考試的認知導向金融評估數據集,全面覆蓋了現實世界的會計與金融技能。該數據集由領域專家嚴格註釋,他們編寫、驗證並標註問題,確保了高水平的註釋者間一致性和細粒度的知識標籤。我們對30個專有、開源及領域特定的LLMs進行了廣泛實驗,結果表明FinCDM揭示了隱藏的知識缺口,識別了傳統基準測試忽視的稅務與監管推理等測試不足的領域,並發現了模型間的行為集群。FinCDM通過實現可解釋的、技能感知的診斷,為金融LLMs評估引入了新範式,支持更可信賴且有針對性的模型開發,所有數據集和評估腳本將公開發布,以支持進一步研究。
English
Large Language Models (LLMs) have shown promise for financial applications, yet their suitability for this high-stakes domain remains largely unproven due to inadequacies in existing benchmarks. Existing benchmarks solely rely on score-level evaluation, summarizing performance with a single score that obscures the nuanced understanding of what models truly know and their precise limitations. They also rely on datasets that cover only a narrow subset of financial concepts, while overlooking other essentials for real-world applications. To address these gaps, we introduce FinCDM, the first cognitive diagnosis evaluation framework tailored for financial LLMs, enabling the evaluation of LLMs at the knowledge-skill level, identifying what financial skills and knowledge they have or lack based on their response patterns across skill-tagged tasks, rather than a single aggregated number. We construct CPA-QKA, the first cognitively informed financial evaluation dataset derived from the Certified Public Accountant (CPA) examination, with comprehensive coverage of real-world accounting and financial skills. It is rigorously annotated by domain experts, who author, validate, and annotate questions with high inter-annotator agreement and fine-grained knowledge labels. Our extensive experiments on 30 proprietary, open-source, and domain-specific LLMs show that FinCDM reveals hidden knowledge gaps, identifies under-tested areas such as tax and regulatory reasoning overlooked by traditional benchmarks, and uncovers behavioral clusters among models. FinCDM introduces a new paradigm for financial LLM evaluation by enabling interpretable, skill-aware diagnosis that supports more trustworthy and targeted model development, and all datasets and evaluation scripts will be publicly released to support further research.
PDF583August 21, 2025