점수에서 역량으로: 금융 대형 언어 모델 평가를 위한 인지 진단 프레임워크

초록

대형 언어 모델(LLMs)은 금융 응용 분야에서 유망한 가능성을 보여주고 있지만, 기존 벤치마크의 부족으로 인해 이 고위험 영역에서의 적합성은 여전히 대부분 입증되지 못한 상태입니다. 기존 벤치마크는 단순히 점수 수준의 평가에 의존하며, 단일 점수로 성능을 요약함으로써 모델이 실제로 알고 있는 것과 정확한 한계에 대한 미묘한 이해를 가리게 됩니다. 또한, 이들은 실제 응용에 필수적인 다른 요소들을 간과하면서도 금융 개념의 좁은 하위 집합만을 다루는 데이터셋에 의존합니다. 이러한 격차를 해결하기 위해, 우리는 금융 LLMs를 위한 첫 번째 인지 진단 평가 프레임워크인 FinCDM을 소개합니다. FinCDM은 지식-기술 수준에서 LLMs를 평가할 수 있게 하여, 단일 집계된 숫자 대신 기술 태그가 지정된 작업들에 대한 응답 패턴을 기반으로 어떤 금융 기술과 지식을 가지고 있거나 부족한지를 식별합니다. 우리는 공인회계사(CPA) 시험에서 유래한 첫 번째 인지적으로 정보화된 금융 평가 데이터셋인 CPA-QKA를 구축했습니다. 이 데이터셋은 실제 회계 및 금융 기술을 포괄적으로 다루며, 도메인 전문가들이 엄격하게 주석을 달아 높은 주석자 간 일치도와 세분화된 지식 레이블을 가지고 질문을 작성, 검증, 주석 처리했습니다. 30개의 독점, 오픈소스, 도메인 특화 LLMs에 대한 광범위한 실험을 통해 FinCDM은 숨겨진 지식 격차를 드러내고, 전통적인 벤치마크에서 간과된 세금 및 규제 추론과 같은 충분히 테스트되지 않은 영역을 식별하며, 모델 간의 행동 클러스터를 발견합니다. FinCDM은 해석 가능하고 기술 인식 진단을 가능하게 함으로써 더 신뢰할 수 있고 목표 지향적인 모델 개발을 지원하는 새로운 금융 LLM 평가 패러다임을 제시하며, 모든 데이터셋과 평가 스크립트는 추가 연구를 지원하기 위해 공개될 예정입니다.

English

Large Language Models (LLMs) have shown promise for financial applications, yet their suitability for this high-stakes domain remains largely unproven due to inadequacies in existing benchmarks. Existing benchmarks solely rely on score-level evaluation, summarizing performance with a single score that obscures the nuanced understanding of what models truly know and their precise limitations. They also rely on datasets that cover only a narrow subset of financial concepts, while overlooking other essentials for real-world applications. To address these gaps, we introduce FinCDM, the first cognitive diagnosis evaluation framework tailored for financial LLMs, enabling the evaluation of LLMs at the knowledge-skill level, identifying what financial skills and knowledge they have or lack based on their response patterns across skill-tagged tasks, rather than a single aggregated number. We construct CPA-QKA, the first cognitively informed financial evaluation dataset derived from the Certified Public Accountant (CPA) examination, with comprehensive coverage of real-world accounting and financial skills. It is rigorously annotated by domain experts, who author, validate, and annotate questions with high inter-annotator agreement and fine-grained knowledge labels. Our extensive experiments on 30 proprietary, open-source, and domain-specific LLMs show that FinCDM reveals hidden knowledge gaps, identifies under-tested areas such as tax and regulatory reasoning overlooked by traditional benchmarks, and uncovers behavioral clusters among models. FinCDM introduces a new paradigm for financial LLM evaluation by enabling interpretable, skill-aware diagnosis that supports more trustworthy and targeted model development, and all datasets and evaluation scripts will be publicly released to support further research.

점수에서 역량으로: 금융 대형 언어 모델 평가를 위한 인지 진단 프레임워크

From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models

초록

Support