ChatPaper.aiChatPaper

FinAuditing:一個基於財務分類結構的多文檔基準測試集,用於評估大型語言模型

FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs

October 10, 2025
作者: Yan Wang, Keyi Wang, Shanshan Yang, Jaisal Patel, Jeff Zhao, Fengran Mo, Xueqing Peng, Lingfei Qian, Jimin Huang, Guojun Xiong, Xiao-Yang Liu, Jian-Yun Nie
cs.AI

摘要

通用會計準則(GAAP)的複雜性以及可擴展商業報告語言(XBRL)申報文件的層級結構,使得財務審計的自動化與驗證變得日益困難。儘管大型語言模型(LLMs)在非結構化文本理解方面展現了強大的能力,但其在處理結構化、相互依賴且基於分類標準的財務文件時的推理能力仍未被充分探索。為填補這一空白,我們提出了FinAuditing,這是首個針對財務審計任務評估LLMs的分類標準對齊、結構感知、多文件基準測試。FinAuditing基於真實的美國GAAP合規XBRL申報文件構建,定義了三個互補的子任務:FinSM用於語義一致性,FinRE用於關係一致性,FinMR用於數值一致性,每個子任務針對結構化審計推理的不同方面。我們進一步提出了一個統一的評估框架,整合了這些子任務中的檢索、分類和推理指標。在13個最先進的LLMs上進行的廣泛零樣本實驗顯示,當前模型在語義、關係和數學維度上的表現不一致,當推理涉及層級化的多文件結構時,準確率下降高達60-90%。我們的研究結果揭示了現代LLMs在基於分類標準的財務推理中的系統性限制,並確立了FinAuditing作為開發可信、結構感知且符合監管的財務智能系統的基礎。該基準數據集可在Hugging Face上獲取。
English
The complexity of the Generally Accepted Accounting Principles (GAAP) and the hierarchical structure of eXtensible Business Reporting Language (XBRL) filings make financial auditing increasingly difficult to automate and verify. While large language models (LLMs) have demonstrated strong capabilities in unstructured text understanding, their ability to reason over structured, interdependent, and taxonomy-driven financial documents remains largely unexplored. To fill this gap, we introduce FinAuditing, the first taxonomy-aligned, structure-aware, multi-document benchmark for evaluating LLMs on financial auditing tasks. Built from real US-GAAP-compliant XBRL filings, FinAuditing defines three complementary subtasks, FinSM for semantic consistency, FinRE for relational consistency, and FinMR for numerical consistency, each targeting a distinct aspect of structured auditing reasoning. We further propose a unified evaluation framework integrating retrieval, classification, and reasoning metrics across these subtasks. Extensive zero-shot experiments on 13 state-of-the-art LLMs reveal that current models perform inconsistently across semantic, relational, and mathematical dimensions, with accuracy drops of up to 60-90% when reasoning over hierarchical multi-document structures. Our findings expose the systematic limitations of modern LLMs in taxonomy-grounded financial reasoning and establish FinAuditing as a foundation for developing trustworthy, structure-aware, and regulation-aligned financial intelligence systems. The benchmark dataset is available at Hugging Face.
PDF192October 14, 2025