FinAuditing:面向大语言模型评估的金融分类结构化多文档基准
FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs
October 10, 2025
作者: Yan Wang, Keyi Wang, Shanshan Yang, Jaisal Patel, Jeff Zhao, Fengran Mo, Xueqing Peng, Lingfei Qian, Jimin Huang, Guojun Xiong, Xiao-Yang Liu, Jian-Yun Nie
cs.AI
摘要
通用会计准则(GAAP)的复杂性以及可扩展商业报告语言(XBRL)申报文件的层级结构,使得财务审计的自动化与验证日益困难。尽管大型语言模型(LLMs)在非结构化文本理解方面展现了强大能力,但其在处理结构化、相互依赖且基于分类标准的财务文档时的推理能力仍很大程度上未被探索。为填补这一空白,我们推出了FinAuditing,这是首个面向财务审计任务评估LLMs的、与分类标准对齐、结构感知的多文档基准。FinAuditing基于符合美国GAAP的XBRL申报文件构建,定义了三个互补的子任务:FinSM用于语义一致性,FinRE用于关系一致性,FinMR用于数值一致性,每个子任务针对结构化审计推理的不同方面。我们进一步提出了一个统一的评估框架,整合了检索、分类和推理指标,覆盖这些子任务。在13个最先进的LLMs上进行的广泛零样本实验表明,当前模型在语义、关系和数学维度上的表现参差不齐,当推理涉及层级多文档结构时,准确率下降高达60-90%。我们的研究揭示了现代LLMs在基于分类标准的财务推理中的系统性局限,并确立了FinAuditing作为开发可信、结构感知且符合监管要求的财务智能系统的基础。该基准数据集可在Hugging Face获取。
English
The complexity of the Generally Accepted Accounting Principles (GAAP) and the
hierarchical structure of eXtensible Business Reporting Language (XBRL) filings
make financial auditing increasingly difficult to automate and verify. While
large language models (LLMs) have demonstrated strong capabilities in
unstructured text understanding, their ability to reason over structured,
interdependent, and taxonomy-driven financial documents remains largely
unexplored. To fill this gap, we introduce FinAuditing, the first
taxonomy-aligned, structure-aware, multi-document benchmark for evaluating LLMs
on financial auditing tasks. Built from real US-GAAP-compliant XBRL filings,
FinAuditing defines three complementary subtasks, FinSM for semantic
consistency, FinRE for relational consistency, and FinMR for numerical
consistency, each targeting a distinct aspect of structured auditing reasoning.
We further propose a unified evaluation framework integrating retrieval,
classification, and reasoning metrics across these subtasks. Extensive
zero-shot experiments on 13 state-of-the-art LLMs reveal that current models
perform inconsistently across semantic, relational, and mathematical
dimensions, with accuracy drops of up to 60-90% when reasoning over
hierarchical multi-document structures. Our findings expose the systematic
limitations of modern LLMs in taxonomy-grounded financial reasoning and
establish FinAuditing as a foundation for developing trustworthy,
structure-aware, and regulation-aligned financial intelligence systems. The
benchmark dataset is available at Hugging Face.