FinTrust:金融领域可信度评估的综合基准
FinTrust: A Comprehensive Benchmark of Trustworthiness Evaluation in Finance Domain
October 17, 2025
作者: Tiansheng Hu, Tongyan Hu, Liuyang Bai, Yilun Zhao, Arman Cohan, Chen Zhao
cs.AI
摘要
近期的大型语言模型(LLMs)在解决金融相关问题上展现出了令人瞩目的能力。然而,由于金融领域的高风险和高利害特性,将LLMs应用于现实世界的金融场景仍面临诸多挑战。本文介绍了FinTrust,这是一个专门为评估LLMs在金融应用中的可信度而设计的综合基准。我们的基准基于实际情境,聚焦于广泛的合规性问题,并为可信度评估的每个维度提供了细粒度的任务。我们在FinTrust上评估了十一个LLMs,发现如o4-mini这样的专有模型在安全性等大多数任务中表现优异,而像DeepSeek-V3这样的开源模型则在行业公平性等特定领域具有优势。对于诸如受托人一致性和信息披露等具有挑战性的任务,所有LLMs均表现不足,显示出在法律意识方面存在显著差距。我们相信,FinTrust将成为金融领域中评估LLMs可信度的一个宝贵基准。
English
Recent LLMs have demonstrated promising ability in solving finance related
problems. However, applying LLMs in real-world finance application remains
challenging due to its high risk and high stakes property. This paper
introduces FinTrust, a comprehensive benchmark specifically designed for
evaluating the trustworthiness of LLMs in finance applications. Our benchmark
focuses on a wide range of alignment issues based on practical context and
features fine-grained tasks for each dimension of trustworthiness evaluation.
We assess eleven LLMs on FinTrust and find that proprietary models like o4-mini
outperforms in most tasks such as safety while open-source models like
DeepSeek-V3 have advantage in specific areas like industry-level fairness. For
challenging task like fiduciary alignment and disclosure, all LLMs fall short,
showing a significant gap in legal awareness. We believe that FinTrust can be a
valuable benchmark for LLMs' trustworthiness evaluation in finance domain.