FinToolBench：面向真实金融工具应用的LLM智能体评估平台

摘要

大型语言模型（LLMs）在金融领域的融合正推动着从被动信息检索到动态智能交互的范式转变。尽管通用工具学习领域已涌现大量基准测试，但具有高风险、强合规性和数据快速演变特性的金融领域仍缺乏关键性评估体系。现有金融评估主要聚焦静态文本分析或文档问答，忽视了工具执行的复杂现实；而通用工具基准又缺乏金融领域所需的专业严谨性，往往依赖模拟环境或极少量的金融API。为弥补这一空白，我们推出首个面向真实场景、可运行的金融工具学习智能体评估基准——FinToolBench。与先前仅支持少量模拟工具的研究不同，FinToolBench构建了包含760个可执行金融工具与295项严格工具化查询的拟真生态体系，并提出超越二元执行成功率的创新评估框架，从时效性、意图类型及监管领域匹配等金融关键维度进行多方位评估。此外，我们提出增强稳定性与合规性的金融感知工具检索推理基线方法FATR。通过提供首个可审计的金融智能执行测试平台，FinToolBench为可信金融AI设立了新标准。工具清单、执行环境及评估代码将开源以推动后续研究。

English

The integration of Large Language Models (LLMs) into the financial domain is driving a paradigm shift from passive information retrieval to dynamic, agentic interaction. While general-purpose tool learning has witnessed a surge in benchmarks, the financial sector, characterized by high stakes, strict compliance, and rapid data volatility, remains critically underserved. Existing financial evaluations predominantly focus on static textual analysis or document-based QA, ignoring the complex reality of tool execution. Conversely, general tool benchmarks lack the domain-specific rigor required for finance, often relying on toy environments or a negligible number of financial APIs. To bridge this gap, we introduce FinToolBench, the first real-world, runnable benchmark dedicated to evaluating financial tool learning agents. Unlike prior works limited to a handful of mock tools, FinToolBench establishes a realistic ecosystem coupling 760 executable financial tools with 295 rigorous, tool-required queries. We propose a novel evaluation framework that goes beyond binary execution success, assessing agents on finance-critical dimensions: timeliness, intent type, and regulatory domain alignment. Furthermore, we present FATR, a finance-aware tool retrieval and reasoning baseline that enhances stability and compliance. By providing the first testbed for auditable, agentic financial execution, FinToolBench sets a new standard for trustworthy AI in finance. The tool manifest, execution environment, and evaluation code will be open-sourced to facilitate future research.