FinToolBench：面向真实世界金融工具使用的LLM智能体评估框架

摘要

大型语言模型（LLMs）与金融领域的融合正在推动行业从被动信息检索向动态智能交互的范式转变。尽管通用工具学习领域已涌现大量基准测试，但具有高风险性、强合规要求及数据快速迭代特点的金融领域，仍缺乏针对性的评估体系。现有金融评估主要聚焦静态文本分析或文档问答，忽视了工具执行的复杂现实；而通用工具基准则缺乏金融领域所需的专业性，往往依赖模拟环境或极少量的金融API。为弥补这一空白，我们推出首个面向真实场景的可运行基准平台FinToolBench。与先前仅支持少量模拟工具的研究不同，该平台构建了包含760个可执行金融工具与295项严格工具化查询的生态体系。我们提出超越二元执行成功率的创新评估框架，从时效性、意图类型及监管领域匹配等金融关键维度进行智能体评估。此外，我们开发了具备金融认知能力的工具检索与推理基线模型FATR，以提升系统稳定性与合规性。通过建立首个可审计的金融智能执行测试平台，FinToolBench为可信金融AI设立了新标准。工具清单、执行环境及评估代码将开源发布，以推动后续研究。

English

The integration of Large Language Models (LLMs) into the financial domain is driving a paradigm shift from passive information retrieval to dynamic, agentic interaction. While general-purpose tool learning has witnessed a surge in benchmarks, the financial sector, characterized by high stakes, strict compliance, and rapid data volatility, remains critically underserved. Existing financial evaluations predominantly focus on static textual analysis or document-based QA, ignoring the complex reality of tool execution. Conversely, general tool benchmarks lack the domain-specific rigor required for finance, often relying on toy environments or a negligible number of financial APIs. To bridge this gap, we introduce FinToolBench, the first real-world, runnable benchmark dedicated to evaluating financial tool learning agents. Unlike prior works limited to a handful of mock tools, FinToolBench establishes a realistic ecosystem coupling 760 executable financial tools with 295 rigorous, tool-required queries. We propose a novel evaluation framework that goes beyond binary execution success, assessing agents on finance-critical dimensions: timeliness, intent type, and regulatory domain alignment. Furthermore, we present FATR, a finance-aware tool retrieval and reasoning baseline that enhances stability and compliance. By providing the first testbed for auditable, agentic financial execution, FinToolBench sets a new standard for trustworthy AI in finance. The tool manifest, execution environment, and evaluation code will be open-sourced to facilitate future research.

FinToolBench：面向真实世界金融工具使用的LLM智能体评估框架

FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use

摘要

Support