FinSearchComp:迈向金融搜索与推理的真实专家级评估
FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning
September 16, 2025
作者: Liang Hu, Jianpeng Jiao, Jiashuo Liu, Yanle Ren, Zhoufutu Wen, Kaiyuan Zhang, Xuanliang Zhang, Xiang Gao, Tianci He, Fei Hu, Yali Liao, Zaiyuan Wang, Chenghao Yang, Qianyu Yang, Mingren Yin, Zhiyuan Zeng, Ge Zhang, Xinyi Zhang, Xiying Zhao, Zhenwei Zhu, Hongseok Namkoong, Wenhao Huang, Yuwen Tang
cs.AI
摘要
搜索已成为基于大型语言模型(LLM)的智能体核心基础设施,并普遍被视为通向更通用智能的关键路径。金融领域尤为严苛:分析师们经常针对时效性强、领域特定的数据进行复杂多步的搜索,这使其成为评估搜索能力与知识驱动推理的理想场景。然而,现有公开金融数据集均未评估端到端智能体的数据搜索能力,主要原因在于构建真实复杂的任务需深厚的金融专业知识,且时效性数据难以评估。我们推出了FinSearchComp,首个完全开源的智能体基准测试,专注于真实、开放领域的金融搜索与推理。FinSearchComp包含三项任务——时效性数据获取、简单历史查询及复杂历史调查——紧密复现了现实世界金融分析师的工作流程。为确保难度与可靠性,我们邀请了70位专业金融专家进行标注,并实施了严格的多阶段质量保证流程。该基准测试涵盖全球及大中华区市场的635个问题,并对21个模型(产品)进行了评估。Grok 4(网页版)在全球子集上表现最佳,接近专家级准确度。而DouBao(网页版)则在大中华区子集上领先。实验分析表明,为智能体配备网页搜索与金融插件能显著提升其在FinSearchComp上的表现,且模型与工具的国家来源对性能有显著影响。通过贴近真实分析师任务并提供端到端评估,FinSearchComp为复杂的金融搜索与推理提供了一个专业且高难度的测试平台。
English
Search has emerged as core infrastructure for LLM-based agents and is widely
viewed as critical on the path toward more general intelligence. Finance is a
particularly demanding proving ground: analysts routinely conduct complex,
multi-step searches over time-sensitive, domain-specific data, making it ideal
for assessing both search proficiency and knowledge-grounded reasoning. Yet no
existing open financial datasets evaluate data searching capability of
end-to-end agents, largely because constructing realistic, complicated tasks
requires deep financial expertise and time-sensitive data is hard to evaluate.
We present FinSearchComp, the first fully open-source agent benchmark for
realistic, open-domain financial search and reasoning. FinSearchComp comprises
three tasks -- Time-Sensitive Data Fetching, Simple Historical Lookup, and
Complex Historical Investigation -- closely reproduce real-world financial
analyst workflows. To ensure difficulty and reliability, we engage 70
professional financial experts for annotation and implement a rigorous
multi-stage quality-assurance pipeline. The benchmark includes 635 questions
spanning global and Greater China markets, and we evaluate 21 models (products)
on it. Grok 4 (web) tops the global subset, approaching expert-level accuracy.
DouBao (web) leads on the Greater China subset. Experimental analyses show that
equipping agents with web search and financial plugins substantially improves
results on FinSearchComp, and the country origin of models and tools impact
performance significantly.By aligning with realistic analyst tasks and
providing end-to-end evaluation, FinSearchComp offers a professional,
high-difficulty testbed for complex financial search and reasoning.