ChatPaper.aiChatPaper

FinSearchComp:邁向真實且專家級的金融搜索與推理評估

FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning

September 16, 2025
作者: Liang Hu, Jianpeng Jiao, Jiashuo Liu, Yanle Ren, Zhoufutu Wen, Kaiyuan Zhang, Xuanliang Zhang, Xiang Gao, Tianci He, Fei Hu, Yali Liao, Zaiyuan Wang, Chenghao Yang, Qianyu Yang, Mingren Yin, Zhiyuan Zeng, Ge Zhang, Xinyi Zhang, Xiying Zhao, Zhenwei Zhu, Hongseok Namkoong, Wenhao Huang, Yuwen Tang
cs.AI

摘要

搜索已成為基於大型語言模型(LLM)代理的核心基礎設施,並被廣泛視為邁向更通用智能的關鍵路徑。金融領域是一個特別嚴苛的試煉場:分析師們經常對時效性強、領域特定的數據進行複雜的多步驟搜索,這使其成為評估搜索能力和基於知識推理的理想場景。然而,現有的公開金融數據集均未評估端到端代理的數據搜索能力,主要原因在於構建真實且複雜的任務需要深厚的金融專業知識,且時效性數據難以評估。我們推出了FinSearchComp,這是首個完全開源的代理基準,專注於真實、開放領域的金融搜索與推理。FinSearchComp包含三項任務——時效性數據獲取、簡單歷史查詢和複雜歷史調查——這些任務緊密再現了現實世界金融分析師的工作流程。為了確保難度和可靠性,我們邀請了70位專業金融專家進行註釋,並實施了嚴格的多階段質量保證流程。該基準涵蓋了全球及大中華區市場的635個問題,我們對21個模型(產品)進行了評估。Grok 4(網絡版)在全球子集中表現最佳,接近專家級準確率。而DouBao(網絡版)則在大中華區子集中領先。實驗分析表明,為代理配備網絡搜索和金融插件能顯著提升其在FinSearchComp上的表現,且模型和工具的來源國對性能有顯著影響。通過對齊現實分析師任務並提供端到端評估,FinSearchComp為複雜金融搜索與推理提供了一個專業且高難度的測試平台。
English
Search has emerged as core infrastructure for LLM-based agents and is widely viewed as critical on the path toward more general intelligence. Finance is a particularly demanding proving ground: analysts routinely conduct complex, multi-step searches over time-sensitive, domain-specific data, making it ideal for assessing both search proficiency and knowledge-grounded reasoning. Yet no existing open financial datasets evaluate data searching capability of end-to-end agents, largely because constructing realistic, complicated tasks requires deep financial expertise and time-sensitive data is hard to evaluate. We present FinSearchComp, the first fully open-source agent benchmark for realistic, open-domain financial search and reasoning. FinSearchComp comprises three tasks -- Time-Sensitive Data Fetching, Simple Historical Lookup, and Complex Historical Investigation -- closely reproduce real-world financial analyst workflows. To ensure difficulty and reliability, we engage 70 professional financial experts for annotation and implement a rigorous multi-stage quality-assurance pipeline. The benchmark includes 635 questions spanning global and Greater China markets, and we evaluate 21 models (products) on it. Grok 4 (web) tops the global subset, approaching expert-level accuracy. DouBao (web) leads on the Greater China subset. Experimental analyses show that equipping agents with web search and financial plugins substantially improves results on FinSearchComp, and the country origin of models and tools impact performance significantly.By aligning with realistic analyst tasks and providing end-to-end evaluation, FinSearchComp offers a professional, high-difficulty testbed for complex financial search and reasoning.
PDF292September 19, 2025