FinSearchComp：迈向金融搜索与推理的真实专家级评估

摘要

搜索已成为基于大型语言模型（LLM）的智能体核心基础设施，并普遍被视为通向更通用智能的关键路径。金融领域尤为严苛：分析师们经常针对时效性强、领域特定的数据进行复杂多步的搜索，这使其成为评估搜索能力与知识驱动推理的理想场景。然而，现有公开金融数据集均未评估端到端智能体的数据搜索能力，主要原因在于构建真实复杂的任务需深厚的金融专业知识，且时效性数据难以评估。我们推出了FinSearchComp，首个完全开源的智能体基准测试，专注于真实、开放领域的金融搜索与推理。FinSearchComp包含三项任务——时效性数据获取、简单历史查询及复杂历史调查——紧密复现了现实世界金融分析师的工作流程。为确保难度与可靠性，我们邀请了70位专业金融专家进行标注，并实施了严格的多阶段质量保证流程。该基准测试涵盖全球及大中华区市场的635个问题，并对21个模型（产品）进行了评估。Grok 4（网页版）在全球子集上表现最佳，接近专家级准确度。而DouBao（网页版）则在大中华区子集上领先。实验分析表明，为智能体配备网页搜索与金融插件能显著提升其在FinSearchComp上的表现，且模型与工具的国家来源对性能有显著影响。通过贴近真实分析师任务并提供端到端评估，FinSearchComp为复杂的金融搜索与推理提供了一个专业且高难度的测试平台。

English

Search has emerged as core infrastructure for LLM-based agents and is widely viewed as critical on the path toward more general intelligence. Finance is a particularly demanding proving ground: analysts routinely conduct complex, multi-step searches over time-sensitive, domain-specific data, making it ideal for assessing both search proficiency and knowledge-grounded reasoning. Yet no existing open financial datasets evaluate data searching capability of end-to-end agents, largely because constructing realistic, complicated tasks requires deep financial expertise and time-sensitive data is hard to evaluate. We present FinSearchComp, the first fully open-source agent benchmark for realistic, open-domain financial search and reasoning. FinSearchComp comprises three tasks -- Time-Sensitive Data Fetching, Simple Historical Lookup, and Complex Historical Investigation -- closely reproduce real-world financial analyst workflows. To ensure difficulty and reliability, we engage 70 professional financial experts for annotation and implement a rigorous multi-stage quality-assurance pipeline. The benchmark includes 635 questions spanning global and Greater China markets, and we evaluate 21 models (products) on it. Grok 4 (web) tops the global subset, approaching expert-level accuracy. DouBao (web) leads on the Greater China subset. Experimental analyses show that equipping agents with web search and financial plugins substantially improves results on FinSearchComp, and the country origin of models and tools impact performance significantly.By aligning with realistic analyst tasks and providing end-to-end evaluation, FinSearchComp offers a professional, high-difficulty testbed for complex financial search and reasoning.

FinSearchComp：迈向金融搜索与推理的真实专家级评估

FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning

摘要

Support