FinSearchComp：現実的で専門家レベルの金融検索と推論の評価に向けて

要旨

検索は、LLMベースのエージェントの中核インフラとして台頭し、より一般的な知能への道において不可欠であると広く認識されています。金融は特に要求の厳しい実証の場です：アナリストは、時間に敏感でドメイン固有のデータに対して複雑で多段階の検索を日常的に行うため、検索能力と知識に基づく推論の両方を評価するのに理想的です。しかし、既存のオープンな金融データセットでは、エンドツーエンドのエージェントのデータ検索能力を評価するものはありません。これは、現実的で複雑なタスクを構築するには深い金融の専門知識が必要であり、時間に敏感なデータを評価することが難しいためです。本論文では、現実的でオープンドメインの金融検索と推論のための最初の完全オープンソースのエージェントベンチマークであるFinSearchCompを紹介します。FinSearchCompは、現実世界の金融アナリストのワークフローを忠実に再現する3つのタスク——時間に敏感なデータ取得、簡単な過去の検索、複雑な過去の調査——で構成されています。難易度と信頼性を確保するため、70人の専門金融アナリストによる注釈を行い、厳格な多段階品質保証パイプラインを実装しました。このベンチマークには、グローバル市場と大中華圏市場にまたがる635の質問が含まれており、21のモデル（製品）を評価しました。Grok 4（ウェブ）はグローバルサブセットでトップとなり、専門家レベルの精度に近づきました。DouBao（ウェブ）は大中華圏サブセットでリードしました。実験分析により、エージェントにウェブ検索と金融プラグインを装備することがFinSearchCompの結果を大幅に改善し、モデルとツールの国産性がパフォーマンスに大きく影響することが示されました。現実的なアナリストタスクに沿い、エンドツーエンドの評価を提供することで、FinSearchCompは複雑な金融検索と推論のための専門的で高難易度のテストベッドを提供します。

English

Search has emerged as core infrastructure for LLM-based agents and is widely viewed as critical on the path toward more general intelligence. Finance is a particularly demanding proving ground: analysts routinely conduct complex, multi-step searches over time-sensitive, domain-specific data, making it ideal for assessing both search proficiency and knowledge-grounded reasoning. Yet no existing open financial datasets evaluate data searching capability of end-to-end agents, largely because constructing realistic, complicated tasks requires deep financial expertise and time-sensitive data is hard to evaluate. We present FinSearchComp, the first fully open-source agent benchmark for realistic, open-domain financial search and reasoning. FinSearchComp comprises three tasks -- Time-Sensitive Data Fetching, Simple Historical Lookup, and Complex Historical Investigation -- closely reproduce real-world financial analyst workflows. To ensure difficulty and reliability, we engage 70 professional financial experts for annotation and implement a rigorous multi-stage quality-assurance pipeline. The benchmark includes 635 questions spanning global and Greater China markets, and we evaluate 21 models (products) on it. Grok 4 (web) tops the global subset, approaching expert-level accuracy. DouBao (web) leads on the Greater China subset. Experimental analyses show that equipping agents with web search and financial plugins substantially improves results on FinSearchComp, and the country origin of models and tools impact performance significantly.By aligning with realistic analyst tasks and providing end-to-end evaluation, FinSearchComp offers a professional, high-difficulty testbed for complex financial search and reasoning.