FinSearchComp: 현실적이고 전문가 수준의 금융 검색 및 추론 평가를 향하여

초록

검색은 LLM 기반 에이전트의 핵심 인프라로 부상했으며, 보다 일반적인 지능으로 나아가는 데 있어 중요한 요소로 널리 인식되고 있다. 금융은 특히 까다로운 시험대인데, 분석가들은 시간에 민감하고 도메인 특화된 데이터에 대해 복잡한 다단계 검색을 정기적으로 수행하므로, 검색 숙련도와 지식 기반 추론 능력을 평가하기에 이상적이다. 그러나 기존의 공개 금융 데이터셋은 종단 간 에이전트의 데이터 검색 능력을 평가하지 않는데, 이는 현실적이고 복잡한 작업을 구성하려면 깊은 금융 전문성이 필요하며 시간에 민감한 데이터를 평가하기 어렵기 때문이다. 우리는 현실적이고 개방형 도메인의 금융 검색 및 추론을 위한 최초의 완전 오픈소스 에이전트 벤치마크인 FinSearchComp를 소개한다. FinSearchComp는 시간에 민감한 데이터 가져오기, 단순한 과거 조회, 복잡한 과거 조사라는 세 가지 작업으로 구성되며, 이는 실제 금융 분석가의 업무 흐름을 면밀히 재현한다. 난이도와 신뢰성을 보장하기 위해 70명의 전문 금융 전문가를 참여시켜 주석 작업을 진행하고, 엄격한 다단계 품질 보증 파이프라인을 구현했다. 이 벤치마크는 글로벌 및 대중국 시장을 아우르는 635개의 질문을 포함하며, 21개의 모델(제품)을 평가했다. Grok 4(웹)는 글로벌 부분에서 전문가 수준의 정확도에 근접하며 선두를 차지했다. DouBao(웹)는 대중국 부분에서 앞섰다. 실험 분석 결과, 에이전트에 웹 검색 및 금융 플러그인을 추가하면 FinSearchComp에서 결과가 크게 개선되며, 모델과 도구의 국가적 기원이 성능에 상당한 영향을 미치는 것으로 나타났다. 현실적인 분석가 작업과 일치하고 종단 간 평가를 제공함으로써, FinSearchComp는 복잡한 금융 검색 및 추론을 위한 전문적이고 고난이도의 테스트베드를 제공한다.

English

Search has emerged as core infrastructure for LLM-based agents and is widely viewed as critical on the path toward more general intelligence. Finance is a particularly demanding proving ground: analysts routinely conduct complex, multi-step searches over time-sensitive, domain-specific data, making it ideal for assessing both search proficiency and knowledge-grounded reasoning. Yet no existing open financial datasets evaluate data searching capability of end-to-end agents, largely because constructing realistic, complicated tasks requires deep financial expertise and time-sensitive data is hard to evaluate. We present FinSearchComp, the first fully open-source agent benchmark for realistic, open-domain financial search and reasoning. FinSearchComp comprises three tasks -- Time-Sensitive Data Fetching, Simple Historical Lookup, and Complex Historical Investigation -- closely reproduce real-world financial analyst workflows. To ensure difficulty and reliability, we engage 70 professional financial experts for annotation and implement a rigorous multi-stage quality-assurance pipeline. The benchmark includes 635 questions spanning global and Greater China markets, and we evaluate 21 models (products) on it. Grok 4 (web) tops the global subset, approaching expert-level accuracy. DouBao (web) leads on the Greater China subset. Experimental analyses show that equipping agents with web search and financial plugins substantially improves results on FinSearchComp, and the country origin of models and tools impact performance significantly.By aligning with realistic analyst tasks and providing end-to-end evaluation, FinSearchComp offers a professional, high-difficulty testbed for complex financial search and reasoning.

FinSearchComp: 현실적이고 전문가 수준의 금융 검색 및 추론 평가를 향하여

FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning

초록

Support