OpenSeeker：通过完全开源训练数据实现前沿搜索智能体的民主化

摘要

深度搜索能力已成为前沿大语言模型智能体不可或缺的核心竞争力，但由于缺乏透明、高质量的训练数据，高性能搜索智能体的开发仍由行业巨头主导。这种持续存在的数据短缺问题从根本上阻碍了更广泛的研究界在该领域的开发与创新进程。为弥合这一鸿沟，我们推出首个完全开源的搜索智能体OpenSeeker（即模型与数据全开源），通过两项核心技术突破实现前沿性能：(1) 基于事实的可扩展可控问答合成技术，通过拓扑扩展和实体混淆逆向解析网络图谱，生成覆盖范围与复杂度可控的复杂多跳推理任务；(2) 去噪轨迹合成技术，采用回溯式摘要机制对交互轨迹进行降噪处理，从而引导教师大模型生成高质量动作。实验结果表明，仅使用1.17万条合成样本进行单次训练的OpenSeeker，在BrowseComp、BrowseComp-ZH、xbench-DeepSearch和WideSearch等多个基准测试中均达到最先进性能。值得注意的是，通过简单监督微调训练的OpenSeeker显著优于第二名的全开源智能体DeepDive（如在BrowseComp上以29.5%对比15.3%），甚至在BrowseComp-ZH基准上超越通义深度研究等工业级竞品（48.4%对比46.7%），后者采用了持续预训练、监督微调与强化学习的复合训练流程。我们完整开源全部训练数据集与模型权重，以推动前沿搜索智能体研究的民主化，构建更透明、协作的科研生态。

English

Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet the development of high-performance search agents remains dominated by industrial giants due to a lack of transparent, high-quality training data. This persistent data scarcity has fundamentally hindered the progress of the broader research community in developing and innovating within this domain. To bridge this gap, we introduce OpenSeeker, the first fully open-source search agent (i.e., model and data) that achieves frontier-level performance through two core technical innovations: (1) Fact-grounded scalable controllable QA synthesis, which reverse-engineers the web graph via topological expansion and entity obfuscation to generate complex, multi-hop reasoning tasks with controllable coverage and complexity. (2) Denoised trajectory synthesis, which employs a retrospective summarization mechanism to denoise the trajectory, therefore promoting the teacher LLMs to generate high-quality actions. Experimental results demonstrate that OpenSeeker, trained (a single training run) on only 11.7k synthesized samples, achieves state-of-the-art performance across multiple benchmarks including BrowseComp, BrowseComp-ZH, xbench-DeepSearch, and WideSearch. Notably, trained with simple SFT, OpenSeeker significantly outperforms the second-best fully open-source agent DeepDive (e.g., 29.5% v.s. 15.3% on BrowseComp), and even surpasses industrial competitors such as Tongyi DeepResearch (trained via extensive continual pre-training, SFT, and RL) on BrowseComp-ZH (48.4% v.s. 46.7%). We fully open-source the complete training dataset and the model weights to democratize frontier search agent research and foster a more transparent, collaborative ecosystem.

OpenSeeker：通过完全开源训练数据实现前沿搜索智能体的民主化

OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data

摘要

Support