OpenSeeker: 완전 오픈소스화된 학습 데이터를 통해 프론티어 검색 에이전트의 대중화를 이루다

초록

딥 서치(deep search) 능력은 최첨단 대규모 언어 모델(LLM) 에이전트에게 필수적인 역량이 되었으나, 고성능 검색 에이전트 개발은 투명하고 고품질의 학습 데이터 부족으로 인해 여전히 주요 산업계 거대 기업들에 의해 주도되고 있습니다. 이러한 지속적인 데이터 부족 문제는 해당 분야에서 보다 광범위한 연구 커뮤니티의 발전과 혁신을 근본적으로 저해해 왔습니다. 이러한 격차를 해소하기 위해, 우리는 최초의 완전 오픈소스 검색 에이전트(즉, 모델과 데이터 모두)인 OpenSeeker를 소개합니다. OpenSeeker는 두 가지 핵심 기술 혁신을 통해 최첨단 수준의 성능을 달성합니다: (1) 사실에 기반한 확장 가능 및 제어 가능한 질의응답 합성 기술로, 위상학적 확장(topological expansion)과 엔티티 난독화(entity obfuscation)를 통해 웹 그래프를 역설계하여 커버리지와 복잡도를 제어할 수 있는 복잡한 다중 홉 추론 과제를 생성합니다. (2) 잡음 제거 트라젝토리 합성 기술로, 회고적 요약(retrospective summarization) 메커니즘을 사용하여 트라젝토리의 노이즈를 제거함으로써 교사 LLM이 고품질의 행동을 생성하도록 유도합니다. 실험 결과, 단 11.7k개의 합성 샘플로 단일 훈련을 수행한 OpenSeeker가 BrowseComp, BrowseComp-ZH, xbench-DeepSearch, WideSearch 등 여러 벤치마크에서 최첨단 성능을 달성함을 입증했습니다. 주목할 점은, 단순한 지도 미세 조정(SFT)으로 훈련된 OpenSeeker가 두 번째로 성능이 좋은 완전 오픈소스 에이전트인 DeepDive를 크게 앞섰으며(예: BrowseComp에서 29.5% 대 15.3%), BrowseComp-ZH에서는 광범위한 지속 사전 훈련, SFT, 강화 학습을 통해 훈련된 Tongyi DeepResearch와 같은 산업계 경쟁자들까지 능가했다는 점입니다(48.4% 대 46.7%). 우리는 최첨단 검색 에이전트 연구의 민주화와 더 투명하고 협력적인 생태계 조성을 위해 완전한 훈련 데이터셋과 모델 가중치를 완전히 오픈소스로 공개합니다.

English

Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet the development of high-performance search agents remains dominated by industrial giants due to a lack of transparent, high-quality training data. This persistent data scarcity has fundamentally hindered the progress of the broader research community in developing and innovating within this domain. To bridge this gap, we introduce OpenSeeker, the first fully open-source search agent (i.e., model and data) that achieves frontier-level performance through two core technical innovations: (1) Fact-grounded scalable controllable QA synthesis, which reverse-engineers the web graph via topological expansion and entity obfuscation to generate complex, multi-hop reasoning tasks with controllable coverage and complexity. (2) Denoised trajectory synthesis, which employs a retrospective summarization mechanism to denoise the trajectory, therefore promoting the teacher LLMs to generate high-quality actions. Experimental results demonstrate that OpenSeeker, trained (a single training run) on only 11.7k synthesized samples, achieves state-of-the-art performance across multiple benchmarks including BrowseComp, BrowseComp-ZH, xbench-DeepSearch, and WideSearch. Notably, trained with simple SFT, OpenSeeker significantly outperforms the second-best fully open-source agent DeepDive (e.g., 29.5% v.s. 15.3% on BrowseComp), and even surpasses industrial competitors such as Tongyi DeepResearch (trained via extensive continual pre-training, SFT, and RL) on BrowseComp-ZH (48.4% v.s. 46.7%). We fully open-source the complete training dataset and the model weights to democratize frontier search agent research and foster a more transparent, collaborative ecosystem.

OpenSeeker: 완전 오픈소스화된 학습 데이터를 통해 프론티어 검색 에이전트의 대중화를 이루다

OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data

초록

Support