SearchSwarm: 장기 심층 연구를 위한 에이전트 LLM의 위임 지능을 향하여

초록

대규모 언어 모델은 점차 컨텍스트 요구사항이 무한히 증가할 수 있는 복잡하고 장기적인 실제 작업을 처리할 것으로 기대되지만, 모델의 컨텍스트 윈도우는 본질적으로 유한하다. 최근 연구에서는 주 에이전트가 작업을 분해하고 하위 에이전트에 하위 작업을 할당하여, 하위 에이전트가 실행 후 요약된 결과만 반환함으로써 주 에이전트의 컨텍스트 예산을 절약하는 패러다임을 탐구하고 있다. 그러나 이를 효과적으로 수행하려면 복잡한 작업을 분해하고, 언제 무엇을 위임할지 결정하며, 반환된 결과를 진행 중인 작업 흐름에 통합하는 능력, 즉 위임 지능이 필요하다. 이러한 능력을 위한 훈련 데이터는 자연적으로 발생하는 텍스트에서 드물며, 우리가 아는 한, 이러한 데이터를 합성하고 모델이 이 능력을 습득하도록 훈련하는 방법은 오픈소스 커뮤니티에서 아직 충분히 탐구되지 않았다. 이러한 격차를 해소하기 위해, 우리는 장기 에이전트 작업의 대표 사례인 심층 연구를 대상으로 한 예비 탐색을 제시한다. 구체적으로, 우리는 모델이 고품질의 작업 분해 및 위임을 수행하도록 안내하면서, 하위 에이전트가 주 에이전트의 작업 흐름을 지원하기 위해 적절하게 결과를 반환하도록 제약하는 하네스를 설계한다. 하네스 안내를 받은 궤적에는 올바른 위임 결정이 자연스럽게 인코딩되어 있으며, 이를 지도 파인튜닝 데이터로 사용하여 위임 지능을 모델 가중치에 내재화한다. 그 결과 모델인 SearchSwarm-30B-A3B는 BrowseComp에서 68.1, BrowseComp-ZH에서 73.3을 달성하여, 유사한 규모의 모든 모델 중 최고 성능을 기록했다. 우리는 향후 연구를 촉진하기 위해 하네스, 모델 가중치 및 훈련 데이터를 공개할 예정이다.

English

Large language models are increasingly expected to handle complex, long-horizon real-world tasks whose context demands can grow without bound, yet model context windows remain inherently finite. Recent work explores a paradigm where a main agent decomposes tasks and dispatches subtasks to subagents, which execute and return only summarized results, conserving the main agent's context budget. However, performing this well requires delegation intelligence: the ability to decompose complex tasks, determine when and what to delegate, and integrate returned results into the ongoing workflow. Training data for this capability is scarce in naturally occurring text, and to our knowledge, how to synthesize such data and train models to acquire this capability remains largely unexplored in the open-source community. To bridge this gap, we present a preliminary exploration targeting deep research, a representative long-horizon agent task. Specifically, we design a harness that guides the model toward high-quality task decomposition and delegation, while constraining subagents to return results properly to support the main agent's workflow. The harness-guided trajectories naturally encode correct delegation decisions, which we use as supervised fine-tuning data to internalize delegation intelligence into model weights. Our resulting model, SearchSwarm-30B-A3B, achieves 68.1 on BrowseComp and 73.3 on BrowseComp-ZH, the best results among all models of comparable scale. We will release our harness, model weights, and training data to facilitate future research.