SearchSwarm: 長期的な深層研究のためのエージェント型LLMにおける委任知能に向けて

要旨

大規模言語モデルは、文脈要求が無限に拡大し得る複雑で長期的な実世界タスクを扱うことが期待される一方、モデルのコンテキストウィンドウは本質的に有限に留まります。近年の研究では、メインエージェントがタスクを分解し、サブエージェントにサブタスクを委譲、サブエージェントは実行結果を要約して返すことで、メインエージェントのコンテキスト予算を節約するパラダイムが模索されています。しかしこれを適切に行うには、複雑なタスクを分解し、いつ何を委譲するかを判断し、返却された結果を進行中のワークフローに統合する能力、すなわち「委譲知能」が必要です。この能力に関する訓練データは自然発生テキスト中に乏しく、我々の知る限り、この能力を獲得するためのデータ合成手法やモデル学習は、オープンソースコミュニティではほとんど探求されていません。このギャップを埋めるため、代表的な長期エージェントタスクであるディープリサーチを対象とした予備的検討を提示します。具体的には、モデルを高品質なタスク分解と委譲へ導くハーネスを設計し、サブエージェントがメインエージェントのワークフローを支援する適切な結果を返すよう制約します。ハーネス誘導による軌跡には正しい委譲判断が自然に符号化されており、これを教師ありファインチューニングデータとして用い、委譲知能をモデル重みに内在化させます。その結果得られたモデルSearchSwarm-30B-A3Bは、BrowseCompで68.1、BrowseComp-ZHで73.3を達成し、同等規模の全モデル中で最高の結果を示しました。今後の研究促進のため、ハーネス、モデル重み、訓練データを公開します。

English

Large language models are increasingly expected to handle complex, long-horizon real-world tasks whose context demands can grow without bound, yet model context windows remain inherently finite. Recent work explores a paradigm where a main agent decomposes tasks and dispatches subtasks to subagents, which execute and return only summarized results, conserving the main agent's context budget. However, performing this well requires delegation intelligence: the ability to decompose complex tasks, determine when and what to delegate, and integrate returned results into the ongoing workflow. Training data for this capability is scarce in naturally occurring text, and to our knowledge, how to synthesize such data and train models to acquire this capability remains largely unexplored in the open-source community. To bridge this gap, we present a preliminary exploration targeting deep research, a representative long-horizon agent task. Specifically, we design a harness that guides the model toward high-quality task decomposition and delegation, while constraining subagents to return results properly to support the main agent's workflow. The harness-guided trajectories naturally encode correct delegation decisions, which we use as supervised fine-tuning data to internalize delegation intelligence into model weights. Our resulting model, SearchSwarm-30B-A3B, achieves 68.1 on BrowseComp and 73.3 on BrowseComp-ZH, the best results among all models of comparable scale. We will release our harness, model weights, and training data to facilitate future research.