Harness-1: 状態外部化ハーネスを用いた検索エージェントの強化学習

要旨

検索エージェントは、しばしば成長するトランスクリプト上の方策として訓練される。モデルは、検索方法を決定すると同時に、これまでに見た情報、有用な証拠、未解決の制約、実際に検証済みの主張を記憶しなければならない。本稿では、この定式化はルーチン的な状態管理を過度に方策内に押し込んでいると論じる。すなわち、強化学習は意味的な検索判断と、環境がより確実に維持できる復元可能な簿記処理の両方を最適化することを強いられる。我々は、状態を持つ検索ハーネス内で強化学習を用いて訓練された200億パラメータの検索エージェント（検索サブエージェント）Harness-1を導入する。このハーネスは、候補プール、重要度タグ付きキュレーションセット、コンパクトな証拠リンク、検証記録、圧縮・重複除去された観測、および予算を考慮したコンテキストレンダリングを含む、環境側のワーキングメモリを維持する。方策は意味的な判断、すなわち何を検索するか、どの文書を保持または破棄するか、何を検証するか、いつ停止するかを保持する。ウェブ、金融、特許、多段階QAにわたる8つの検索ベンチマークにおいて、Harness-1は平均キュレーション再現率0.730を達成し、次に強力なオープン検索サブエージェントを+11.4ポイント上回り、より大規模なフロンティアモデルによる検索手法とも競合する。その利得は、特にホールドアウト転移ベンチマークにおいて顕著であり、明示的な検索状態に対する強化学習が、訓練領域を超えて一般化する検索行動を生み出せることを示唆している。コードはhttps://github.com/pat-jj/harness-1で公開している。

English

Search agents are often trained as policies over growing transcripts: the model must decide how to search while also remembering what it has seen, which evidence is useful, which constraints remain open, and which claims have actually been checked. We argue that this formulation puts too much routine state management inside the policy: reinforcement learning is forced to optimize both semantic search decisions and recoverable bookkeeping that the environment can maintain more reliably. We introduce Harness-1, a 20B search agent (retrieval subagent) trained with reinforcement learning inside a stateful search harness. The harness maintains environment-side working memory, including a candidate pool, an importance-tagged curated set, compact evidence links, verification records, compressed and deduplicated observations, and budget-aware context rendering. The policy retains the semantic decisions: what to search, which documents to keep or discard, what to verify, and when to stop. Across eight retrieval benchmarks spanning web, finance, patents, and multi-hop QA, Harness-1 achieves 0.730 average curated recall, outperforming the next strongest open search subagent by +11.4 points and remaining competitive with much larger frontier-model searchers. Its gains are especially strong on held-out transfer benchmarks, suggesting that reinforcement learning over explicit search state can produce retrieval behaviors that generalize beyond the training domains. Our code is available at https://github.com/pat-jj/harness-1.