ChatPaper.aiChatPaper

Harness-1:具狀態外化綁定之搜索智能體的強化學習

Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

June 1, 2026
作者: Pengcheng Jiang, Zhiyi Shi, Kelly Hong, Xueqiang Xu, Jiashuo Sun, Jimeng Sun, Hammad Bashir, Jiawei Han
cs.AI

摘要

搜尋代理通常被訓練為基於逐漸增長的轉錄本(transcripts)的策略:模型必須決定如何搜尋,同時記住所看到的內容、哪些證據有用、哪些約束條件尚未解決、以及哪些聲明實際上已被查證。我們認為,這種表述將過多的例行狀態管理(routine state management)置於策略內部:強化學習被迫同時優化語義搜尋決策以及環境能更可靠維護的可回復性簿記(recoverable bookkeeping)。為此,我們引入Harness-1:一個在具狀態搜尋框架(stateful search harness)內以強化學習訓練的200億參數搜尋代理(檢索子代理)。該框架維護環境端的運作記憶(working memory),包括候選池(candidate pool)、重要性標記的策展集合、精簡的證據連結、驗證記錄、壓縮並去重複的觀測結果,以及預算感知的上下文渲染(budget-aware context rendering)。策略保留語義決策:搜尋什麼、保留或捨棄哪些文件、驗證什麼、以及何時停止。在涵蓋網頁、金融、專利及多跳問答(multi-hop QA)等八項檢索基準測試中,Harness-1平均策展召回率(curated recall)達到0.730,比次強的開源搜尋子代理高出+11.4個百分點,並與規模更大的前沿模型搜尋器(frontier-model searchers)保持競爭力。其增益在保留的遷移基準(held-out transfer benchmarks)上尤為顯著,這表明對明確的搜尋狀態進行強化學習,可產生超越訓練領域的檢索行為。我們的程式碼已在 https://github.com/pat-jj/harness-1 公開。
English
Search agents are often trained as policies over growing transcripts: the model must decide how to search while also remembering what it has seen, which evidence is useful, which constraints remain open, and which claims have actually been checked. We argue that this formulation puts too much routine state management inside the policy: reinforcement learning is forced to optimize both semantic search decisions and recoverable bookkeeping that the environment can maintain more reliably. We introduce Harness-1, a 20B search agent (retrieval subagent) trained with reinforcement learning inside a stateful search harness. The harness maintains environment-side working memory, including a candidate pool, an importance-tagged curated set, compact evidence links, verification records, compressed and deduplicated observations, and budget-aware context rendering. The policy retains the semantic decisions: what to search, which documents to keep or discard, what to verify, and when to stop. Across eight retrieval benchmarks spanning web, finance, patents, and multi-hop QA, Harness-1 achieves 0.730 average curated recall, outperforming the next strongest open search subagent by +11.4 points and remaining competitive with much larger frontier-model searchers. Its gains are especially strong on held-out transfer benchmarks, suggesting that reinforcement learning over explicit search state can produce retrieval behaviors that generalize beyond the training domains. Our code is available at https://github.com/pat-jj/harness-1.