ChatPaper.aiChatPaper

Harness-1:使用状态外部化框架的搜索代理强化学习

Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

June 1, 2026
作者: Pengcheng Jiang, Zhiyi Shi, Kelly Hong, Xueqiang Xu, Jiashuo Sun, Jimeng Sun, Hammad Bashir, Jiawei Han
cs.AI

摘要

搜索代理通常通过增量对话记录进行策略训练:模型必须一边决定如何搜索,一边记住已查看的内容、哪些证据是有用的、哪些约束条件尚未解决,以及哪些声明已被实际核实。我们认为,这种表述将过多的常规状态管理置于策略内部——强化学习被迫同时优化语义搜索决策和本可由环境更可靠维护的可恢复性簿记工作。我们提出Harness-1,一个200亿参数的搜索代理(检索子代理),它在一个有状态搜索框架(harness)内通过强化学习训练而成。该框架维护环境侧的工作记忆,包括候选池、按重要性标注的精选集、紧凑的证据链接、验证记录、压缩去重的观测结果,以及预算感知的上下文呈现。策略保留语义决策:搜索什么、保留或丢弃哪些文档、验证哪些内容、何时停止。在涵盖网页、金融、专利和多跳问答的八个检索基准测试中,Harness-1实现了平均0.730的精选召回率(curated recall),比次强的开源搜索子代理高出11.4个百分点,并且与规模大得多的前沿模型搜索器保持竞争力。其优势在保留的迁移基准上尤为显著,这表明基于显式搜索状态的强化学习能够产生超越训练领域的检索行为。我们的代码已开源:https://github.com/pat-jj/harness-1。
English
Search agents are often trained as policies over growing transcripts: the model must decide how to search while also remembering what it has seen, which evidence is useful, which constraints remain open, and which claims have actually been checked. We argue that this formulation puts too much routine state management inside the policy: reinforcement learning is forced to optimize both semantic search decisions and recoverable bookkeeping that the environment can maintain more reliably. We introduce Harness-1, a 20B search agent (retrieval subagent) trained with reinforcement learning inside a stateful search harness. The harness maintains environment-side working memory, including a candidate pool, an importance-tagged curated set, compact evidence links, verification records, compressed and deduplicated observations, and budget-aware context rendering. The policy retains the semantic decisions: what to search, which documents to keep or discard, what to verify, and when to stop. Across eight retrieval benchmarks spanning web, finance, patents, and multi-hop QA, Harness-1 achieves 0.730 average curated recall, outperforming the next strongest open search subagent by +11.4 points and remaining competitive with much larger frontier-model searchers. Its gains are especially strong on held-out transfer benchmarks, suggesting that reinforcement learning over explicit search state can produce retrieval behaviors that generalize beyond the training domains. Our code is available at https://github.com/pat-jj/harness-1.