超越回合限制:使用動態上下文窗口訓練深度搜索代理
Beyond Turn Limits: Training Deep Search Agents with Dynamic Context Window
October 9, 2025
作者: Qiaoyu Tang, Hao Xiang, Le Yu, Bowen Yu, Yaojie Lu, Xianpei Han, Le Sun, WenJuan Zhang, Pengbo Wang, Shixuan Liu, Zhenru Zhang, Jianhong Tu, Hongyu Lin, Junyang Lin
cs.AI
摘要
尽管近期推理模型的进展通过强化学习展示了认知行为,现有方法在多轮交互的长时程智能体上仍难以激发深层推理能力。我们提出DeepMiner,一个新颖的框架,通过引入高难度训练任务和动态上下文窗口来激发此类能力。DeepMiner采用逆向构建方法,从真实网络资源生成复杂但可验证的问答对,这确保了训练数据的挑战性和可靠性,同时将认知能力注入多轮推理场景。我们进一步设计了一种简洁而有效的动态上下文管理策略,适用于训练和推理,利用滑动窗口机制,同时消除了对外部摘要模型的依赖,从而高效地赋能模型处理持续扩展的长时程上下文。通过在Qwen3-32B上进行强化学习,我们开发了DeepMiner-32B,在多个搜索智能体基准测试中实现了显著的性能提升。DeepMiner在BrowseComp-en上达到了33.5%的准确率,比之前最佳的开源智能体高出近20个百分点,并在BrowseComp-zh、XBench-DeepSearch和GAIA上展示了一致的改进。值得注意的是,我们的动态上下文管理使得在标准的32k上下文长度内能够维持近100轮的持续交互,有效解决了现有多轮交互系统所面临的上下文限制问题。
English
While recent advances in reasoning models have demonstrated cognitive
behaviors through reinforcement learning, existing approaches struggle to
invoke deep reasoning capabilities in multi-turn agents with long-horizon
interactions. We propose DeepMiner, a novel framework that elicits such
abilities by introducing high-difficulty training tasks and dynamic context
window. DeepMiner presents a reverse construction method to generate complex
but verifiable question-answer pairs from authentic web sources, which ensures
the challenge and reliability of training data while injecting cognitive
capabilities into multi-turn reasoning scenarios. We further design an elegant
yet effective dynamic context management strategy for both training and
inference, utilizing sliding window mechanisms while eliminating the dependency
on external summarization models, thereby efficiently empowering the model to
handle continuously expanding long-horizon contexts. Through reinforcement
learning on Qwen3-32B, we develop DeepMiner-32B, which achieves substantial
performance improvements across multiple search agent benchmarks. DeepMiner
attains 33.5% accuracy on BrowseComp-en, surpassing the previous best
open-source agent by almost 20 percentage points, and demonstrates consistent
improvements on BrowseComp-zh, XBench-DeepSearch, and GAIA. Notably, our
dynamic context management enables sustained interactions of nearly 100 turns
within standard 32k context length, effectively addressing the context
limitations that constrain existing multi-turn interaction systems.