ターン制限を超えて：動的コンテキストウィンドウを用いた深層探索エージェントの訓練

要旨

近年の推論モデルの進歩は、強化学習を通じて認知的行動を示してきたが、既存のアプローチでは、長期的な相互作用を伴うマルチターンエージェントにおいて深い推論能力を引き出すことに苦戦している。我々は、DeepMinerという新しいフレームワークを提案し、高難度のトレーニングタスクと動的コンテキストウィンドウを導入することで、そのような能力を引き出す。DeepMinerは、信頼性のあるウェブソースから複雑だが検証可能な質問-回答ペアを生成する逆構築法を提示し、トレーニングデータの難易度と信頼性を確保しながら、マルチターン推論シナリオに認知能力を注入する。さらに、トレーニングと推論の両方において、外部の要約モデルへの依存を排除しつつスライディングウィンドウメカニズムを活用した、洗練されながらも効果的な動的コンテキスト管理戦略を設計し、モデルが継続的に拡大する長期的なコンテキストを効率的に処理できるようにする。Qwen3-32Bに対する強化学習を通じて、DeepMiner-32Bを開発し、複数の検索エージェントベンチマークで大幅な性能向上を達成した。DeepMinerは、BrowseComp-enで33.5%の精度を達成し、従来の最高のオープンソースエージェントを約20ポイント上回り、BrowseComp-zh、XBench-DeepSearch、GAIAでも一貫した改善を示した。特に、我々の動的コンテキスト管理により、標準の32kコンテキスト長内でほぼ100ターンにわたる持続的な相互作用が可能となり、既存のマルチターン相互作用システムを制約するコンテキストの限界を効果的に解決する。

English

While recent advances in reasoning models have demonstrated cognitive behaviors through reinforcement learning, existing approaches struggle to invoke deep reasoning capabilities in multi-turn agents with long-horizon interactions. We propose DeepMiner, a novel framework that elicits such abilities by introducing high-difficulty training tasks and dynamic context window. DeepMiner presents a reverse construction method to generate complex but verifiable question-answer pairs from authentic web sources, which ensures the challenge and reliability of training data while injecting cognitive capabilities into multi-turn reasoning scenarios. We further design an elegant yet effective dynamic context management strategy for both training and inference, utilizing sliding window mechanisms while eliminating the dependency on external summarization models, thereby efficiently empowering the model to handle continuously expanding long-horizon contexts. Through reinforcement learning on Qwen3-32B, we develop DeepMiner-32B, which achieves substantial performance improvements across multiple search agent benchmarks. DeepMiner attains 33.5% accuracy on BrowseComp-en, surpassing the previous best open-source agent by almost 20 percentage points, and demonstrates consistent improvements on BrowseComp-zh, XBench-DeepSearch, and GAIA. Notably, our dynamic context management enables sustained interactions of nearly 100 turns within standard 32k context length, effectively addressing the context limitations that constrain existing multi-turn interaction systems.

ターン制限を超えて：動的コンテキストウィンドウを用いた深層探索エージェントの訓練

Beyond Turn Limits: Training Deep Search Agents with Dynamic Context Window

要旨

Support