턴 제한을 넘어서: 동적 컨텍스트 윈도우를 활용한 심층 탐색 에이전트 훈련

초록

최근 추론 모델의 발전은 강화 학습을 통해 인지적 행동을 보여주었지만, 기존 접근 방식은 장기적 상호작용이 필요한 다중 턴 에이전트에서 깊은 추론 능력을 발휘하는 데 어려움을 겪고 있습니다. 우리는 이러한 능력을 이끌어내기 위해 고난이도 훈련 작업과 동적 컨텍스트 창을 도입한 새로운 프레임워크인 DeepMiner를 제안합니다. DeepMiner는 실제 웹 소스에서 복잡하지만 검증 가능한 질문-답변 쌍을 생성하기 위한 역구성 방법을 제시하여, 훈련 데이터의 도전성과 신뢰성을 보장하면서 다중 턴 추론 시나리오에 인지 능력을 주입합니다. 또한, 우리는 훈련과 추론 모두를 위한 우아하면서도 효과적인 동적 컨텍스트 관리 전략을 설계하여, 슬라이딩 윈도우 메커니즘을 활용하면서 외부 요약 모델에 대한 의존성을 제거함으로써, 모델이 지속적으로 확장되는 장기적 컨텍스트를 효율적으로 처리할 수 있도록 합니다. Qwen3-32B에 대한 강화 학습을 통해 DeepMiner-32B를 개발하였으며, 이는 여러 검색 에이전트 벤치마크에서 상당한 성능 향상을 달성했습니다. DeepMiner는 BrowseComp-en에서 33.5%의 정확도를 달성하여 이전 최고의 오픈소스 에이전트를 거의 20% 포인트 앞섰으며, BrowseComp-zh, XBench-DeepSearch, GAIA에서도 일관된 개선을 보여주었습니다. 특히, 우리의 동적 컨텍스트 관리는 표준 32k 컨텍스트 길이 내에서 거의 100턴에 이르는 지속적인 상호작용을 가능하게 하여, 기존 다중 턴 상호작용 시스템을 제한하는 컨텍스트 한계를 효과적으로 해결합니다.

English

While recent advances in reasoning models have demonstrated cognitive behaviors through reinforcement learning, existing approaches struggle to invoke deep reasoning capabilities in multi-turn agents with long-horizon interactions. We propose DeepMiner, a novel framework that elicits such abilities by introducing high-difficulty training tasks and dynamic context window. DeepMiner presents a reverse construction method to generate complex but verifiable question-answer pairs from authentic web sources, which ensures the challenge and reliability of training data while injecting cognitive capabilities into multi-turn reasoning scenarios. We further design an elegant yet effective dynamic context management strategy for both training and inference, utilizing sliding window mechanisms while eliminating the dependency on external summarization models, thereby efficiently empowering the model to handle continuously expanding long-horizon contexts. Through reinforcement learning on Qwen3-32B, we develop DeepMiner-32B, which achieves substantial performance improvements across multiple search agent benchmarks. DeepMiner attains 33.5% accuracy on BrowseComp-en, surpassing the previous best open-source agent by almost 20 percentage points, and demonstrates consistent improvements on BrowseComp-zh, XBench-DeepSearch, and GAIA. Notably, our dynamic context management enables sustained interactions of nearly 100 turns within standard 32k context length, effectively addressing the context limitations that constrain existing multi-turn interaction systems.

턴 제한을 넘어서: 동적 컨텍스트 윈도우를 활용한 심층 탐색 에이전트 훈련

Beyond Turn Limits: Training Deep Search Agents with Dynamic Context Window

초록

Support