IterResearch: マルコフ状態再構成による長期的エージェントの再考

要旨

深層研究エージェントの最近の進歩は、外部情報源に対する動的推論を通じた自律的知識構築の可能性を示している。しかし、既存手法は単一の拡張コンテキストウィンドウにすべての情報を蓄積する単一コンテキストパラダイムに依存しており、コンテキスト飽和とノイズ混入を引き起こし、長期的タスクにおける有効性を制限している。本論文では、長期的研究を戦略的ワークスペース再構築を伴うマルコフ決定過程として再定式化する新しい反復的深層研究パラダイム「IterResearch」を提案する。進化するレポートをメモリとして維持し、定期的に知見を統合することで、任意の探索深度にわたって一貫した推論能力を維持する。さらに、幾何学的報酬割引による効率的探索を促進し、適応的ダウンサンプリングによる安定した分散訓練を可能とする強化学習フレームワーク「Efficiency-Aware Policy Optimization（EAPO）」を開発した。大規模実験により、IterResearchが既存のオープンソースエージェントを平均+14.5pp（6ベンチマーク）で大幅に上回り、先進的専有システムとの差を縮めることを実証した。特筆すべきは、本パラダイムが前例のないインタラクション拡張性を示し、2048インタラクションまで劇的な性能向上（3.5%から42.5%へ）を達成し、長期的タスクにおいてReActに対して最大19.2ppの改善をもたらす効果的なプロンプティング戦略としても機能することである。これらの知見は、IterResearchを訓練済みエージェントとしてだけでなく、先進モデルのためのプロンプティングパラダイムとしても有効な、長期的推論のための汎用ソリューションとして位置づけるものである。

English

Recent advances in deep-research agents have shown promise for autonomous knowledge construction through dynamic reasoning over external sources. However, existing approaches rely on a mono-contextual paradigm that accumulates all information in a single, expanding context window, leading to context suffocation and noise contamination that limit their effectiveness on long-horizon tasks. We introduce IterResearch, a novel iterative deep-research paradigm that reformulates long-horizon research as a Markov Decision Process with strategic workspace reconstruction. By maintaining an evolving report as memory and periodically synthesizing insights, our approach preserves consistent reasoning capacity across arbitrary exploration depths. We further develop Efficiency-Aware Policy Optimization (EAPO), a reinforcement learning framework that incentivizes efficient exploration through geometric reward discounting and enables stable distributed training via adaptive downsampling. Extensive experiments demonstrate that IterResearch achieves substantial improvements over existing open-source agents with average +14.5pp across six benchmarks and narrows the gap with frontier proprietary systems. Remarkably, our paradigm exhibits unprecedented interaction scaling, extending to 2048 interactions with dramatic performance gains (from 3.5\% to 42.5\%), and serves as an effective prompting strategy, improving frontier models by up to 19.2pp over ReAct on long-horizon tasks. These findings position IterResearch as a versatile solution for long-horizon reasoning, effective both as a trained agent and as a prompting paradigm for frontier models.

IterResearch: マルコフ状態再構成による長期的エージェントの再考

IterResearch: Rethinking Long-Horizon Agents via Markovian State Reconstruction

要旨

Support