迭代研究:通过马尔可夫状态重构重新思考长视野智能体
IterResearch: Rethinking Long-Horizon Agents via Markovian State Reconstruction
November 10, 2025
作者: Guoxin Chen, Zile Qiao, Xuanzhong Chen, Donglei Yu, Haotian Xu, Wayne Xin Zhao, Ruihua Song, Wenbiao Yin, Huifeng Yin, Liwen Zhang, Kuan Li, Minpeng Liao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
cs.AI
摘要
近期深度研究智能体的进展表明,通过对外部信息源进行动态推理可实现自主知识构建。然而,现有方法依赖单一上下文范式,将所有信息累积在持续扩展的上下文窗口中,导致上下文窒息与噪声污染,限制了其在长周期任务中的有效性。我们提出IterResearch——一种创新的迭代式深度研究范式,将长周期研究重构为具有策略性工作空间重建的马尔可夫决策过程。该方法通过维护动态演进的研究报告作为记忆体,并定期整合研究洞见,使得推理能力在任意探索深度下保持稳定。我们进一步开发了效率感知策略优化(EAPO),该强化学习框架通过几何奖励折现机制激励高效探索,并借助自适应降采样实现稳定的分布式训练。大量实验表明,IterResearch在六项基准测试中相较现有开源智能体平均提升14.5个百分点,显著缩小了与前沿专有系统的差距。值得注意的是,该范式展现出前所未有的交互扩展能力,可延伸至2048次交互且性能实现跨越式提升(从3.5%至42.5%),同时作为有效的提示策略,在长周期任务上相较ReAct将前沿模型性能提升最高达19.2个百分点。这些发现确立了IterResearch作为长周期推理的通用解决方案,既能作为训练完成的智能体,也可作为前沿模型的提示范式。
English
Recent advances in deep-research agents have shown promise for autonomous
knowledge construction through dynamic reasoning over external sources.
However, existing approaches rely on a mono-contextual paradigm that
accumulates all information in a single, expanding context window, leading to
context suffocation and noise contamination that limit their effectiveness on
long-horizon tasks. We introduce IterResearch, a novel iterative deep-research
paradigm that reformulates long-horizon research as a Markov Decision Process
with strategic workspace reconstruction. By maintaining an evolving report as
memory and periodically synthesizing insights, our approach preserves
consistent reasoning capacity across arbitrary exploration depths. We further
develop Efficiency-Aware Policy Optimization (EAPO), a reinforcement learning
framework that incentivizes efficient exploration through geometric reward
discounting and enables stable distributed training via adaptive downsampling.
Extensive experiments demonstrate that IterResearch achieves substantial
improvements over existing open-source agents with average +14.5pp across six
benchmarks and narrows the gap with frontier proprietary systems. Remarkably,
our paradigm exhibits unprecedented interaction scaling, extending to 2048
interactions with dramatic performance gains (from 3.5\% to 42.5\%), and serves
as an effective prompting strategy, improving frontier models by up to 19.2pp
over ReAct on long-horizon tasks. These findings position IterResearch as a
versatile solution for long-horizon reasoning, effective both as a trained
agent and as a prompting paradigm for frontier models.