ChatPaper.aiChatPaper

迭代研究:透過馬可夫狀態重構重新思考長視野智能體

IterResearch: Rethinking Long-Horizon Agents via Markovian State Reconstruction

November 10, 2025
作者: Guoxin Chen, Zile Qiao, Xuanzhong Chen, Donglei Yu, Haotian Xu, Wayne Xin Zhao, Ruihua Song, Wenbiao Yin, Huifeng Yin, Liwen Zhang, Kuan Li, Minpeng Liao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
cs.AI

摘要

近期深度研究智能體的進展顯示,通過對外部資源進行動態推理來實現自主知識構建具有巨大潛力。然而,現有方法依賴於單一情境範式,將所有信息累積在不斷擴展的單一情境窗口內,導致情境窒息與噪聲污染問題,限制了其在長週期任務中的效能。我們提出IterResearch——一種創新的迭代式深度研究範式,將長週期研究重新定義為具有策略性工作空間重構的馬可夫決策過程。該方法通過維護動態演進的報告作為記憶體,並定期合成研究洞見,能在任意探索深度下保持一致的推理能力。我們進一步開發效率感知策略優化(EAPO),這是一個透過幾何獎勵折減激勵高效探索,並藉由自適應降採樣實現穩定分散式訓練的強化學習框架。大量實驗表明,IterResearch在六項基準測試中平均提升14.5個百分點,較現有開源智能體實現顯著進步,並縮小了與前沿專有系統的差距。值得注意的是,本範式展現出前所未有的交互擴展性——可延伸至2048次交互且效能大幅提升(從3.5%至42.5%),同時作為有效的提示策略,在長週期任務上相較ReAct能使前沿模型效能提升達19.2個百分點。這些發現確立了IterResearch作為長週期推理的通用解決方案,無論作為訓練完成的智能體還是前沿模型的提示範式均具卓越效能。
English
Recent advances in deep-research agents have shown promise for autonomous knowledge construction through dynamic reasoning over external sources. However, existing approaches rely on a mono-contextual paradigm that accumulates all information in a single, expanding context window, leading to context suffocation and noise contamination that limit their effectiveness on long-horizon tasks. We introduce IterResearch, a novel iterative deep-research paradigm that reformulates long-horizon research as a Markov Decision Process with strategic workspace reconstruction. By maintaining an evolving report as memory and periodically synthesizing insights, our approach preserves consistent reasoning capacity across arbitrary exploration depths. We further develop Efficiency-Aware Policy Optimization (EAPO), a reinforcement learning framework that incentivizes efficient exploration through geometric reward discounting and enables stable distributed training via adaptive downsampling. Extensive experiments demonstrate that IterResearch achieves substantial improvements over existing open-source agents with average +14.5pp across six benchmarks and narrows the gap with frontier proprietary systems. Remarkably, our paradigm exhibits unprecedented interaction scaling, extending to 2048 interactions with dramatic performance gains (from 3.5\% to 42.5\%), and serves as an effective prompting strategy, improving frontier models by up to 19.2pp over ReAct on long-horizon tasks. These findings position IterResearch as a versatile solution for long-horizon reasoning, effective both as a trained agent and as a prompting paradigm for frontier models.
PDF7310December 2, 2025