OPD-Evolver:以同策略蒸餾培育全方位智能體演化器
OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation
June 16, 2026
作者: Guibin Zhang, Xun Xu, Yanwei Yue, Zikun Su, Wangchunshu Zhou, Xiaobin Hu, Shuicheng Yan
cs.AI
摘要
記憶已成為自我演化代理的標準基礎,但保留經驗並不等於學習如何透過經驗演化。現有的記憶代理能儲存軌跡、檢索反思或累積技能,但往往缺乏選擇有用經驗、據此行動、撰寫可重複使用的知識,並維護持續成長的儲存庫的整體能力。我們提出 OPD-Evolver,這是一個慢-快協同演化框架,透過同策略自我蒸餾來培育這樣的代理演化器。在快速循環中,OPD-Evolver 與四層級記憶階層互動,以讀取、使用、撰寫及維護經驗,實現快速的測試階段演化。在慢速循環中,結果校準的記憶歸因與特權後見之明將這四種能力蒸餾至可部署的策略中。在多領域基準測試中,OPD-Evolver 超越如 ReasoningBank 等記憶系統達 11.5%,以及如 Skill0 等基於訓練的方法約 5.8%。進一步分析顯示,OPD-Evolver 內化高價值經驗與記憶管理,使 OPD-Evolver-9B 能夠挑戰 Qwen3.5-397B-A17B 和 Step-3.5-Flash 等大型對手,指向超越記憶增強代理、邁向真正合格的代理演化器。
English
Memory has become a standard substrate for self-evolving agents, yet retaining experience is not the same as learning how to evolve through it. Existing memory agents can store trajectories, retrieve reflections, or accumulate skills, but often lack the holistic competence to select useful experience, act on it, write reusable knowledge, and maintain a growing repository. We introduce OPD-Evolver, a slow-fast co-evolution framework that cultivates such an agent evolver through on-policy self-distillation. In the fast loop, OPD-Evolver interacts with a four-level memory hierarchy to read, use, write, and maintain experience for rapid test-time evolution. In the slow loop, outcome-calibrated memory attribution and privileged hindsight distill these four abilities into the deployable policy. Across multi-domain benchmarks, OPD-Evolver surpasses memory systems such as ReasoningBank by up to 11.5%, and training-based methods such as Skill0 by ~5.8%. Further analysis shows that OPD-Evolver internalizes high-value experience and memory management, enabling OPD-Evolver-9B to challenge giant counterparts such as Qwen3.5-397B-A17B and Step-3.5-Flash, pointing beyond memory-augmented agents toward genuinely qualified agent evolvers.