OPD-Evolver: 通过同策略蒸馏培养全面智能体演化器

摘要

记忆已成为自我进化智能体的标准基础，但保留经验并不等同于学会如何通过经验进化。现有记忆智能体能够存储轨迹、检索反思或积累技能，却往往缺乏选择有用经验、据此行动、编写可复用知识并维护不断增长的知识库的整体能力。我们提出OPD-Evolver，一种慢-快协同进化框架，通过同策略自蒸馏来培育这样的智能体进化器。在快循环中，OPD-Evolver与四层记忆层级交互，以读取、使用、编写和维护经验，实现快速测试时进化。在慢循环中，结果校准的记忆归因及特权后见之明将这四种能力蒸馏到可部署的策略中。在跨领域基准测试中，OPD-Evolver以最高达11.5%的优势超越ReasoningBank等记忆系统，并以约5.8%的优势超越Skill0等基于训练的方法。进一步分析表明，OPD-Evolver内化了高价值经验与记忆管理，使得OPD-Evolver-9B能够挑战Qwen3.5-397B-A17B和Step-3.5-Flash等巨量模型，标志着从记忆增强智能体向真正合格的智能体进化器的迈进。

English

Memory has become a standard substrate for self-evolving agents, yet retaining experience is not the same as learning how to evolve through it. Existing memory agents can store trajectories, retrieve reflections, or accumulate skills, but often lack the holistic competence to select useful experience, act on it, write reusable knowledge, and maintain a growing repository. We introduce OPD-Evolver, a slow-fast co-evolution framework that cultivates such an agent evolver through on-policy self-distillation. In the fast loop, OPD-Evolver interacts with a four-level memory hierarchy to read, use, write, and maintain experience for rapid test-time evolution. In the slow loop, outcome-calibrated memory attribution and privileged hindsight distill these four abilities into the deployable policy. Across multi-domain benchmarks, OPD-Evolver surpasses memory systems such as ReasoningBank by up to 11.5%, and training-based methods such as Skill0 by ~5.8%. Further analysis shows that OPD-Evolver internalizes high-value experience and memory management, enabling OPD-Evolver-9B to challenge giant counterparts such as Qwen3.5-397B-A17B and Step-3.5-Flash, pointing beyond memory-augmented agents toward genuinely qualified agent evolvers.