OPD-Evolver: Cultivando um Evolucionador Holístico de Agentes via Destilação On-Policy

Resumo

A memória tornou-se um substrato padrão para agentes auto-evolutivos, mas reter experiência não é o mesmo que aprender a evoluir por meio dela. Agentes de memória existentes podem armazenar trajetórias, recuperar reflexões ou acumular habilidades, mas frequentemente carecem da competência holística para selecionar experiência útil, agir sobre ela, escrever conhecimento reutilizável e manter um repositório em crescimento. Apresentamos o OPD-Evolver, uma estrutura de coevolução lenta-rápida que cultiva tal evolutor de agente por meio de autodestilação na política. No loop rápido, o OPD-Evolver interage com uma hierarquia de memória de quatro níveis para ler, usar, escrever e manter experiência para evolução rápida durante o teste. No loop lento, a atribuição de memória calibrada por resultado e a retrospectiva privilegiada destilam essas quatro habilidades na política implantável. Em benchmarks de múltiplos domínios, o OPD-Evolver supera sistemas de memória como o ReasoningBank em até 11,5% e métodos baseados em treinamento como o Skill0 em ~5,8%. Análises adicionais mostram que o OPD-Evolver internaliza experiência de alto valor e gerenciamento de memória, permitindo que o OPD-Evolver-9B desafie contrapartes gigantes como Qwen3.5-397B-A17B e Step-3.5-Flash, apontando além de agentes aumentados por memória em direção a evolutores de agentes genuinamente qualificados.

English

Memory has become a standard substrate for self-evolving agents, yet retaining experience is not the same as learning how to evolve through it. Existing memory agents can store trajectories, retrieve reflections, or accumulate skills, but often lack the holistic competence to select useful experience, act on it, write reusable knowledge, and maintain a growing repository. We introduce OPD-Evolver, a slow-fast co-evolution framework that cultivates such an agent evolver through on-policy self-distillation. In the fast loop, OPD-Evolver interacts with a four-level memory hierarchy to read, use, write, and maintain experience for rapid test-time evolution. In the slow loop, outcome-calibrated memory attribution and privileged hindsight distill these four abilities into the deployable policy. Across multi-domain benchmarks, OPD-Evolver surpasses memory systems such as ReasoningBank by up to 11.5%, and training-based methods such as Skill0 by ~5.8%. Further analysis shows that OPD-Evolver internalizes high-value experience and memory management, enabling OPD-Evolver-9B to challenge giant counterparts such as Qwen3.5-397B-A17B and Step-3.5-Flash, pointing beyond memory-augmented agents toward genuinely qualified agent evolvers.