OPD-Evolver: Cultivando un Evolucionador Holístico de Agentes mediante Destilación On-Policy

Resumen

La memoria se ha convertido en un sustrato estándar para agentes autoevolutivos, pero retener experiencia no es lo mismo que aprender a evolucionar a través de ella. Los agentes de memoria existentes pueden almacenar trayectorias, recuperar reflexiones o acumular habilidades, pero a menudo carecen de la competencia holística para seleccionar experiencia útil, actuar sobre ella, escribir conocimiento reutilizable y mantener un repositorio en crecimiento. Presentamos OPD-Evolver, un marco de coevolución lenta-rápida que cultiva un evolucionador de agente a través de autodestilación on-policy. En el bucle rápido, OPD-Evolver interactúa con una jerarquía de memoria de cuatro niveles para leer, usar, escribir y mantener experiencia para una evolución rápida en tiempo de prueba. En el bucle lento, la atribución de memoria calibrada por resultados y la retrospectiva privilegiada destilan estas cuatro habilidades en la política desplegable. En puntos de referencia multidominio, OPD-Evolver supera a sistemas de memoria como ReasoningBank hasta en un 11.5%, y a métodos basados en entrenamiento como Skill0 en ~5.8%. Análisis adicionales muestran que OPD-Evolver internaliza la experiencia de alto valor y la gestión de memoria, permitiendo que OPD-Evolver-9B desafíe a contrapartes gigantes como Qwen3.5-397B-A17B y Step-3.5-Flash, apuntando más allá de los agentes aumentados con memoria hacia evolucionadores de agentes genuinamente calificados.

English

Memory has become a standard substrate for self-evolving agents, yet retaining experience is not the same as learning how to evolve through it. Existing memory agents can store trajectories, retrieve reflections, or accumulate skills, but often lack the holistic competence to select useful experience, act on it, write reusable knowledge, and maintain a growing repository. We introduce OPD-Evolver, a slow-fast co-evolution framework that cultivates such an agent evolver through on-policy self-distillation. In the fast loop, OPD-Evolver interacts with a four-level memory hierarchy to read, use, write, and maintain experience for rapid test-time evolution. In the slow loop, outcome-calibrated memory attribution and privileged hindsight distill these four abilities into the deployable policy. Across multi-domain benchmarks, OPD-Evolver surpasses memory systems such as ReasoningBank by up to 11.5%, and training-based methods such as Skill0 by ~5.8%. Further analysis shows that OPD-Evolver internalizes high-value experience and memory management, enabling OPD-Evolver-9B to challenge giant counterparts such as Qwen3.5-397B-A17B and Step-3.5-Flash, pointing beyond memory-augmented agents toward genuinely qualified agent evolvers.