在职学习:面向长期任务的基于经验的自进化智能体
Learning on the Job: An Experience-Driven Self-Evolving Agent for Long-Horizon Tasks
October 9, 2025
作者: Cheng Yang, Xuemeng Yang, Licheng Wen, Daocheng Fu, Jianbiao Mei, Rong Wu, Pinlong Cai, Yufan Shen, Nianchen Deng, Botian Shi, Yu Qiao, Haifeng Li
cs.AI
摘要
大型语言模型在多个领域展现了卓越的能力,然而在将其部署为AI代理以执行现实世界中的长期任务时,仍面临重大挑战。现有的LLM代理存在一个关键局限:它们在测试时是静态的,无法从经验中学习,缺乏积累知识和在工作中持续改进的能力。为应对这一挑战,我们提出了MUSE,一种新颖的代理框架,它引入了一个以分层记忆模块为核心的、经验驱动的自我进化系统。MUSE组织不同层次的经验,并利用这些经验来规划和执行跨多个应用的长期任务。每次子任务执行后,代理自主反思其轨迹,将原始轨迹转化为结构化经验,并将其整合回记忆模块。这一机制使代理能够超越其静态预训练参数,促进持续学习和自我进化。我们在长期生产力基准TAC上评估了MUSE。仅使用轻量级的Gemini-2.5 Flash模型,MUSE便以显著优势达到了新的SOTA性能。充分的实验表明,随着代理自主积累经验,其任务完成能力不断提升,同时展现出强大的持续学习和自我进化能力。此外,MUSE积累的经验表现出强大的泛化特性,能够在新任务上实现零样本提升。MUSE为能够自动化现实世界生产力任务的AI代理确立了一个新范式。
English
Large Language Models have demonstrated remarkable capabilities across
diverse domains, yet significant challenges persist when deploying them as AI
agents for real-world long-horizon tasks. Existing LLM agents suffer from a
critical limitation: they are test-time static and cannot learn from
experience, lacking the ability to accumulate knowledge and continuously
improve on the job. To address this challenge, we propose MUSE, a novel agent
framework that introduces an experience-driven, self-evolving system centered
around a hierarchical Memory Module. MUSE organizes diverse levels of
experience and leverages them to plan and execute long-horizon tasks across
multiple applications. After each sub-task execution, the agent autonomously
reflects on its trajectory, converting the raw trajectory into structured
experience and integrating it back into the Memory Module. This mechanism
enables the agent to evolve beyond its static pretrained parameters, fostering
continuous learning and self-evolution. We evaluate MUSE on the long-horizon
productivity benchmark TAC. It achieves new SOTA performance by a significant
margin using only a lightweight Gemini-2.5 Flash model. Sufficient Experiments
demonstrate that as the agent autonomously accumulates experience, it exhibits
increasingly superior task completion capabilities, as well as robust
continuous learning and self-evolution capabilities. Moreover, the accumulated
experience from MUSE exhibits strong generalization properties, enabling
zero-shot improvement on new tasks. MUSE establishes a new paradigm for AI
agents capable of real-world productivity task automation.