ChatPaper.aiChatPaper

基于技能库的自改进智能体强化学习

Reinforcement Learning for Self-Improving Agent with Skill Library

December 18, 2025
作者: Jiongxiao Wang, Qiaojing Yan, Yawei Wang, Yijun Tian, Soumya Smruti Mishra, Zhichao Xu, Megha Gandhi, Panpan Xu, Lin Lee Cheong
cs.AI

摘要

基于大语言模型(LLM)的智能体在复杂推理和多轮交互中展现出卓越能力,但在新部署环境中难以实现持续改进与适应。构建技能库使智能体能够学习、验证并应用新技能,是目前颇具前景的解决方案。然而现有技能库方法主要依赖大语言模型提示,导致技能库的稳定实施面临挑战。为突破这些限制,我们提出一种基于强化学习(RL)的方法,通过技能库增强智能体的自我进化能力。具体而言,我们设计了面向自我进化的技能增强型GRPO框架(SAGE),该创新强化学习框架系统性地将技能融入学习过程。其核心组件"顺序式 rollout"机制,会在相似任务链上迭代部署智能体——当智能体遍历任务链时,前期任务生成的技能将不断积累至技能库,供后续任务调用。此外,框架通过融合原始结果奖励与技能集成奖励,显著提升了技能生成与利用效率。在AppWorld环境中的实验表明,应用SAGE的专家经验监督微调模型实现了场景目标完成度8.9%的提升,同时交互步骤减少26%,生成令牌数降低59%,在准确性与效率上显著超越现有方法。
English
Large Language Model (LLM)-based agents have demonstrated remarkable capabilities in complex reasoning and multi-turn interactions but struggle to continuously improve and adapt when deployed in new environments. One promising approach is implementing skill libraries that allow agents to learn, validate, and apply new skills. However, current skill library approaches rely primarily on LLM prompting, making consistent skill library implementation challenging. To overcome these challenges, we propose a Reinforcement Learning (RL)-based approach to enhance agents' self-improvement capabilities with a skill library. Specifically, we introduce Skill Augmented GRPO for self-Evolution (SAGE), a novel RL framework that systematically incorporates skills into learning. The framework's key component, Sequential Rollout, iteratively deploys agents across a chain of similar tasks for each rollout. As agents navigate through the task chain, skills generated from previous tasks accumulate in the library and become available for subsequent tasks. Additionally, the framework enhances skill generation and utilization through a Skill-integrated Reward that complements the original outcome-based rewards. Experimental results on AppWorld demonstrate that SAGE, when applied to supervised-finetuned model with expert experience, achieves 8.9% higher Scenario Goal Completion while requiring 26% fewer interaction steps and generating 59% fewer tokens, substantially outperforming existing approaches in both accuracy and efficiency.
PDF121December 25, 2025