ChatPaper.aiChatPaper

基於技能庫的自改進智能體強化學習

Reinforcement Learning for Self-Improving Agent with Skill Library

December 18, 2025
作者: Jiongxiao Wang, Qiaojing Yan, Yawei Wang, Yijun Tian, Soumya Smruti Mishra, Zhichao Xu, Megha Gandhi, Panpan Xu, Lin Lee Cheong
cs.AI

摘要

基於大型語言模型(LLM)的智慧體雖在複雜推理與多輪互動中展現卓越能力,但在新環境部署時仍難以實現持續改進與適應。建構技能庫成為一項極具潛力的解決方案,可使智慧體學習、驗證並應用新技能。然而,現有技能庫方法主要依賴LLM提示技術,導致技能庫的穩定實施面臨挑戰。為突破此限制,我們提出一種基於強化學習(RL)的方法,透過技能庫增強智慧體的自我改進能力。具體而言,我們創新性地引入「技能增強型GRPO自進化框架」(SAGE),該RL框架能系統性地將技能整合至學習過程中。其核心組件「序列化滾動執行」機制,會在每次滾動時將智慧體迭代部署於一系列相似任務鏈中。隨著智慧體在任務鏈中推進,過往任務生成的技能將持續累積至技能庫,供後續任務調用。此外,框架透過「技能整合獎勵」機制強化技能生成與運用,此機制與原有基於結果的獎勵形成互補。在AppWorld環境的實驗結果顯示,SAGE應用於具專家經驗的監督微調模型時,情境目標完成率提升8.9%,同時減少26%的互動步驟與59%的標記生成量,在準確性與效率方面顯著超越現有方法。
English
Large Language Model (LLM)-based agents have demonstrated remarkable capabilities in complex reasoning and multi-turn interactions but struggle to continuously improve and adapt when deployed in new environments. One promising approach is implementing skill libraries that allow agents to learn, validate, and apply new skills. However, current skill library approaches rely primarily on LLM prompting, making consistent skill library implementation challenging. To overcome these challenges, we propose a Reinforcement Learning (RL)-based approach to enhance agents' self-improvement capabilities with a skill library. Specifically, we introduce Skill Augmented GRPO for self-Evolution (SAGE), a novel RL framework that systematically incorporates skills into learning. The framework's key component, Sequential Rollout, iteratively deploys agents across a chain of similar tasks for each rollout. As agents navigate through the task chain, skills generated from previous tasks accumulate in the library and become available for subsequent tasks. Additionally, the framework enhances skill generation and utilization through a Skill-integrated Reward that complements the original outcome-based rewards. Experimental results on AppWorld demonstrate that SAGE, when applied to supervised-finetuned model with expert experience, achieves 8.9% higher Scenario Goal Completion while requiring 26% fewer interaction steps and generating 59% fewer tokens, substantially outperforming existing approaches in both accuracy and efficiency.
PDF121December 25, 2025