ChatPaper.aiChatPaper

技能一:基於強化學習的技能增強型智能體統一演化

Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

May 7, 2026
作者: Yaorui Shi, Yuxin Chen, Zhengxi Lu, Yuchun Miao, Shugui Liu, Qi GU, Xunliang Cai, Xiang Wang, An Zhang
cs.AI

摘要

持續性技能庫允許語言模型代理跨任務重複使用成功策略。維護此類技能庫需要三項相互耦合的能力:代理需選擇相關技能、在執行過程中運用技能,並從經驗中提煉新技能。現有方法往往孤立地優化這些能力或使用分離的獎勵來源,導致部分能力進化失衡且相互衝突。我們提出Skill1框架,該框架通過單一策略共同演化技能選擇、運用與提煉,以實現共享的任務目標。該策略生成查詢以搜索技能庫,對候選技能重新排序後選定一項,基於該技能解決任務,並從執行軌跡中提煉新技能。所有學習均源自單一的任務結果信號:其低頻趨勢為技能選擇提供依據,而高頻波動則驅動技能提煉。在ALFWorld和WebShop上的實驗表明,Skill1優於先前的基於技能和強化學習的基準方法。訓練動態證實了三項能力的協同演化,消融實驗則顯示移除任一獎勵信號都會損害演化效果。
English
A persistent skill library allows language model agents to reuse successful strategies across tasks. Maintaining such a library requires three coupled capabilities. The agent selects a relevant skill, utilizes it during execution, and distills new skills from experience. Existing methods optimize these capabilities in isolation or with separate reward sources, resulting in partial and conflicting evolution. We propose Skill1, a framework that trains a single policy to co-evolve skill selection, utilization, and distillation toward a shared task-outcome objective. The policy generates a query to search the skill library, re-ranks candidates to select one, solves the task conditioned on it, and distills a new skill from the trajectory. All learning derives from a single task-outcome signal. Its low-frequency trend credits selection and its high-frequency variation credits distillation. Experiments on ALFWorld and WebShop show that Skill1 outperforms prior skill-based and reinforcement learning baselines. Training dynamics confirm the co-evolution of the three capabilities, and ablations show that removing any credit signal degrades the evolution.
PDF531May 9, 2026