ChatPaper.aiChatPaper

技能一:基于强化学习的技能增强智能体统一演化

Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

May 7, 2026
作者: Yaorui Shi, Yuxin Chen, Zhengxi Lu, Yuchun Miao, Shugui Liu, Qi GU, Xunliang Cai, Xiang Wang, An Zhang
cs.AI

摘要

持久技能库使得语言模型智能体能够在不同任务中复用成功策略。维护此类库需要三项协同能力:智能体需选择相关技能、在执行过程中运用该技能,并从经验中提炼新技能。现有方法往往孤立地优化这些能力或采用分离的奖励机制,导致局部优化与冲突演化。我们提出Skill1框架,通过训练单一策略使技能选择、运用与提炼围绕共享的任务目标协同进化。该策略生成查询语句检索技能库,对候选技能重排序后选定一项,基于该技能完成任务,并从执行轨迹中提炼新技能。所有学习均源自单一的任务结果信号:其低频趋势为技能选择提供反馈,高频波动则为技能提炼提供依据。在ALFWorld和WebShop上的实验表明,Skill1优于现有基于技能和强化学习的基线方法。训练动态验证了三项能力的协同进化,消融实验证明移除任一反馈信号都会损害进化效果。
English
A persistent skill library allows language model agents to reuse successful strategies across tasks. Maintaining such a library requires three coupled capabilities. The agent selects a relevant skill, utilizes it during execution, and distills new skills from experience. Existing methods optimize these capabilities in isolation or with separate reward sources, resulting in partial and conflicting evolution. We propose Skill1, a framework that trains a single policy to co-evolve skill selection, utilization, and distillation toward a shared task-outcome objective. The policy generates a query to search the skill library, re-ranks candidates to select one, solves the task conditioned on it, and distills a new skill from the trajectory. All learning derives from a single task-outcome signal. Its low-frequency trend credits selection and its high-frequency variation credits distillation. Experiments on ALFWorld and WebShop show that Skill1 outperforms prior skill-based and reinforcement learning baselines. Training dynamics confirm the co-evolution of the three capabilities, and ablations show that removing any credit signal degrades the evolution.
PDF531May 9, 2026