ChatPaper.aiChatPaper

SkillFlow:自主智能体终身技能发现与演进的基准框架

SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

April 19, 2026
作者: Ziao Zhang, Kou Shi, Shiting Huang, Avery Nie, Yu Zeng, Yiming Zhao, Zhen Fang, Qishen Su, Haibo Qiu, Wei Yang, Qingnan Ren, Shun Zou, Wenxuan Huang, Lin Chen, Zehui Chen, Feng Zhao
cs.AI

摘要

随着智能体能力边界的持续拓展,其通过即插即用外部技能完成专项任务的能力日益增强。然而现有基准测试大多仅检验模型能否使用预设技能,却未涉及从经验中自主发现技能、故障后修复技能以及长期维护技能库连贯性等核心能力。我们提出SkillFlow基准测试集,涵盖20个任务族的166项任务,每个任务族均遵循领域无关执行流(DAEF)构建任务——该框架定义了智能体工作流范式,使得所有任务共享统一的工作流程。我们在智能体终身学习协议下进行评估:智能体从零技能起步,按序解决各任务族中的任务,通过轨迹与规则驱动的技能补丁外化学习成果,并持续更新技能库。实验揭示了显著的能力差距:Claude Opus 4.6通过终身技能演化将任务成功率从62.65%提升至71.08%(+8.43分)。但高技能使用率未必带来高效用——Kimi K2.5虽达到66.87%的技能使用率,仅提升0.60分;Qwen-Coder-Next任务完成率仅44.58%,甚至较基础设置出现性能倒退。SkillFlow通过结构化测试环境,对终身学习框架下的技能发现、补丁、迁移及其失效模式进行了深入实证分析。
English
As the capability frontier of autonomous agents continues to expand, they are increasingly able to complete specialized tasks through plug-and-play external skills. Yet current benchmarks mostly test whether models can use provided skills, leaving open whether they can discover skills from experience, repair them after failure, and maintain a coherent library over time. We introduce SkillFlow, a benchmark of 166 tasks across 20 families in which task construction within each family follows a Domain-Agnostic Execution Flow (DAEF) that defines an agent workflow framework, allowing these tasks to share a consistent workflow. Agents are evaluated under an Agentic Lifelong Learning protocol in which they begin without skills, solve tasks sequentially within each family, externalize lessons through trajectory- and rubric-driven skill patches, and carry the updated library forward. Experiments reveal a substantial capability gap. For Claude Opus 4.6, lifelong skill evolution improves task success from 62.65% to 71.08% (+8.43 points). However, high skill usage does not necessarily imply high utility: Kimi K2.5 gains only +0.60 points despite 66.87% skill usage, while Qwen-Coder-Next reaches only a 44.58% task completion rate and still regresses relative to the vanilla setting. SkillFlow contributes a structured testbed for this direction and an in-depth empirical analysis of skill discovery, patching, transfer, and their failure modes under lifelong evaluation.
PDF152April 22, 2026