SkillFlow：自主智能体终身技能发现与演进的基准框架

摘要

随着智能体能力边界的持续拓展，其通过即插即用外部技能完成专项任务的能力日益增强。然而现有基准测试大多仅检验模型能否使用预设技能，却未涉及从经验中自主发现技能、故障后修复技能以及长期维护技能库连贯性等核心能力。我们提出SkillFlow基准测试集，涵盖20个任务族的166项任务，每个任务族均遵循领域无关执行流（DAEF）构建任务——该框架定义了智能体工作流范式，使得所有任务共享统一的工作流程。我们在智能体终身学习协议下进行评估：智能体从零技能起步，按序解决各任务族中的任务，通过轨迹与规则驱动的技能补丁外化学习成果，并持续更新技能库。实验揭示了显著的能力差距：Claude Opus 4.6通过终身技能演化将任务成功率从62.65%提升至71.08%（+8.43分）。但高技能使用率未必带来高效用——Kimi K2.5虽达到66.87%的技能使用率，仅提升0.60分；Qwen-Coder-Next任务完成率仅44.58%，甚至较基础设置出现性能倒退。SkillFlow通过结构化测试环境，对终身学习框架下的技能发现、补丁、迁移及其失效模式进行了深入实证分析。

English

As the capability frontier of autonomous agents continues to expand, they are increasingly able to complete specialized tasks through plug-and-play external skills. Yet current benchmarks mostly test whether models can use provided skills, leaving open whether they can discover skills from experience, repair them after failure, and maintain a coherent library over time. We introduce SkillFlow, a benchmark of 166 tasks across 20 families in which task construction within each family follows a Domain-Agnostic Execution Flow (DAEF) that defines an agent workflow framework, allowing these tasks to share a consistent workflow. Agents are evaluated under an Agentic Lifelong Learning protocol in which they begin without skills, solve tasks sequentially within each family, externalize lessons through trajectory- and rubric-driven skill patches, and carry the updated library forward. Experiments reveal a substantial capability gap. For Claude Opus 4.6, lifelong skill evolution improves task success from 62.65% to 71.08% (+8.43 points). However, high skill usage does not necessarily imply high utility: Kimi K2.5 gains only +0.60 points despite 66.87% skill usage, while Qwen-Coder-Next reaches only a 44.58% task completion rate and still regresses relative to the vanilla setting. SkillFlow contributes a structured testbed for this direction and an in-depth empirical analysis of skill discovery, patching, transfer, and their failure modes under lifelong evaluation.