技能基准:评估不同任务中智能体技能表现的综合基准
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks
February 13, 2026
作者: Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, Yijiang Li, Yueqian Lin, Xinyi Liu, Xuanqing Liu, Haoran Lyu, Ze Ma, Bowei Wang, Runhui Wang, Tianyu Wang, Wengao Ye, Yue Zhang, Hanwen Xing, Yiqi Xue, Steven Dillmann, Han-chung Lee
cs.AI
摘要
智能体技能是增强大语言模型智能体推理能力的程序化知识包。尽管应用广泛,但目前缺乏标准化方法来衡量其实际效用。我们推出SkillsBench基准测试,涵盖11个领域的86项任务,每项任务均配备精心设计的技能库和确定性验证器。每种任务在三种条件下进行评估:无技能辅助、使用预设技能、以及自主生成技能。通过对7种智能体模型配置的7,308条轨迹测试发现:预设技能使平均通过率提升16.2个百分点,但不同领域差异显著(软件工程领域仅提升4.5个百分点,医疗健康领域则提升51.9个百分点),84项任务中有16项出现负增长。自主生成技能未产生显著效益,表明模型无法可靠地创作出它们能有效利用的程序化知识。包含2-3个模块的聚焦式技能优于全面文档,配备技能的小模型可达到无技能辅助的大模型水平。
English
Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers. Each task is evaluated under three conditions: no Skills, curated Skills, and self-generated Skills. We test 7 agent-model configurations over 7,308 trajectories. Curated Skills raise average pass rate by 16.2 percentage points(pp), but effects vary widely by domain (+4.5pp for Software Engineering to +51.9pp for Healthcare) and 16 of 84 tasks show negative deltas. Self-generated Skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming. Focused Skills with 2--3 modules outperform comprehensive documentation, and smaller models with Skills can match larger models without them.