ChatPaper.aiChatPaper

SkillsBench:評估代理技能在多元任務中的效能基準

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

February 13, 2026
作者: Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, Yijiang Li, Yueqian Lin, Xinyi Liu, Xuanqing Liu, Haoran Lyu, Ze Ma, Bowei Wang, Runhui Wang, Tianyu Wang, Wengao Ye, Yue Zhang, Hanwen Xing, Yiqi Xue, Steven Dillmann, Han-chung Lee
cs.AI

摘要

代理技能是結構化的程序性知識套件,能在推理階段增強大型語言模型代理的能力。儘管應用迅速普及,目前尚無標準方法來衡量其實際效用。我們推出SkillsBench基準測試,涵蓋11個領域的86項任務,每項任務均配備精選技能與確定性驗證器。每項任務在三種條件下進行評估:無技能、精選技能及自主生成技能。我們對7種代理模型配置進行了7,308次軌跡測試。精選技能使平均通過率提升16.2個百分點,但效果因領域差異顯著(軟體工程領域僅提升4.5個百分點,醫療保健領域則提升51.9個百分點),且84項任務中有16項呈現負增長。自主生成技能未產生平均效益,表明模型無法可靠地創建其能受益的程序性知識。包含2-3個模組的聚焦式技能勝過全面文檔,配備技能的小型模型可達到未配備技能的大型模型同等效能。
English
Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers. Each task is evaluated under three conditions: no Skills, curated Skills, and self-generated Skills. We test 7 agent-model configurations over 7,308 trajectories. Curated Skills raise average pass rate by 16.2 percentage points(pp), but effects vary widely by domain (+4.5pp for Software Engineering to +51.9pp for Healthcare) and 16 of 84 tasks show negative deltas. Self-generated Skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming. Focused Skills with 2--3 modules outperform comprehensive documentation, and smaller models with Skills can match larger models without them.
PDF363February 19, 2026