SWE-Skills-Bench：智能体技能在真实软件工程中是否真正发挥作用？

摘要

智能体技能（即在推理时注入的结构化程序知识包）正日益广泛地用于增强LLM智能体处理软件工程任务的能力。然而，其在端到端开发环境中的实际效用仍不明确。我们推出SWE-Skills-Bench——首个需求驱动的基准测试框架，专门用于衡量智能体技能在真实软件工程场景中的边际效用。该基准将49个公开的软件工程技能与固定提交点的真实GitHub仓库、以及包含明确验收标准的需求文档进行配对，在六大软件工程子领域生成约565个任务实例。我们引入确定性验证框架，将每个任务的验收标准映射为基于执行的测试，从而实现有/无技能注入的受控配对评估。研究结果表明技能注入的收益远低于快速普及所暗示的水平：49项技能中有39项未带来通过率提升，平均增益仅为+1.2%。令牌开销从适度节省到激增451%不等，而通过率维持不变。仅七项专业技能产生显著增益（最高+30%），三项技能因版本不匹配的指导与项目上下文冲突导致性能下降（最高-10%）。这些发现表明智能体技能属于窄域干预手段，其效用高度依赖领域适配性、抽象层级和上下文兼容性。SWE-Skills-Bench为评估软件工程智能体的技能设计、选择与部署提供了测试平台。项目地址：https://github.com/GeniusHTX/SWE-Skills-Bench。

English

Agent skills, structured procedural knowledge packages injected at inference time, are increasingly used to augment LLM agents on software engineering tasks. However, their real utility in end-to-end development settings remains unclear. We present SWE-Skills-Bench, the first requirement-driven benchmark that isolates the marginal utility of agent skills in real-world software engineering (SWE). It pairs 49 public SWE skills with authentic GitHub repositories pinned at fixed commits and requirement documents with explicit acceptance criteria, yielding approximately 565 task instances across six SWE subdomains. We introduce a deterministic verification framework that maps each task's acceptance criteria to execution-based tests, enabling controlled paired evaluation with and without the skill. Our results show that skill injection benefits are far more limited than rapid adoption suggests: 39 of 49 skills yield zero pass-rate improvement, and the average gain is only +1.2%. Token overhead varies from modest savings to a 451% increase while pass rates remain unchanged. Only seven specialized skills produce meaningful gains (up to +30%), while three degrade performance (up to -10%) due to version-mismatched guidance conflicting with project context. These findings suggest that agent skills are a narrow intervention whose utility depends strongly on domain fit, abstraction level, and contextual compatibility. SWE-Skills-Bench provides a testbed for evaluating the design, selection, and deployment of skills in software engineering agents. SWE-Skills-Bench is available at https://github.com/GeniusHTX/SWE-Skills-Bench.