SWE-Skills-Bench: 에이전트 기술이 실제 소프트웨어 엔지니어링에 실제로 도움이 되는가?

초록

에이전트 스킬은 추론 시점에 주입되는 구조화된 절차적 지식 패키지로, 소프트웨어 엔지니어링 작업에서 LLM 에이전트의 성능을 강화하기 위해 점점 더 많이 사용되고 있습니다. 그러나 종단간 개발 환경에서의 실제 유용성은 여전히 불분명합니다. 본 연구에서는 실제 소프트웨어 엔지니어링(SWE)에서 에이전트 스킬의 한계 효용을 분리하여 평가하는 최초의 요구사항 기반 벤치마크인 SWE-Skills-Bench를 소개합니다. 이 벤치마크는 49개의 공개 SWE 스킬을 특정 커밋으로 고정된 실제 GitHub 저장소 및 명시적인 수용 기준이 포함된 요구사항 문서와配对하여, 6개의 SWE 하위 도메인에 걸쳐 약 565개의 작업 인스턴스를 생성합니다. 또한 각 작업의 수용 기준을 실행 기반 테스트에 매핑하는 결정론적 검증 프레임워크를 도입하여, 스킬 사용 여부에 따른 통제된 쌍별 평가를 가능하게 합니다. 우리의 결과는 스킬 주입의 이점이 빠른 도입 속도가 시사하는 것보다 훨씬 제한적임을 보여줍니다: 49개 스킬 중 39개는 합격률 개선 효과가 전혀 없었으며, 평균 개선률은 단 +1.2%에 불과했습니다. 토큰 오버헤드는 적절한 절감에서 451% 증가에 이르렀지만 합격률은 변하지 않았습니다. 오직 7개의 특화된 스킬만이 의미 있는 개선(최대 +30%)을 가져온 반면, 3개의 스킬은 버전 불일치로 인한 지침이 프로젝트 컨텍스트와 충돌하여 성능을 저하시켰습니다(최대 -10%). 이러한 결과는 에이전트 스킬이 도메인 적합성, 추상화 수준, 컨텍스트 호환성에 크게 의존하는 제한된 개입 수단임을 시사합니다. SWE-Skills-Bench는 소프트웨어 엔지니어링 에이전트의 스킬 설계, 선택, 배포를 평가하기 위한 테스트베드를 제공합니다. SWE-Skills-Bench는 https://github.com/GeniusHTX/SWE-Skills-Bench에서 이용 가능합니다.

English

Agent skills, structured procedural knowledge packages injected at inference time, are increasingly used to augment LLM agents on software engineering tasks. However, their real utility in end-to-end development settings remains unclear. We present SWE-Skills-Bench, the first requirement-driven benchmark that isolates the marginal utility of agent skills in real-world software engineering (SWE). It pairs 49 public SWE skills with authentic GitHub repositories pinned at fixed commits and requirement documents with explicit acceptance criteria, yielding approximately 565 task instances across six SWE subdomains. We introduce a deterministic verification framework that maps each task's acceptance criteria to execution-based tests, enabling controlled paired evaluation with and without the skill. Our results show that skill injection benefits are far more limited than rapid adoption suggests: 39 of 49 skills yield zero pass-rate improvement, and the average gain is only +1.2%. Token overhead varies from modest savings to a 451% increase while pass rates remain unchanged. Only seven specialized skills produce meaningful gains (up to +30%), while three degrade performance (up to -10%) due to version-mismatched guidance conflicting with project context. These findings suggest that agent skills are a narrow intervention whose utility depends strongly on domain fit, abstraction level, and contextual compatibility. SWE-Skills-Bench provides a testbed for evaluating the design, selection, and deployment of skills in software engineering agents. SWE-Skills-Bench is available at https://github.com/GeniusHTX/SWE-Skills-Bench.

SWE-Skills-Bench: 에이전트 기술이 실제 소프트웨어 엔지니어링에 실제로 도움이 되는가?

SWE-Skills-Bench: Do Agent Skills Actually Help in Real-World Software Engineering?

초록

Support