SWE-Skills-Bench: エージェントのスキルは実世界のソフトウェア工学において実際に役立つのか？

要旨

エージェントスキルは、推論時に注入される構造化された手続き的知識パッケージであり、ソフトウェアエンジニアリングタスクにおけるLLMエージェントの拡張にますます利用されている。しかし、エンドツーエンドの開発環境におけるその実際の有用性は依然として不明確である。本研究では、現実のソフトウェアエンジニアリング（SWE）においてエージェントスキルの限界的効用を分離して評価する、要件駆動型ベンチマークであるSWE-Skills-Benchを初めて提案する。このベンチマークは、49の公開SWEスキルを、特定のコミットで固定された実際のGitHubリポジトリおよび明示的な受入基準を持つ要件文書と組み合わせ、6つのSWEサブドメインにわたって約565のタスクインスタンスを生成する。各タスクの受入基準を実行ベースのテストにマッピングする決定論的検証フレームワークを導入し、スキルありとなしでの制御されたペア評価を可能にする。結果は、スキル注入の利点が急速な採用が示唆するよりもはるかに限定的であることを示している：49のスキルのうち39は合格率の向上がゼロであり、平均的な向上率はわずか+1.2%であった。トークンオーバーヘッドは、適度な節約から451%の増加まで様々であるが、合格率は変化しなかった。意味のある向上（最大+30%）をもたらすのは7つの特殊化されたスキルのみであり、3つのスキルはバージョンの不一致によるガイダンスがプロジェクト文脈と競合するため、性能を劣化させた（最大-10%）。これらの知見は、エージェントスキルが、その有用性がドメイン適合性、抽象化レベル、文脈的互換性に強く依存する限定的な介入であることを示唆する。SWE-Skills-Benchは、ソフトウェアエンジニアリングエージェントにおけるスキルの設計、選択、展開を評価するためのテストベッドを提供する。SWE-Skills-Benchはhttps://github.com/GeniusHTX/SWE-Skills-Bench で利用可能である。

English

Agent skills, structured procedural knowledge packages injected at inference time, are increasingly used to augment LLM agents on software engineering tasks. However, their real utility in end-to-end development settings remains unclear. We present SWE-Skills-Bench, the first requirement-driven benchmark that isolates the marginal utility of agent skills in real-world software engineering (SWE). It pairs 49 public SWE skills with authentic GitHub repositories pinned at fixed commits and requirement documents with explicit acceptance criteria, yielding approximately 565 task instances across six SWE subdomains. We introduce a deterministic verification framework that maps each task's acceptance criteria to execution-based tests, enabling controlled paired evaluation with and without the skill. Our results show that skill injection benefits are far more limited than rapid adoption suggests: 39 of 49 skills yield zero pass-rate improvement, and the average gain is only +1.2%. Token overhead varies from modest savings to a 451% increase while pass rates remain unchanged. Only seven specialized skills produce meaningful gains (up to +30%), while three degrade performance (up to -10%) due to version-mismatched guidance conflicting with project context. These findings suggest that agent skills are a narrow intervention whose utility depends strongly on domain fit, abstraction level, and contextual compatibility. SWE-Skills-Bench provides a testbed for evaluating the design, selection, and deployment of skills in software engineering agents. SWE-Skills-Bench is available at https://github.com/GeniusHTX/SWE-Skills-Bench.

SWE-Skills-Bench: エージェントのスキルは実世界のソフトウェア工学において実際に役立つのか？

SWE-Skills-Bench: Do Agent Skills Actually Help in Real-World Software Engineering?

要旨

Support