エージェント的スキルは実環境でどの程度機能するか：現実的な設定におけるLLMスキル使用のベンチマーキング

要旨

エージェントスキルは、再利用可能なドメイン固有の知識アーティファクトとして、LLMベースのエージェントを拡張する一般的なメカニズムとなっているが、スキル使用性能を形式的にベンチマークすることは依然として稀である。既存のスキルベンチマークは、各タスクに対して手作りで狭く特化したタスク固有スキルがLLMに直接提供されるという、過度に理想化された条件に焦点を当てている。一方、現実的な設定では、LLMエージェントは関連するスキルを自身で検索・選択する必要があり、最も一致するスキルでさえタスクに十分に適合していない可能性がある。本論文では、エージェントが34kの実世界スキルからなる大規模コレクションからスキルを検索し、手動で選別されたスキルにアクセスできない可能性がある、段階的に難易度が増す現実的な設定下におけるスキルの有用性について、初めて包括的な調査を行う。我々の調査結果は、スキルの利点が脆弱であることを明らかにしている：設定が現実的になるにつれて性能向上効果は一貫して低下し、最も困難なシナリオでは合格率がスキル未使用のベースラインに近づく。この差を埋めるため、クエリ特定型およびクエリ非依存型のスキル改良戦略を調査し、初期スキルが合理的な関連性と品質を有する場合、クエリ特定型の改良が失われた性能を大幅に回復させることを示す。さらに、Terminal-Bench 2.0において検索と改良の汎用性を実証し、Claude Opus 4.6の合格率を57.7%から65.5%に向上させる。複数のモデルで一貫した我々の結果は、LLMベースのエージェントにおけるスキルの可能性と現在の限界の両方を浮き彫りにする。コードはhttps://github.com/UCSB-NLP-Chang/Skill-Usageで公開されている。

English

Agent skills, which are reusable, domain-specific knowledge artifacts, have become a popular mechanism for extending LLM-based agents, yet formally benchmarking skill usage performance remains scarce. Existing skill benchmarking efforts focus on overly idealized conditions, where LLMs are directly provided with hand-crafted, narrowly-tailored task-specific skills for each task, whereas in many realistic settings, the LLM agent may have to search for and select relevant skills on its own, and even the closest matching skills may not be well-tailored for the task. In this paper, we conduct the first comprehensive study of skill utility under progressively challenging realistic settings, where agents must retrieve skills from a large collection of 34k real-world skills and may not have access to any hand-curated skills. Our findings reveal that the benefits of skills are fragile: performance gains degrade consistently as settings become more realistic, with pass rates approaching no-skill baselines in the most challenging scenarios. To narrow this gap, we study skill refinement strategies, including query-specific and query-agnostic approaches, and we show that query-specific refinement substantially recovers lost performance when the initial skills are of reasonable relevance and quality. We further demonstrate the generality of retrieval and refinement on Terminal-Bench 2.0, where they improve the pass rate of Claude Opus 4.6 from 57.7% to 65.5%. Our results, consistent across multiple models, highlight both the promise and the current limitations of skills for LLM-based agents. Our code is available at https://github.com/UCSB-NLP-Chang/Skill-Usage.

エージェント的スキルは実環境でどの程度機能するか：現実的な設定におけるLLMスキル使用のベンチマーキング

How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

要旨

Support